Publications & Notes
Short, technical writeups: experiments, paper reproductions, security analyses, and tutorials. Written to learn in public and contribute to the AI security community.
Benchmarking Prompt Injection Filters: 6 Approaches Tested Against 200 Adversarial Inputs
I built a test harness with 200 adversarial prompt injection inputs — ranging from trivial instruction overwrites to multi-turn jailbreak sequences — and clocked six different mitigation strategies: keyword blocklists, regex patterns, LLM-based classifiers, semantic similarity filters, dual-LLM isolation, and system-prompt hardening. The results were humbling for two strategies I expected to work well, and revealing about where prompt injection defense actually fails.
Reproducing "Universal Adversarial Perturbations" — Moosavi-Dezfooli et al., 2017
A faithful reproduction in PyTorch of the Universal Adversarial Perturbations paper, which showed that a single image-agnostic perturbation can fool a classifier on most images. I compare my results to the original paper, discuss the geometric interpretation (proximity to decision boundaries), and evaluate how adversarial training in 2026 handles these universal attacks. Short answer: better than 2017, but not nearly enough.
MITRE ATLAS Walkthrough: Mapping Three Real-World AI Attacks to the Framework
A practical walkthrough of MITRE ATLAS using three documented real-world AI attack scenarios: the 2021 Microsoft Tay-class poisoning incident, the 2023 GPT function-calling prompt injection demonstrated by security researchers, and the 2024 supply chain attack against a popular open-source ML framework. I map each attack to specific ATLAS tactics and techniques, then propose countermeasures using the framework's recommended mitigations.
Building a Secure RAG Pipeline: Sanitizing Retrieval Outputs Before LLM Injection
Step-by-step tutorial for building a RAG system that treats retrieved document chunks as untrusted input. Covers indirect prompt injection via document content, how to sanitize retrieved text before injecting it into LLM context, and why the standard LangChain RetrievalQA chain is insecure by default. Includes working code with tests and a CVE walkthrough for a real RAG-based prompt injection in a popular open-source chatbot.
Reproducing Madry et al. PGD Adversarial Training on CIFAR-10
Motivated by the Madry et al. claim that PGD-based adversarial training yields "the most robust" classifiers against ℓ∞ attacks. I reproduced their CIFAR-10 results, then deliberately pushed the limits: what happens at larger epsilon budgets, with AutoAttack (the stronger successor to PGD), and with distribution-shifted test sets? The robustness-accuracy tradeoff is real and steeper than the paper implies for practical deployments.
How Many Jailbreak Attempts Before a Safe Model Breaks? — A Statistical Analysis
I ran a structured experiment across GPT-4o, Claude 3.5 Sonnet, and Llama 3 70B using 50 jailbreak template variants per model across 10 sensitive topic categories. Measured attack success rate, average attempts required, and which categories were hardest to defend. Then modeled the empirical distribution of "attempts to first success" and estimated what it means for adversaries with different persistence thresholds. The results have practical implications for red team budgeting and safety layer design.
Threat Modeling a Production LLM Chatbot: 14 Attack Surfaces, Ranked by Exploitability
Applied STRIDE to a production-equivalent LLM chatbot architecture — user-facing frontend, API gateway, memory/RAG layer, tool-use integrations, model serving, and logging — and enumerated 14 distinct attack surfaces. Ranked each by exploitability and potential impact, then mapped to OWASP Top 10 for LLMs and MITRE ATLAS. A practical reference for any team shipping an LLM-powered product and trying to right-size their security investments.
Implementing FGSM, PGD, and C&W Attacks in PyTorch — From Theory to Code
A beginner-to-intermediate tutorial implementing the three most important adversarial attacks from scratch in PyTorch. No libraries — raw gradient computation, projected gradient descent, and the full Carlini-Wagner optimization loop. Annotated line-by-line with the underlying math, implementation gotchas, and tips for numerical stability. Tested on MNIST and CIFAR-10 with visualizations at each ε step.
Does Embedding-Based Anomaly Detection Actually Work on Synthetic Logs?
I generated 50,000 synthetic syslog entries across normal operations and 8 attack scenarios (brute force, lateral movement, data exfiltration, etc.), embedded them with three different models (text-embedding-ada-002, BGE-large, and e5-large), and evaluated whether k-NN anomaly detection in embedding space could reliably surface the attack logs. Spoiler: it works excellently for high-entropy anomalies and poorly for low-and-slow attacks — which is exactly the class that matters most.
OWASP Top 10 for LLMs — My Annotated Study Notes with Practical Examples
My annotated study notes on the OWASP Top 10 for LLMs (2025 edition), with a concrete attack example and code-level mitigation for each of the 10 categories. Written for the developer audience who knows web security but is new to LLM-specific risks. Covers everything from prompt injection (LLM01) to model theft (LLM10) with realistic attack scenarios drawn from public incident reports and my own red-teaming experiments.
Reproducing "Membership Inference Attacks Against Machine Learning Models" — Shokri et al., 2017
Reproducing the foundational membership inference attack on ML models using the shadow model training approach. I extended the original paper with a comparison against more recent attack variants (LiRA and RMIA) and evaluated how differential privacy training (DP-SGD) affects attack success rates at different privacy budgets. The privacy-utility tradeoff results are troubling for anyone planning to just "add DP" and call it done.
Setting Up a Secure LLM API Gateway: Rate-Limiting, Input Validation, and Audit Logging
End-to-end tutorial for building a production-grade API gateway in front of any LLM API. Covers: input length and content validation, semantic prompt injection detection, per-user rate-limiting with Redis, prompt/response audit logging to an append-only store, DLP scanning on outputs, and structured error responses that don't leak internal system prompts. Includes a Docker Compose setup for local testing with realistic load generation.
New Notes Monthly
Technical writeups focused on AI security — no newsletter fluff, no tutorials you already know.
Follow on GitHub →