Research Notes

01

Experiment Apr 02, 2026 · 14 min read

Benchmarking Prompt Injection Filters: 6 Approaches Tested Against 200 Adversarial Inputs

Tags: LLM Security · Prompt Injection · Evaluation · Python

I built a test harness with 200 adversarial prompt injection inputs — ranging from trivial instruction overwrites to multi-turn jailbreak sequences — and clocked six different mitigation strategies: keyword blocklists, regex patterns, LLM-based classifiers, semantic similarity filters, dual-LLM isolation, and system-prompt hardening. The results were humbling for two strategies I expected to work well, and revealing about where prompt injection defense actually fails.

Read Writeup →

02

Reproduction Mar 24, 2026 · 22 min read

Reproducing "Universal Adversarial Perturbations" — Moosavi-Dezfooli et al., 2017

Tags: Adversarial ML · Universal Perturbations · PyTorch · CIFAR-10

A faithful reproduction in PyTorch of the Universal Adversarial Perturbations paper, which showed that a single image-agnostic perturbation can fool a classifier on most images. I compare my results to the original paper, discuss the geometric interpretation (proximity to decision boundaries), and evaluate how adversarial training in 2026 handles these universal attacks. Short answer: better than 2017, but not nearly enough.

GitHub Notebook → Read Writeup →

03

Analysis Mar 15, 2026 · 18 min read

MITRE ATLAS Walkthrough: Mapping Three Real-World AI Attacks to the Framework

Tags: MITRE ATLAS · AI Threat Modeling · Red Teaming · Case Studies

A practical walkthrough of MITRE ATLAS using three documented real-world AI attack scenarios: the 2021 Microsoft Tay-class poisoning incident, the 2023 GPT function-calling prompt injection demonstrated by security researchers, and the 2024 supply chain attack against a popular open-source ML framework. I map each attack to specific ATLAS tactics and techniques, then propose countermeasures using the framework's recommended mitigations.

Read Writeup →

04

Tutorial Mar 05, 2026 · 25 min read

Building a Secure RAG Pipeline: Sanitizing Retrieval Outputs Before LLM Injection

Tags: RAG Security · Prompt Injection · LangChain · Python

Step-by-step tutorial for building a RAG system that treats retrieved document chunks as untrusted input. Covers indirect prompt injection via document content, how to sanitize retrieved text before injecting it into LLM context, and why the standard LangChain RetrievalQA chain is insecure by default. Includes working code with tests and a CVE walkthrough for a real RAG-based prompt injection in a popular open-source chatbot.

Code on GitHub → Read Tutorial →

05

Reproduction Feb 22, 2026 · 20 min read

Reproducing Madry et al. PGD Adversarial Training on CIFAR-10

Tags: PGD · Adversarial Training · Robustness · PyTorch

Motivated by the Madry et al. claim that PGD-based adversarial training yields "the most robust" classifiers against ℓ∞ attacks. I reproduced their CIFAR-10 results, then deliberately pushed the limits: what happens at larger epsilon budgets, with AutoAttack (the stronger successor to PGD), and with distribution-shifted test sets? The robustness-accuracy tradeoff is real and steeper than the paper implies for practical deployments.

GitHub Notebook → Read Writeup →

06

Experiment Feb 11, 2026 · 12 min read

How Many Jailbreak Attempts Before a Safe Model Breaks? — A Statistical Analysis

Tags: Jailbreaking · LLM Safety · Red Teaming · Statistical Analysis

I ran a structured experiment across GPT-4o, Claude 3.5 Sonnet, and Llama 3 70B using 50 jailbreak template variants per model across 10 sensitive topic categories. Measured attack success rate, average attempts required, and which categories were hardest to defend. Then modeled the empirical distribution of "attempts to first success" and estimated what it means for adversaries with different persistence thresholds. The results have practical implications for red team budgeting and safety layer design.

Read Writeup →

07

Analysis Jan 29, 2026 · 15 min read

Threat Modeling a Production LLM Chatbot: 14 Attack Surfaces, Ranked by Exploitability

Tags: Threat Modeling · LLM Security · STRIDE · Production Systems

Applied STRIDE to a production-equivalent LLM chatbot architecture — user-facing frontend, API gateway, memory/RAG layer, tool-use integrations, model serving, and logging — and enumerated 14 distinct attack surfaces. Ranked each by exploitability and potential impact, then mapped to OWASP Top 10 for LLMs and MITRE ATLAS. A practical reference for any team shipping an LLM-powered product and trying to right-size their security investments.

Read Analysis →

08

Tutorial Jan 14, 2026 · 30 min read

Implementing FGSM, PGD, and C&W Attacks in PyTorch — From Theory to Code

Tags: FGSM · PGD · Carlini-Wagner · PyTorch · Adversarial ML

A beginner-to-intermediate tutorial implementing the three most important adversarial attacks from scratch in PyTorch. No libraries — raw gradient computation, projected gradient descent, and the full Carlini-Wagner optimization loop. Annotated line-by-line with the underlying math, implementation gotchas, and tips for numerical stability. Tested on MNIST and CIFAR-10 with visualizations at each ε step.

Code on GitHub → Read Tutorial →

09

Experiment Jan 05, 2026 · 16 min read

Does Embedding-Based Anomaly Detection Actually Work on Synthetic Logs?

Tags: Log Analysis · Embeddings · Anomaly Detection · Evaluation

I generated 50,000 synthetic syslog entries across normal operations and 8 attack scenarios (brute force, lateral movement, data exfiltration, etc.), embedded them with three different models (text-embedding-ada-002, BGE-large, and e5-large), and evaluated whether k-NN anomaly detection in embedding space could reliably surface the attack logs. Spoiler: it works excellently for high-entropy anomalies and poorly for low-and-slow attacks — which is exactly the class that matters most.

Notebook → Read Writeup →

10

Analysis Dec 18, 2025 · 11 min read

OWASP Top 10 for LLMs — My Annotated Study Notes with Practical Examples

Tags: OWASP · LLM Security · Study Notes · Best Practices

My annotated study notes on the OWASP Top 10 for LLMs (2025 edition), with a concrete attack example and code-level mitigation for each of the 10 categories. Written for the developer audience who knows web security but is new to LLM-specific risks. Covers everything from prompt injection (LLM01) to model theft (LLM10) with realistic attack scenarios drawn from public incident reports and my own red-teaming experiments.

Read Notes →

11

Reproduction Dec 04, 2025 · 19 min read

Reproducing "Membership Inference Attacks Against Machine Learning Models" — Shokri et al., 2017

Tags: Membership Inference · Privacy · Shadow Models · PyTorch

Reproducing the foundational membership inference attack on ML models using the shadow model training approach. I extended the original paper with a comparison against more recent attack variants (LiRA and RMIA) and evaluated how differential privacy training (DP-SGD) affects attack success rates at different privacy budgets. The privacy-utility tradeoff results are troubling for anyone planning to just "add DP" and call it done.

GitHub Notebook → Read Writeup →

12

Tutorial Nov 20, 2025 · 28 min read

Setting Up a Secure LLM API Gateway: Rate-Limiting, Input Validation, and Audit Logging

Tags: API Security · FastAPI · LLM Security · Production · Python

End-to-end tutorial for building a production-grade API gateway in front of any LLM API. Covers: input length and content validation, semantic prompt injection detection, per-user rate-limiting with Redis, prompt/response audit logging to an append-only store, DLP scanning on outputs, and structured error responses that don't leak internal system prompts. Includes a Docker Compose setup for local testing with realistic load generation.

Code on GitHub → Read Tutorial →

Get Notified When I Publish