Adversarial Attack & Defense Demo

Q: How do you defend against adversarial attacks on ML models?

Key defenses include adversarial training (including adversarial examples in training data), input preprocessing (JPEG compression, spatial smoothing, bit-depth reduction), certified defenses like randomized smoothing, ensemble methods, gradient masking, and input validation. No single defense is perfect — defense in depth combining multiple techniques provides the best protection.

Q: What is the difference between white-box and black-box adversarial attacks?

White-box attacks assume the attacker has full access to the model architecture, weights, and gradients (e.g., FGSM, PGD). Black-box attacks work without model internals — the attacker can only query the model and observe outputs. Black-box methods include transfer attacks (crafting adversarial examples on a substitute model), query-based attacks (estimating gradients from output changes), and decision-based attacks (using only the final classification label).

Understanding Adversarial Attacks on ML Models

Adversarial examples are one of the most counterintuitive vulnerabilities in machine learning. By adding carefully calculated perturbations to an input — changes so small they are invisible to the human eye — an attacker can cause a well-trained model to make completely wrong predictions with high confidence. This is not a bug in a specific model but a fundamental property of how neural networks learn to classify inputs. The decision boundaries in high-dimensional input spaces are closer to legitimate data points than we intuitively expect, and gradient-based perturbations can efficiently cross those boundaries.

The interactive demo above simulates this concept using simple geometric pattern recognition. While real adversarial attacks operate on neural network gradients in high-dimensional spaces (images with millions of pixels), the visual principles are identical: small, structured noise added to a clean input shifts the model's internal representation enough to cross a decision boundary. The epsilon parameter controls the magnitude of this perturbation — at low values, the changes are imperceptible; at high values, the noise becomes visible but classification flips much more dramatically.

How FGSM Works

The Fast Gradient Sign Method (FGSM), introduced by Goodfellow et al. in 2014, is the simplest and fastest adversarial attack. It computes the gradient of the loss function with respect to the input image, then takes the sign of each gradient element (making each pixel perturbation either +epsilon or -epsilon). The key insight is that neural networks are locally linear in the input space, so a single step in the gradient direction is often enough to cross the decision boundary. The perturbation is computed as: x_adv = x + epsilon * sign(gradient_x(loss(model(x), y_true))). FGSM is fast (single forward and backward pass) but not the strongest attack because it does not optimize the perturbation iteratively.

How PGD Works

Projected Gradient Descent (PGD), introduced by Madry et al. in 2017, is an iterative extension of FGSM. Instead of taking a single large step, PGD takes many small steps, projecting back onto the epsilon-ball after each step to ensure the total perturbation stays within bounds. This iterative optimization finds much stronger adversarial examples than FGSM at the same epsilon. PGD is considered the standard strong attack for evaluating adversarial robustness. If a model is robust to PGD, it is likely robust to most first-order attacks. The tradeoff is compute cost: PGD requires 10-100 forward/backward passes compared to FGSM's single pass.

Defense Techniques Explained

Input preprocessing defenses work by transforming the input before it reaches the classifier, removing or disrupting the adversarial perturbation in the process. JPEG compression eliminates high-frequency noise through lossy compression, which destroys the carefully structured perturbation patterns. Spatial smoothing (Gaussian blur) averages neighboring pixels, smearing out the pixel-level perturbations. Bit-depth reduction quantizes pixel values to fewer levels, collapsing small perturbations into the same quantized bucket. Median filtering replaces each pixel with the median of its neighborhood, effectively removing salt-and-pepper style noise. Feature squeezing combines multiple preprocessing techniques to compress the input representation.

These preprocessing defenses are simple to implement and require no model retraining, but they have limitations. An adaptive attacker who knows the preprocessing is applied can craft adversarial examples that survive the transformation. This is why defense in depth — combining multiple techniques — is more effective than any single defense. The demo above lets you see how each defense individually affects the adversarial input and whether it restores correct classification.

Adversarial Training

The strongest known defense is adversarial training, where the model is trained on both clean and adversarial examples. During training, PGD attacks are generated on each batch, and the model learns to classify both the clean and adversarial versions correctly. This produces models that are inherently robust to perturbations within the epsilon-ball used during training. The cost is significant: adversarial training takes 3-10 times longer than standard training and typically reduces clean accuracy by 1-3 percentage points. Adversarial training with PGD (the Madry defense) remains the gold standard, but it only provides robustness up to the epsilon used during training.

Real-World Implications

Adversarial vulnerabilities matter most in safety-critical applications. Autonomous vehicles use neural networks for perception (object detection, lane detection, sign recognition), and adversarial patches have been demonstrated that cause stop signs to be misclassified as speed limit signs. Medical imaging systems that miss cancerous tumors due to adversarial perturbations could have fatal consequences. Content moderation systems can be fooled by adversarial modifications to bypass harmful content detection. Even financial ML models are vulnerable: adversarial manipulation of input features can fool fraud detection systems into approving fraudulent transactions.

For teams deploying ML models in production, the key question is: who controls the inputs? If users can submit arbitrary inputs to your model (image upload, text input, sensor data), adversarial attacks are a real threat. If inputs come from trusted internal systems, the risk is lower but supply chain attacks on upstream data sources remain possible. Use the ML Threat Model Generator to systematically evaluate adversarial risks in your specific deployment context, and the ML Security Checklist to verify your defenses are comprehensive.

Transferability and Black-Box Attacks

One of the most surprising properties of adversarial examples is transferability: an adversarial example crafted to fool one model often fools a different model trained on the same task, even if the architectures are completely different. This enables black-box attacks where the attacker has no access to the target model's internals. The attacker trains a local substitute model, generates adversarial examples against it, and transfers them to the target. Transfer rates vary (typically 30-70% for untargeted attacks) but are high enough to be practically dangerous. Query-based black-box attacks are even more effective, using repeated API queries to estimate gradients numerically without any model access, though they require thousands of queries and are detectable through rate limiting and query pattern analysis.

Frequently Asked Questions

What are adversarial examples in machine learning?

Adversarial examples are inputs intentionally designed to cause a machine learning model to make incorrect predictions. They are created by adding small, often imperceptible perturbations to legitimate inputs. For image classifiers, this means modifying pixel values by tiny amounts that are invisible to humans but cause the model to misclassify with high confidence. The vulnerability exists because neural network decision boundaries in high-dimensional spaces are closer to data points than intuition suggests.

What is the FGSM attack and how does it work?

FGSM (Fast Gradient Sign Method) is a single-step adversarial attack. It computes the gradient of the loss function with respect to the input, then perturbs each pixel by +epsilon or -epsilon in the gradient's sign direction. It exploits the local linearity of neural networks to cross the decision boundary in one step. FGSM is computationally cheap (one forward + backward pass) but produces weaker adversarial examples than iterative methods like PGD.

How do you defend against adversarial attacks on ML models?

Key defenses include adversarial training (training on adversarial examples), input preprocessing (JPEG compression, spatial smoothing, bit-depth reduction, median filtering), certified defenses like randomized smoothing, ensemble methods, and input validation. No single defense is perfect — adaptive attackers can bypass any individual technique. Defense in depth combining multiple techniques provides the best protection against both known and unknown attacks.

Are adversarial attacks a real-world threat or just academic?

Adversarial attacks are a real-world threat, especially for safety-critical systems. Physical adversarial examples have been demonstrated against autonomous vehicles (patches on stop signs), facial recognition (adversarial glasses and makeup), malware classifiers, and content moderation systems. Any ML system that accepts untrusted inputs in production should evaluate adversarial robustness as part of its security posture.

What is the difference between white-box and black-box adversarial attacks?

White-box attacks assume full access to the model (architecture, weights, gradients) and use this information to craft optimal perturbations (FGSM, PGD, C&W). Black-box attacks work without model internals, using only query access. Black-box methods include transfer attacks (crafting adversarial examples on a substitute model), query-based attacks (estimating gradients from output differences), and decision-based attacks (using only the final label). Black-box attacks are more realistic but require more queries.

Pattern Recognition Demo

Defense Techniques

Attack Comparison: FGSM vs. PGD

FGSM (Single Step)

PGD (10 Steps)

Understanding Adversarial Attacks on ML Models

How FGSM Works

How PGD Works

Defense Techniques Explained

Adversarial Training

Real-World Implications

Transferability and Black-Box Attacks

Frequently Asked Questions

Michael Lip

Adversarial Attack & Defense Demo

Pattern Recognition Demo

Defense Techniques

Attack Comparison: FGSM vs. PGD

FGSM (Single Step)

PGD (10 Steps)

Understanding Adversarial Attacks on ML Models

How FGSM Works

How PGD Works

Defense Techniques Explained

Adversarial Training

Real-World Implications

Transferability and Black-Box Attacks

Frequently Asked Questions

Related Tools

ML Threat Model Generator

ML Privacy Techniques

ML Security Checklist

Michael Lip