See how tiny perturbations fool ML classifiers. Interactive visual demo of adversarial examples, noise injection, and defense techniques including input preprocessing and spatial smoothing.
Select a pattern to classify, then add adversarial noise with the epsilon slider. Watch how a simulated classifier's confidence shifts as perturbation increases. The defense panel shows how preprocessing can restore correct classification.
Apply defense preprocessing to the adversarial input and see how it affects classification. Each defense modifies the input before the classifier sees it, potentially neutralizing the adversarial perturbation.
Compare single-step (FGSM) and iterative (PGD) attacks at the same epsilon. PGD is stronger but slower, finding more optimal perturbations through multiple gradient steps.
Adversarial examples are one of the most counterintuitive vulnerabilities in machine learning. By adding carefully calculated perturbations to an input — changes so small they are invisible to the human eye — an attacker can cause a well-trained model to make completely wrong predictions with high confidence. This is not a bug in a specific model but a fundamental property of how neural networks learn to classify inputs. The decision boundaries in high-dimensional input spaces are closer to legitimate data points than we intuitively expect, and gradient-based perturbations can efficiently cross those boundaries.
The interactive demo above simulates this concept using simple geometric pattern recognition. While real adversarial attacks operate on neural network gradients in high-dimensional spaces (images with millions of pixels), the visual principles are identical: small, structured noise added to a clean input shifts the model's internal representation enough to cross a decision boundary. The epsilon parameter controls the magnitude of this perturbation — at low values, the changes are imperceptible; at high values, the noise becomes visible but classification flips much more dramatically.
The Fast Gradient Sign Method (FGSM), introduced by Goodfellow et al. in 2014, is the simplest and fastest adversarial attack. It computes the gradient of the loss function with respect to the input image, then takes the sign of each gradient element (making each pixel perturbation either +epsilon or -epsilon). The key insight is that neural networks are locally linear in the input space, so a single step in the gradient direction is often enough to cross the decision boundary. The perturbation is computed as: x_adv = x + epsilon * sign(gradient_x(loss(model(x), y_true))). FGSM is fast (single forward and backward pass) but not the strongest attack because it does not optimize the perturbation iteratively.
Projected Gradient Descent (PGD), introduced by Madry et al. in 2017, is an iterative extension of FGSM. Instead of taking a single large step, PGD takes many small steps, projecting back onto the epsilon-ball after each step to ensure the total perturbation stays within bounds. This iterative optimization finds much stronger adversarial examples than FGSM at the same epsilon. PGD is considered the standard strong attack for evaluating adversarial robustness. If a model is robust to PGD, it is likely robust to most first-order attacks. The tradeoff is compute cost: PGD requires 10-100 forward/backward passes compared to FGSM's single pass.
Input preprocessing defenses work by transforming the input before it reaches the classifier, removing or disrupting the adversarial perturbation in the process. JPEG compression eliminates high-frequency noise through lossy compression, which destroys the carefully structured perturbation patterns. Spatial smoothing (Gaussian blur) averages neighboring pixels, smearing out the pixel-level perturbations. Bit-depth reduction quantizes pixel values to fewer levels, collapsing small perturbations into the same quantized bucket. Median filtering replaces each pixel with the median of its neighborhood, effectively removing salt-and-pepper style noise. Feature squeezing combines multiple preprocessing techniques to compress the input representation.
These preprocessing defenses are simple to implement and require no model retraining, but they have limitations. An adaptive attacker who knows the preprocessing is applied can craft adversarial examples that survive the transformation. This is why defense in depth — combining multiple techniques — is more effective than any single defense. The demo above lets you see how each defense individually affects the adversarial input and whether it restores correct classification.
The strongest known defense is adversarial training, where the model is trained on both clean and adversarial examples. During training, PGD attacks are generated on each batch, and the model learns to classify both the clean and adversarial versions correctly. This produces models that are inherently robust to perturbations within the epsilon-ball used during training. The cost is significant: adversarial training takes 3-10 times longer than standard training and typically reduces clean accuracy by 1-3 percentage points. Adversarial training with PGD (the Madry defense) remains the gold standard, but it only provides robustness up to the epsilon used during training.
Adversarial vulnerabilities matter most in safety-critical applications. Autonomous vehicles use neural networks for perception (object detection, lane detection, sign recognition), and adversarial patches have been demonstrated that cause stop signs to be misclassified as speed limit signs. Medical imaging systems that miss cancerous tumors due to adversarial perturbations could have fatal consequences. Content moderation systems can be fooled by adversarial modifications to bypass harmful content detection. Even financial ML models are vulnerable: adversarial manipulation of input features can fool fraud detection systems into approving fraudulent transactions.
For teams deploying ML models in production, the key question is: who controls the inputs? If users can submit arbitrary inputs to your model (image upload, text input, sensor data), adversarial attacks are a real threat. If inputs come from trusted internal systems, the risk is lower but supply chain attacks on upstream data sources remain possible. Use the ML Threat Model Generator to systematically evaluate adversarial risks in your specific deployment context, and the ML Security Checklist to verify your defenses are comprehensive.
One of the most surprising properties of adversarial examples is transferability: an adversarial example crafted to fool one model often fools a different model trained on the same task, even if the architectures are completely different. This enables black-box attacks where the attacker has no access to the target model's internals. The attacker trains a local substitute model, generates adversarial examples against it, and transfers them to the target. Transfer rates vary (typically 30-70% for untargeted attacks) but are high enough to be practically dangerous. Query-based black-box attacks are even more effective, using repeated API queries to estimate gradients numerically without any model access, though they require thousands of queries and are detectable through rate limiting and query pattern analysis.
Adversarial examples are inputs intentionally designed to cause a machine learning model to make incorrect predictions. They are created by adding small, often imperceptible perturbations to legitimate inputs. For image classifiers, this means modifying pixel values by tiny amounts that are invisible to humans but cause the model to misclassify with high confidence. The vulnerability exists because neural network decision boundaries in high-dimensional spaces are closer to data points than intuition suggests.
FGSM (Fast Gradient Sign Method) is a single-step adversarial attack. It computes the gradient of the loss function with respect to the input, then perturbs each pixel by +epsilon or -epsilon in the gradient's sign direction. It exploits the local linearity of neural networks to cross the decision boundary in one step. FGSM is computationally cheap (one forward + backward pass) but produces weaker adversarial examples than iterative methods like PGD.
Key defenses include adversarial training (training on adversarial examples), input preprocessing (JPEG compression, spatial smoothing, bit-depth reduction, median filtering), certified defenses like randomized smoothing, ensemble methods, and input validation. No single defense is perfect — adaptive attackers can bypass any individual technique. Defense in depth combining multiple techniques provides the best protection against both known and unknown attacks.
Adversarial attacks are a real-world threat, especially for safety-critical systems. Physical adversarial examples have been demonstrated against autonomous vehicles (patches on stop signs), facial recognition (adversarial glasses and makeup), malware classifiers, and content moderation systems. Any ML system that accepts untrusted inputs in production should evaluate adversarial robustness as part of its security posture.
White-box attacks assume full access to the model (architecture, weights, gradients) and use this information to craft optimal perturbations (FGSM, PGD, C&W). Black-box attacks work without model internals, using only query access. Black-box methods include transfer attacks (crafting adversarial examples on a substitute model), query-based attacks (estimating gradients from output differences), and decision-based attacks (using only the final label). Black-box attacks are more realistic but require more queries.