Data Poisoning Detection

Q: How does Isolation Forest detect poisoned data points?

Isolation Forest works by randomly selecting features and split values to isolate data points in a tree structure. Anomalous points (including poisoned samples) are easier to isolate because they differ from the majority of the data, requiring fewer random splits. The algorithm assigns an anomaly score based on the average path length to isolate each point across multiple random trees. Points with shorter average path lengths are flagged as anomalies. Isolation Forest is effective for poisoning detection because poisoned samples typically lie in sparse regions of the feature space.

Q: What statistical tests can detect distribution shifts from data poisoning?

The Kolmogorov-Smirnov test compares cumulative distribution functions between clean and potentially poisoned datasets. The chi-squared test detects changes in categorical feature distributions. The t-test and Mann-Whitney U test identify shifts in feature means. The Anderson-Darling test checks if data follows expected distributions. Population Stability Index (PSI) measures overall distribution drift. For multivariate detection, Mahalanobis distance identifies points that are outliers considering feature correlations. Combining multiple tests provides more robust detection than any single method.

Q: How do you detect label flipping attacks in training data?

Label flip detection uses several approaches: confident learning identifies samples where the model's prediction strongly disagrees with the provided label across multiple training runs. Cross-validation filtering trains on subsets and flags samples consistently misclassified by models trained without them. Feature-space analysis examines whether a sample's features are more consistent with a different class than its assigned label. Influence function analysis identifies samples that disproportionately affect model parameters when included or excluded from training. The cleanlab library implements many of these techniques.

Q: What percentage of poisoned data can these detection methods catch?

Detection rates depend on the poisoning strategy and the percentage of poisoned samples. For random label flipping affecting 5-10% of data, statistical methods typically detect 85-95% of poisoned samples with 5-15% false positive rates. For targeted backdoor attacks affecting 1-3% of data, Isolation Forest and spectral analysis catch 60-80% of poisoned samples. Clean-label attacks are hardest to detect, with typical detection rates of 40-60%. Combining multiple detection methods (ensemble approach) improves overall detection to 80-95% for most attack types while reducing false positives.

Understanding Data Poisoning Attacks

Data poisoning is one of the most insidious attacks against machine learning systems because it happens before the model is even trained. Unlike adversarial attacks that target a deployed model at inference time, data poisoning corrupts the training process itself. An attacker who can manipulate even a small percentage of training data — often as little as 1-5% — can cause the resulting model to exhibit systematically incorrect behavior while passing standard evaluation metrics on clean test sets.

There are several distinct data poisoning strategies, each requiring different detection approaches. Label flipping is the simplest: the attacker changes correct labels to incorrect ones, causing the model to learn wrong associations. For a spam classifier, this means labeling spam emails as legitimate or vice versa. Outlier injection inserts data points that lie far from the normal data distribution, pulling the model's decision boundaries in unintended directions. Backdoor poisoning is more sophisticated: the attacker inserts samples with a specific trigger pattern (a pixel patch in images, a specific phrase in text) labeled with a target class, so the deployed model misclassifies any input containing the trigger while behaving normally otherwise.

Isolation Forest for Anomaly Detection

Isolation Forest is one of the most effective algorithms for detecting poisoned data points. Its core insight is that anomalous points are easier to isolate than normal points. The algorithm builds multiple random decision trees where each tree randomly selects a feature and a random split value between the feature's min and max. Normal data points, which cluster together, require many random splits to isolate. Anomalous points, which sit in sparse regions, can be isolated with very few splits.

The anomaly score for each data point is calculated as the average path length across all trees in the forest, normalized against the expected path length for a dataset of the same size. Points with anomaly scores above a threshold (typically the 90th or 95th percentile) are flagged as suspicious. The demo above simulates this process: when you run detection, the Isolation Forest scores are visualized as a histogram where you can see the distribution of scores for clean data (clustered at lower scores) versus poisoned data (distributed toward higher scores).

Distribution Shift Analysis

Distribution shift analysis compares the statistical properties of the potentially poisoned dataset against a known-clean reference distribution. The Kolmogorov-Smirnov (KS) test is the primary tool: it measures the maximum difference between the cumulative distribution functions of two samples. If the KS statistic exceeds the critical value for your significance level, the null hypothesis (that both samples come from the same distribution) is rejected, indicating likely data manipulation.

For multivariate data, single-feature KS tests may miss poisoning that affects feature correlations without changing marginal distributions. The Mahalanobis distance addresses this by measuring how far each point is from the data centroid in terms of the covariance structure. Points with high Mahalanobis distance have unusual feature combinations even if each individual feature value appears normal. The Population Stability Index (PSI) provides an overall measure of distribution drift between time periods, which is particularly useful for detecting gradual poisoning over time.

Label Flip Detection

Label flip attacks are detected by identifying samples whose features are inconsistent with their assigned labels. Confident learning, implemented in the cleanlab library, trains the model on the data, collects per-sample prediction probabilities, and flags samples where the model's confident prediction disagrees with the label across multiple cross-validation folds. If a sample is consistently predicted as class A by models trained on the rest of the data, but it is labeled as class B, it is likely a label flip.

The influence function approach takes a different angle: it measures how much each training sample affects the model's parameters and predictions. Samples with abnormally high influence (their inclusion or exclusion dramatically changes model behavior) are flagged as suspicious. This approach is more computationally expensive but can detect subtle poisoning that confident learning misses, particularly in cases where the poisoned labels create internally consistent clusters.

Detection Techniques Compared

No single detection method is sufficient for all poisoning strategies. Isolation Forest excels at detecting outlier injection attacks where poisoned points lie in sparse feature regions, but it struggles with clean-label attacks where the poisoned points are correctly labeled and lie within normal feature ranges. KS tests and distribution shift analysis are effective when the poisoning affects enough samples to shift the overall distribution (typically above 3-5% poisoning rate) but miss targeted attacks on individual samples. Label flip detection catches mislabeled samples but does not detect correctly-labeled poisoning.

The ensemble approach used in this tool combines all four detection methods and flags a sample as suspicious if it is identified by two or more methods. This significantly reduces false positives while maintaining high detection rates. In practice, the ensemble approach achieves 80-95% detection rates for most attack types with false positive rates below 10%. The visualization above shows you exactly which samples each method flags, so you can see where the methods agree and disagree.

Applying Detection to Real Datasets

For production use, integrate poisoning detection into your data pipeline before training. Run detection on each new batch of training data, comparing it against historical distributions. Set up automated alerts when KS statistics exceed thresholds or when the Isolation Forest flags more samples than the expected baseline rate. For federated learning scenarios, run detection on client updates before aggregation — the Federated Learning Guide covers this in detail.

When the tool flags suspicious samples, do not automatically remove them. Manual review is essential because false positives are inevitable, and removing legitimate edge cases can degrade model performance on rare but important inputs. Instead, quarantine flagged samples, review them with domain experts, and maintain a log of decisions. This audit trail is valuable for compliance with the EU AI Act's data governance requirements (Article 10), which mandates that training data quality be documented and monitored. The ML Compliance Checker can assess whether your data governance practices meet regulatory requirements.

Limitations and Future Directions

Current detection methods have known limitations. Clean-label attacks (where correctly labeled but strategically chosen samples shift the decision boundary) remain the hardest to detect because the individual samples appear legitimate. Gradient-based defenses like spectral signatures can detect backdoor patterns but require training a model first, which defeats the purpose of pre-training detection. Certified defenses using randomized smoothing can provide provable robustness bounds but at significant accuracy cost. Active research in 2026 focuses on using foundation models to identify anomalous samples in context, leveraging their broad knowledge of normal data patterns to flag poisoning without requiring domain-specific reference distributions.

Frequently Asked Questions

What is data poisoning in machine learning?

Data poisoning is an attack where adversaries manipulate training data to corrupt model behavior. Three main types: label flipping (changing correct labels), backdoor poisoning (inserting triggered samples that cause specific misclassifications), and clean-label poisoning (strategically chosen correctly-labeled samples that degrade performance). It is particularly dangerous because it is invisible to standard model evaluation on clean test sets.

How does Isolation Forest detect poisoned data points?

Isolation Forest randomly selects features and split values to isolate data points in tree structures. Anomalous points (including poisoned samples) require fewer random splits to isolate because they differ from the majority. The anomaly score is based on average path length across multiple trees. Shorter paths indicate anomalies. It is effective because poisoned samples typically lie in sparse feature regions.

What statistical tests can detect distribution shifts from data poisoning?

Kolmogorov-Smirnov test compares cumulative distributions, chi-squared test detects categorical changes, t-test and Mann-Whitney U identify mean shifts, Anderson-Darling checks expected distributions, and Population Stability Index measures overall drift. Mahalanobis distance detects multivariate outliers considering feature correlations. Combining multiple tests provides more robust detection.

How do you detect label flipping attacks in training data?

Confident learning identifies samples where model predictions strongly disagree with labels across multiple folds. Cross-validation filtering flags consistently misclassified samples. Feature-space analysis checks if features match a different class. Influence function analysis identifies samples with disproportionate impact on model parameters. The cleanlab library implements many of these techniques.

What percentage of poisoned data can these detection methods catch?

Random label flipping (5-10% of data): 85-95% detection with 5-15% false positives. Targeted backdoor (1-3% of data): 60-80% detection with Isolation Forest and spectral analysis. Clean-label attacks: 40-60% detection. Ensemble approach combining multiple methods: 80-95% for most attack types with reduced false positives. Detection rates improve with higher poisoning rates.

Data Source

Clean vs Poisoned Scatter

Isolation Forest Anomaly Scores

Feature Distribution Comparison

Detection Accuracy

Statistical Test Results

Suspicious Data Points (Top 15)

Understanding Data Poisoning Attacks

Isolation Forest for Anomaly Detection

Distribution Shift Analysis

Label Flip Detection

Detection Techniques Compared

Applying Detection to Real Datasets

Limitations and Future Directions

Frequently Asked Questions

Michael Lip

Data Poisoning Detection

Data Source

Clean vs Poisoned Scatter

Isolation Forest Anomaly Scores

Feature Distribution Comparison

Detection Accuracy

Statistical Test Results

Suspicious Data Points (Top 15)

Detection Report

Understanding Data Poisoning Attacks

Isolation Forest for Anomaly Detection

Distribution Shift Analysis

Label Flip Detection

Detection Techniques Compared

Applying Detection to Real Datasets

Limitations and Future Directions

Frequently Asked Questions

Related Tools

ML Model Security Audit Tool

ML Threat Model Generator

Federated Learning Introduction

Michael Lip