ML Model Security Checklist
Machine learning models are valuable intellectual property and critical infrastructure. As organizations deploy ML systems in production, the attack surface expands from training data pipelines through inference endpoints. A single vulnerability in your ML pipeline can lead to data exfiltration, model theft, adversarial manipulation, or compliance violations under the EU AI Act.
This interactive checklist covers 50+ security controls across five categories. Use it to assess your current ML security posture, identify gaps, and build a remediation plan. Every control maps to real-world attack vectors documented by OWASP, MITRE ATLAS, and NIST AI RMF.
How to Use This Checklist
Check each control your organization has implemented. The tool calculates your overall security score (0-100), risk level per category, and generates an exportable Markdown report. Severity ratings (Critical / High / Medium / Low) indicate remediation priority. Start with unchecked Critical items. For the main LockML security analysis tool, visit the homepage.
ML Security Assessment Tool
ML Security Threat Landscape 2026
The machine learning security landscape has shifted dramatically. In 2024, adversarial attacks were primarily academic concerns. By 2026, they are operational threats. MITRE ATLAS documents over 90 real-world case studies of ML-targeted attacks. The convergence of three trends has made ML model security a board-level concern:
Commoditized Attack Tooling
Open-source adversarial ML libraries like ART, Foolbox, and TextAttack have lowered the barrier to entry. Attackers no longer need PhD-level knowledge to craft adversarial examples or run model extraction campaigns. Automated attack frameworks can probe thousands of endpoints per hour.
Expanding Attack Surface
With LLMs embedded in customer-facing applications, every chat interface is a potential attack vector. RAG pipelines introduce data injection risks. Multi-modal models (vision + text + code) multiply the input channels an attacker can exploit. The average enterprise now deploys 35+ ML models in production.
Regulatory Pressure
The EU AI Act entered enforcement in August 2025. NIST AI RMF is now the de facto US standard. Organizations deploying high-risk AI systems must demonstrate robustness testing, data governance, and incident response capabilities. Non-compliance penalties reach 7% of global revenue.
The cost of an ML security breach extends beyond traditional cyber incidents. Model theft eliminates competitive advantages built over years. Data poisoning can cause silent model degradation that goes undetected for months. Adversarial manipulation of safety-critical systems (autonomous vehicles, medical diagnosis, fraud detection) creates liability exposure that traditional cybersecurity insurance does not cover.
OWASP ML Top 10 Mapped to Practical Defenses
The OWASP Machine Learning Security Top 10 provides a standardized framework for categorizing ML-specific threats. Below, each risk category is mapped to concrete, implementable defenses that your engineering team can adopt immediately.
| OWASP ML Risk | Attack Vector | Practical Defense | Priority |
|---|---|---|---|
| ML01: Input Manipulation | Adversarial examples, prompt injection, evasion attacks | Input validation, adversarial training, ensemble voting, confidence thresholds | Critical |
| ML02: Data Poisoning | Backdoor insertion, label flipping, clean-label attacks | Data provenance tracking, statistical outlier detection, robust aggregation | Critical |
| ML03: Model Inversion | Reconstructing training data from model outputs | Differential privacy, output perturbation, membership inference monitoring | High |
| ML04: Membership Inference | Determining if a data point was in the training set | Differential privacy guarantees, regularization, confidence calibration | High |
| ML05: Model Theft | API-based model extraction, side-channel analysis | Rate limiting, query budgets, watermarking, output truncation | Critical |
| ML06: AI Supply Chain | Compromised pre-trained models, malicious dependencies | Model signing, SBOM, dependency scanning, SafeTensors format | High |
| ML07: Transfer Learning Attack | Exploiting vulnerabilities inherited from base models | Base model auditing, fine-tuning validation, adversarial probing of transferred layers | Medium |
| ML08: Model Skewing | Training-serving skew, concept drift exploitation | Drift detection pipelines, shadow models, continuous validation | Medium |
| ML09: Output Integrity | Manipulating model outputs post-inference | Output signing, integrity checks, secure serving infrastructure | High |
| ML10: Model Poisoning | Directly modifying model weights or parameters | Artifact integrity verification, access controls, cryptographic checksums | Critical |
Data Poisoning Prevention
Data poisoning remains the most underestimated threat in machine learning security. Unlike inference-time attacks that require ongoing access, a successful poisoning attack embeds permanent vulnerabilities into the model itself. The model behaves normally on clean inputs but produces attacker-controlled outputs when triggered by specific patterns.
Types of Data Poisoning
Backdoor attacks insert a trigger pattern (a specific pixel patch, word, or feature) into a subset of training samples, associating them with a target label. The resulting model achieves normal accuracy on clean data but misclassifies any input containing the trigger. In 2025, researchers demonstrated backdoor attacks that survived fine-tuning, making them persistent even when downstream users retrain on clean data.
Clean-label attacks are more sophisticated. The attacker does not modify labels -- only subtly perturbs features in correctly-labeled data. This makes poisoned samples pass manual inspection. The perturbations shift decision boundaries in ways that benefit the attacker at inference time. Detection requires statistical methods beyond simple label-checking.
Label-flipping attacks are the simplest form: the attacker changes labels on a fraction of training data. Even flipping 1-3% of labels can degrade model accuracy by 10-15% on targeted classes. This is particularly dangerous when training data is crowdsourced or scraped from the web.
Defense Strategy
- Data provenance tracking: Maintain a cryptographic chain of custody for every training sample. Hash datasets at ingestion and verify before training. Use tools like DVC (Data Version Control) to track dataset lineage.
- Statistical anomaly detection: Run distribution analysis on incoming data batches. Flag samples that deviate significantly from established feature distributions. Use spectral signature analysis to detect inserted patterns.
- Robust aggregation: Use trimmed-mean or median aggregation instead of simple averaging when combining data from multiple sources. This limits the influence of any single poisoned source.
- Influence function analysis: After training, use influence functions to identify which training samples most strongly affect predictions on a validation set. High-influence outliers warrant manual review.
- Certified defenses: Techniques like randomized smoothing and DPSGD (Differentially Private Stochastic Gradient Descent) provide mathematical guarantees against a bounded number of poisoned samples.
Model Extraction and Theft Defense
Model extraction attacks aim to steal your model's functionality by querying it repeatedly and training a surrogate model on the input-output pairs. A well-executed extraction attack can replicate 90-99% of a model's accuracy using as few as 10,000 API queries. For models that cost millions to train, this represents a catastrophic loss of intellectual property.
Attack Mechanics
The attacker sends carefully chosen inputs to your model API and collects the predictions. By selecting inputs that maximize information gain (active learning), the attacker builds a training set for a clone model. Modern extraction attacks work even when the API only returns top-1 predictions without confidence scores. For LLMs, attackers can extract fine-tuning data, system prompts, and behavioral alignments.
Layered Defense
Rate limiting and query budgets: Implement per-user query limits that reset daily. Track cumulative query patterns, not just instantaneous rates. A sophisticated attacker will throttle their queries to stay under simple rate limits, so use weekly and monthly aggregate tracking.
Query pattern detection: Monitor for extraction signatures: systematic grid sampling across the input space, boundary-probing queries that test decision boundaries, and unusual input distributions that don't match legitimate traffic. Deploy anomaly detection on the query stream itself.
Output truncation: Return only the minimum information needed. For classification, return the predicted class without confidence scores. For regression, round outputs to reduce precision. Never expose logits, embeddings, or intermediate layer activations through the API.
Model watermarking: Embed verifiable watermarks in your model's behavior. If a stolen model surfaces, you can prove ownership by demonstrating that the watermark transfers to the clone. Techniques include training on specially crafted "key" inputs that produce predetermined outputs.
Differential privacy: Training with differential privacy (epsilon < 10) limits how much any single training example influences outputs, making extraction less faithful. This is the strongest theoretical defense but comes with an accuracy-privacy tradeoff.
Adversarial Example Detection
Adversarial examples are inputs crafted to fool ML models while appearing normal to humans. In computer vision, imperceptible pixel perturbations can cause misclassification with near-100% success rates. In NLP, synonym substitutions and character-level modifications bypass text classifiers and content moderation systems. In 2026, adversarial attacks against multi-modal models combine visual and textual perturbations for maximum effect.
Detection Approaches
Input preprocessing: Apply transformations (JPEG compression, spatial smoothing, feature squeezing) to inputs before inference. Adversarial perturbations are often fragile and get destroyed by preprocessing. Compare model predictions on raw vs. preprocessed inputs -- divergence signals an adversarial example.
Gradient monitoring: Adversarial inputs produce anomalous gradient patterns. Monitor the magnitude and direction of input gradients during inference. Legitimate inputs produce smooth, low-magnitude gradients. Adversarial inputs produce sharp, high-magnitude gradient spikes.
Ensemble disagreement: Maintain an ensemble of models trained on the same task with different architectures or random seeds. Adversarial examples crafted for one model rarely transfer perfectly to all ensemble members. Flag inputs where ensemble predictions diverge significantly.
Certified robustness: Randomized smoothing provides mathematically proven robustness within an L2 ball around each input. For any input, you can certify that no perturbation within the certified radius will change the prediction. This is the gold standard but applies primarily to classification tasks with continuous inputs.
Confidence calibration: Well-calibrated models assign lower confidence to adversarial inputs. Apply temperature scaling or Platt calibration, then reject predictions below a confidence threshold. This is a lightweight defense that can be added to any existing model without retraining.
Supply Chain Security for ML
The ML supply chain introduces unique risks beyond traditional software supply chain concerns. Pre-trained models are opaque binaries that can contain backdoors. Serialization formats like Python pickle allow arbitrary code execution on deserialization. Popular model hubs are targets for dependency confusion and typosquatting attacks.
Dependency Risks
ML projects typically depend on 200+ Python packages, each of which is a potential attack vector. In 2025, malicious packages mimicking popular ML libraries were found on PyPI with embedded cryptocurrency miners and data exfiltration code. The torch-nightly namespace attack demonstrated how even sophisticated engineering teams can be tricked into installing compromised packages.
- Pin all versions: Use
pip freezeorpoetry.lockto pin exact dependency versions. Never use>=version specifiers in production requirements. - Scan continuously: Integrate Snyk, Safety, or Dependabot into your CI/CD pipeline. Scan on every commit, not just weekly.
- Maintain an SBOM: Generate a Software Bill of Materials for your ML stack using tools like Syft or CycloneDX. This enables rapid vulnerability triage when new CVEs are published.
- Use private registries: Mirror approved packages in a private PyPI server (e.g., Artifactory, Nexus). Configure pip to install only from your private mirror.
Pre-trained Model Risks
Downloading models from Hugging Face Hub, TensorFlow Hub, or PyTorch Hub is the ML equivalent of curl | bash. Pre-trained models can contain:
- Backdoors: Trained triggers that produce attacker-controlled outputs on specific inputs
- Arbitrary code: Pickle-serialized models can execute arbitrary Python code during loading. A malicious
.ptor.pklfile is functionally equivalent to a trojan - Data leakage: Models may memorize and expose sensitive training data through extraction attacks
- Licensing violations: Using a model with restrictive licensing (e.g., non-commercial) in a commercial product creates legal liability
Mitigations: Use SafeTensors format instead of pickle for model serialization. Verify model checksums against official repository signatures. Run downloaded models through adversarial probing on a known-good validation set before deployment. Implement model signing with Sigstore/Cosign for internal model registries. Check the ML Model License Guide before integrating any pre-trained model.
Monitoring and Incident Response
Traditional application monitoring (uptime, latency, error rates) is necessary but insufficient for ML systems. ML-specific monitoring must detect silent failures: models that continue serving predictions but produce degraded, biased, or manipulated outputs without throwing errors.
What to Monitor
Input drift: Track the statistical distribution of inputs over time. Significant drift from training data distribution signals potential adversarial manipulation or natural concept drift. Use KL divergence, Population Stability Index (PSI), or Kolmogorov-Smirnov tests on feature distributions.
Output distribution: Monitor the distribution of model predictions. A sudden shift in class distribution (e.g., fraud detection model suddenly flagging 2x more transactions) may indicate model manipulation or genuine environmental change. Both require investigation.
Confidence distributions: Track the distribution of prediction confidence scores. A spike in low-confidence predictions may indicate adversarial probing. A cluster of predictions at exactly the decision boundary suggests targeted evasion attempts.
Query patterns: Log all inference requests with timestamps, user identifiers, and input characteristics. Enable retroactive forensics when a breach is discovered. Store logs for at least 90 days, consistent with EU AI Act logging requirements for high-risk systems.
Incident Response for ML
Traditional incident response playbooks assume a binary state: compromised or not. ML incidents exist on a spectrum. A model can be partially compromised (backdoored on specific trigger patterns) while appearing fully functional on routine monitoring. Your ML incident response plan should include:
- Detection: Automated alerts on drift metrics, anomalous query patterns, and performance degradation
- Containment: Ability to roll back to a known-good model version within minutes. Pre-stage validated rollback artifacts
- Analysis: Forensic comparison of suspect model behavior vs. baseline on a comprehensive test suite including adversarial probes
- Remediation: Retrain from verified clean data if poisoning is confirmed. Update access controls if extraction is detected. Patch inference pipeline if adversarial bypass is found
- Communication: Notify affected stakeholders. For regulated systems under the EU AI Act, file required incident reports within mandated timeframes
Compliance Considerations: EU AI Act
The EU AI Act is the world's first comprehensive AI regulation. Its security requirements directly impact ML engineering practices. Organizations deploying AI systems accessible to EU citizens must comply regardless of where the organization is headquartered.
Risk Categories and Requirements
Unacceptable risk (banned): Social scoring systems, real-time biometric identification in public spaces (with limited law enforcement exceptions), manipulation techniques targeting vulnerable groups. If your ML system falls into this category, no amount of security hardening makes it compliant -- it must be discontinued.
High risk: Systems used in critical infrastructure, education, employment, law enforcement, migration, and democratic processes. These systems require: conformity assessments before deployment, risk management systems, data governance documentation, technical documentation and logging, human oversight mechanisms, accuracy and robustness standards, and cybersecurity measures proportionate to the risk.
Limited risk: Chatbots, emotion recognition systems, deep fakes. These require transparency obligations -- users must be informed they are interacting with an AI system.
Minimal risk: Spam filters, video game AI, inventory management. No specific requirements beyond existing law.
Security-Specific Obligations
For high-risk AI systems, the EU AI Act mandates:
- Robustness testing: Systems must be resilient to errors, faults, and attempts to alter their use or performance by unauthorized third parties. This includes adversarial robustness testing.
- Data governance: Training, validation, and testing datasets must meet quality criteria. Data collection practices must be documented. Bias testing is mandatory.
- Logging: Automatic recording of events during system operation for traceability. Logs must be retained and made available to authorities on request.
- Accuracy metrics: Declared levels of accuracy must be maintained throughout the system's lifecycle. Performance degradation triggers compliance obligations.
- Cybersecurity: The system must be designed to achieve appropriate levels of accuracy, robustness, and cybersecurity, and to perform consistently in those respects throughout its lifecycle.
Non-compliance penalties scale with severity: up to 35 million EUR or 7% of worldwide annual turnover for prohibited AI practices, and up to 15 million EUR or 3% for other infringements. The checklist above maps directly to these compliance requirements.