Research

ML Model Security Checklist

By Michael Lip · May 16, 2026 · 12 min read

Machine learning models are valuable intellectual property and critical infrastructure. As organizations deploy ML systems in production, the attack surface expands from training data pipelines through inference endpoints. A single vulnerability in your ML pipeline can lead to data exfiltration, model theft, adversarial manipulation, or compliance violations under the EU AI Act.

This interactive checklist covers 50+ security controls across five categories. Use it to assess your current ML security posture, identify gaps, and build a remediation plan. Every control maps to real-world attack vectors documented by OWASP, MITRE ATLAS, and NIST AI RMF.

How to Use This Checklist

Check each control your organization has implemented. The tool calculates your overall security score (0-100), risk level per category, and generates an exportable Markdown report. Severity ratings (Critical / High / Medium / Low) indicate remediation priority. Start with unchecked Critical items. For the main LockML security analysis tool, visit the homepage.

ML Security Assessment Tool

0
Security Score
Critical Risk

ML Security Threat Landscape 2026

The machine learning security landscape has shifted dramatically. In 2024, adversarial attacks were primarily academic concerns. By 2026, they are operational threats. MITRE ATLAS documents over 90 real-world case studies of ML-targeted attacks. The convergence of three trends has made ML model security a board-level concern:

Trend 1

Commoditized Attack Tooling

Open-source adversarial ML libraries like ART, Foolbox, and TextAttack have lowered the barrier to entry. Attackers no longer need PhD-level knowledge to craft adversarial examples or run model extraction campaigns. Automated attack frameworks can probe thousands of endpoints per hour.

Trend 2

Expanding Attack Surface

With LLMs embedded in customer-facing applications, every chat interface is a potential attack vector. RAG pipelines introduce data injection risks. Multi-modal models (vision + text + code) multiply the input channels an attacker can exploit. The average enterprise now deploys 35+ ML models in production.

Trend 3

Regulatory Pressure

The EU AI Act entered enforcement in August 2025. NIST AI RMF is now the de facto US standard. Organizations deploying high-risk AI systems must demonstrate robustness testing, data governance, and incident response capabilities. Non-compliance penalties reach 7% of global revenue.

The cost of an ML security breach extends beyond traditional cyber incidents. Model theft eliminates competitive advantages built over years. Data poisoning can cause silent model degradation that goes undetected for months. Adversarial manipulation of safety-critical systems (autonomous vehicles, medical diagnosis, fraud detection) creates liability exposure that traditional cybersecurity insurance does not cover.

OWASP ML Top 10 Mapped to Practical Defenses

The OWASP Machine Learning Security Top 10 provides a standardized framework for categorizing ML-specific threats. Below, each risk category is mapped to concrete, implementable defenses that your engineering team can adopt immediately.

OWASP ML Risk Attack Vector Practical Defense Priority
ML01: Input Manipulation Adversarial examples, prompt injection, evasion attacks Input validation, adversarial training, ensemble voting, confidence thresholds Critical
ML02: Data Poisoning Backdoor insertion, label flipping, clean-label attacks Data provenance tracking, statistical outlier detection, robust aggregation Critical
ML03: Model Inversion Reconstructing training data from model outputs Differential privacy, output perturbation, membership inference monitoring High
ML04: Membership Inference Determining if a data point was in the training set Differential privacy guarantees, regularization, confidence calibration High
ML05: Model Theft API-based model extraction, side-channel analysis Rate limiting, query budgets, watermarking, output truncation Critical
ML06: AI Supply Chain Compromised pre-trained models, malicious dependencies Model signing, SBOM, dependency scanning, SafeTensors format High
ML07: Transfer Learning Attack Exploiting vulnerabilities inherited from base models Base model auditing, fine-tuning validation, adversarial probing of transferred layers Medium
ML08: Model Skewing Training-serving skew, concept drift exploitation Drift detection pipelines, shadow models, continuous validation Medium
ML09: Output Integrity Manipulating model outputs post-inference Output signing, integrity checks, secure serving infrastructure High
ML10: Model Poisoning Directly modifying model weights or parameters Artifact integrity verification, access controls, cryptographic checksums Critical

Data Poisoning Prevention

Data poisoning remains the most underestimated threat in machine learning security. Unlike inference-time attacks that require ongoing access, a successful poisoning attack embeds permanent vulnerabilities into the model itself. The model behaves normally on clean inputs but produces attacker-controlled outputs when triggered by specific patterns.

Types of Data Poisoning

Backdoor attacks insert a trigger pattern (a specific pixel patch, word, or feature) into a subset of training samples, associating them with a target label. The resulting model achieves normal accuracy on clean data but misclassifies any input containing the trigger. In 2025, researchers demonstrated backdoor attacks that survived fine-tuning, making them persistent even when downstream users retrain on clean data.

Clean-label attacks are more sophisticated. The attacker does not modify labels -- only subtly perturbs features in correctly-labeled data. This makes poisoned samples pass manual inspection. The perturbations shift decision boundaries in ways that benefit the attacker at inference time. Detection requires statistical methods beyond simple label-checking.

Label-flipping attacks are the simplest form: the attacker changes labels on a fraction of training data. Even flipping 1-3% of labels can degrade model accuracy by 10-15% on targeted classes. This is particularly dangerous when training data is crowdsourced or scraped from the web.

Defense Strategy

Model Extraction and Theft Defense

Model extraction attacks aim to steal your model's functionality by querying it repeatedly and training a surrogate model on the input-output pairs. A well-executed extraction attack can replicate 90-99% of a model's accuracy using as few as 10,000 API queries. For models that cost millions to train, this represents a catastrophic loss of intellectual property.

Attack Mechanics

The attacker sends carefully chosen inputs to your model API and collects the predictions. By selecting inputs that maximize information gain (active learning), the attacker builds a training set for a clone model. Modern extraction attacks work even when the API only returns top-1 predictions without confidence scores. For LLMs, attackers can extract fine-tuning data, system prompts, and behavioral alignments.

Layered Defense

Rate limiting and query budgets: Implement per-user query limits that reset daily. Track cumulative query patterns, not just instantaneous rates. A sophisticated attacker will throttle their queries to stay under simple rate limits, so use weekly and monthly aggregate tracking.

Query pattern detection: Monitor for extraction signatures: systematic grid sampling across the input space, boundary-probing queries that test decision boundaries, and unusual input distributions that don't match legitimate traffic. Deploy anomaly detection on the query stream itself.

Output truncation: Return only the minimum information needed. For classification, return the predicted class without confidence scores. For regression, round outputs to reduce precision. Never expose logits, embeddings, or intermediate layer activations through the API.

Model watermarking: Embed verifiable watermarks in your model's behavior. If a stolen model surfaces, you can prove ownership by demonstrating that the watermark transfers to the clone. Techniques include training on specially crafted "key" inputs that produce predetermined outputs.

Differential privacy: Training with differential privacy (epsilon < 10) limits how much any single training example influences outputs, making extraction less faithful. This is the strongest theoretical defense but comes with an accuracy-privacy tradeoff.

Adversarial Example Detection

Adversarial examples are inputs crafted to fool ML models while appearing normal to humans. In computer vision, imperceptible pixel perturbations can cause misclassification with near-100% success rates. In NLP, synonym substitutions and character-level modifications bypass text classifiers and content moderation systems. In 2026, adversarial attacks against multi-modal models combine visual and textual perturbations for maximum effect.

Detection Approaches

Input preprocessing: Apply transformations (JPEG compression, spatial smoothing, feature squeezing) to inputs before inference. Adversarial perturbations are often fragile and get destroyed by preprocessing. Compare model predictions on raw vs. preprocessed inputs -- divergence signals an adversarial example.

Gradient monitoring: Adversarial inputs produce anomalous gradient patterns. Monitor the magnitude and direction of input gradients during inference. Legitimate inputs produce smooth, low-magnitude gradients. Adversarial inputs produce sharp, high-magnitude gradient spikes.

Ensemble disagreement: Maintain an ensemble of models trained on the same task with different architectures or random seeds. Adversarial examples crafted for one model rarely transfer perfectly to all ensemble members. Flag inputs where ensemble predictions diverge significantly.

Certified robustness: Randomized smoothing provides mathematically proven robustness within an L2 ball around each input. For any input, you can certify that no perturbation within the certified radius will change the prediction. This is the gold standard but applies primarily to classification tasks with continuous inputs.

Confidence calibration: Well-calibrated models assign lower confidence to adversarial inputs. Apply temperature scaling or Platt calibration, then reject predictions below a confidence threshold. This is a lightweight defense that can be added to any existing model without retraining.

Supply Chain Security for ML

The ML supply chain introduces unique risks beyond traditional software supply chain concerns. Pre-trained models are opaque binaries that can contain backdoors. Serialization formats like Python pickle allow arbitrary code execution on deserialization. Popular model hubs are targets for dependency confusion and typosquatting attacks.

Dependency Risks

ML projects typically depend on 200+ Python packages, each of which is a potential attack vector. In 2025, malicious packages mimicking popular ML libraries were found on PyPI with embedded cryptocurrency miners and data exfiltration code. The torch-nightly namespace attack demonstrated how even sophisticated engineering teams can be tricked into installing compromised packages.

Pre-trained Model Risks

Downloading models from Hugging Face Hub, TensorFlow Hub, or PyTorch Hub is the ML equivalent of curl | bash. Pre-trained models can contain:

Mitigations: Use SafeTensors format instead of pickle for model serialization. Verify model checksums against official repository signatures. Run downloaded models through adversarial probing on a known-good validation set before deployment. Implement model signing with Sigstore/Cosign for internal model registries. Check the ML Model License Guide before integrating any pre-trained model.

Monitoring and Incident Response

Traditional application monitoring (uptime, latency, error rates) is necessary but insufficient for ML systems. ML-specific monitoring must detect silent failures: models that continue serving predictions but produce degraded, biased, or manipulated outputs without throwing errors.

What to Monitor

Input drift: Track the statistical distribution of inputs over time. Significant drift from training data distribution signals potential adversarial manipulation or natural concept drift. Use KL divergence, Population Stability Index (PSI), or Kolmogorov-Smirnov tests on feature distributions.

Output distribution: Monitor the distribution of model predictions. A sudden shift in class distribution (e.g., fraud detection model suddenly flagging 2x more transactions) may indicate model manipulation or genuine environmental change. Both require investigation.

Confidence distributions: Track the distribution of prediction confidence scores. A spike in low-confidence predictions may indicate adversarial probing. A cluster of predictions at exactly the decision boundary suggests targeted evasion attempts.

Query patterns: Log all inference requests with timestamps, user identifiers, and input characteristics. Enable retroactive forensics when a breach is discovered. Store logs for at least 90 days, consistent with EU AI Act logging requirements for high-risk systems.

Incident Response for ML

Traditional incident response playbooks assume a binary state: compromised or not. ML incidents exist on a spectrum. A model can be partially compromised (backdoored on specific trigger patterns) while appearing fully functional on routine monitoring. Your ML incident response plan should include:

  1. Detection: Automated alerts on drift metrics, anomalous query patterns, and performance degradation
  2. Containment: Ability to roll back to a known-good model version within minutes. Pre-stage validated rollback artifacts
  3. Analysis: Forensic comparison of suspect model behavior vs. baseline on a comprehensive test suite including adversarial probes
  4. Remediation: Retrain from verified clean data if poisoning is confirmed. Update access controls if extraction is detected. Patch inference pipeline if adversarial bypass is found
  5. Communication: Notify affected stakeholders. For regulated systems under the EU AI Act, file required incident reports within mandated timeframes

Compliance Considerations: EU AI Act

The EU AI Act is the world's first comprehensive AI regulation. Its security requirements directly impact ML engineering practices. Organizations deploying AI systems accessible to EU citizens must comply regardless of where the organization is headquartered.

Risk Categories and Requirements

Unacceptable risk (banned): Social scoring systems, real-time biometric identification in public spaces (with limited law enforcement exceptions), manipulation techniques targeting vulnerable groups. If your ML system falls into this category, no amount of security hardening makes it compliant -- it must be discontinued.

High risk: Systems used in critical infrastructure, education, employment, law enforcement, migration, and democratic processes. These systems require: conformity assessments before deployment, risk management systems, data governance documentation, technical documentation and logging, human oversight mechanisms, accuracy and robustness standards, and cybersecurity measures proportionate to the risk.

Limited risk: Chatbots, emotion recognition systems, deep fakes. These require transparency obligations -- users must be informed they are interacting with an AI system.

Minimal risk: Spam filters, video game AI, inventory management. No specific requirements beyond existing law.

Security-Specific Obligations

For high-risk AI systems, the EU AI Act mandates:

Non-compliance penalties scale with severity: up to 35 million EUR or 7% of worldwide annual turnover for prohibited AI practices, and up to 15 million EUR or 3% for other infringements. The checklist above maps directly to these compliance requirements.

Frequently Asked Questions

What are the biggest security risks to ML models in 2026?
The top ML security risks in 2026 are data poisoning (manipulating training data to introduce backdoors), model extraction (stealing model weights through repeated API queries), adversarial examples (crafted inputs that cause misclassification), supply chain attacks (compromised dependencies or pre-trained models), and prompt injection for LLM-based systems. The OWASP ML Top 10 provides a standardized framework for categorizing these threats.
How do I prevent model theft and extraction attacks?
Prevent model theft by: (1) rate-limiting API calls to prevent systematic querying, (2) monitoring for extraction patterns like sequential boundary-probing queries, (3) adding watermarks to model outputs for post-theft detection, (4) using differential privacy during training to limit what can be learned from outputs, (5) restricting prediction confidence scores (return top-1 class only, not full probability distributions), and (6) implementing query budgets per user/API key.
What is data poisoning and how do I detect it?
Data poisoning is the deliberate manipulation of training data to introduce vulnerabilities or backdoors into an ML model. Detection methods include: statistical analysis of training data distributions to find anomalies, comparing model behavior on clean vs. potentially poisoned validation sets, using influence functions to identify high-impact training samples, implementing data provenance tracking to verify source integrity, and running spectral signature analysis to detect inserted patterns.
Does the EU AI Act affect ML model security requirements?
Yes, the EU AI Act (effective August 2025, with full enforcement from August 2026) imposes security requirements especially for high-risk AI systems. These include mandatory risk assessments, robustness testing against adversarial attacks, data governance requirements, logging and traceability, human oversight, and cybersecurity requirements. Non-compliance can result in fines up to 35 million EUR or 7% of global turnover.
How do I secure ML dependencies and the supply chain?
Secure your ML supply chain by: (1) pinning all dependency versions, (2) scanning with Snyk, Dependabot, or Safety, (3) verifying pre-trained model checksums, (4) using model signing (Sigstore/Cosign), (5) avoiding pickle-based serialization (use SafeTensors), (6) auditing hub downloads before deployment, and (7) maintaining a software bill of materials (SBOM).
What is an adversarial example and how do I defend against it?
An adversarial example is a deliberately crafted input designed to cause incorrect predictions while appearing normal to humans. Defenses include adversarial training, input preprocessing and sanitization, ensemble methods, certified defenses like randomized smoothing, gradient monitoring, and confidence thresholds that reject uncertain predictions.

Related Tools

Exported Checklist (Markdown)