Interactive infrastructure hardening checklist for production ML systems. Select your deployment environment and get a tailored security checklist covering container security, API gateways, model serving, secrets management, network policies, and monitoring.
Different infrastructure stacks have different security profiles. The checklist adapts recommendations to your specific environment.
Deploying machine learning models to production introduces security challenges that do not exist in traditional software systems. A conventional web application serves deterministic responses from code and data that are fully controlled by the development team. An ML system serves probabilistic predictions from a learned model that encodes information about the training data in its weights, accepts arbitrary user inputs that influence its behavior, and often relies on specialized hardware (GPUs, TPUs) with unique security properties. The attack surface expands in three dimensions simultaneously: the model artifact itself becomes a target, the inference pipeline introduces new vulnerability classes, and the operational infrastructure must accommodate ML-specific requirements like GPU memory management and model hot-swapping.
The most dangerous vulnerability class unique to ML deployments is model serialization attacks. Python's pickle format, used by scikit-learn and PyTorch's default save mechanism, allows arbitrary code execution on deserialization. An attacker who can substitute a model artifact with a malicious pickle file achieves remote code execution in the model serving environment. This is not a theoretical concern: multiple CVEs have been published for model file poisoning, and the SafeTensors format was created specifically to address this class of vulnerability. The deployment checklist below includes specific controls for safe model loading, artifact integrity verification, and sandboxed deserialization.
Most production ML models are served from containers, typically using frameworks like TensorFlow Serving, Triton Inference Server, TorchServe, or custom Flask/FastAPI applications. Container security for ML workloads requires additional considerations beyond standard container hardening. ML containers often need GPU access, which requires privileged device access that conflicts with the principle of least privilege. The NVIDIA Container Toolkit provides a secure way to expose GPUs to containers without full host access, but misconfiguration can allow container escape via GPU driver vulnerabilities. ML containers also tend to be large (multi-GB images with CUDA libraries and model weights), making image scanning slower and more complex. Best practices include using multi-stage builds to minimize the final image size, running model loading and inference as a non-root user, mounting model artifacts as read-only volumes, and using seccomp profiles that restrict the system calls available during model inference.
The inference API is the primary external attack surface of a deployed ML model. Beyond standard API security (authentication, TLS, input validation), ML-specific concerns include model extraction through systematic querying, adversarial example probing, and resource exhaustion via computationally expensive inputs. Rate limiting for ML endpoints must be more granular than traditional API rate limiting. Limits should be set per-user, per-endpoint, per-model, and per-input-size. A single large image sent to a vision model can consume significantly more GPU time than a small text query to an NLP model. Token-based rate limiting (limiting total tokens processed rather than just request count) is essential for language model endpoints. The API gateway should also implement request body size limits, input format validation, and timeout enforcement to prevent slowloris-style attacks that tie up GPU resources.
ML deployments involve multiple credential types: API keys for inference endpoints, model registry access tokens, feature store credentials, monitoring service tokens, and cloud provider IAM credentials for GPU instances. These credentials must never be hardcoded in container images, model serving configurations, or CI/CD pipelines. Use a secrets management service (HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager) with short-lived credentials rotated automatically. Model registries (MLflow, Weights and Biases, SageMaker Model Registry, Vertex AI Model Registry) should be treated as critical infrastructure with access controls, audit logging, and integrity verification. Implement least-privilege access: the model serving process should only have read access to model artifacts, not write access to the registry.
ML inference workloads should be network-segmented from training workloads, data processing pipelines, and general-purpose compute. In Kubernetes, use NetworkPolicies to restrict which pods can communicate with the model serving pods. The inference pods should be able to receive requests from the API gateway and send responses back, but should not have outbound internet access (preventing data exfiltration) or access to the training data storage (preventing unauthorized data access if the inference pod is compromised). For managed ML platforms, use VPC peering or private endpoints to keep model traffic within the private network. GPU-to-GPU communication (used in multi-GPU inference) should be isolated to dedicated subnets with no external routing.
Production ML monitoring extends traditional infrastructure monitoring with ML-specific signals. Model prediction distribution should be tracked in real time: a sudden shift in output distribution may indicate a model swap attack, data pipeline corruption, or adversarial campaign. Input data quality metrics (out-of-distribution detection, schema validation failure rates) provide early warning of adversarial probing. Performance baselines for inference latency and throughput enable detection of resource exhaustion attacks. All prediction requests should be logged with hashed inputs, outputs, model version, timestamp, and authenticated user identity, creating an audit trail for forensic analysis. Incident response procedures for ML systems should include model rollback procedures (reverting to the previous known-good model version), prediction quarantine (holding suspicious predictions for review), and model isolation (disconnecting the inference endpoint while preserving the model for analysis).
The ML supply chain is broader and more fragile than the traditional software supply chain. It includes Python packages (PyTorch, TensorFlow, scikit-learn, and their transitive dependencies), pre-trained model weights downloaded from model hubs (Hugging Face, TensorFlow Hub), training data sourced from external providers, and CUDA/cuDNN libraries from NVIDIA. Each component represents a supply chain attack vector. Pin all dependency versions in requirements files, use hash verification for downloaded packages, scan container images for CVEs before deployment, verify model file integrity with cryptographic checksums, and maintain a software bill of materials (SBOM) for your ML stack. The SLSA framework (Supply-chain Levels for Software Artifacts) provides a maturity model for supply chain security that maps well to ML-specific assets when extended to cover model artifacts and training data provenance.
Key risks include model artifact tampering, inference API abuse, container escape vulnerabilities, supply chain attacks, pickle deserialization exploits, GPU memory leaks exposing sensitive data, and side-channel attacks revealing model architecture. Each requires specific mitigations beyond standard software security.
Cryptographically sign models using GPG or Sigstore, encrypt at rest with AES-256, embed SHA-256 checksums in deployment manifests, use access-controlled model registries with audit logging, enforce immutable storage for production versions, and separate training from serving environments.
Use both in a defense-in-depth approach. API keys handle application-level authentication, mTLS provides transport-level mutual authentication. For internal services, use mTLS via a service mesh. For external endpoints, combine API keys or OAuth2 with TLS.
Use safe formats (ONNX, SavedModel, SafeTensors) when possible. Never load untrusted models. Scan with fickling or picklescan before loading. Sandbox deserialization with restricted filesystem and network. Use seccomp profiles to limit system calls. PyTorch supports weights_only=True to prevent code execution.
Monitor inference latency, prediction distribution drift, input validation failures, authentication failures, model version mismatches, GPU resource anomalies, error rates, and maintain complete audit logs of all requests with hashed inputs, outputs, model version, and user identity.