Table of Contents
The development of machine learning systems for production environments requires a fundamentally different approach than experimental or academic model building. While training a model with satisfactory performance on a held-out test set remains necessary, it represents a minor fraction of the total engineering effort in a production-grade AI system. Industry surveys consistently identify data integration, versioning, validation, and monitoring as the primary sources of technical debt and operational failures. According to a 2022 analysis, approximately 68% of production incidents in ML-enabled systems originated outside the model code itself – in data pipelines, feature stores, or inference serving layers.
This article provides a systematic engineering framework for end-to-end AI system design. The framework comprises seven discrete stages, each with defined inputs, outputs, validation criteria, and common failure modes.
The Seven-Stage Framework
The table below consolidates the essential components of a production-ready AI system. Each row corresponds to a mandatory stage; the rightmost column specifies the acceptance criterion that must be satisfied before proceeding to the subsequent stage.
| Stage | Input | Primary Activities | Output Artifact | Acceptance Criterion |
| 1. Problem formulation & metric selection | Business requirements, system constraints | Define prediction target, select loss function, and establish SLA thresholds | ML specification document (one page) | Metric aligned with business cost structure; latency SLA documented |
| 2. Data ingestion & versioning | Raw data (logs, images, time series) | Import pipeline, checksum computation, registration in the version control system | Versioned dataset in object storage | Full dataset reconstruction from sources in < 1 hour |
| 3. Data validation | Versioned dataset | Apply expectations (distributions, nulls, types, cross-field constraints) | Validation report (JSON + HTML) | All expectations passed or explicitly overridden with justification |
| 4. Feature engineering & storage | Validated raw data | Transformations, aggregations, encoding | Feature store (offline + online backends) | Reproducible feature retrieval by ID under 50 ms |
| 5. Model training & experiment tracking | Features + labels | Hyperparameter optimization, cross-validation, artifact logging | Registered model in the experiment registry | Metrics reproducible upon re-run; no data leakage |
| 6. Deployment (online or batch) | Registered model | Containerization, endpoint implementation, canary strategy | Container image + deployed service | 99.9% successful requests over 48 hours of canary traffic |
| 7. Observability & drift monitoring | Inference logs, delayed ground truth | Drift detection (PSI, KS test), dashboard configuration | Alerts + visualizations | Alert triggered at 5% performance degradation or PSI > 0.2 |
Stage-by-Stage Technical Analysis
Stage 1: Problem Formulation and Metric Selection
Teams often adopt convenient metrics instead of economically meaningful ones. For imbalanced problems like brain CT screening, false negatives have different consequences than false positives. Use a weighted F-beta metric. Required deliverable: one-page ML spec containing prediction target, inference-time inputs, latency SLA, target metric, and error cost structure.
Stage 2: Data Ingestion and Versioning
Raw storage without versioning causes irreproducibility. Proper versioning requires object storage (S3, GCS), a manifest with cryptographic hashes (SHA-256), and a version identifier (DVC or custom solution). These controls become essential when training data originates from multiple internal sources, annotation vendors, or external AI data collection companies, where maintaining dataset consistency and traceability is critical for reproducible ML systems.
Example manifest row (CSV format):
brain_ct/train/patient_042/slice_013.dcm, a3f5c6e1d8a7b9c0, 2025-11-01
The manifest enables the validation stage to detect unintended file modifications. If a file changes outside the version control process, validation fails before training begins.
Stage 3: Data Validation
Three layers: (A) Schema expectations (types, ranges, nulls) via Great Expectations or Pandera; (B) Distributional expectations using KS test (continuous) or chi-square (categorical) with p<0.01 threshold; (C) Target variable expectations (class proportion deviation <15%). Example: a scanner upgrade shifted pixel intensity by 1.8 standard deviations, validation caught it before training.
Example from practice: After a scanner hardware upgrade, the mean pixel intensity in brain CT images shifted by 1.8 standard deviations. Layer B captured this within the first batch. The pipeline halted, and the model was retrained on data from the new scanner. Without validation, degraded performance would have persisted for weeks.
Stage 4: Feature Engineering and Storage
A feature store separates computation from training. Offline store uses columnar formats (Parquet, Delta Lake) with versioned metadata. Online store uses low-latency DBs (Redis, DynamoDB) for serving.
Stage 5: Model Training Without Data Leakage
Requirements: hyperparameter optimization, cross-validation, and experiment logging (MLflow, Weights & Biases). Train/validation/test splits must be time-aware. Register the model only after reproducibility is confirmed.
Stage 6: Deployment (Online Inference)
A production model endpoint must satisfy a minimum contract:
Request: Structured JSON with required fields, or base64-encoded image with metadata.
Response:
{
“prediction”: 1,
“probability”: 0.92,
“model_version”: “v2.1.0”,
“latency_ms”: 147
}
Infrastructure checklist: Containerization (Docker <5 GB), health endpoint (/health), structured logging (JSONL with request_id, latency, prediction, version), and canary deployment (5% traffic for 48 hours, then ramp-up). For batch inference, prioritize idempotency and checkpointing.
Stage 7: Observability and Drift Monitoring
Two distinct types of drift must be monitored, each requiring different detection methods.
- Covariate shift (data drift): The distribution of input features has changed. For brain CT, possible causes include new scanner hardware, a changed acquisition protocol, or a different patient population. Detection metric: Population Stability Index (PSI). Alert threshold: PSI > 0.2.
- Concept shift: The relationship between features and target has changed. This is more dangerous and typically requires model retraining. Detection requires delayed ground truth. The system compares recent prediction quality (on newly labeled data) against the declared baseline. Alert threshold: F1 drop > 5% relative to baseline.
Conclusion
End-to-end AI system design is a structured engineering discipline. The framework, from problem formulation to observability, is essential. Omitting data validation, feature versioning, or drift monitoring leads to technical debt and incidents. While the model gets attention, the surrounding infrastructure requires deliberate design.
Following the framework avoids common failure modes in production ML. Key practices: write ML specs before code, validate before training, version datasets and features, deploy with canary strategies and structured logging, and configure drift alerts before production. These separate reliable systems from silent failures. The choice isn’t speed vs. rigour, it’s rigour now vs. incidents later.