Layer 1: Microservices Foundation The distributed system that generates real logs, traces, and metrics for the AI intelligence layer. It now includes chaos engineering controls to simulate realistic, unpredictable failure patterns seen in production systems. You can run controlled experiments with probabilistic, time-bound, and intensity-based failures across dependent services.
Client
│
▼
Auth Service :8001 ← Entry point, token generation, trace_id origin
│
▼
Product Service :8003 ← Business logic, calls auth + db
│
▼
DB Service :8002 ← In-memory store, primary failure generator
Client
│
▼
Payment Service :8004 ← Checkout flow, calls auth + db (+ Stripe)
Cluster Observability Stack
│
▼
Anomaly Detector :8006 ← ML Model: Proxies Prometheus metrics & runs Isolation Forest
│
▼
AI Operator :8005 ← AI Model: Receives ML webhooks, executes Gemini LLM recovery
Dependency graph: product → auth, product → db, payment → auth, payment → db, payment → stripe
This chain is what enables Root Cause Analysis in Layer 2.
- Probabilistic failure simulation: Trigger failures randomly using per-request probability controls.
- Time-bound failures: Configure automatic failure deactivation after a specified duration.
- Intensity-based chaos controls: Scale timeout length and CPU-pressure impact safely during experiments.
- Cascading failure testing: Observe upstream/downstream impact when auth or DB degrades.
- Distributed trace tracking: Follow request flow across services using shared
trace_id/X-Trace-ID. - Structured observability output: JSON logs and metrics-ready behavior for analysis and RCA.
- Cart clearing after payment: Automatically clears user cart upon successful checkout (real or simulated).
- Mobile-friendly chaos control: Phone-based UI to inject failures and run load tests from anywhere.
- K6 load testing: Pre-built load test scenarios (normal, spike, endurance, stress) for performance validation.
To run the full stack (including apps, Prometheus, Jaeger, Loki, and Grafana) locally on a Kubernetes cluster:
# 1. Start Minikube
minikube start
# 2. Point terminal to Minikube's Docker daemon
# PowerShell:
minikube docker-env | Invoke-Expression
# Bash/Zsh:
eval $(minikube docker-env)
# 3. Build application images directly into the Minikube registry
docker build -t auth-service:latest ./services/auth
docker build -t db-service:latest ./services/db
docker build -t product-service:latest ./services/product
docker build -t payment-service:latest ./services/payment
docker build -t anomaly-detector:latest ./services/anomaly-detector
docker build -t ai-operator:latest ./services/ai-operator
# 4. Deploy Base Services and Observability Stack
kubectl apply -k infra/k8s/base
kubectl apply -k infra/k8s/observability
# 5. Wait for pods to initialize
kubectl get pods -A -w
# 6. Expose Dashboards via Port-Forwarding
# (Run these in separate terminal tabs)
kubectl port-forward svc/grafana 3000:3000 -n observability
kubectl port-forward svc/jaeger-all-in-one-query 16686:16686 -n observability
kubectl port-forward svc/prometheus 9090:9090 -n observability
# Expose Application Services
kubectl port-forward svc/product-service 8003:8003 -n archaics
kubectl port-forward svc/payment-service 8004:8004 -n archaics
# Expose AI-Ops Services
kubectl port-forward svc/ai-operator 8005:8005 -n archaics
kubectl port-forward svc/anomaly-detector 8006:8006 -n archaics
# View AI-Ops Pipeline Logs
# kubectl logs -f deployment/anomaly-detector -n archaics
# kubectl logs -f deployment/ai-operator -n archaicsdocker compose -f infra/docker/docker-compose.yml up --buildAll four services start with health checks. product-service and payment-service wait for auth/db before starting.
# Terminal 1 — Auth Service
cd services/auth
pip install -r requirements.txt
uvicorn main:app --port 8001 --reload
# Terminal 2 — DB Service
cd services/db
pip install -r requirements.txt
uvicorn main:app --port 8002 --reload
# Terminal 3 — Product Service
cd services/product
pip install -r requirements.txt
AUTH_SERVICE_URL=http://localhost:8001 DB_SERVICE_URL=http://localhost:8002 \
uvicorn main:app --port 8003 --reload
# Terminal 4 — Payment Service
cd services/payment
pip install -r requirements.txt
AUTH_SERVICE_URL=http://localhost:8001 DB_SERVICE_URL=http://localhost:8002 STRIPE_API_KEY=sk_test_dummy \
uvicorn main:app --port 8004 --reloadTo interact with the services from your local machine (e.g. via the Dashboard), you must port-forward them:
# Run each in a separate terminal or as background jobs
kubectl port-forward svc/auth-service 8001:8001 -n archaics
kubectl port-forward svc/db-service 8002:8002 -n archaics
kubectl port-forward svc/product-service 8003:8003 -n archaics
kubectl port-forward svc/payment-service 8004:8004 -n archaics
kubectl port-forward svc/ai-operator 8005:8005 -n archaics
kubectl port-forward svc/anomaly-detector 8006:8006 -n archaicsA helper script .\scripts\port-forward.ps1 is provided to run all of these automatically in the background on Windows.
The repo includes a Next.js dashboard and observability services:
cd apps/dashboard
npm install
npm run devOpen: http://localhost:7000
Phone-friendly UI to inject failures and run load tests:
./scripts/serve-control.shOpen on laptop: http://localhost:8080
Open on phone: http://<laptop-ip>:8080
By default the dashboard targets the canonical service ports:
AUTH_SERVICE_URL=http://127.0.0.1:8001DB_SERVICE_URL=http://127.0.0.1:8002PRODUCT_SERVICE_URL=http://127.0.0.1:8003PAYMENT_SERVICE_URL=http://127.0.0.1:8004AI_OPERATOR_URL=http://127.0.0.1:8005ANOMALY_DETECTOR_URL=http://127.0.0.1:8006
Override those environment variables before npm run dev if your services are exposed elsewhere.
Install k6:
brew install k6 # macOS
apt install k6 # Ubuntu/Debian
choco install k6 # Windows# Normal load test (10 VUs, 30s)
./load/run-test.sh normal
# Spike test (0→50→0 VUs)
./load/run-test.sh spike
# Endurance test (5 VUs, 2 minutes)
./load/run-test.sh endurance
# Stress test (0→100 VUs)
./load/run-test.sh stress
# Run with failures pre-injected
./load/run-test.sh normal --with-failures| Script | VUs | Duration | Scenario | Purpose |
|---|---|---|---|---|
normal.js |
10 | 30s | Full user journey | Establish baseline |
spike.js |
0→50→0 | 45s | Sudden traffic spike | Test spike handling |
endurance.js |
5 | 2min | Repeated operations | Find memory leaks |
stress.js |
0→100 | 70s | Ramp to breaking point | Find limits |
Mobile-friendly interface to control the entire system:
./scripts/serve-control.sh- Failure Injection: Select type, service, duration
- Load Testing: Run k6 tests from the UI
- Quick Actions: Error storms, high latency triggers
- Response Logging: Real-time feedback on actions
./scripts/chaos-help.sh-
Start control dashboard:
./scripts/serve-control.sh
-
Open on phone/laptop:
http://localhost:8080 http://<laptop-ip>:8080 (from phone) -
Run a baseline load test:
./load/run-test.sh normal
-
From dashboard, inject a failure (e.g., timeout on product service)
-
Run spike test to see impact:
./load/run-test.sh spike
-
Watch effects on dashboard (http://localhost:7000)
-
Click "Reset All Failures" to recover
-
Verify recovery with another baseline test
| Method | Endpoint | Description |
|---|---|---|
| POST | /signup |
Register a user |
| POST | /login |
Login, get JWT token |
| GET | /validate |
Validate a token (used by product-service) |
| GET | /health |
Health + failure state |
| POST | /inject-failure?type=X |
Inject failure |
| POST | /reset |
Clear failure |
| Method | Endpoint | Description |
|---|---|---|
| GET | /products |
All products |
| POST | /cart/add |
Add item to cart |
| GET | /cart/{user_id} |
Get user cart |
| POST | /cart/clear |
Clear user cart |
| GET | /health |
Health + failure state |
| POST | /inject-failure?type=X |
Inject failure |
| POST | /reset |
Clear failure |
| Method | Endpoint | Description |
|---|---|---|
| GET | /products |
Fetch catalog (requires auth token) |
| POST | /cart/add |
Add to cart (requires auth token) |
| GET | /cart |
View cart (requires auth token) |
| POST | /cart/clear |
Clear cart (requires auth token) |
| GET | /health |
Health + failure state |
| POST | /inject-failure?type=X |
Inject failure |
| POST | /reset |
Clear failure |
| Method | Endpoint | Description |
|---|---|---|
| POST | /create-checkout-session |
Creates a Stripe checkout session |
| GET | /health |
Health + failure state |
| POST | /inject-failure?type=X |
Inject failure |
| POST | /reset |
Clear failure |
| Method | Endpoint | Description |
|---|---|---|
| POST | /analyze |
Receives anomaly webhook, queries Gemini & initiates fixes |
| GET | /health |
Health check for the AI agent |
| Method | Endpoint | Description |
|---|---|---|
| GET | /health |
Returns ML status and number of baseline samples gathered |
Each service supports POST /inject-failure with query params:
type: failure mode (timeout,error,cpu,crash, plusbad_dataon DB)intensity: positive integer multiplier (default:1)probability: trigger chance per request from0.0to1.0(default:1.0)duration: optional active window in seconds; failure auto-disables when elapsed
Reset any service with POST /reset.
| Type | Effect |
|---|---|
timeout |
Adds async delay (2 * intensity seconds) to simulate latency/hangs |
error |
Returns simulated HTTP 500 failure response |
cpu |
Starts CPU pressure workload in background to simulate resource exhaustion |
crash |
Terminates service process (os._exit(1)) |
bad_data |
Returns intentionally corrupted payloads (DB service only) |
# Probabilistic failure on Product service (30% of requests fail with error)
curl -X POST "http://localhost:8003/inject-failure?type=error&probability=0.3"
# Duration-based timeout on Auth service (active for 45 seconds)
curl -X POST "http://localhost:8001/inject-failure?type=timeout&intensity=2&duration=45"
# DB bad_data for 20 seconds at full probability
curl -X POST "http://localhost:8002/inject-failure?type=bad_data&probability=1.0&duration=20"
# Reset after experiment
curl -X POST http://localhost:8001/reset
curl -X POST http://localhost:8002/reset
curl -X POST http://localhost:8003/reset# 1. Sign up
curl -X POST http://localhost:8001/signup \
-H "Content-Type: application/json" \
-d '{"email": "alice@example.com", "password": "secure123"}'
# 2. Login → get token
TOKEN=$(curl -s -X POST http://localhost:8001/login \
-H "Content-Type: application/json" \
-d '{"email": "alice@example.com", "password": "secure123"}' | python -c "import sys,json; print(json.load(sys.stdin)['access_token'])")
# 3. Fetch products (trace flows Auth → Product → DB)
curl http://localhost:8003/products -H "Authorization: Bearer $TOKEN"
# 4. Add to cart
curl -X POST http://localhost:8003/cart/add \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"product_id": "p1", "quantity": 2}'
# 5. View cart
curl http://localhost:8003/cart -H "Authorization: Bearer $TOKEN"# Inject DB timeout
curl -X POST "http://localhost:8002/inject-failure?type=timeout"
# Now call product-service — it calls DB, detects timeout, logs upstream impact
curl http://localhost:8003/products -H "Authorization: Bearer $TOKEN"
# Logs show:
# db-service: "Injected DB timeout — sleeping 15s"
# product-service: "DB-service timeout after 8002ms — upstream impact detected"
# Reset
curl -X POST http://localhost:8002/resetExpected RCA: Root cause = db-service timeout → cascaded to product-service.
# Force Auth errors
curl -X POST "http://localhost:8001/inject-failure?type=error&probability=1.0"
# Product depends on Auth token validation, so protected calls fail upstream
curl http://localhost:8003/products -H "Authorization: Bearer $TOKEN"Expected behavior: product-service returns auth-related failure path (401/upstream unavailability behavior), and logs show dependency impact.
# Inject corrupted data responses in DB
curl -X POST "http://localhost:8002/inject-failure?type=bad_data&duration=30"
# Product fetch now receives malformed DB payload content
curl http://localhost:8003/products -H "Authorization: Bearer $TOKEN"Expected behavior: DB returns intentionally degraded fields (for example name: null, invalid prices), enabling downstream resilience testing.
# Random timeout spikes on DB at 40% probability
curl -X POST "http://localhost:8002/inject-failure?type=timeout&intensity=2&probability=0.4&duration=60"
# Repeated calls show intermittent success/failure patterns
for i in {1..10}; do
curl -s -o /dev/null -w "request $i -> %{http_code}\n" http://localhost:8003/products -H "Authorization: Bearer $TOKEN"
doneExpected behavior: mixed response outcomes that emulate real distributed instability and intermittent degradation.
Every log line is valid JSON:
{
"service": "product-service",
"level": "INFO",
"message": "DB products fetch success: 5 items in 12.3ms",
"timestamp": "2024-01-15T10:30:00.000Z",
"trace_id": "550e8400-e29b-41d4-a716-446655440000"
}The same trace_id appears across all services for a single request chain — enabling distributed tracing in Layer 2.
The system includes two sophisticated observability services that monitor the Prometheus metrics and automatically orchestrate fixes using an LLM:
anomaly-detector: Analyzes multivariate metrics (Latency, Error Rate, CPU usage) continuously using an internal Isolation Forest to predict system decay.ai-operator: Triggered by the anomaly detector to orchestrate automatic fixes.
We have provided a convenient script, generate_errors.ps1 (or .sh), to inject failures and generate authentic traffic hitting the DB and Auth endpoints so that the anomaly detection system triggers an alert and sends an automated payload to the AI operator.
# Before running make sure to expose the local ports for the services
# Run the error generation script
./generate_errors.ps1Watch the autonomous system detect, analyze, and resolve the issue:
# Terminal 1: Watch the Anomaly Detector compute anomalies and trigger webhooks
kubectl logs -f deployment/anomaly-detector -n archaics
# Terminal 2: Watch the AI Operator analyze the metrics, determine root causes, and execute fixes
kubectl logs -f deployment/ai-operator -n archaics
# Terminal 3: Watch your Kubernetes pods auto-recover
kubectl get pods -n archaics -w