Covered — homelab or professional exp.
Gap — ramp-up required
AI platform domain (new territory)
External / third-party integration
Onboarding roadmap — phase overview
Day 1 → Week 4 → Month 3 → Month 6
▾
Phase 1 · Days 1–14
Infrastructure audit · cluster access · existing CI/CD review · Debian system familiarisation
onboarding
Phase 2 · Weeks 3–6
GitOps migration · ArgoCD bootstrap · Helm charts for existing services · container orchestration foundation
GitOps foundation
Phase 3 · Weeks 7–12
AI pipeline infrastructure · MCP server hosting · vector store deployment · low-latency S2S pipeline foundations
AI platform
Phase 4 · Month 4–6
Automated QA module · evaluation scheduling · feedback loop observability · ELK + tracing layer
self-evolving ops
↓
Linux systems & infrastructure baseline
Debian · VM fleet · HA across DCs · provisioning automation · Ansible
▾
Linux (Ubuntu/Debian)
5+ yrs Ubuntu administration · Proxmox host management · systemd · kernel tuning
sysadmin
Ansible
Cluster provisioning · cloud-init integration · idempotent playbooks · role-based structure
config mgmt
Terraform
IaC for cloud resources · AWS & Azure providers · state management · modular design
IaC
Bash / Python scripting
Automation scripts · operational tooling · Django/Flask backend experience · REST API authoring
scripting
gap
BGP / multi-DC routing
BGP concepts and HA routing design across data centres · subnetting at scale · familiar conceptually
networking
Networking fundamentals
Subnetting · Calico CNI routing · NetworkPolicy egress · pre-DNAT behaviour understood in production
CNI · netpol
↓
Container orchestration — VM → Kubernetes migration
kubeadm · Helm · ArgoCD · namespace isolation · gradual workload lift-and-shift
▾
Kubernetes (kubeadm)
Production 3-node cluster · v1.31 · CKA in preparation (Q2 2026) · full lifecycle management
K8s v1.31
ArgoCD App-of-Apps
Full GitOps engine · multi-source Helm · selfHeal · bootstrap layer · staging/production split
GitOps
Helm
Pinned chart versions · multi-source apps · values per environment · dry-run validation pattern
package mgmt
Docker / containerd
Container authoring · multi-stage builds · OCI image standards · image promotion pipeline
OCI
gap
Docker Swarm
Swarm orchestration mode · may be interim state in IONOS VM fleet · transferable from K8s
swarm
gap
JFrog Artifactory
Enterprise artifact registry · Helm chart hosting · image promotion — same pattern as GHCR/crane
artifact registry
↓
CI/CD pipelines & image security
GitLab CI · GitHub Actions · Trivy CVE gate · image promotion · vulnerability tracking
▾
Implemented pipeline (homelab)
GH Actions (ARC)self-hosted K8s
→
Trivy scanblock HIGH/CRIT
→
crane promotepush to registry
→
Helm values updatetag → main commit
→
ArgoCD sync→ staging / prod
IONOS target pipeline (GitLab CI)
GitLab CIbuild + unit test
→
Trivy + SBOMdependency track
→
JFrog Artifactoryartifact push
→
Manual gateprod approval
→
ArgoCD sync→ production ns
GitHub Actions + ARC
Self-hosted runner in K8s · image build/test/promote · security gate validated in production
CI/CD
GitLab CI
4 yrs professional GitLab CI experience at GNS · pipeline authoring · runner configuration
CI/CD · GNS
Trivy (v0.69.3)
Image CVE scanning · HIGH/CRITICAL block gate · OS + app scope · validated in production pipeline
CVE scan
gap
JFrog Artifactory
Enterprise Helm + image registry · xray scanning integration · same operational pattern as crane/GHCR
artifact mgmt
Jenkins
Listed in JD — Groovy pipeline authoring familiar from previous work; transferable from GitLab CI experience
CI/CD alt
↓
Observability — metrics · logs · traces
Prometheus · Grafana · Loki · ELK · Jaeger · alerting for AI pipeline health
▾
Prometheus
kube-prometheus-stack · ServiceMonitor CRDs · custom alerting rules · cluster-wide scrape
metrics
Grafana
Custom dashboards · OIDC SSO via Keycloak · public dashboards · alerting integration
dashboards
Loki + Promtail
Log aggregation · DaemonSet · MinIO S3 backend for long-term retention · LogQL queries
logs
gap
ELK Stack
Elasticsearch · Logstash · Kibana — operational pattern identical to Loki; tool familiarisation needed
log analytics
gap
Jaeger / OpenTelemetry
Distributed tracing for LLM chains and S2S pipelines · trace context propagation · latency analysis
tracing
AI pipeline alerting
Custom metrics for hallucination rate · response quality scores · latency p95/p99 · nightly eval job alerts
AI observability
Uptime Kuma
Endpoint health monitoring · SLA tracking · public status page · incident notification
SLA · uptime
Alertmanager
Alert routing · inhibition rules · rapid response for unhealthy AI pipeline patterns
alerting
↓
Security, identity & secrets management
Vault · ESO · Keycloak · OIDC · NetworkPolicy · ISO security standards · zero static secrets
▾
HashiCorp Vault
KV v2 · Shamir unseal · Longhorn PVC · ESO ClusterSecretStore · no static secrets in Git
secrets
External Secrets Op.
Vault → K8s secret sync · ESO v2.1.0 · per-namespace stores · least-privilege remoteRefs
ESO v2.1.0
Keycloak 26.x
OIDC/OAuth2 broker · SSO for ArgoCD / Grafana / Nextcloud · group-based RBAC · MFA
OIDC · SSO
NetworkPolicy
Default-deny · Calico enforcement · pre-DNAT egress · documented per-namespace exceptions
default-deny
RBAC + PSS
Least-privilege service accounts · restricted PSS enforced · no wildcards · audit logging
RBAC · PSS
cert-manager
Let's Encrypt DNS-01 · Cloudflare API · wildcard TLS · mTLS between services
TLS · PKI
gap
ISO 27001 formal
ISMS documentation · audit process · formal control mapping — security posture implemented, documentation formal gap
ISMS
gap
Ansible Vault
Ansible-native secret encryption · transferable from HashiCorp Vault + SOPS experience
ansible secrets
↓
Storage, databases & backup
PostgreSQL · Redis · MinIO S3 · Longhorn PVs · Velero · 3-2-1 backup strategy
▾
CloudNativePG
PostgreSQL operator · WAL archiving to MinIO · PITR · TLS · automated failover · per-app clusters
WAL · PITR
Redis
Caching layer · distributed locking · mandatory for stateful multi-replica workloads
cache
MinIO (S3-compat)
Object storage for CNPG backups · Velero backend · Loki long-term storage · WAL archive target
S3-compat
Longhorn
Default StorageClass · RF=2 replication · PVs for stateful workloads · Velero integration
RF=2 · PV
Velero + rclone
3-2-1 backup strategy · CSI snapshots · offsite OneDrive replication · tested restore procedures
backup · DR
Vector store (AI)
pgvector on CNPG or Qdrant/Weaviate for LLM context retrieval · persistent PV · backup included
vector DB
↓
AI platform infrastructure — MCP · LLM · S2S pipeline
MCP servers · vector stores · Speech-to-Speech · LLM hosting · sub-500ms latency target
▾
MCP server hosting
Model Context Protocol servers as K8s Deployments · Helm charts · NetworkPolicy egress to CRM APIs · ESO secrets
MCP · K8s
LLM inference hosting
Open-source LLM inference (vLLM / Ollama) · GPU node scheduling · resource limits · HPA on token throughput
LLM ops
Vector store
pgvector on CNPG or Qdrant · persistent storage · backup strategy consistent with platform pattern
RAG · retrieval
gap
S2S / WSS / SRTP
Speech-to-Speech latency pipeline · WebSocket Secure · SRTP media transport · sub-500ms RTT design
voice · streaming
gap
Telephony gateway
SIP/RTP protocols · Twilio / Amazon Connect · IONOS telephony stack — listed as nice-to-have
SIP · RTP
Stateful workload ops
Nextcloud + CNPG + Redis + MinIO — same multi-component pattern required for AI pipeline services
ops pattern
↓
Automated QA & evaluation module
Nightly eval jobs · quality metrics · hallucination alerting · feedback loop plumbing
▾
Kubernetes CronJobs
Scheduled eval runs · nightly test suite · resource-bounded · failure alerting via Alertmanager
scheduling
Prometheus custom metrics
Pushgateway for batch job results · quality score gauges · hallucination rate counters · latency histograms
custom metrics
Alertmanager rules
Threshold breach alerts → PagerDuty / Slack · rapid response SLA for quality degradation · inhibition logic
alerting
Eval result storage
CNPG PostgreSQL for eval run history · Grafana dashboard for quality trends · MinIO for raw eval artifacts
eval persistence
gap
LLM eval frameworks
DeepEval / RAGAS / LangSmith — programmatic quality measurement frameworks · ramp-up during Phase 3
eval tooling
Python pipeline authoring
5 yrs Python incl. ML pipelines · data extraction / preprocessing / validation · directly applicable to eval plumbing
Python · ML ops
Security & compliance envelope
ISO 27001 controls
GDPR / DSGVO
BSI IT-Grundschutz SYS.1.6
OpenID Connect · OAuth2
Zero static secrets in Git
CVE gate in every pipeline
RBAC least-privilege
Audit log → Loki