О себе
Sole owner of the production infrastructure for an enterprise platform serving 5 clients across higher education, manufacturing, and public transit: 12 Kubernetes clusters, 45 nodes, ~10K RPS, 99.85% uptime. Scope runs end to end, down to an on-prem cluster assembled and deployed on-site in a client's data center. Also designed and shipped an internal CPU-only LLM analytics system, now used daily by DevOps and Analytics teams.
Опыт работы
SMART
Infrastructure / ML Platform Engineer
Sole owner of the production infrastructure for an enterprise platform serving 5 clients across higher education, manufacturing, and public transit: 12 Kubernetes clusters, 45 nodes, ~10K RPS, 99.85% uptime. Scope runs end to end, down to an on-prem cluster assembled and deployed on-site in a client's data center. Also designed and shipped an internal CPU-only LLM analytics system, now used daily by DevOps and Analytics teams.
- Sole owner of all production infrastructure for an enterprise platform deployed per-client across higher education, manufacturing, and public transit (5 clients live in production, each on a dedicated cluster for strict isolation of sensitive personal data). The platform spans up to 25 microservices per deployment: identity/SSO, campus & site access (dynamic QR passes, guest management, events), service requests, analytics dashboards, and iOS/Android mobile apps.
- Run 12 Kubernetes clusters: 5 production (one per client; bare-metal + Selectel IaaS), 2 staging, and 5 dev (bare-metal, built from scratch), with full ownership of upgrades, on-call, and incident response.
- Hand-built and deployed an on-prem production cluster for a security-sensitive client: assembled the Supermicro server, installed it on-site in the client's data center, and brought it online inside their locked-down network over a secure WireGuard tunnel (single ingress IP).
- Migrated all 25 platform services off VMware vCenter VMs onto self-built k3s clusters, replacing ad-hoc developer deployments with standardized, centrally managed environments.
- Put every production microservice behind a GitOps pipeline (GitLab CI + ArgoCD): tag-triggered build, rollout, and rollback, raising deploy frequency from weekly to daily and reaching 0 manual production deployments.
- Owns backup and disaster recovery across all clusters: automated PostgreSQL backups with point-in-time recovery (WAL), replicated to multi-region S3; recovery procedures proven against real production incidents.
- Built an internal field operations tracking system deployed across 2 enterprise clients (100+ buildings): FastAPI backend, Grafana dashboards with map visualization and automated issue routing, cutting site issue resolution from months to under a week.
- Prototyped and validated Rook/Ceph on k3s/EC2 (AWS, Terraform) as a replacement for legacy bare-metal MinIO; the evaluation drove adoption, now running on all production bare-metal clusters.
- Designed, built, and own an internal LLM-based infrastructure analytics system (idea to production): a CPU-only inference pipeline (llama.cpp + Ray) correlating Kubernetes events, Prometheus metrics, Alertmanager alerts, and Uptime Kuma data to surface root-cause hypotheses; reports delivered via Telegram, used daily by DevOps and Analytics teams
Образование
ITMO University
2024 — 2026Computer Science