Peter Netupskii

Platform / Infrastructure Engineer · Kubernetes · Bare-metal · LLM Systems

Senior Удалённо Батуми, Georgia Готов к переезду

1 г. 10 мес. опыта 47 навыка

О себе

Sole owner of the production infrastructure for an enterprise platform serving 5 clients across higher education, manufacturing, and public transit: 12 Kubernetes clusters, 45 nodes, ~10K RPS, 99.85% uptime. Scope runs end to end, down to an on-prem cluster assembled and deployed on-site in a client's data center. Also designed and shipped an internal CPU-only LLM analytics system, now used daily by DevOps and Analytics teams.

Опыт работы

SMART

10.2024 — по н.в. 1 г. 10 мес.

Infrastructure / ML Platform Engineer

Удалённо

Sole owner of all production infrastructure for an enterprise platform deployed per-client across higher education, manufacturing, and public transit (5 clients live in production, each on a dedicated cluster for strict isolation of sensitive personal data). The platform spans up to 25 microservices per deployment: identity/SSO, campus & site access (dynamic QR passes, guest management, events), service requests, analytics dashboards, and iOS/Android mobile apps.
Run 12 Kubernetes clusters: 5 production (one per client; bare-metal + Selectel IaaS), 2 staging, and 5 dev (bare-metal, built from scratch), with full ownership of upgrades, on-call, and incident response.
Hand-built and deployed an on-prem production cluster for a security-sensitive client: assembled the Supermicro server, installed it on-site in the client's data center, and brought it online inside their locked-down network over a secure WireGuard tunnel (single ingress IP).
Migrated all 25 platform services off VMware vCenter VMs onto self-built k3s clusters, replacing ad-hoc developer deployments with standardized, centrally managed environments.
Put every production microservice behind a GitOps pipeline (GitLab CI + ArgoCD): tag-triggered build, rollout, and rollback, raising deploy frequency from weekly to daily and reaching 0 manual production deployments.
Owns backup and disaster recovery across all clusters: automated PostgreSQL backups with point-in-time recovery (WAL), replicated to multi-region S3; recovery procedures proven against real production incidents.
Built an internal field operations tracking system deployed across 2 enterprise clients (100+ buildings): FastAPI backend, Grafana dashboards with map visualization and automated issue routing, cutting site issue resolution from months to under a week.
Prototyped and validated Rook/Ceph on k3s/EC2 (AWS, Terraform) as a replacement for legacy bare-metal MinIO; the evaluation drove adoption, now running on all production bare-metal clusters.
Designed, built, and own an internal LLM-based infrastructure analytics system (idea to production): a CPU-only inference pipeline (llama.cpp + Ray) correlating Kubernetes events, Prometheus metrics, Alertmanager alerts, and Uptime Kuma data to surface root-cause hypotheses; reports delivered via Telegram, used daily by DevOps and Analytics teams

Образование

ITMO University

2024 — 2026

Computer Science

Навыки

Python Bash Kubernetes k3s Helm Kustomize ArgoCD GitLab CI AWS Terraform Selectel IaaS MinIO Prometheus Grafana Loki Uptime Kuma Alertmanager VictoriaMetrics PostgreSQL MongoDB ClickHouse SQLite Docker Ansible VMware QEMU Rook/Ceph Cilium MetalLB Wireguard NetBird OPNsense MikroTik DNS Ray Qdrant pgvector LLM inference pipelines RHEL Ubuntu Arch Linux Devops Production MLops SRE Redis Bare-metal

Языки

English C1 — Продвинутый

Russian Родной

Личные данные

Гражданство РФ