Pattern

AI Infrastructure & Inference Engineer (AWS, Open-Source Models)

AI Infrastructure & Inference Engineer (AWS, Open-Source Models)

Job type

  • Permanent

  • Full-time

 

Location

Noida, Uttar Pradesh

 

Benefits

Pulled from the full job description

  • Paid time off

 

Full job description

LOCATION: IN-PERSON

SHIFTIME: OVERNIGHT 8:30pm - 6:30am IST

Job Description:

Own the end-to-end path from open-source model → production endpoint on AWS. You’ll provision GPU infra (EKS/EC2), containerize model servers (vLLM/TGI/TensorRT-LLM), ship autoscaling and observability, and drive tokens/s and p95 down while keeping costs predictable.

Key Responsibilities:

  • Build & harden AWS GPU stacks (EKS or EC2 ASG): VPC, subnets, SGs, IAM, ECR, ALB/NLB.
  • Containerize and operate vLLM/TGI/TensorRT-LLM with streaming, batching, KV-cache.
  • Implement autoscaling (HPA/Cluster Autoscaler), blue/green & canary deploys.
  • Codify everything with Terraform + Helm; wire CI/CD (GitHub Actions).
  • Add observability: Prometheus/Grafana, CloudWatch, logs/traces, SLOs, on-call alerts.
  • Optimize cost/perf: quantization (AWQ/GPTQ/FP8), spot strategies, node packing, caching.
  • Secure secrets (KMS/Secrets Manager), private ECR, least-privilege IAM, image signing.
  • Publish runbooks and incident playbooks; partner with research to productionize new models.

Minimum Qualifications:

  • 4+ years DevOps/SRE/Platform or ML-infra experience with AWS (EC2/EKS/ECR/ALB/ASG/IAM/S3).
  • Kubernetes, Helm, Docker expertise; Infrastructure-as-Code (Terraform) in production.
  • Hands-on with at least one model server (vLLM or TGI); CUDA/NVIDIA drivers/NCCL basics.
  • Python or Bash for tooling; CI/CD pipelines; strong debugging skills.
  • Triton Inference Server, TensorRT-LLM, SageMaker endpoints.

Nice to have:

  • Redis/RocksDB caches, MoE model serving, multi-region DR.
  • Experience with Llama-3, Mistral, Qwen, Mixtral; GGUF/ONNX export for edge.
  • Success in 30/60/90

30 days: Terraform’d EKS + first GPU endpoint (vLLM) with Grafana dashboard & runbook.

60 days: Autoscaling & canary deploys in place; first cost/perf report and savings plan.

90 days: Multi-model routing, SLOs with alerts, incident playbooks, perf regression tests.

Interview Process
Intro screen → Technical deep-dive → Team panel → Founder chat → Offer

Notes

Beware of fake consultant Don't pay any amount to anyone. Jobpana do not charge any amount from anyone.

Huntsman

  • Salary 15000/MONTH
  • Education Bachelor's (Preferred)
  • Job Type Full Time
  • Experience 1 Years
  • Date 28-Aug-2025
Apply Now