Support Engineer (US Time Zone
Support Engineer (US Time Zone)
Experience: 1–3 years
Location: Remote
Work Hours: US Time Zone
About the Role
We’re looking for a coding-forward Support Engineer to own L2 investigations and fixes across our Intel OPEA-based GenAI services.n production.
What You’ll Do
- Own L2 incidents end-to-end: triage, root-cause, hotfix (small code changes), and drive long-term fixes for OPEA services (e.g., retriever/embedding/services, agents, inference gateway).
- Debug microservices & APIs: reproduce issues locally with Docker Compose/Kubernetes; verify health checks; trace requests across components (LLM, vector DB, tool/agent).
- Code to unblock customers: write focused patches and scripts (Python/TypeScript, FastAPI/Node) for data prep, adapters, and service hardening.
- Pipeline reliability: monitor and tune RAG/agent pipelines (token/latency budgets, timeouts, batching, retries, circuit-breakers).
- Observability first: build/run dashboards and alerts (logs, metrics, traces; OpenTelemetry where applicable).
- CI/CD & IaC: maintain build/deploy for OPEA components; contribute to Terraform/Helm changes with DevOps.
- Compatibility & model routing: validate OpenAI-compatible endpoints, model switches, and fallbacks (on-prem/cloud).
- Docs & learning loops: keep high-signal runbooks, RCAs, and “best known methods” for recurring issues.
- Participate in US-hours on-call rotations; provide crisp stakeholder updates.
Required Skills
- 1–3 years in Support/Platform/Dev/DevOps roles with significant coding in Python (preferred) or TypeScript/Node.
- Solid microservices debugging: REST/gRPC, auth, queues, caching, concurrency, rate limits.
- Containers & orchestration: Docker, Docker Compose; working knowledge of Kubernetes.
- Linux fluency and shell scripting.
- Cloud familiarity: AWS/Azure/GCP (networking, IAM, storage, managed K8s).
- Version control & CI/CD: Git + a common CI (GitHub Actions/Jenkins).
- Strong troubleshooting, crisp written/verbal comms, and customer empathy.
Nice to Have
- OPEA ecosystem familiarity (GenAIComps microservices like retriever/embedding/reranker; Agent service built on LangChain/LangGraph).
- Vector databases (Milvus/pgvector/FAISS), RAG patterns, prompt/tool/agent debugging.
- OpenAI-compatible API experience; gateway/proxy patterns; token accounting.
- Observability: Grafana/Prometheus, ELK/Datadog, OpenTelemetry traces.
- Infra & MLOps: Helm/Terraform; KServe/Ray/Airflow basics.
- Intel stack awareness helpful (Xeon, Gaudi accelerators, OpenVINO), but not required.
- Jira/ServiceNow/Zendesk for incident workflows; Agile practices.
What Success Looks Like
- Can reproduce and fix common OPEA microservice issues locally (compose/k8s), validate via health endpoints, and contribute small PRs.
- Ship/run dashboards + actionable alerts for latency, error budgets, and throughput across RAG/agent paths.
- Improve customer-visible SLO (availability, P50/P95 latency) through code/config changes.
- Author clean runbooks and RCA that prevents a repeat incident.
We’re looking for a coding-forward Support Engineer to own L2 investigations and fixes across our Intel OPEA-based GenAI services. You’ll dive into microservices (retriever/embedding/reranker/agent), APIs, and infra, reproducing issues, shipping small patches, and partnering with Platform/Dev teams to keep customer workloads healthy and fast. OPEA uses a composable, microservice architecture for enterprise GenAI (e.g., RAG blueprints, agents, OpenAI-compatible inference endpoints), which you’ll support and extend in production.