Looking for USA local candidates.

Location- Remote Work

Key Responsibilities

· Architect and deploy agentic multi-agent AI frameworks.

· Develop scalable pipelines integrating LLM → RAG → VectorDB → Agents

· Build and deploy MCP server for Agentic AI Agents and Integration

· Build observability, latency optimization, and performance monitoring systems.

· Implement self-refine / feedback loop learning architectures.

1. Multi-Agent System Architecture & Deployment

· Architect, design, and deploy agentic multi-agent frameworks where multiple AI agents collaborate autonomously.

· Design and implement inter-agent communication protocols, coordination strategies, and workflow orchestration layers.

· Integrate with frameworks such as LangGraph, CrewAI, AutoGen, or Swarm to develop distributed, event-driven agentic ecosystems.

· Develop containerized deployments (Docker / Kubernetes) for multi-agent clusters running in hybrid or multi-cloud environments.

2. Intelligent Pipeline Development

· Build end-to-end scalable pipelines integrating LLMs → RAG → VectorDB → Agents, ensuring optimal latency and retrieval quality.

· Implement retrieval-augmented generation (RAG) architectures using FAISS, Chroma, Weaviate, Milvus, or Pinecone.

· Develop embedding generation, storage, and query pipelines using OpenAI, Hugging Face, or local LLMs.

· Orchestrate data movement, context caching, and memory persistence for agentic reasoning loops.

3. Agentic Infrastructure & Orchestration

· Build and maintain MCP (Model Context Protocol) servers for Agentic AI agents and integrations.

· Develop APIs, microservices, and serverless components for flexible integration with third-party systems.

· Implement distributed task scheduling and event orchestration using Celery, Airflow, Temporal, or Prefect.

4. Observability, Performance, and Optimization

· Build observability stacks for multi-agent systems with centralized logging, distributed tracing, and metrics visualization.

· Optimize latency, throughput, and inference cost across LLM and RAG layers.

· Implement performance benchmarking and automated regression testing for large-scale agent orchestration.

· Monitor LLM response quality, drift, and fine-tuning performance through continuous feedback loops.

5. Self-Refining & Feedback Loop Architectures

· Implement self-refining / reinforcement learning feedback mechanisms for agents to iteratively improve their performance.

· Integrate auto-evaluation agents to assess output correctness and reduce hallucination.

· Design memory systems (episodic, semantic, long-term) for adaptive agent learning and contextual persistence.

· Experiment with tool-use capabilities, chaining, and adaptive reasoning strategies to enhance autonomous capabilities.

Technical Skills Required

· Programming: Expert-level Python (async, multiprocessing, API design, performance tuning).

· LLM Ecosystem: Familiarity with OpenAI, Anthropic, Hugging Face, Ollama, LangChain, LangGraph, CrewAI, or AutoGen.

· Databases: VectorDBs (FAISS, Weaviate, Milvus, Pinecone), NoSQL (MongoDB, Redis), SQL (PostgreSQL, MySQL).

· Cloud Platforms: AWS / Azure / GCP; experience with Kubernetes, Docker, Terraform, and serverless architecture.

· Observability: Prometheus, Grafana, OpenTelemetry, ELK Stack, Datadog, or New Relic.

· CI/CD & DevOps: GitHub Actions, Jenkins, ArgoCD, Cloud Build, and testing frameworks (PyTest, Locust, etc.).

· Other Tools: FastAPI, gRPC, REST, Kafka, Redis Streams, or event-driven frameworks.

Preferred Experience

· Experience designing agentic workflows or AI orchestration systems in production environments.

· Background in applied AI infrastructure, ML Ops, or distributed system design.

· Exposure to RAG-based conversational AI or autonomous task delegation frameworks.

· Strong understanding of context management, caching, and inference optimization for large models.

· Experience with multi-agent benchmarking or simulation environments.

Soft Skills

· Ability to translate conceptual AI architectures into production-grade systems.

· Strong problem-solving and debugging capabilities in distributed environments.

· Collaboration mindset – working closely with AI researchers, data scientists, and backend teams.

· Passion for innovation in agentic intelligence, orchestration systems, and AI autonomy.

Education & Experience

· Bachelor’s or Master’s in Computer Science, AI/ML, or related technical field.

· 5+ years of experience in backend, cloud, or AI infrastructure engineering.

· 2+ years in applied AI or LLM-based system development preferred.

Optional Nice-to-Haves

· Knowledge of Reinforcement Learning from Human Feedback (RLHF) or self-improving AI systems.

· Experience deploying on-premise or private LLMs or integrating custom fine-tuned models.

· Familiarity with graph-based reasoning or knowledge representation systems.

· Understanding of AI safety, alignment, and autonomous agent governance.

Required Skills

Cloud engineering python AI

View all job openings

Agentic AI Infrastructure & Orchestration Engineer