Looking for USA local candidates.
Location- Remote Work
Key Responsibilities
· Architect and deploy agentic multi-agent AI frameworks.
· Develop scalable pipelines integrating LLM → RAG → VectorDB → Agents
· Build and deploy MCP server for Agentic AI Agents and Integration
· Build observability, latency optimization, and performance monitoring systems.
· Implement self-refine / feedback loop learning architectures.
1. Multi-Agent System Architecture & Deployment
· Architect, design, and deploy agentic multi-agent frameworks where multiple AI agents collaborate autonomously.
· Design and implement inter-agent communication protocols, coordination strategies, and workflow orchestration layers.
· Integrate with frameworks such as LangGraph, CrewAI, AutoGen, or Swarm to develop distributed, event-driven agentic ecosystems.
· Develop containerized deployments (Docker / Kubernetes) for multi-agent clusters running in hybrid or multi-cloud environments.
2. Intelligent Pipeline Development
· Build end-to-end scalable pipelines integrating LLMs → RAG → VectorDB → Agents, ensuring optimal latency and retrieval quality.
· Implement retrieval-augmented generation (RAG) architectures using FAISS, Chroma, Weaviate, Milvus, or Pinecone.
· Develop embedding generation, storage, and query pipelines using OpenAI, Hugging Face, or local LLMs.
· Orchestrate data movement, context caching, and memory persistence for agentic reasoning loops.
3. Agentic Infrastructure & Orchestration
· Build and maintain MCP (Model Context Protocol) servers for Agentic AI agents and integrations.
· Develop APIs, microservices, and serverless components for flexible integration with third-party systems.
· Implement distributed task scheduling and event orchestration using Celery, Airflow, Temporal, or Prefect.
4. Observability, Performance, and Optimization
· Build observability stacks for multi-agent systems with centralized logging, distributed tracing, and metrics visualization.
· Optimize latency, throughput, and inference cost across LLM and RAG layers.
· Implement performance benchmarking and automated regression testing for large-scale agent orchestration.
· Monitor LLM response quality, drift, and fine-tuning performance through continuous feedback loops.
5. Self-Refining & Feedback Loop Architectures
· Implement self-refining / reinforcement learning feedback mechanisms for agents to iteratively improve their performance.
· Integrate auto-evaluation agents to assess output correctness and reduce hallucination.
· Design memory systems (episodic, semantic, long-term) for adaptive agent learning and contextual persistence.
· Experiment with tool-use capabilities, chaining, and adaptive reasoning strategies to enhance autonomous capabilities.
Technical Skills Required
· Programming: Expert-level Python (async, multiprocessing, API design, performance tuning).
· LLM Ecosystem: Familiarity with OpenAI, Anthropic, Hugging Face, Ollama, LangChain, LangGraph, CrewAI, or AutoGen.
· Databases: VectorDBs (FAISS, Weaviate, Milvus, Pinecone), NoSQL (MongoDB, Redis), SQL (PostgreSQL, MySQL).
· Cloud Platforms: AWS / Azure / GCP; experience with Kubernetes, Docker, Terraform, and serverless architecture.
· Observability: Prometheus, Grafana, OpenTelemetry, ELK Stack, Datadog, or New Relic.
· CI/CD & DevOps: GitHub Actions, Jenkins, ArgoCD, Cloud Build, and testing frameworks (PyTest, Locust, etc.).
· Other Tools: FastAPI, gRPC, REST, Kafka, Redis Streams, or event-driven frameworks.
Preferred Experience
· Experience designing agentic workflows or AI orchestration systems in production environments.
· Background in applied AI infrastructure, ML Ops, or distributed system design.
· Exposure to RAG-based conversational AI or autonomous task delegation frameworks.
· Strong understanding of context management, caching, and inference optimization for large models.
· Experience with multi-agent benchmarking or simulation environments.
Soft Skills
· Ability to translate conceptual AI architectures into production-grade systems.
· Strong problem-solving and debugging capabilities in distributed environments.
· Collaboration mindset – working closely with AI researchers, data scientists, and backend teams.
· Passion for innovation in agentic intelligence, orchestration systems, and AI autonomy.
Education & Experience
· Bachelor’s or Master’s in Computer Science, AI/ML, or related technical field.
· 5+ years of experience in backend, cloud, or AI infrastructure engineering.
· 2+ years in applied AI or LLM-based system development preferred.
Optional Nice-to-Haves
· Knowledge of Reinforcement Learning from Human Feedback (RLHF) or self-improving AI systems.
· Experience deploying on-premise or private LLMs or integrating custom fine-tuned models.
· Familiarity with graph-based reasoning or knowledge representation systems.
· Understanding of AI safety, alignment, and autonomous agent governance.