HomeJob Openings

Job Openings

About the Role

We are a fast-growing AI infrastructure company building cutting-edge GPU cloud platforms and high-performance inference solutions that empower AI developers, startups, and enterprises worldwide. As we scale our global operations, we are looking for a skilled and hands-on AI Infra Engineer – SRE (Kubernetes) to join our Global Infrastructure team.

Role Overview

This is a critical hands-on position focused on the reliability, performance, and operational excellence of large-scale, high-performance AI/ML GPU clusters in our data centers. As an AI Infra Engineer – SRE (Kubernetes), you will design, operate, and optimize Kubernetes-based infrastructure to ensure maximum uptime, efficiency, and scalability for demanding AI workloads. You will bring deep expertise in system-level troubleshooting, GPU cluster management, and automation to keep our platforms running at peak performance.

Key Responsibilities

  • Design, build, and maintain scalable, production-grade AI/ML infrastructure using Kubernetes.
  • Proactively monitor GPU cluster health, performance, and utilization across compute, accelerators, storage, and networking layers, performing root-cause analysis and resolution.
  • Develop and implement automation for infrastructure provisioning, configuration, and ongoing management.
  • Own the complete GPU node lifecycle — including provisioning, dynamic scaling, maintenance, decommissioning, and zero-downtime upgrades of GPU-enabled nodes in Kubernetes environments.
  • Build and improve CI/CD pipelines for reliable infrastructure deployment and orchestration.
  • Enforce security best practices, compliance standards, and operational excellence across the infrastructure stack.
  • Lead incident response and post-incident improvements for issues related to GPUs, CPUs, high-speed storage, and networks.
  • Manage end-to-end customer GPU resource provisioning — from request intake and configuration to onboarding, troubleshooting, and support — ensuring high levels of customer satisfaction.
  • Stay up to date with the latest GPU hardware, software, and orchestration technologies, integrating relevant advancements into our platforms.
  • Be available for occasional regional or international travel to data center locations as required.

Requirements

  • Bachelor’s degree in Computer Science, Engineering, or a related technical field.
  • 3+ years of practical experience in data center operations, infrastructure engineering, or site reliability engineering.
  • Strong background in infrastructure automation using tools such as Terraform and Ansible.
  • Deep hands-on experience with Kubernetes in large-scale environments, including:
    • NVIDIA GPU Operator for GPU driver management, device plugins, container toolkit, and monitoring (DCGM).
    • NVIDIA Network Operator for high-performance networking, RDMA, and GPUDirect support.
    • CNI (Container Network Interface) and CSI (Container Storage Interface) plugins tailored for AI/ML workloads.
    • Integration with job schedulers such as Slurm in Kubernetes clusters.
  • Proficiency in Linux system administration and scripting (Python, Bash).
  • Experience with observability stacks including Prometheus, Grafana, and Loki.
  • Solid understanding of GPU architecture, NVIDIA CUDA, NCCL, and AI/ML frameworks is a strong plus.
  • Excellent troubleshooting skills with the ability to analyze complex system logs and performance metrics.
  • Strong communication and collaboration skills to work effectively with engineering and operations teams.

Reports To : Director of Cloud Infrastructure

About the Role

We are seeking a Senior Technical Architect to drive architecture, design, and technology strategy for our enterprise-level revenue optimization and performance management platform. The ideal candidate will bring deep experience in cloud-native architecture, distributed systems, and modern application frameworks, with proven expertise in security, scalability, integrations, and enterprise data processing pipelines.

As a senior leader, you will collaborate with engineering, DevOps, InfoSec, product, and business stakeholders to ensure our platform is resilient, secure, compliant, and scalable while supporting our roadmap for growth and innovation.

Key Responsibilities

Architecture & Design

  • Cloud-Native Architecture: Expertise in designing AWS-based cloud architectures for scalability, high availability, and cost optimization (EKS, EMR, RDS, Redshift, S3, Lambda, VPC, IAM).
  • Microservices & API Design: Strong experience in microservices architecture, service decomposition, API gateway patterns, REST, GraphQL, gRPC, and event-driven messaging (Kafka, SQS, SNS).
  • Data Architecture: Ability to design data pipelines, ETL/ELT frameworks, data lakes, data warehouses, and distributed processing systems, ensuring data quality, schema evolution, and reconciliation.
  • Security & Compliance by Design: Embed security, access control, encryption, and compliance (SOC2, PCI, GDPR, ISO 27001) into all layers of architecture.
  • Scalability & Resilience: Design systems that handle geo-distributed deployments, multi-tenancy, auto-scaling, failover, and disaster recovery.
  • Observability & Monitoring: Define logging, monitoring, alerting, and performance tuning standards for applications and infrastructure.
  • CI/CD & Deployment Architecture: Design pipelines that enforce code quality, automated testing, versioning, and secure deployments across environments.
  • Technical Documentation & Decision Records: Clearly document architecture diagrams, design decisions, trade-offs, and rationale for stakeholders and auditors.
  • Future-State Roadmapping: Ability to plan evolutionary architecture and modernization strategies, including monolith-to-microservices migration, cloud adoption, and AI/ML integration.
  • Performance & Cost Optimization: Design for efficient compute, storage, and network usage while maintaining required SLA, latency, and throughput.

Qualifications & Skills

  • 10+ years of experience in enterprise software engineering and architecture.
  • Strong expertise in AWS Cloud services including EKS, EMR, RDS, Redshift, S3, Glue, Lambda, VPC, and IAM.
  • Proven experience in microservices architecture, API design (REST, GraphQL, gRPC), and event-driven systems (Kafka, SQS, SNS).
  • Deep expertise in data pipelines, ETL/ELT, data lakes, data warehousing, and distributed processing.
  • Experience with containerization and orchestration (Docker, Kubernetes, Helm, service mesh).
  • Strong understanding of security architecture, IAM, OAuth2.0, OIDC, SAML, Auth0, and compliance frameworks (SOC2, PCI-DSS, GDPR, ISO 27001).
  • Proficiency in DevOps, CI/CD, GitOps, Jenkins, ArgoCD, GitHub Actions, and Infrastructure-as-Code (Terraform, Ansible, Pulumi).
  • Expertise in observability, monitoring, logging, tracing, and performance tuning using New Relic, OpenTelemetry, Prometheus, Grafana, ELK/EFK.
  • Extensive experience in database design and management: relational (MySQL, Postgres), NoSQL (DocumentDB, DynamoDB), and data warehouse (Redshift, Snowflake).
  • Experience designing geo-distributed, multi-tenant, high-availability, and resilient SaaS architectures.
  • Familiarity with frontend frameworks (Angular, React) ,Backend Framework (Java Spring Boot) and mobile application architecture.
  • Strong skills in architectural governance, technical debt management, and future-state roadmap planning.
  • Understanding of AI/ML workflows for anomaly detection, KPI forecasting, and optimization.
  • Expertise in cost optimization, scalability, disaster recovery, and high-performance infrastructure design.
  • Hands-on experience with software lifecycle best practices, agile methodologies, and code quality governance.
  • Ability to evaluate tools, frameworks, and platforms for strategic enterprise adoption.
  • Strong analytical, problem-solving, and decision-making skills with the ability to balance trade-offs.

Reports To : Director of Product Development

Why You’ll Love Working With Us

This isn’t just another AI job. You’ll be part of a pioneering team pushing the boundaries of what autonomous AI systems can do for Platform Engineering. You’ll have the freedom to innovate, the resources to build at scale, and the support of a collaborative, forward-thinking environment.

We are seeking a skilled d AI Agentic Engineer to join our innovative team. The ideal candidate will have a strong background in artificial intelligence, natural language processing, and software development. This role involves designing, developing, and implementing advanced RAG systems and Agentic AI based solutions that enhance user interaction and content retrieval.

What We’re Looking For

  • You eat, sleep, and breathe generative AI — prompt and context engineering, retrieval-augmented generation (RAG), and autonomous agents are your playground.
  • You know your way around modern AI agent frameworks like LangChain, LlamaIndex, Semantic Kernel, crewAI, and AutoGen — and you’re excited to push their limits.
  • Vector databases like Pinecone, Weaviate, or Chroma? You’re comfortable querying and managing them to power semantic search.
  • Full-stack skills? Absolutely. React + TypeScript on the frontend, Node.js or Python microservices on the backend, and REST or gRPC APIs.
  • DevOps savvy: Kubernetes, Terraform or AWS CDK, plus monitoring tools like Grafana and Prometheus are in your toolkit.

Key Responsibilities

  • Design and Build AI Agents: Create autonomous agents that can reason, plan, act, and collaborate.
  • Prompt Engineering: Develop advanced prompts and roles to guide agent behavior.
  • Memory & Context Integration: Implement systems for agents to store, recall, and use memory in conversations.
  • Tool & API Integration: Enable agents to use external tools and APIs for real-world automation.
  • Multi-Step Reasoning: Build agents capable of decomposing tasks, planning, and self-correcting.
  • Multi-Agent Systems: Deploy and manage groups of agents that collaborate and delegate.
  • Ecosystem Automation: Launch and maintain agentic systems that automate business and technical workflows.
  • Collaborate with data scientists and engineers to refine algorithms and improve the performance of AI models.
  • Conduct thorough testing and validation of developed systems to ensure accuracy and reliability.
  • Stay updated with industry trends and advancements in AI, machine learning, and natural language processing.

The Tech We Love

  • AI Agent & Orchestration: LangChain, LangGraph, CrewAU, LlamaIndex, Semantic Kernel, AutoGen
  • Protocol: MCP, Agent2Agent
  • Vector DBs: Pinecone, Weaviate, Chroma
  • Observability & Evaluation: LangSmith, Helicone, PromptLayer, RAGAS
  • CI/CD for LLMs: PromptOps, LlamaTest, GitHub Actions with AI evaluation workflows
  • Telephony: Twilio Programmable Voice, SIP, VAPI

If you are passionate about advancing AI technologies and have the skills to build innovative RAG and agentic systems, we encourage you to apply. Join us in shaping the future of intelligent applications!

Reports To: Director of Cloud Infrastructure

Role Overview:
Join our Core Kubernetes Operator Development team, where we’re pushing the boundaries of Kubernetes innovation. As a Kubernetes Controller Developer (Golang), you will play a crucial role in building “01”, our cloud-agnostic Platform as a Service (PaaS), driven by full-fledged Kubernetes operators and agents.

This position requires a strong background in Kubernetes internals and Golang programming, particularly in developing and managing Kubernetes controllers. If you’re a proactive problem solver with experience in building cloud-native infrastructure, this is your opportunity to contribute to a transformative platform.

We highly encourage candidates with a solid programming foundation and a hunger to explore the cloud-native world to apply. Comprehensive onboarding and professional development support will be provided.

Key Responsibilities (Not limited to):

  • Collaborate in Agile teams, taking ownership of development stories with minimal supervision.
  • Partner with internal teams and clients to accurately capture technical requirements.
  • Design, build, deploy, and maintain Kubernetes controllers and operators using Golang.
  • Identify gaps in current systems and propose or implement technical improvements.
  • Apply best practices across the full software development lifecycle.
  • Create and execute unit, regression, and E2E tests for operator reliability.
  • Work in Linux environments and troubleshoot issues in containerized applications.
  • Contribute to CI/CD workflows for seamless testing and deployment.

Essential Skillset:

  • Kubernetes Controller Development: Proven expertise in building and maintaining controllers and operators.
  • Proficiency in Golang: 2+ years writing idiomatic, well-tested Go code for Kubernetes projects.
  • Deep understanding of Kubernetes APIs and libraries including client-go, CRDs, and API extensions.
  • Hands-on experience with:
    • Kubebuilder – For scaffolding controllers and CRDs
    • Operator SDK – For building Operators with OLM support
    • controller-runtime – For abstracting Kubernetes client logic
  • Strong testing skills, including unit, load, and E2E tests for operators.
  • Familiarity with containerization (Docker) and orchestration (Kubernetes).
  • Comfortable working in Linux with debugging tools and CLI.
  • 2+ years experience working with CI/CD tools like Jenkins, GitHub Actions, Tekton, or similar.

Preferred Skills (Nice to Have):

  • CKA or CKAD certifications.
  • Hands-on experience managing production-grade Kubernetes clusters.
  • Knowledge of Infrastructure as Code tools (e.g., Terraform).
  • Exposure to major cloud providers: AWS, GCP, or Azure.
  • Scripting experience in Shell or Python.

What We Offer:

  • A chance to build infrastructure automation tools that power real-world workloads.
  • Opportunity to work on bleeding-edge cloud-native technologies with a global impact.
  • Collaborative and innovation-driven culture, with strong engineering mentorship.
  • Remote-friendly setup and flexible work culture.
  • Career development in one of the most in-demand areas of DevOps.

Get the latest BerryBytes updates by subscribing to our Newsletter!

Enterprise AI Acceleration Unleashed

Copyright © 2026 BerryBytes. All Rights Reserved.