Job Openings

An easier way to deploy your apps

Automated infrastructure management platform for high availability
Minimizes manual effort and operational overhead
Uses predictive auto-scaling to handle demand dynamically
Provides instant failover to prevent downtime
Leverages intelligent orchestration for seamless operations
Ensures systems stay online 24/7

View Product

Simplify Kubernetes operations with 01Agents

Intelligent Kubernetes remediation platform
Automatically converts alerts into actionable solutions
Powered by a LangGraph-driven agent for smart analysis
Analyzes real-time cluster signals and gathers live context
Generates ready-to-run remediation scripts within seconds
Integrates with modern UI, Slack, and Microsoft Teams
Streamlines incident response and reduces downtime
Enhances Kubernetes management with speed and clarity

View Product

AI Powered Cybersecurity Platform

Autonomous, AI-driven cybersecurity agent
Functions as an embedded expert-level threat hunter
Continuously learns and adapts to your network behavior
Detects hidden vulnerabilities proactively
Neutralizes complex cyber attacks in real time
Operates without requiring human intervention

View Product

About the Role

We are a fast-growing AI infrastructure company building cutting-edge GPU cloud platforms and high-performance inference solutions that empower AI developers, startups, and enterprises worldwide. As we scale our global operations, we are looking for a skilled and hands-on AI Infra Engineer – SRE (Kubernetes) to join our Global Infrastructure team.

Role Overview

This is a critical hands-on position focused on the reliability, performance, and operational excellence of large-scale, high-performance AI/ML GPU clusters in our data centers. As an AI Infra Engineer – SRE (Kubernetes), you will design, operate, and optimize Kubernetes-based infrastructure to ensure maximum uptime, efficiency, and scalability for demanding AI workloads. You will bring deep expertise in system-level troubleshooting, GPU cluster management, and automation to keep our platforms running at peak performance.

Key Responsibilities

Design, build, and maintain scalable, production-grade AI/ML infrastructure using Kubernetes.
Proactively monitor GPU cluster health, performance, and utilization across compute, accelerators, storage, and networking layers, performing root-cause analysis and resolution.
Develop and implement automation for infrastructure provisioning, configuration, and ongoing management.
Own the complete GPU node lifecycle — including provisioning, dynamic scaling, maintenance, decommissioning, and zero-downtime upgrades of GPU-enabled nodes in Kubernetes environments.
Build and improve CI/CD pipelines for reliable infrastructure deployment and orchestration.
Enforce security best practices, compliance standards, and operational excellence across the infrastructure stack.
Lead incident response and post-incident improvements for issues related to GPUs, CPUs, high-speed storage, and networks.
Manage end-to-end customer GPU resource provisioning — from request intake and configuration to onboarding, troubleshooting, and support — ensuring high levels of customer satisfaction.
Stay up to date with the latest GPU hardware, software, and orchestration technologies, integrating relevant advancements into our platforms.
Be available for occasional regional or international travel to data center locations as required.

Requirements

Bachelor’s degree in Computer Science, Engineering, or a related technical field.
3+ years of practical experience in data center operations, infrastructure engineering, or site reliability engineering.
Strong background in infrastructure automation using tools such as Terraform and Ansible.
Deep hands-on experience with Kubernetes in large-scale environments, including:
- NVIDIA GPU Operator for GPU driver management, device plugins, container toolkit, and monitoring (DCGM).
- NVIDIA Network Operator for high-performance networking, RDMA, and GPUDirect support.
- CNI (Container Network Interface) and CSI (Container Storage Interface) plugins tailored for AI/ML workloads.
- Integration with job schedulers such as Slurm in Kubernetes clusters.
Proficiency in Linux system administration and scripting (Python, Bash).
Experience with observability stacks including Prometheus, Grafana, and Loki.
Solid understanding of GPU architecture, NVIDIA CUDA, NCCL, and AI/ML frameworks is a strong plus.
Excellent troubleshooting skills with the ability to analyze complex system logs and performance metrics.
Strong communication and collaboration skills to work effectively with engineering and operations teams.

Reports To : Director of Product Development

Why You’ll Love Working With Us

This isn’t just another AI job. You’ll be part of a pioneering team pushing the boundaries of what autonomous AI systems can do for Platform Engineering. You’ll have the freedom to innovate, the resources to build at scale, and the support of a collaborative, forward-thinking environment.

We are seeking a skilled d AI Agentic Engineer to join our innovative team. The ideal candidate will have a strong background in artificial intelligence, natural language processing, and software development. This role involves designing, developing, and implementing advanced RAG systems and Agentic AI based solutions that enhance user interaction and content retrieval.

What We’re Looking For

You eat, sleep, and breathe generative AI — prompt and context engineering, retrieval-augmented generation (RAG), and autonomous agents are your playground.
You know your way around modern AI agent frameworks like LangChain, LlamaIndex, Semantic Kernel, crewAI, and AutoGen — and you’re excited to push their limits.
Vector databases like Pinecone, Weaviate, or Chroma? You’re comfortable querying and managing them to power semantic search.
Full-stack skills? Absolutely. React + TypeScript on the frontend, Node.js or Python microservices on the backend, and REST or gRPC APIs.
DevOps savvy: Kubernetes, Terraform or AWS CDK, plus monitoring tools like Grafana and Prometheus are in your toolkit.

Key Responsibilities

Design and Build AI Agents: Create autonomous agents that can reason, plan, act, and collaborate.
Prompt Engineering: Develop advanced prompts and roles to guide agent behavior.
Memory & Context Integration: Implement systems for agents to store, recall, and use memory in conversations.
Tool & API Integration: Enable agents to use external tools and APIs for real-world automation.
Multi-Step Reasoning: Build agents capable of decomposing tasks, planning, and self-correcting.
Multi-Agent Systems: Deploy and manage groups of agents that collaborate and delegate.
Ecosystem Automation: Launch and maintain agentic systems that automate business and technical workflows.
Collaborate with data scientists and engineers to refine algorithms and improve the performance of AI models.
Conduct thorough testing and validation of developed systems to ensure accuracy and reliability.
Stay updated with industry trends and advancements in AI, machine learning, and natural language processing.

The Tech We Love

AI Agent & Orchestration: LangChain, LangGraph, CrewAU, LlamaIndex, Semantic Kernel, AutoGen
Protocol: MCP, Agent2Agent
Vector DBs: Pinecone, Weaviate, Chroma
Observability & Evaluation: LangSmith, Helicone, PromptLayer, RAGAS
CI/CD for LLMs: PromptOps, LlamaTest, GitHub Actions with AI evaluation workflows
Telephony: Twilio Programmable Voice, SIP, VAPI

If you are passionate about advancing AI technologies and have the skills to build innovative RAG and agentic systems, we encourage you to apply. Join us in shaping the future of intelligent applications!

Reports To: Director of Cloud Infrastructure

Role Overview:
Join our Core Kubernetes Operator Development team, where we’re pushing the boundaries of Kubernetes innovation. As a Kubernetes Controller Developer (Golang), you will play a crucial role in building “01”, our cloud-agnostic Platform as a Service (PaaS), driven by full-fledged Kubernetes operators and agents.

This position requires a strong background in Kubernetes internals and Golang programming, particularly in developing and managing Kubernetes controllers. If you’re a proactive problem solver with experience in building cloud-native infrastructure, this is your opportunity to contribute to a transformative platform.

We highly encourage candidates with a solid programming foundation and a hunger to explore the cloud-native world to apply. Comprehensive onboarding and professional development support will be provided.

Key Responsibilities (Not limited to):

Collaborate in Agile teams, taking ownership of development stories with minimal supervision.
Partner with internal teams and clients to accurately capture technical requirements.
Design, build, deploy, and maintain Kubernetes controllers and operators using Golang.
Identify gaps in current systems and propose or implement technical improvements.
Apply best practices across the full software development lifecycle.
Create and execute unit, regression, and E2E tests for operator reliability.
Work in Linux environments and troubleshoot issues in containerized applications.
Contribute to CI/CD workflows for seamless testing and deployment.

Essential Skillset:

Kubernetes Controller Development: Proven expertise in building and maintaining controllers and operators.
Proficiency in Golang: 2+ years writing idiomatic, well-tested Go code for Kubernetes projects.
Deep understanding of Kubernetes APIs and libraries including client-go, CRDs, and API extensions.
Hands-on experience with:
- Kubebuilder – For scaffolding controllers and CRDs
- Operator SDK – For building Operators with OLM support
- controller-runtime – For abstracting Kubernetes client logic
Strong testing skills, including unit, load, and E2E tests for operators.
Familiarity with containerization (Docker) and orchestration (Kubernetes).
Comfortable working in Linux with debugging tools and CLI.
2+ years experience working with CI/CD tools like Jenkins, GitHub Actions, Tekton, or similar.

Preferred Skills (Nice to Have):

CKA or CKAD certifications.
Hands-on experience managing production-grade Kubernetes clusters.
Knowledge of Infrastructure as Code tools (e.g., Terraform).
Exposure to major cloud providers: AWS, GCP, or Azure.
Scripting experience in Shell or Python.

What We Offer:

A chance to build infrastructure automation tools that power real-world workloads.
Opportunity to work on bleeding-edge cloud-native technologies with a global impact.
Collaborative and innovation-driven culture, with strong engineering mentorship.
Remote-friendly setup and flexible work culture.
Career development in one of the most in-demand areas of DevOps.

Cloud

Agent

Security

Agent Sandbox

Compliance

Cloud

Agent

Security

Sandbox

Compliance

AI Engineering

Platform Engineering

Fine-Tuning

Advanced RAG Pipeline

Infrastructure Automation

Cloud Native Architecture

Internet of Things (IoT)

Get the latest BerryBytes updates by subscribing to our Newsletter!

Navigation

Services

Services

Legal

Cloud

Agent

Security

Agent Sandbox

Compliance

Cloud

Agent

Security

Sandbox

Compliance

AI Engineering

Platform Engineering

Fine-Tuning

Advanced RAG Pipeline

Infrastructure Automation

Cloud Native Architecture

Internet of Things (IoT)