At Radiant Digital, we provide IT solutions and consulting services to help government agencies and businesses in the USA, Canada, the Middle East, and Southeast Asia. On the federal side, we support agencies like NASA, the Department of State (DOS), the IRS, ACL, ACF,USDA and many others, along with numerous state and local government agencies.
We work with industries like telecom, healthcare, entertainment, oil and gas offering solutions designed to meet their specific needs. We focus on improving systems, making better use of data, and updating applications to keep up with changing markets.
Position: Kubernetes Engineer
Duration: 12 Months plus
Location: Hybrid - Dallas, TX
Job Description:
In this role, you will design, implement, and optimise GPU-accelerated container platforms at scale, enabling high-performance workloads (AI/ML, HPC, LLM training) across hybrid or on-prem environments.
You will have deep expertise with both NVIDIA and Kubernetes ecosystems, including GPU scheduling, device plugins and custom operators.
Key responsibilities of the role include:
· Architecting and operating Kubernetes clusters optimised for GPU workloads, leveraging NVIDIA GPU Operator, Network Operator and DCGM
· Developing, deploying and maintaining custom Kubernetes operators and controllers to automate infrastructure services
· Integrating NVIDIA device plugins, Multi-Instance GPU (MIG) and GPU sharing features into the scheduling layer
· Optimising GPU utilisation and job placement through scheduler extensions, such as kube-scheduler plugins, Slurm and Volcano
· Collaborating with HPC, ML and DevOps teams to ensure multi-tenant, high-throughput cluster performance
· Driving observability and telemetry integrations using Prometheus, Grafana, DCGM Exporter and OpenTelemetry
· Implementing secure multi-user and multi-namespace GPU isolation, with RBAC and policy enforcement, such as OPA or Gatekeeper
· Maintaining CI/CD pipelines for Kubernetes infrastructure using GitOps, ArgoCD and FluxCD
· Contributing to infrastructure-as-code, using Terraform, Helm, and Kustomize
· Participating in performance tuning, incident response and production readiness reviews