Remote

Sr AI Engineer

WEX, Inc.
life insurance, paid time off, tuition reimbursement
United States
Mar 27, 2026
This is a remote position; however, the candidate must reside within 30 miles of one of the following locations: Portland, ME; Boston, MA; Chicago, IL; Dallas, TX; San Francisco Bay Area, CA; and Seattle/WA. About the Role/Team We are the backbone of the AI organization, building the high-performance compute foundation that powers our generative AI and machine learning initiatives. Our team bridges the gap between hardware and software, ensuring that our researchers and data scientists have a reliable, scalable, and efficient platform to train and deploy models. We focus on maximizing GPU utilization, minimizing inference latency, and creating a seamless "paved road" for AI development. How You'll Make an Impact You are a systems thinker who loves solving hard infrastructure challenges. You will architect the underlying platform that serves our production AI workloads, ensuring they are resilient, secure, and cost-effective. By optimizing our compute layer and deployment pipelines, you will directly accelerate the velocity of the entire AI product team, transforming how we deliver AI at scale. Responsibilities Platform Architecture: Design and maintain a robust, Kubernetes-based AI platform that supports distributed training and high-throughput inference serving. Inference Optimization: Engineer low-latency serving solutions for LLMs and other models, optimizing engines (vLLM,) to maximize throughput and minimize cost per token. Compute Orchestration: Manage and scale GPU clusters on Cloud (AWS/Azure), implementing efficient scheduling, auto-scaling, and spot instance management to optimize costs. Operational Excellence (MLOps): Build and maintain "Infrastructure as Code" (Terraform) and CI/CD pipelines to automate the lifecycle of model deployments and infrastructure provisioning. Reliability & Observability: Implement comprehensive monitoring (Prometheus, Grafana) for GPU health, model latency, and system resource usage; lead incident response for critical AI infrastructure. Developer Experience: Create tools and abstraction layers (SDKs, CLI tools) that allow data scientists to self-serve compute resources without managing underlying infrastructure. Security & Compliance: Ensure all AI infrastructure meets strict security standards, handling sensitive data encryption and access controls (IAM, VPCs) effectively. Experience You'll Bring Machine Learning Infrastructure: 2+ years of experience building and maintaining ML infrastructure for production workloads. Production Expertise: Proven experience managing large-scale production clusters (Kubernetes) and distributed systems. Automation-First Mindset: Strong advocate for "Everything as Code"; skilled at automating repetitive tasks using Python, Go, or Bash. Experience using Jupyter, MLflow, and other related ML tools. CI/CD & Deployment: Experience designing and maintaining CI/CD pipelines for ML workloads and containerized applications. Monitoring & Observability: Hands-on experience implementing monitoring, logging, and alerting at scale for production ML systems. Security & Compliance: Familiarity with securing ML infrastructure, enforcing access controls, and maintaining compliance standards. Collaboration: Experience working closely with data scientists, ML engineers, and DevOps teams to operationalize models. Performance Optimization: Ability to profile, debug, and optimize ML workloads and infrastructure for throughput, latency, and cost efficiency. Scalable Architecture Design: Experience designing scalable and reliable infrastructure to support high-traffic ML applications. Technical Skills Core Engineering: Expert proficiency in Python and Go; comfortable digging into lower-level system performance. Orchestration & Containers: Mastery of Kubernetes (EKS/AKS), Helm, Docker, and container runtimes. Experience with Ray is a huge plus. Infrastructure as Code: Advanced skills with Terraform. Cloud Platforms: Experience one of cloud AWS, Azure, GCP Observability: Proficiency with Prometheus, Grafana, and tracing tools (OpenTelemetry). Networking: Understanding of service mesh (Istio), load balancing, and high-performance networking. The base pay range represents the anticipated low and high end of the pay range for this position. Actual pay rates will vary and will be based on various factors, such as your qualifications, skills, competencies, and proficiency for the role. Base pay is one component of WEX's total compensation package. Most sales positions are eligible for commission under the terms of an applicable plan. Non-sales roles are typically eligible for a quarterly or annual bonus based on their role and applicable plan. WEX's comprehensive and market competitive benefits are designed to support your personal and professional well-being. Benefits include health, dental and vision insurances, retirement savings plan, paid time off, health savings account, flexible spending accounts, life insurance, disability insurance, tuition reimbursement, and more. For more information, check out the "About Us" section. Pay Range: $140,200.00 - $185,800.00