Overview
Position Overview: We are seeking an experienced Senior Systems Engineer with US Government Top Secret/SCI security clearance with Full Scope Polygraph to support a small standalone system dedicated to high-performance computing (HPC) and artificial intelligence (AI) workloads. This role demands a blend of operational expertise and strategic technical vision, focusing on the management and optimization of our standalone HPC/AI system. The ideal candidate will manage the technical operation of our infrastructure, develop standardized procedures for hardware, network, and software management across the system, and expertly oversee cluster management (including provisioning, optimization, and monitoring of clustered resources for HPC/AI workloads, such as NVIDIA BCM).
What will you do?
This position requires broad expertise in HPC/AI system administration, with a focus on: * Refining infrastructure management frameworks * Traditional infrastructure management (hardware, networking, directory services) * Modern HPC/AI support (Linux/Ubuntu, Proxmox, NVIDIA BCM, WEKA storage) * Designing scalable, secure, and highly available system architectures
Do you have what it takes?
- Active TS/SCI with Polygraph required.
- Bachelor's degree in computer science, Software Engineering, or related field.
- REQUIRED ON DAY 1: Active Top Secret/SCI security clearance + Full Scope Polygraph
- Bachelor's degree in engineering, computer science, or related technical field, or equivalent experience
- 7+ years' experience in systems engineering or related field
- Operating Systems & Infrastructure:
o Expert-level Linux systems engineering o Windows client operating systems deployment/maintenance o Linux (Ubuntu) server operating systems deployment/maintenance
o Server hardware o Network hardware, wiring, and switching configurations
- Virtualization & Containerization:
o Virtualization (ideally Proxmox) o Containerization (ideally Docker/Podman with Ray or Kubernetes)
- Management & Orchestration:
o Directory services and PKI infrastructure deployment/maintenance o Configuration management (ideally Ansible, Puppet, Chef, or DSC) o Cluster orchestration (ideally NVIDIA Base Cluster Management (BCM))
- Development Support & Software Management:
o Development support services (Gitlab, Jenkins, Nexus) o Operating system software repository synchronization (Apt, Snap, Yum)
|