Specialist Engineer, High Performance Computing

Job Description

Virgin Galactic is looking for a Specialist Engineer in High-Performance Computing
This is a high visibility position within engineering that will perform the system administrator role of current High-Performance Computing (HPC) systems while also playing a key role in defining the path forward for HPC at Virgin Galactic
You will work with internal users across several functional areas to maximize the performance of the current systems and remove roadblocks to access and utilization
You will provide strategic insight into the path forward for HPC including data integrity planning and digital thread integration
The role will also require you to work with external vendors to maintain the existing infrastructure and to lead expansion and upgrade work
The right candidate will balance technically advanced approaches and bootstrapped innovation that will allow for cost-effectiveness.

Responsibilities

This role is both systems-facing and user-facing
In it, you will use your in-depth knowledge of Linux, your cluster administration experience, and your passion for supporting ground-breaking engineering work daily
Your role is crucial in designing, implementing, and maintaining our advanced computing infrastructure.

* HPC Infrastructure Maintenance: Manage the day-to-day system administration of Linux-based cluster computing and storage environments, and associated network infrastructure, in alignment with applicable company, regulatory agency, and/or contractual security and privacy requirements.
* Ensures users have the environment, tools, compilers, and any additional resources needed to deploy applications across the clusters, including open source, proprietary, and in-house developed codes.
* Slurm: Responsible for all aspects of management of Slurm for efficient resource allocation and job scheduling across the clusters
This includes managing job accounting databases and generating utilization reports.
* User Support: Collaborate with colleagues and team members to understand their computing needs, provide technical assistance, and troubleshoot issues related to system performance and job execution
Provide user consultation and training.
* Performance monitoring: Monitor system performance, diagnose bottlenecks, and take necessary actions to improve system performance.
* Documentation: Maintain detailed documentation of system configurations, procedures, and troubleshooting guides to facilitate knowledge sharing and team collaboration
Develop user-facing documentation.
* Planning: Meet regularly with internal and external stakeholders to understand existing challenges, anticipated needs, and opportunities for closer collaboration
Develop a roadmap for system improvements and life cycling, making recommendations to leadership
Creation of data integrity plans as well as a strategy for data integration into the digital thread.

Required Skills and Experience

* Relevant bachelor’s degree and 10 years of increasingly technical work experience or a combination of education and relevant experience.
* In-depth experience managing multiuser HPC clusters and distributed storage environments.
* Working knowledge of engineering simulation tools such as CFD, FEM, and heat transfer codes that typically run on clusters.
* Independent and proactive working style.
* Demonstrated ability to communicate with a diverse set of stakeholders.
* This position requires in-depth knowledge of and hands-on experience with:
* Linux cluster system administration (RedHat/CentOS/Rocky)
* SLURM configuration and management
* Active Directory authentication for Linux systems
* SMB file shares between Windows and Linux systems
* BeeGFS configuration and management
* Scripting for system management and task automation
* Networking technologies (Infiniband, Message Passing Interfaces)
* Installing and repairing servers and associated cluster hardware

* Problem-solving and troubleshooting
* Experience with stateless node management and provisioning (OpenHPC/Warewulf)
* Experience with the proprietary ACT ClusterVisor tools
* Experience with hybrid on-prem/cloud cluster technologies and containerization in the context of HPC
* Tape backup systems
* Working knowledge of Digital Thread concepts
* Working knowledge of the 3DEXPERIENCE platform

The annual U.S
base salary range for this full-time position is $132,100.00-$201,550.00
The base pay actually offered will vary depending on job-related knowledge, skills, location, and experience and take into account internal equity
Other forms of pay (e.g., bonus or long term incentive) may be provided as part of the compensation package, in addition to a full range of medical, financial, and other benefits, dependent on the position offered
For more information regarding Virgin Galactic benefits, please visit

Recommended Skills

  • Administration
  • Automation
  • Consulting
  • Curiosity
  • Data Integration
  • Databases