HPC (High Performance Computing) Systems Engineer
Our client is seeking a High Performance Computing (HPC) Systems Engineer for a NASA NACS High Performance Computing contract. You will be an active member of the ASRC Federal account team, interacting with the Program Manager, Site Lead, Customer, and site staff attending regularly scheduled customer meetings to keep the customer informed of activities and progress, answer customer inquiries concerning all aspects of the various the program.
An individual at this skill level should have demonstrated his/her problem solving ability in the appropriate area of expertise with numerous technical publications and formal technical presentations, and should have some experience in mentoring and leading others in small team environments.
Duties and Responsibilities:
- Design (architect), implement and troubleshoot large-scale (tens of Petabytes) storage systems. This includes developing technical drawings including all required cables and connectivity to existing systems, and communicating with key stakeholders.
- Serve as a GPFS SME for the Discover HPC team as well as other teams running GPFS both within and outside of the immediate organization.
- Develop and execute test plans for filesystem upgrades and address with vendor support any issues that arise during testing.
- Troubleshoot and resolve user-reported application issues be they with the filesystem, RDMA interconnect, kernel, operating system, MPI middleware etc.
- Develop minimal reproducers for user-reporting application issues to allow the relevant code to be provided to the vendors).
- Provide support to the NCCS Applications team in installing user-requested applications.
- Evaluate and test proposed changes to the Discover supercomputer's production operating environment (e.g. MPI upgrades, OS Patches, Kernel parameter changes, etc.) for impact to performance or stability and develop an upgrade and potential backout plan.
- Maintain the Discover TDS (Test and Development System), keeping it as close as is reasonably possible to the production system configuration.
- Provide 24x7 on-call support as required
- Bachelor's degree in Computer Science, Management Information Systems or other technical discipline required plus 5 years of experience. Master's degree or equivalent.
- At least five years of experience as a High-Performance Computing parallel filesystem Storage Administrator, with experience with IBM Spectrum Scale (GPFS), Lustre or equivalent.
- Familiar with IBM Spectrum Scale (GPFS), optimizing for performance, reliability, and security.
- In-depth knowledge of HPC parallel filesystems and the ability to troubleshoot complex problems. Must be comfortable with monitoring and managing clustered filesystems, and be able to examine GPL driver code when required, participating in mailing list discussions and interacting with developers support as required).
- Experience with deploying parallel filesystem upgrades in a rolling fashion with no overall system downtime.
- In-depth knowledge of Linux NFS server/client implementation and ability to troubleshoot issues with NFS.
- In-depth knowledge of SAN technologies (such as FC, FCoE, RoCE, NVMoF, iSER, SRP) and awareness of high-level protocol function, management approaches, and performance tuning (e.g. how to configure zoning, host initiators, tune kernel modules for SRP, etc.).
- Deep experience with Infiniband or Omni-Path high speed fabrics, including subnet management, IPoIB and/or IPoOPA mechanisms, fabric topology and health monitoring and integration with MPI.
- Knowledge of ethernet networking (VLANs, etc.)
- In-depth knowledge of MPI Implementations (Intel MPI, MVAPICH2, OpenMPI, HPE/SGI MPT) and troubleshooting MPI application stability (e.g. application crashes) and performance problems (e.g. slow collectives).
- Experience with debugging issues with the Linux kernel, including examining source code and using tools such a crash and systemtap. Ability to produce patches to fix issues, as required. Also desired is experience with applying patches and building custom kernels as required to implement functionality or address security concerns.
- Experience deploying and managing large HPC clusters using image based cluster management tool such as xCAT.
- Familiarity with developing applications in C (including familiarity with threading).
- Experience in building, installing and debugging scientific applications (e.g. MPI, NetCDF, HDF, WRF).
- Experience in submitting parallel applications to a batch scheduler (ideally SLURM).
- Knowledge of configuration management tools (Puppet, CFEngine).
- Working knowledge of scripting and programming languages such as C, C++, Fortran Bash, CSH, TSCH, Perl, Python, Ruby.
- Good organization skills to balance and prioritize work, and ability to multitask
- Good communication skills to communicate with support personnel, customer, and managers
Preferred Skills (Requesting Manager Defines):
- Experience with cloud technologies (AWS, Azure, GCP), OpenStack, and Kubernetes
- Experience with GPFS Cluster Export Services, Clustered NFS, GPFS Multi-cluster and
- Broad knowledge of distributed file systems and object stores such as Lustre, HDFS, BeeGFS, LizardFS Gluster, Ceph, Swift.
- Experience with revision control via Git.
- Familiarity with Time-Series databases and associated tools (Such as InfluxDB, Graphite, Grafana, Elastisearch, Kibana).
- Knowledge of virtualization technologies (particularly qemu/kvm) and managing large numbers of virtual machines.