Software Engineer - Acceler8 Talent

Job Details

Software Engineer (ML Infrastructure, Distributed Systems)

Are you driven by the challenge of creating scalable systems that shape the future of AI? We are looking for a Senior/Principal Software Engineer with a strong passion for innovation in distributed machine learning infrastructure. As leaders in advanced machine learning compute solutions, we connect theoretical AI with practical applications, ensuring that complex models are trained efficiently and at scale.

In this position, you will play a crucial role in designing and developing our distributed training frameworks, facilitating the rapid iteration and deployment of cutting-edge machine learning models. Your contributions will directly enhance the performance and capabilities of our next-generation AI technologies.

What We Offer

Competitive compensation package with comprehensive benefits
Generous professional development opportunities (conferences, training, etc.)
Access to state-of-the-art tools and technologies
Collaborative team environment with regular retreats
Opportunity to publish and present your work

Key Responsibilities

Design, develop, and optimize distributed training frameworks for large-scale machine learning models
Lead technical initiatives to enhance the scalability, performance, and efficiency of our compute infrastructure
Collaborate with research teams to integrate advanced algorithms and models into our systems
Innovate in areas such as model parallelism, data parallelism, and hybrid approaches
Mentor and guide junior engineers, fostering a culture of technical excellence

We seek individuals with a robust software engineering background and proven expertise in distributed systems and cloud infrastructure (AWS, GCP, Azure, etc.). Candidates should have a successful history of building and deploying large-scale machine learning systems. A deep understanding of distributed training frameworks (e.g., Horovod, PyTorch DDP, TensorFlow Distributed) is essential. Proficiency in Python and/or C++ is required, along with excellent communication and collaboration skills.

Experience with high-performance computing (HPC) clusters, GPU acceleration, or optimization techniques for distributed training is a plus. Additionally, a background in machine learning research or development, or contributions to open-source projects or publications in relevant fields, would be highly valued.

Keywords: Distributed computing, parallel processing, cluster computing, scalability, multi-node architecture, data partitioning, load balancing, high-performance computing (HPC), fault tolerance, model parallelism, data parallelism, Horovod, PyTorch DDP, TensorFlow Distributed, machine learning infrastructure, cloud-based ML, GPU/CPU collaboration.

Acceler8 Talent

07/11/2024

all cities,CA