Software Engineer


Job Details

Software Engineer (ML Infrastructure, Distributed Systems)


Are you driven by the challenge of creating scalable systems that shape the future of AI? We are looking for a Senior/Principal Software Engineer with a strong passion for innovation in distributed machine learning infrastructure. As leaders in advanced machine learning compute solutions, we connect theoretical AI with practical applications, ensuring that complex models are trained efficiently and at scale.


In this position, you will play a crucial role in designing and developing our distributed training frameworks, facilitating the rapid iteration and deployment of cutting-edge machine learning models. Your contributions will directly enhance the performance and capabilities of our next-generation AI technologies.


What We Offer

  • Competitive compensation package with comprehensive benefits
  • Generous professional development opportunities (conferences, training, etc.)
  • Access to state-of-the-art tools and technologies
  • Collaborative team environment with regular retreats
  • Opportunity to publish and present your work


Key Responsibilities

  • Design, develop, and optimize distributed training frameworks for large-scale machine learning models
  • Lead technical initiatives to enhance the scalability, performance, and efficiency of our compute infrastructure
  • Collaborate with research teams to integrate advanced algorithms and models into our systems
  • Innovate in areas such as model parallelism, data parallelism, and hybrid approaches
  • Mentor and guide junior engineers, fostering a culture of technical excellence


We seek individuals with a robust software engineering background and proven expertise in distributed systems and cloud infrastructure (AWS, GCP, Azure, etc.). Candidates should have a successful history of building and deploying large-scale machine learning systems. A deep understanding of distributed training frameworks (e.g., Horovod, PyTorch DDP, TensorFlow Distributed) is essential. Proficiency in Python and/or C++ is required, along with excellent communication and collaboration skills.


Experience with high-performance computing (HPC) clusters, GPU acceleration, or optimization techniques for distributed training is a plus. Additionally, a background in machine learning research or development, or contributions to open-source projects or publications in relevant fields, would be highly valued.


Keywords: Distributed computing, parallel processing, cluster computing, scalability, multi-node architecture, data partitioning, load balancing, high-performance computing (HPC), fault tolerance, model parallelism, data parallelism, Horovod, PyTorch DDP, TensorFlow Distributed, machine learning infrastructure, cloud-based ML, GPU/CPU collaboration.





 Acceler8 Talent

 07/11/2024

 all cities,CA