About the department
Cloudflare s Infrastructure group is responsible for building our global network. Our Hardware Engineering team helps research, develop, test, and deploy new equipment enabling 20% of the world s internet traffic to be served smoothly. Deployed across 285 cities in 100+ countries, the hardware we select helps improve the security, reliability, and performance of the Internet.
About the Role
We need to make thoughtful infrastructure choices affecting a significant portion of the Internet. Hardware we work with includes servers, routers, switches, optical equipment, power distribution units, cables, optics, and more. As a Hardware Systems Engineer, you will work with colleagues on the Hardware Engineering, Product teams, and Hardware Sourcing teams to troubleshoot and maintain Cloudflare s worldwide fleet of storage and compute servers.
What you'll do
- Develop and maintain automation tools to update firmware on servers and components in Cloudflare s fleet
- Work with software teams to validate bug fixes and performance of new firmware revisions
- Test and deploy firmware updates to the fleet, monitoring the progress of the rollout for compliance and reliability
- Work with server and component vendors to obtain, debug, and maintain the latest updates
- Work with our Site Reliability Engineering teams to triage bug reports
- Support our Data Centre Engineering teams in resolving hardware issues
- Communicate your results and updates through blog posts, internal talks, and tickets
Examples of desirable skills, knowledge and experience
- Bachelor s degree in Computer Engineering, Electrical Engineering, or Computer Science
- Desire to learn about the Cloudflare hardware used by almost 20% of all web sites
- Desire to learn how a diverse server fleet is managed at scale
- Desire to learn the tools Cloudflare uses to maintain and monitor our hardware
- Knowledge of PXE booting
- Knowledge of configuration management, in particular we use salt to manage our fleet
- Knowledge of Redfish, IPMI and server remote management protocols
- Knowledge of running production mission critical systems
Bonus Points
- Familiarity with server hardware architecture
- Knowledge of debugging server hardware faults and the ability to engage with our sourcing team and vendors to improve quality
- Experience of managing large fleets comprising of thousands of servers
- Experience of observability and monitoring tools such as Prometheus and Grafana, and the ability to observe trends over time
- Experience scripting and programming, in particular python and bash
- Experience with software development tools and processes such as git, Bitbucket and TeamCity and Jira