Site Reliability Engineer (SRE)

FLUIX is building the AI operating system that plans, designs, and optimizes AI infrastructure. We are based in Silicon Valley. We specialize in providing AI-driven solutions for data centers and power providers, leveraging cutting-edge Machine Learning (ML) and Artificial Intelligence (AI) technologies. Our mission is to double America’s compute capacity without building new data centers.

We are seeking a skilled Site Reliability Engineer to join our growing team. The ideal candidate will help ensure the reliability, scalability, and performance of our hybrid-based (Cloud & On-Prem) platform while supporting our AI/ML infrastructure. You will work closely with our engineering, AI, and operations teams to build and maintain robust systems that support our cutting-edge solutions. Your expertise in ML/AI and experience with data center sites will be crucial in driving the success of our platform.

Who you’ll work closely with

Abhi Sastri

Founder & CEO

Chase Overcash

CTO

What you’ll do

  • Design, implement, and maintain scalable systems while optimizing performance, ensuring high availability and disaster recovery, and assisting with codebase refactoring for modular deployment.

  • Develop and maintain automation tools to streamline operations, improve efficiency, and automate repetitive tasks to enhance system reliability.

  • Collaborate with engineering and data science teams to integrate ML and AI models into production environments, while ensuring seamless integration and high performance of cutting-edge models within our technology stack.

  • Identify areas for improvement and drive initiatives to enhance system reliability and performance, while staying updated on industry trends and advancements in SRE practices, ML, and AI technologies.

  • Respond to and resolve incidents to minimize impact and ensure timely resolution, while conducting post-incident reviews and implementing improvements to prevent recurrence.

  • Create and manage multiple cloud instances (dev, staging, test), optimize cloud infrastructure and data center operations, and ensure the security and compliance of both infrastructure and applications.

Your background

  • Bachelorʼs degree in Computer Science, Engineering, or a related field (or equivalent experience).

  • Proven experience as a Site Reliability Engineer or similar role in a SaaS environment, with a strong background in managing and optimizing cloud infrastructure (AWS preferred, or GCP, Azure), experience with ML and AI technologies, and familiarity with data center operations integrations.

  • Proficiency in programming and scripting languages (e.g., Python), experience with containerization and orchestration tools (Kubernetes), a strong understanding of networking, security, and performance optimization, and knowledge of CI/CD pipelines and DevOps practices.

  • Excellent problem-solving skills with attention to detail, strong communication and collaboration abilities, and the capacity to thrive in a fast-paced, dynamic startup environment.

Culture Fit

  • We are looking for obsessed individuals who want to give it their all.

  • We are not afraid to get our hands dirty with physical and software systems.

  • We are eager to visit and work with clients and understand the importance and gravitas of their mission-critical work.

  • We are eager to come into the office and on-site, as our work directly affects physical environments.

  • Due to our mission-critical work, we understand and our eager to help our teammates and co-workers during holidays, weekends, and emergencies.

  • We are cordial and over-communicate with teammates, co-workers, and management.

Benefits

Competitive Salary

Attractive compensation package, including equity options.

Benefits

Comprehensive health, dental, and vision insurance, along with other standard benefits.

Work Environment

A dynamic and collaborative San Francisco Bay Area work environment.

Growth Opportunities

Opportunities for professional growth and development, with the chance to shape the future of technology in the industry.

www.fluix.ai

Full-time I San Francisco

Apply for this role

FLUIX AI is a rapidly growing Enterprise B2B SAAS startup based in the San Francisco Bay Area. We specialize in providing innovative solutions for data centers and facilities, leveraging the latest advancements in Machine Learning (ML) and Artificial Intelligence (AI). Our mission is to use AI to solve the world’s inefficiencies, starting with the world’s most important buildings. Facilities that provide the world with communication, data, food, manufactured goods, etc. are ultimately inefficient and require real-time and dynamic optimization. With A.I.M.I. our Artificial Intelligence for Managing Infrastructure Platform, we will usher in a new age of automation & optimization for facilities.

We are seeking a skilled Site Reliability Engineer to join our growing team. The ideal candidate will help ensure the reliability, scalability, and performance of our hybrid-based (Cloud & On-Prem) platform while supporting our AI/ML infrastructure. You will work closely with our engineering, data science, and operations teams to build and maintain robust systems that support our cutting-edge solutions. Your expertise in ML and AI, integration with GenAI model providers, and experience with data centers and manufacturing sites will be crucial in driving the success of our platform.

Who you’ll work closely with

Abhi Sastri

Founder & CEO

Chase Overcash

CTO

Your background

  • Bachelorʼs degree in Computer Science, Engineering, or a related field (or equivalent experience).

  • Proven experience as a Site Reliability Engineer or similar role in a SaaS environment, with a strong background in managing and optimizing cloud infrastructure (AWS preferred, or GCP, Azure), experience with ML and AI technologies including GenAI model integration, and familiarity with data center operations and manufacturing site integrations.

  • Proficiency in programming and scripting languages (e.g., Python, Go, Bash), experience with containerization and orchestration tools (Docker, Kubernetes), a strong understanding of networking, security, and performance optimization, and knowledge of CI/CD pipelines and DevOps practices.

  • Excellent problem-solving skills with attention to detail, strong communication and collaboration abilities, and the capacity to thrive in a fast-paced, dynamic startup environment.

Benefits

Competitive Salary

Attractive compensation package, including equity options.

Benefits

Comprehensive health, dental, and vision insurance, along with other standard benefits.

Work Environment

A dynamic and collaborative San Francisco Bay Area work environment.

Growth Opportunities

Opportunities for professional growth and development, with the chance to shape the future of technology in the industry.

Site Reliability Engineer (SRE)

What you’ll do

  • Design, implement, and maintain scalable systems while optimizing performance, ensuring high availability and disaster recovery, and assisting with codebase refactoring for modular deployment.

  • Develop and maintain automation tools to streamline operations, improve efficiency, and automate repetitive tasks to enhance system reliability.

  • Collaborate with engineering and data science teams to integrate ML and AI models into production environments, while working with the GenAI community to ensure seamless integration and high performance of cutting-edge models within our technology stack.

  • Respond to and resolve incidents to minimize impact and ensure timely resolution, while conducting post-incident reviews and implementing improvements to prevent recurrence.

  • Create and manage multiple cloud instances (dev, staging, test), optimize cloud infrastructure and data center operations, and ensure the security and compliance of both infrastructure and applications.

  • Identify areas for improvement and drive initiatives to enhance system reliability and performance, while staying updated on industry trends and advancements in SRE practices, ML, and AI technologies.

Culture Fit

  • We are looking for obsessed individuals who want to give it their all.

  • We are not afraid to get our hands dirty with physical and software systems.

  • We are eager to visit and work with clients and understand the importance and gravitas of their mission-critical work.

  • We are eager to come into the office and on-site, as our work directly affects physical environments.

  • Due to our mission-critical work, we understand and our eager to help our teammates and co-workers during holidays, weekends, and emergencies.

  • We are cordial and over-communicate with teammates, co-workers, and management.

Join the team

Apply for this role

Join the team