Senior Site Reliability Engineer (SRE) - (Dublin, CA) Job at Articul8, Dublin, CA

NENUQ1VTTW1sSVJsN00xQjBhRUR2UWZ2
  • Articul8
  • Dublin, CA

Job Description

About Us Articul8 AI is at the forefront of Generative AI innovation, delivering cutting-edge SaaS products that transform how businesses operate. Our platform empowers organizations to leverage the power of artificial intelligence in a reliable, scalable, and secure environment. Position Overview We are seeking an experienced Site Reliability Engineer (SRE) to join our team and help ensure the reliability, performance, and scalability of our GenAI SaaS platform. As an SRE, you will bridge the gap between development and operations, implementing automation and best practices to maintain our service reliability objectives while supporting rapid innovation. Key Responsibilities Architect and maintain scalable, highly available infrastructure for our GenAI platform. Design and implement robust monitoring, alerting, and observability solutions to proactively ensure system health and performance. Automate deployment, scaling, and management of our cloud-native infrastructure, reducing toil and improving efficiency. Define, measure, and improve Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to deliver outstanding service quality. Participate in on-call rotations and provide rapid response to production incidents, minimizing downtime and user impact. Collaborate closely with development teams to build reliable, scalable, and efficient systems for complex AI workloads. Lead incident response efforts, conduct thorough post-mortems, and champion continuous improvement initiatives. Optimize infrastructure for performance, scalability, and cost-effectiveness—especially for high-demand AI workloads. Implement and enforce security best practices across all systems and environments. Create and maintain comprehensive documentation, including runbooks and knowledge base articles, to foster a culture of shared knowledge. Qualifications Required Bachelor's degree in Computer Science, Engineering, or related field, or equivalent practical experience 5+ years of experience in DevOps, SRE, or similar roles Strong experience with cloud platforms (AWS, GCP, or Azure) Proficiency in at least one programming/scripting language (Python, Go, Bash, etc.) Hands-on experience with infrastructure as code tools (Terraform, CloudFormation, etc.) Solid background in containerization technologies (Docker, Kubernetes) Proven experience with monitoring and observability tools (Prometheus, Grafana, ELK stack, etc.) Strong understanding of CI/CD pipelines and automation Exceptional troubleshooting and problem-solving skills and ability to troubleshoot complex systems Preferred Experience supporting AI/ML systems in production Knowledge of GPU infrastructure management and optimization Familiarity with distributed systems and high-performance computing Experience with database systems (SQL and NoSQL) Certifications in cloud platforms (AWS, GCP, Azure) Experience with chaos engineering and resilience testing Knowledge of security best practices and compliance requirements Ready to shape the future of resilient software systems? Apply now and help drive the reliability of tomorrow’s AI at Articul8 AI! #J-18808-Ljbffr Articul8

Job Tags

Similar Jobs

Ross Stores, Inc.

Area Supervisor/Department Manager Job at Ross Stores, Inc.

 ...Overview Area Supervisor is a member of Store Leadership responsible for a specific, assigned area of the Store as well as the general...  ...operations and supervision of the Store when functioning as the Manager on Duty. They are responsible for opening and closing the Store,... 

Vaco by Highspring

IT Technical Writer Job at Vaco by Highspring

 ...our team. The ideal candidate will have a strong background in writing technical content for IT systems, applications, and processes....  ...be eligible for discretionary bonuses, and can participate in medical, dental, and vision benefits as well as the companys 401(k) retirement... 

Aniesispharma

Software developer Job at Aniesispharma

 ..., 20 hours per week Position: Software Developer Company Overview: Aniesispharma is a leading pharmaceutical company based in California...  ...of our research and development processes. This is a part-time position, with a commitment of 20 hours per week, based in... 

Zobility

Trading Enablement Assistant Job at Zobility

 ...Execute and confirm routine securities transactions initiated by customers. Provide quality customer service addressing issues, trading strategies, market terminology, and account status. Identify and recommend opportunities for process improvement and risk control development... 

L3Harris Technologies

Statistician Job at L3Harris Technologies

Overview Join to apply for the Statistician role at L3Harris Technologies .L3Harris is dedicated to recruiting and developing high-performing talent who are passionate about what they do. Our employees are unified in a shared dedication to our customers mission and quest...