Site Reliability Engineer – Logging & Monitoring
Location: Bengaluru (Bangalore), Karnataka, India (Hybrid)
Department: Infrastructure & Technology
Employment Type: Full-time, Professional
🚀 Introduction
Embark on a high-impact role within IBM’s Software division, driving innovation across cloud-native, AI-powered solutions. At IBM, we prize curiosity, collaboration, and continuous growth. In this role, you’ll join a world-class team working on Cloud Object Storage (COS)—transforming data availability and reliability for enterprise-scale workloads.
IBM’s ecosystem spans Research, Software, Infrastructure, and Services. As an SRE in this domain, you’ll be at the core of innovation—building resilient, scalable observability solutions. Whether you’re excelling in monitoring frameworks, container orchestration, or automation pipelines, your work will shape how our global clients operate their mission-critical storage services.
Key Responsibilities
As an SRE focusing on logging and monitoring for IBM’s COS offering, your responsibilities include:
-
Design & Architecture of Observability Stack
-
Architect, configure, and deploy a robust monitoring system tailored to Cloud Object Storage.
-
Evaluate and integrate technologies such as Elasticsearch, Logstash, Kibana (the ELK stack), Kafka (including Kafka Mirror Maker), File beat, Grafana, and Siding.
-
Deliver scalable, maintainable observability solutions for global-scale services.
-
-
Automation & Infrastructure as Code (IaC)
-
Leverage Ansible, Terraform, Jenkins, and Travis CI to automate provisioning, configuration, deployment, and updates of observability infrastructure.
-
Build and maintain CI/CD pipelines for continuous enhancements and streamlined deployments.
-
-
Distributed & Cloud-Native Systems Experience
-
Work with microservices and distributed architectures deployed in containers and Kubernetes.
-
Tune logging ingestion, metrics collection, and dashboard performance to manage high-throughput data volumes.
-
-
Linux Systems & Programming Proficiency
-
Employ Linux administration best practices—package management, performance tuning, system diagnostics.
-
Write production-grade code in Python, Java, and SQL for integration, custom tools, or data pipelines.
-
-
Performance Tuning & Scale Engineering
-
Analyze data flow and monitoring ingestion demographics to optimize system performance as COS usage grows.
-
Implement proactive scaling and alerting to prevent bottlenecks or failures.
-
-
Strategic Leadership & Observability Evangelism
-
Propose architectural enhancements, champion observability best practices, and guide the team in continuous improvement.
-
Collaborate with SREs and Dev teams to evolve visibility into the COS ecosystem.
-
-
24×7 On‑Call & Incident Management
-
Participate in rotational, around-the-clock support coverage.
-
Triage and resolve critical incidents—apply alerting logic, root cause analysis, and drive post-incident retrospectives.
-
-
Dashboards & Analytics Design
-
craft intuitive, actionable dashboards using Grafana, Kibana, or Sysdig UI.
-
Present metrics trends such as request latency, error rates, resource usage, and storage performance.
-
-
Integrated Alerting & Incident Response
-
Build alerting pipelines that tie Sysdig alerts to PagerDuty, email, Slack, and other channels.
-
Enable proactive incident detection, escalation, and automated recovery.
-
Required Qualifications
To excel in this role, you should bring:
-
Educational Background
-
Bachelor’s degree in Computer Science, Engineering, or a related field.
-
-
Experience & Technical Expertise (2–5 Years)
-
Proven Linux administration experience (Ubuntu, RHEL, CentOS).
-
Programming skills in Python (OOP), Java, and proficiency with SQL.
-
Hands-on experience with observability tools: Elasticsearch, Kibana, Filebeat, Kafka, Grafana, Sysdig.
-
Infrastructure automation tools: Ansible, Terraform, Jenkins, Travis CI.
-
Well-versed in microservices, Docker, Kubernetes, and distributed system patterns.
-
Agile-Scrum development environment familiarity.
-
Competent with Jira and GitHub workflows.
-
-
Core Competencies
-
Strong analytical and problem-solving mindset.
-
Tenacious approach to debugging and issue resolution.
-
Effective communicator—able to collaborate effectively with cross-functional teams.
-
Ownership attitude—championing observability and stability at scale.
-
About IBM Systems
IBM Systems empowers global enterprises with cognitive infrastructure that’s intelligent, adaptive, and hybrid-cloud optimized. Our services help businesses remain ahead of disruptions—turning data into insight and systems into active partners. As an SRE in this business unit, you’ll shape cutting-edge solutions that underpin everything from financial systems to AI platforms.
What It Means to Be an IBMer
At IBM, we believe our people are our greatest invention. Here’s how being part of the IBMer community empowers you:
-
Client-Centricity & Impact
-
Deliver transformative solutions that matter—every project aligns with real-world impact and our customers’ growth.
-
-
Continuous Learning & Development
-
Embrace a growth mindset—iterate, learn, adapt. IBM fosters exploration through mentorship, internal learning platforms, and certification opportunities.
-
-
Courage & Innovation
-
We encourage experimentation—try new technologies, pivot quickly, and learn from outcomes.
-
-
Belonging & Trust
-
A safe, respectful environment where diverse voices thrive. Imbert culture celebrates backgrounds, perspectives, and experiences.
-
-
Shared Accountability
-
We encourage feedback, collaborative goal-setting, and mutual support to achieve outstanding outcomes.
-
Equal Opportunity & Diversity Statement
IBM India is committed to being an equitable and inclusive employer. All qualified applicants will be considered without regard to race, color, religion, gender, gender identity or expression, sexual orientation, caste, genetics, pregnancy, disability, neurodiversity, age, veteran status, or any other protected characteristic. We welcome diverse perspectives and are committed to equitable hiring practices—without bias toward citizenship or immigration status.
Application Guidance
-
Role Match – Quality Over Quantity:
-
IBM recommends applying only to 1–3 roles per year. Focus on roles where your background strongly aligns with the core requirements—particularly in observability, automation, and COS-scale systems.
-
-
Location & Work mode:
-
The position is based in Bengaluru, operating on a hybrid work model. Specific work location details will be clarified during recruitment.
-
-
Next Steps:
-
Submit your resume and cover letter via the IBM Careers portal.
-
Align your application to highlight hands-on experience with ELK (or equivalent), Terraform, Kubernetes, and Python automation pipelines.
-
Showcase examples where you designed monitoring systems, improved alerting effectiveness, or supported mission-critical services.
-
Why This Role Matters
Reliability at Scale: COS is the bedrock of enterprise storage workloads—ensuring durability, performance, and visibility into data operations. As an SRE, you’ll enhance system observability to maintain seamless, 24×7 service delivery.
End-to-End Observability: You’ll own the full lifecycle—from log ingestion and metrics collection to dashboarding and incident management. Your work will directly impact how engineers and SREs perceive system health and respond to issues.
Global Impact & Learning: IBM’s client base spans industries—finance, healthcare, research, IoT—that rely on COS. As such, your observability solutions will serve high-profile global use cases, helping you grow while delivering measurable impact.
Cutting-Edge Stack: Gain hands-on exposure to top-tier open-source tools and enterprise-grade solutions. Work with Elasticsearch, Kafka, Kubernetes, Syndic, Grafana, and more—all orchestrated through Terraform, Ansible, and CI/CD pipelines.
Career Progression: IBM’s hybrid ecosystem—from Systems to Cloud and Services—offers avenues to transition into roles like Platform Reliability Engineer, Site Reliability Engineering Lead, or Cloud Infrastructure Architect.
Sample Candidate Profile
Meet “Asha”, a mid-level SRE based in Bengaluru:
-
Education: B.Tech in Computer Science
-
SRE Experience (3 years):
-
Built a centralized ELK stack for real-time application logs.
-
Implemented Kafka-driven log forwarding and File beat agents on 500+ nodes.
-
Designed Grafana dashboards displaying request latency, error rates, and resource utilization.
-
Automated deployment using Terraform and Ansible—spun up monitoring clusters in under 30 minutes.
-
On-call rotations using PagerDuty and Slack—incident resolution SLA improved by 35%.
-
Developed Python utilities to parse logs, generate alerts, and manage analytics.
-
-
Skills: Kubernetes, Docker, Jenkins, Linux shell scripting, Jira/GitHub.
This role is the next step for Asha—at IBM she’ll lead observability at massive scale, mentor teams, and strengthen her expertise in reliability engineering.
Conclusion
If you’re passionate about observability, automation, and systems reliability—this is the role for you. At IBM Bengaluru, you’ll shape the future of cloud-born storage services, elevate enterprise systems’ visibility, and grow within an organization built on curiosity, inclusion, and continuous innovation.
Apply today to become part of the IBM Software Systems team—where reliability meets intelligence and infrastructure meets impact.
✅ Ready to Apply?
-
Ensure your resume highlights:
-
Cloud Object Storage or large-scale distributed system experience
-
ELK/Kafka/File beat/Grafana/Syndic implementations
-
Terraform/Ansible CI/CD workflows
-
Python/Java/SQL tooling and automation pipelines
-
-
Limit applications to 1–3 roles per year to optimize recruiter attention and candidate experience.
-
Have questions? Ask the IBM recruiter about office locations, hybrid work schedule, team structure, or technologies you’ll own.