Senior Site Reliability Engineer - Observability
Company: Dimensional
Location: Austin
Posted on: April 1, 2026
|
|
|
Job Description:
Job Description: About the Role: We are looking for a Senior SRE
to join our Platform Engineering team as the operations owner of
our observability platforms. You’ll be responsible for the
reliability, scalability, and continued evolution of the tools that
give our engineering organization visibility into everything they
build and run. The current observability platform is primarily
comprised of on-premises ELK (Elasticsearch, Logstash, Kibana)
Stack and Grafana, with some exposure to New Relic and SolarWinds.
This is a hybrid role: roughly half your time will be spent on
steady-state operations and platform support, and the other half on
engineering projects that meaningfully advance the platforms you
support. It’s a great fit for someone who is genuinely motivated by
the pursuit of excellence – not just sustaining what works but
relentlessly refining it. You take pride in the platforms you own,
and that pride drives you to keep improving them, whether that
means tightening an SLO, eliminating a source of toil, or building
something that gives teams faster insight into their systems. What
You’ll Work On: Operations & Reliability (~ 50%) Serve as a primary
escalation point for production support involving the ELK Stack,
Grafana, and New Relic Own platform health, capacity planning, and
performance tuning for on-premises observability infrastructure –
including Elasticsearch cluster management, index lifecycle
policies, and retention strategies Monitor and maintain SLOs for
the observability platforms, ensuring the tools engineers depend on
are highly available and performant Support engineering teams in
onboarding to observability platforms – helping teams instrument
their applications, build dashboards, and define meaningful alerts
Manage patching, upgrades, and configuration management across the
observability stack Collaborate with security to harden platform
configurations and manage software vulnerabilities Contribute to
on-call rotations and maintain runbooks and escalation procedures
Platform Engineering (~ 50%) Design and build tooling/automation to
reduce toil and improve the experience for teams using
observability platforms Lead or contribute to platform
modernization initiatives – e.g., improving ingestion pipelines,
scaling platform capacity, standardizing Grafana dashboard and
alerting patterns, or evaluating new capabilities within the
existing stack Develop and maintain infrastructure-as-code
(Terraform, Helm, Ansible, etc.) for platform components Build and
enforce standards around logging metrics and alerting that help
engineering teams adopt observability best practices at scale
Participate in design reviews and contribute to the overall
platform roadmap What We’re Looking For: Bachelor’s degree in a
technical field or equivalent practical experience 5 years of
experience in SRE, DevOps, or platform engineering roles Deep
hands-on experience with the ELK Stack – Elasticsearch cluster
operations, Logstash pipeline development, Kibana, and index
lifecycle management Strong experience with Grafana, including data
source integrations, dashboard design, and alerting Solid
understanding of observability principles Experience operating
on-premises infrastructure, including capacity planning, server
management, and the operational tradeoffs with managed cloud
services Proficiency in Python for automation and tooling;
familiarity with shell scripting Strong Linux systems knowledge and
comfort working with configuration management tools (e.g., Ansible,
Chef, Puppet, etc.) Demonstrated ability to drive incidents to
resolution and communicate clearly under pressure A bias toward
automation and a low tolerance for repetitive manual work Nice to
Have: Experience with Prometheus Experience with New Relic
administration or APM instrumentation Familiarity with log shipping
agents and pipeline tools such as Beats, Fluentd, or Fluent Bit
Experience with distributed tracing tools like OpenTelemetry
Exposure to cloud-based observability offerings and experience
thinking through hybrid strategies Prior experience building or
governing observability standards across a large engineering
organization LI-Hybrid Dimensional offers a variety of programs to
help take care of you, your family, and your career, including
comprehensive benefits, educational initiatives, and special
celebrations of our history, culture, and growth. It is the policy
of the Company to provide equal opportunity for all employees and
applicants. The Company recruits, hires, trains, promotes,
compensates, and administers all personnel actions without regard
to actual or perceived race, color, religion, religious practice,
creed, sex, sex stereotyping, pregnancy (which includes pregnancy,
childbirth, and medical conditions related to pregnancy,
childbirth, or breastfeeding), caregiver status, gender, gender
identity, gender expression, transgender identity, national origin,
age, mental or physical disability, ancestry, medical condition,
marital status, familial status, domestic partnership status,
military or veteran status or service, unemployment status,
citizenship status or alienage, sexual orientation, status as a
victim of domestic violence, status as a victim of stalking, status
as a victim of sex offenses, genetic information, political
activities or recreational activities, arrest or conviction record,
salary history, natural hairstyle or any other status protected by
applicable law except as otherwise required or permitted by law or
regulation applicable to the Company or its affiliates.
Keywords: Dimensional, Temple , Senior Site Reliability Engineer - Observability, Engineering , Austin, Texas