ARION
Digital Presence & Branding
SPARK
Marketing & Growth Systems
OLIVER
Operations, Admin & Execution
STELLA
Data Intelligence & Analytics
FORGE
Custom Apps & Integrations
ARGUS
Automation & Orchestration
SPARK — Marketing & Growth Systems
Turn contacts into loyal customers with automated, data-driven marketing.
FORGE — Custom Apps & Integrations
Build exactly what your business needs, connected to every tool you use.
ARGUS — Automation & Orchestration
The intelligence layer connecting every platform, automatically.
One login. One data model. Six platforms. Zero app-switching. Explore the full ecosystem →
Build Your Brand
Presence, Visibility & Growth
Build Your Foundation
Operations, Process & Workflows
Build Your Clarity
Reporting, KPIs & Data Strategy
Build Your Engine
Integrations, Automation & Tech
HomeSignal › How We Reduced Our p99 Latency by 80% Without Rewriting Anything

How We Reduced Our p99 Latency by 80% Without Rewriting Anything

Taylor Liu··1 min read·2 views
Signal
AWSObservabilityPostgreSQL

Our p99 latency was 4.2 seconds. Our SLA promised 2 seconds. We’d been living with this discrepancy for six months, assuming it would require a significant architectural change to fix. It required four targeted changes over three weeks, and none of them involved rewriting application code.

Finding the Real Root Cause

The first step was actually measuring where time was going in the p99 requests — something we’d been doing inadequately. We had average latency dashboards but not tail latency breakdowns by service and operation. Adding distributed tracing to our slowest endpoints immediately surfaced the pattern: p99 requests were hitting database query timeouts caused by lock contention on a specific table.

Change 1: Index Addition

The lock contention was caused by a full table scan on a write-heavy table during read operations. A composite index eliminated the scan. Two hours of work. p99 dropped from 4.2 seconds to 2.8 seconds.

Change 2: Connection Pool Tuning

Our connection pool was too small for our concurrency. Under p99 load conditions, requests were queueing waiting for connections. Increasing pool size and implementing connection timeout with appropriate retry reduced queueing latency significantly. p99 dropped to 1.9 seconds.

Changes 3 and 4

Caching a frequently-read but rarely-updated dataset that was generating expensive queries on every request. And increasing read replica utilization from 40% to 80% for read-heavy endpoints. Final p99: 0.85 seconds. Total engineering time: three weeks across two engineers.

Taylor Liu
Taylor Liu
Cloud infrastructure lead. Writes about cost optimization, Kubernetes, and platform engineering.

Related Posts