Watch the full demo — detection to Slack alert in under 90 seconds
At 2:18am, payment_settlement_daily crashed. The task writing settled transactions to Postgres failed after 14 seconds — 103 connections open against a database capped at 100. Three concurrent DAGs had quietly exhausted the pool.
The page went out. That part is the same as always.
What was different: by the time the page fired, Tapifra had already read the traceback, cross-referenced the concurrent DAG schedule, identified the root cause, verified the fix, and written it all to Slack. The oncall engineer saw it at 8am over coffee — root cause, one-line fix, done. No war room. No log diving. No 3am Zoom call.
"The page isn't the problem. Waking someone up to investigate is."
What actually happened
Three DAGs — payment_settlement_daily, reconciliation_hourly, and audit_report_weekly — were all scheduled within the same 5-minute window. Each one opens its own set of database connections. Under normal conditions this is fine. At 2:18am, with all three running concurrently, the connection count crept past 100.
Postgres doesn't negotiate. When the 101st connection request came in, it returned:
Traceback (most recent call last): File "dags/payment_settlement_daily.py", line 112, in db_write_settlements conn = pool.getconn() File "psycopg2/pool.py", line 144, in getconn return self._connect(key) psycopg2.OperationalError: FATAL: sorry, too many clients already DETAIL: max_connections=100, active_connections=103 HINT: Consider increasing max_connections or using a connection pooler such as PgBouncer. RuntimeError: DB connection pool exhausted — 3 concurrent runs of payment_settlement_daily are holding connections. Active: 103, limit: 100.
The settlement task hit its retry limit and died. Four downstream tasks — including notify_finance_team — never ran.
Airflow DAG graph — three tasks green, db_write_settlements red, notify_finance_team upstream-failed
What Tapifra did
Tapifra's monitoring loop caught the failure within 90 seconds of the task state flipping to failed. Here's what happened next.
1 — Detection
The agentic loop watching payment_settlement_daily flagged the failing task and pulled the full execution log for db_write_settlements. Not a summary — the raw traceback, Airflow task metadata, and run context (which other DAGs were running at the same time).
2 — Investigation with Claude Opus 4.7
Tapifra sent the failure context to Claude Opus 4.7 with a structured prompt built for SRE root cause analysis. The model read the traceback, cross-referenced the concurrent DAG schedule, and returned a precise diagnosis:
max_connections=100.Immediate fix: Set
max_active_runs=1 on payment_settlement_daily.
Stagger DAG start times by 15 minutes.Structural fix: Deploy PgBouncer in transaction pooling mode. Allows connection multiplexing and gives headroom as pipeline volume grows.
3 — Verifier pass
Before posting anything to Slack, Tapifra ran a second reasoning pass — checking that the proposed fix was consistent with the error signal, that no alternative explanation fit the evidence better, and that the suggested parameters were reasonable for the observed workload. False confidence in an incident is worse than no alert.
The verifier confirmed: connection pool exhaustion, high confidence. Fix is correct.
Tapifra's investigation output — root cause, severity, and fix suggestion
4 — Slack alert with root cause and fix
Within minutes of detection, this landed in #tapifra-demo:
The actual Slack alert — root cause, fix suggestion, and severity, posted by Tapifra automatically
The engineer on call never got paged. They woke up at 8am, saw the Slack message, applied max_active_runs=1, and added PgBouncer to the next sprint. A one-line config change, made in daylight, with full context already in front of them.
Why this matters for fintech specifically
Settlement pipelines aren't optional. When payment_settlement_daily fails, you're not just looking at a system alert — you're looking at unprocessed transactions, downstream ledger inconsistencies, and a finance team that opens their morning to broken numbers. Every minute of MTTR in a payment system has a real cost.
The industry reflex is to throw PagerDuty at the problem and rotate oncall engineers through a schedule nobody wants. That works until your team is three people and your pipelines run at 2am across six time zones.
What Tapifra is not
Tapifra is not a fancier alert router. It doesn't just reformat a log line and send it to Slack. The value in the scenario above isn't that we detected the failure — Airflow already knew it failed. The value is that we identified why, verified the diagnosis, and surfaced a concrete fix, all before any human had to open a laptop.
That distinction matters enormously at 2am. An alert that says "DAG failed" gets acknowledged and deferred. An alert that says "here's the root cause, here's the exact config change to make, and here's why" gets actioned.
max_active_runs=1. Tapifra does. It's the difference between a pager and an SRE.
The right question to ask your team
Not "do we have monitoring?" — you have monitoring. The question is: when a pipeline fails at 2am, does your team know the root cause before a human touches a keyboard?
If the answer is no, you're paying oncall engineers to do work that a system should be doing. That's expensive, it burns out your best engineers, and it scales badly as your pipeline count grows.
Tapifra is built by engineers with 8 years of production SRE experience. We've been the person on call at 2am. We built the thing we wish we'd had.
We're onboarding a small group of fintech engineering teams first. No per-seat pricing. 46% cheaper than PagerDuty. Connect in under 5 minutes.
Get early access