Ardent Performance Computing

Jeremy Schneider

Search

Losing Data is Harder Than I Expected

Posted byJeremy⋅September 28, 2025⋅1 Comment

Filed Under cloudnativepg,cnpg,crash,data loss,database,durability,isolation,jepsen,lab,replication,synchronous,testing

This is a follow‑up to the last article:Run Jepsen against CloudNativePG to see sync replication prevent data loss. In that post, we set up a Jepsen lab to make data loss visible when synchronous replication was disabled — and to show that enabling synchronous replication prevents it under crash‑induced failovers.

Since then, I’ve been trying to make data loss happen more reliably in the “async” configuration so students can observe it on their own hardware and in the cloud. Along the way, I learned that losing data on purpose is trickier than I expected.

Methodology and a Kubernetes caveat

To simulate an abrupt primary crash, the lab uses a forced pod deletion, which is effectively a kill -9 for Postgres:

kubectl delete pod -l role=primary --grace-period=0 --force --wait=false

This mirrors the very first sanity check I used to run on Oracle RAC clusters about 15 years ago: “unplug the server.” It isn’t a perfect simulation, but it’s a simple, repeatable crash model that’s easy to reason about.

I should note thatthe labelrole is deprecated by CNPG and will be removed. I originally used it for brevity, but I will update the labs and scripts to use the labelcnpg.io/instanceRole instead.

After publishing my original blog post, someone pointed out an important Kubernetes caveat with forced deletions:

Irrespective of whether a force deletion is successful in killing a Pod, it will immediately free up the name from the apiserver. This would let the StatefulSet controller create a replacement Pod with that same identity; this can lead to the duplication of a still-running Pod

https://kubernetes.io/docs/tasks/run-application/force-delete-stateful-set-pod/

This caveat would apply to the CNPG controller just like a StatefulSet controller. In practice, for my tests, this caveat did not undermine the goal of demonstrating that synchronous replication prevents data loss. The lab includes an automation script (Exercise 3) to run the 5‑minute Jepsen test in a loop for many hours and collect results automatically.

Hardware used included an inexpensive HP EliteBook (Ryzen Pro 5, $299 on Amazon) with two CNPG Lab VMs via Hyper‑V, plus multiple cloud instance types. I ran long‑burner loops (8–20 hours) and aggregated failure rates across configurations.

I’m considering bringing Chaos Mesh into the lab in the future, but for now I’m sticking with the explicit crash model above because it’s easy for folks to see exactly what it does.

High‑level results:

With synchronous replication: 1,061 five‑minute runs, 0 data‑loss failures.
With asynchronous replication: 1,448 runs, 478 data‑loss failures.

These are the total counts across all runs from three different sets of experiments.

Experiment 1: Checkpoints and replica count

Hypothesis A: Increase replication traffic (shorter checkpoints which causes more FPWs) to raise odds of “unshipped” WAL at crash ⇒ more losses with async.

Hypothesis B: Fewer replicas (2 instances total instead of 3) might make losses more likely.

Each row below shows the fraction of async runs that showed data loss.

I also ran two of the configurations with sync replication enabled. No data loss was observed in either of the runs with sync replication.

Checkpoint	3 instances	2 instances
5 min (default)	5% [async results] / 0% [sync]	24% [async results]
30 second	5% [async results]	15% [async results] / 0% [sync]

Findings: Hypothesis B was right—2 instances amplified data loss. Hypothesis A was wrong—shorter checkpoints did not increase loss rates here and even correlated with slightly fewer losses.

Experiment 2: Jepsen rate and thread count

I varied the transaction rate and the number of client threads. My intuition was that higher rates would increase the chance of a commit landing during a crash window, and that fewer threads might improve per‑thread throughput (given CPU saturation).

Rate	50 threads	20 threads
1000	24% [cf. experiment 1]	8% [results]
2000	51% [results]	38% [results]
3000	80% [results]	39% [results]
4000	N/A [results]	N/A

Findings: Higher rates increased loss frequency (as expected). Reducing thread count lowered CPU pressure and but surprisingly it also reduced loss frequency—even when achieving similar rates. The “4000” rate did not complete successfully; Jepsen analysis stalled and timed out.

The most reliable async configuration for provoking visible loss so far: 2 instances total, rate 3000, 50 threads.

Experiment 3: Hardware differences

To ensure reproducibility beyond my laptop, I repeated runs on several cloud instance types.

Hardware	async	sync
AWS m7g	26% [results]	0% [results]
AWS m6g	23% [results]	0% [results]
Azure Dpsv6	51% [results]	0% [results]
HP Elitebook (Ryzen 5675U)	75% [results]	0% [results]

I didn’t expect the spread in async failure rates. My current guess is that some combination of CPU and/or IO saturation characteristics change the window for unreplicated commits. The takeaway for teachers and students: if you want to reliably see data loss, Azure Dpsv6 performed best in my runs (about half of iterations saw data loss).

What this means

Synchronous replication remains the guardrail. Across thousands of minutes of testing, I did not observe a single instance of data loss with sync enabled under these test configurations.
Topology matters. Two instances (one replica) increases the chance of async loss versus three instances.
Workload shape matters. Higher rates raise loss frequency; fewer client threads can reduce it even at similar throughput.
Hardware matters. Different CPU/IO profiles change how often you’ll catch an in‑flight commit during a crash.

Reproduce it yourself

Use the CloudNativePG LAB and Exercise 3 to run the Jepsen “append” workload and induce rapid primary failures. The looped test and automatic report upload are included. If your goal is to demonstrate loss in async mode, start with:

2 instances
rate 3000
50 threads

If Jepsen analysis is stalling and timing out then try reducing the rate to 2000. And if you have the option, try Azure Dpsv6 for the highest chance of observing loss quickly.

About Jeremy

Building and running reliable data platforms that scale and perform.about.me/jeremy_schneider

View all posts by Jeremy»

« Run Jepsen against CloudNativePG to see sync replication prevent data loss

Postgres Replication Links»

Discussion

Trackbacks/Pingbacks

Pingback:Losing Data with PostgreSQL and Jepsen – Curated SQL -September 29, 2025

Leave a New CommentCancel reply

This site uses Akismet to reduce spam.Learn how your comment data is processed.

Disclaimer

This is my personal website. The views expressed here are mine alone and may not reflect the views of my employer.I am currently looking for consulting and/or contracting work in the USA around the oracle database ecosystem.

contact:312-725-9249 orschneider @ ardentperf.com

about.me/jeremy_schneider

ocm ace

(a)

Movatterモバイル変換