Ardent Performance Computing

Jeremy Schneider

Search

Kubernetes Requests and Limits for Postgres

Posted byJeremy⋅September 22, 2024⋅1 Comment

Filed Under consolidation,kubernetes,linux,memory,performance,PostgreSQL

AsJoe Drumgoole said a few days ago: so many Postgres providers. Aiven, AWS, Azure, Crunchy, DigitalOcean, EDB, GCP, Heroku, Neon, Nile, Oracle, Supabase, Tembo, Timescale, Xata, Yugabyte… 🤯 I’m sure there’s more I missed. And that’s not even the providers using Postgres underneath services they offer with a different focus than Postgres compatibility. (I noticedQian Li’s upcoming PGConf NYC talk in 2 weeks… I have questions about DBOS!)

Kubernetes. I have a theory that more people are using kubernetes to run Postgres than we realize – even people on that list above.Neon’s architecture docs describe their sprinkling of k8s stardust (but not quite vanilla k8s; Neon did a little extra engineering here). There are hints around the internet suggesting some others on that list also found out about kubernetes.

And of course there are the Postgres operators. Crunchy and Zalando were~~first~~ out of the gate in 2017. But not far behind, we had ongres and percona and kubegres and cloudnativepg.

Edit Nov 2: The first out of the gate was stolon in 2015. I missed it when I originally published this article.

We are database people. We are not actually a priesthood (the act is only for fun), but weare different. We are not like application people who can spin a compute container anywhere and everywhere without a care in the world. We have state. We are the arch enemies of the storage people. When the ceph team says they have finished theirfio performance testing, we laugh and kick off the database benchmark and watch them panic as their storage crumbles under the immense beating of our IOPS and their caches utterly fail to predict our read/write patterns. (I jest. We don’t really surprise each other that much anymore, except for occasional harmless office pranks.)

But we all have at least one thing in common: none of us want to pay for a bunch of servers to sit around and do nothing, unless it’s really necessary. Since the dawn of time. From mainframes to PowerVM to VMware and now to kubernetes. We’re hooked on consolidating better and saving more money and kubernetes is the best drug yet.

In kubernetes, you manage consolidation with two things:requests andlimits.

The Production Kubernetes book says this is a common question in the field.

Because we’re hooked, I’ve noticed that in kubernetes circles there areproponents ofno limits. There arecounterarguments too. But for modern distributed microservice applications there is an argument. Especially if pods can be rescheduled (ie. shut down on one node and started on a different node) in a way that’s non-disruptive as traffic is seamlessly rerouted.

Vanilla open source Postgres is not multi-master. Only one server is allowed to make changes to data. There is the open source BDR project which does active-active with logical replication (and requires careful setup, good understanding of tradeoffs, and much more hands-on operations) and there’s some interesting commercial development happening in this space – but today there is nothing that approaches the seamless experience we’d like for shutting down a Postgres node without some connections getting errors and/or rolled back transactions. On vanilla open source Postgres you can’t avoid at least a few seconds of unavailability during the failover. You can build your application to attempt reconnections and hide the errors; but that’s not trivial and it’s only useful for greenfield apps where you’re willing to do the extra dev work.

For this reason, I get a sense that most postgres+kubernetes people are definitely not in the “no limits” camp. I’m still asking around, but anecdotally I’ve heard the other extreme – the “request==limit” mentality – and leaving idle resources on the table is just the price we pay to avoid periodic unavailability and errors from a kubernetes scheduler decision to failover your production database (say, during peak business hours, because someone kicked off end-of-month reporting in a different production pod on the same node).

Can we do better? Can we do some oversubscription and find a way to reduce the risks?

Lets just start with dev machines, where we can care a little bit less about a few seconds of unavailability, as long as k8s rescheduling isn’t too frequent.

CPU is fairly easy to reason about. But it’s harder to figure out how memory pressure will work, and how to size the buffer cache if we want to experiment with “request<limit”.

We need to solve memory because that will be the limiting factor for consolidation. If we set buffer cache and request size on the smaller side, and then put a bunch of developer DBs on a single machine hoping for burst, is the linux kernel very good about sensing demand and increasing memory for a pod to give it more memory for page cache when some developer is actively using their database? And will it release that memory under reduced load to avoid triggering rescheduling events?

There are two things that can cause a pod to be terminated for memory reasons: (1) the kubernetes scheduler and (2) the linux kernel OOM killer. Both are important. Based on my initial digging around, I think these notes are accurate:

k8s control plane scheduler uses “request” to schedule pods. it will put lots of pods on a node if requests are low.
linux kernel uses k8s “limit” setting to set cgroups limits.
on an interval of 10+ seconds, k8s control plane scheduler also has ability to evict & reschedule based on a memory metric. from checking references (1) (2), looks like its using cgroups v2 memory.current which “includes page cache, in-kernel data structures such as inodes, and network buffers” and then it’s subtracting the size of the inactive page list. It assumes memory from inactive_file is reclaimable under pressure.
node memory is also managed by linux kernel OOM killer. once a bunch of pods are running, if they start using a lot memory before the k8s scheduler takes action, then OOM killer can kick in.
important to also note that request==limit puts a pod in a different “QoS class” ensuring that other PODs should evicted first by the k8s scheduler.~~i don’t think this influences OOM killer which could still be triggered by a rapid workload spike.~~ kubelet also adjusts OOM killer behavior based on QoS classes.

Linux kernel memory management is tightly integrated with hardware capabilities. Modern processor memory management units (MMUs) automatically set bits in the page table entry (PTE) when pages are accessed or dirtied – and linux leverages these hardware features in its page eviction algorithm. If you’re not familiar with linux kernel memory management, you can start with the kernel docs athttps://www.kernel.org/doc/gorman/html/understand/understand013.html (and if you’re new, LLMs are great for explaining a sentence or a word on pages like this)

Per that page: “The LRU in Linux consists of two lists called the active_list and inactive_list.The objective is for the active_list to contain the working set of all processes and the inactive_list to contain reclaim candidates.”

For a little more depth, this is good https://biriukov.dev/docs/page-cache/4-page-cache-eviction-and-page-reclaim/

That page describes how each cgroup has a two sets of active/inactive lists. One active/inactive set for anonymous memory and a separate active/inactive set for the page cache. (It might be even a little more complicated; there might be processor-local lists?)

My hope is kindled. If the page cache “working set” is factored into cgroup memory management and k8s scheduler decisions, then there’s a chance that postgres doing a lot of file reads would be correctly interpreted by the linux kernel and k8s, allowing a cgroup to increase its memory usage (and page cache) up to the memory limit, and allowing the cgroup to release some of that memory if the system is idle.

However, based on my non-expert understanding of how the active/inactive lists work, I worry that Postgres might bias toward keeping pages on the active list and using more memory even under smaller loads. After all –deciding what the phrase “working set” means in database-land is full of nuance, and the linux memory management algorithm is relatively naive. So this might work for almost totally idle dev containers, but if people are using their databases then it might trigger k8s rescheduling more than we’d like – rather than just letting some dev pods do a little more IO. Or the opposite problem – not increasing memory available to the page cache even under heavy IO. I don’t know. And either way, I don’t think linux exposes any tuning knobs for this?

The journey continues. I still have a lot to learn and figure out here.

Edit 7:30pm: I’d readJoe Conway’s 2021 article about this topic before, but last time I read it I was mostly interesting in the cgroup bits and not the kubernetes bits.There’s also a 2022 follow-up article. Honestly I completely forgot it was mainly about kubernetes before publishing this, and I was just reminded. That’s a bit embarrassing. Re-reading it now :)

Edit 9:00pm:

Joe’s 2021 article highlighted that OOM problems were frequently being seen a few years back. His article is also eye-opening about some kubernetes quirks I hadn’t stumbled on yet. I was a bit familiar with cgroups v1 and v2, linux namespaces, overlaying filesystems, and some other underlying concepts – but I started directly learning kubernetes somewhat recently.

A few follow-up things I wonder after reading Joe’s article:

I’m aware there are folks in the Postgres community who think overcommit should be disable, but I lean toward disagreeing with this approach (and I might argue whether there’s consensus on the idea)
- HOWEVER – “Kubernetes actively sets vm.overcommit_memory=1” worries me. I agree with Joe that thispromiscuous overcommit doesn’t seem right.
“an OOM kill can happen even when the host node does not have any memory pressure. When the memory usage of a cgroup (pod) exceeds its memory limit, the OOM killer will reap one or more processes in the cgroup.”
- I think that linux will still try to evict candidate pages (like dropping clean pagecache pages) before invoking OOM killer? Isn’t the issue is just that it will NOT error out a malloc() call with “out of memory”? Erroring out malloc() is absolutely the behavior we want (and I think we can still get it withvm.overcommit_memory=0 under heavy pressure). That causes a single query to fail rather than crashing the whole database.
Running without swap seems like a not-great-idea to me too – and I didn’t realize back in 2021 that was the only option for kubernetes. It sounds likethey added swap support but now I need to go figure out current state and whether this is enabled by default.
It’s interesting to me that back in 2021 Joe recommended choosing between “request==limit” (with 2x over-provisioning of pod memory) or “no limit” (with 2x over-provisioning of node memory). The amount of suggested memory over-provisioning makes me 😢
He also didn’t directly address how to set the buffer_cache for “request<>limit” cases. Off the top of my head, I think something like 50% of “request” might be a reasonable starting point.
I think Joe’s blog is based on cgroups v1 and I will need to review the changes in cgroups v2 to see how much it changes the picture specifically on memory management.

This also led me to re-read thekubernetes “Node-pressure Eviction” doc page more thoroughly (also was linked above). A few key notes:

Pod selection for kubelet eviction – Pod Priority is a major factor in eviction, so maybe can use this to make burstable database pods evicted after burstable non-database pods.
Node out of memory behavior – explains exactly how QoS class changes OOM behavior. Joe’s 2021 article also gives more in-depth explanation about this.
Schedulable resources and eviction policies – a way to tell kubernetes scheduler to get involved sooner, which may help reduce OOM risks

But the most significant thing is the very last section of the page.

Known issues: active_file memory is not considered as available memory

The active_file statistic is the cgroup equivalent of Active(File) in meminfo which 100% means the page cache. This is agreeing with what I said above: “Postgres might bias toward keeping pages on the active list” and heavy I/O could aggressively trigger kubernetes rescheduling events. This doc suggests “request==limit”. We could still really use that kernel enhancement Joe mentioned – so that when someone runs a query and explicitly tells it to do a big in-memory sort, Postgres can error the query (via malloc error) instead of crashing the node.

The journey continues. I still have a lot to learn and figure out here. So far this was mostly reading and reviewing past work… next should be some data collection.

About Jeremy

Building and running reliable data platforms that scale and perform.about.me/jeremy_schneider

View all posts by Jeremy»

« Good Benchmark Engineers and Postgres Benchmark Week

Challenges of Postgres Containers»

Discussion

Trackbacks/Pingbacks

Pingback:Challenges of Postgres Containers | Ardent Performance Computing -December 31, 2024

Leave a New CommentCancel reply

This site uses Akismet to reduce spam.Learn how your comment data is processed.

Disclaimer

This is my personal website. The views expressed here are mine alone and may not reflect the views of my employer.I am currently looking for consulting and/or contracting work in the USA around the oracle database ecosystem.

contact:312-725-9249 orschneider @ ardentperf.com