- Notifications
You must be signed in to change notification settings - Fork328
-
Hello, First of all, I want to thank you for the fantastic work you've done on this project !! The trend to separate data from compute has been pushed to the extreme, introducing a lot ofsometimes unneeded complexity. I took a look at the extension code and was a little bit surprised when I saw that models were loaded per connection process. This is not a huge deal when dealing with small models but could be a huge bottleneck when dealing with LLM. In addition, you lose the ability to batch your requests. You also are constrained to keep those expensive connections running, which is not a big deal if you have a connection pooler like pgcat. As I understand it, if I take the example of embedding function, the extension calls into a python-wrapped function that uses the transformers lib. I think that this design is highly inefficient for both memory and compute. I opened this as a discussion because I don't really know the ins and outs of building a pg extension and would like to understand the limits of this suggestion. I also would love to help on this front if you accept contributions on this front 😺 ! |
BetaWas this translation helpful?Give feedback.
All reactions
Replies: 1 comment 2 replies
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
I think there are a few issues here to unpack. Current State
Because of 1) & 2), the lowest latency way to get predictions that use all available compute is often a single model in a single connection that's being supported by a query queue in PgCat. It's better to queue outside of Postgres than inside, and also having fewer concurrent connections lowers lock contention, but can still allow you to achieve large scale resource utilization. This is also good, because the query cache will stay warm across many clients without having to re-init the model every time a client connects, similar to having a background worker hold the model. The drawback here, is that you need to setup PgCat or some other proxy/pooler, but that's pretty common in production environments at scale these days, so I think this is a pretty acceptable state of affairs, although we could definitely use better documentation on this front. Next StepsI would prioritize work on removing Python from the inference path through 1 of the 3 following so we can get around the GIL. This is part of the plan for PostgresML 3.0, although it's not currently under active development: I'd love to see some benchmarks and configurations of LLMs in these runtimes to see how much we can improve things. PRs are welcome on this front but I'd love to coordinate somewhat closely as we explore the possibilities. This is somewhat tricky in that not all LLMs are supported by these options, although my impression is we could get most of the mainstream ones working. End stateAfter we've removed Python from the execution path, it'll make it simpler to potentially share a model across multiple connections. AdditionallyThe largest LLMs need to be loaded on the GPU, and shared memory on the GPU is even trickier than normal shared memory, but moving outside of Python is a prerequisite for that as well. FinallyWe've implemented a background worker that can share models acrossdatabases not just connections so that our serverless users can have even cheaper access to the latest LLMs, but we haven't open sourced this yet, since we're still testing it internally. It still involves Python, and there is a fair amount of complexity around queuing, locks, memory management etc, e.g. what happens when someone tries to load a model and their isn't enough memory? Our goal is to work more on documentation for the many use cases before added too much more to the complexity and options, but if you can help drive things forward while we cleanup the existing bits, PR's will always be welcome. |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
Wow! Thanks a lot for the detailed explanation and rationale behind the current design. I do have some further questions if that's ok.
If I understand correctly, the choice to load a model per connection in the database is working as model replication and the pooler is basically load-balancing requests and reusing the alreadywarm models. That's all great if the
I don't really get this point, Python GIL prevents multithreading not multiprocessing. I might be wrong here or have a simplistic view on PG, but as I understand it postgres spawns a process per connection. This means that either way we need to have synchronization across processes not threads if we need to have a shared model architecture 🤔 ?
This sounds really cool but isn't it a little bit against the idea of postgresML of colocating compute and data? Okay, I might be pedantic here 😄 and as a data scientist I do understand that it these huge models might need to be separated, but on the other hand there is a push to optimize model inference speed for LLM and also use alternative accelerators which would match perfectly with postgresML.
I have contributed to the Thanks again, I really appreciate the response you provided👍🏼 |
BetaWas this translation helpful?Give feedback.
All reactions
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
-
I agree, this is an interesting question, and it depends on the complexity of the VIEW being used for inference, but I think in the case that you only have the resources for 1 copy of the model, it is likely a large model, i.e. > 10GB. Even on a GPU, models at this scale take > 50ms (depending on number of tokens), which is significantly longer than SSD access times to page data in, so I'd suspect even in this case, you should be able to achieve > 90% GPU utilization with a single model/connection queuing outside the database. There could be other stalls though, like Python tokenization for smaller models, so any examples would be informative for us to optimize further. I still think Python is the likely low hanging fruit in this case.
Correct. When I was writing this sentence, I was thinking about achieving inference with the weights in truly shared memory, across multiple threads in a single process (either connection or multithreaded background worker), e.g. connections can do parallel scans.
Yep, you've nailed the thesis, which is another reason why this work hasn't been open sourced while we're in the exploration phase. Have you seen YANDEX's recent work ontabr. Tight data/model integration is only going to get more and more important.
That'd be a great starting point to flush out the best path forward. Thanks for your thoughts and contributing to the discussion! |
BetaWas this translation helpful?Give feedback.
All reactions
👍 1