postgresml/postgresmlPublic

NotificationsYou must be signed in to change notification settings
Fork328
Star6.4k

Ergonomic pain points when designing a multi-model system#1653

zzzint started this conversation inGeneral

zzzint

Nov 8, 2024

· 2 comments· 1 reply

Return to top

Discussion options

zzzint
Nov 8, 2024

Hi PGML team,

I've been building out a system backed by PostgresML and wanted to share some feedback around my experience.

What's been going well:

The general vision for PGML feels genuinely great. The ability to do embedding generation, model training and prediction all within SQL allows for some incredibly convenient system designs.
Being able to just generate embeddings as computed columns is wonderful.

Pain Points / Oddities:

Vector Compatibility with Clustering Models
- Clustering models don't seem to accept vectors directly when training or predicting.
  [XX000] ERROR: Unhandled type for quantitative scalar column: embeddings "vector"
- Vectors seem to need to be cast as something like areal[].
- This isn't really that big of a deal, it just strikes me as odd - when predicting, we end up needing both:
  - Embeddings as areal[] to predict on.
  - Embeddings as a vector for vector-to-vector comparisons.
User-Specific Model Training
- This is where the product is unfortunately feeling genuinely unusable to me at the moment.
- We're looking to train clustering models on per-user embeddings. Imagine something like below: a table partitioned by user such that we can pass an exact partition to.train
- .train errors when encountering theuser_id column:
  [XX000] ERROR: Unhandled type for quantitative scalar column: user_id "uuid"
- There doesn't seem to be a way to inform PGML to ignore specific columns during training
- There doesn't seem to be any other reasonable way of accomplishing this kind of training:
  - Subqueries can't be passed to.train in place of a relation name.
  - Tables or views per user with no actual reference to the user feel like far too big of a design smell to actually consider

CREATETABLEuser_embeddings(    user_id    uuidNOT NULLCONSTRAINT user_embeddings_user_id_pkPRIMARY KEYCONSTRAINT user_embeddings_user_id_fkREFERENCES usersONUPDATE RESTRICTON DELETE CASCADE,    embeddings vector(384)NOT NULL)    PARTITION BY LIST (user_id);

Questions

Am I missing something that would help solve for this per-user training situation?
If not, should we start some discussion around how we might improve the ergonomics of.train such that a use case like the above could be designed around?
Is it unreasonable to expect clustering model train & predict methods to accept vectors?

You must be logged in to vote

Replies: 2 comments 1 reply

Comment options

montanalow
Nov 8, 2024
Maintainer

Thank you for this feedback.

Let's create an issue to track vector compatibility. Our APIs predate pgvector dominance, but we should move that direction with the community. This isn't a particularly big lift, but it's a little bit odd for one extension to depend on another since Postgres extensions don't have an official dependency resolution mechanism. We can probably sort this out dynamically.
User specific models are interesting
a) You could create a view for each user partition, but yeah, this is a hack.
b) Many ML libs (our dependencies) don't support uuid as inputs. The quick fix is to map this to an int, but high cardinality categoricals don't usually make good features. It's usually better to do some feature engineering on whatever attributes your users might have, like account age, email domain, referrer domain, number of interactions...
c) We could add support for partitioned models. This is feasible, but partitioning data/models usually creates a bunch of inferior models at much greater computational training expense. Can you share more details about this use case so we can prioritize appropriately?

You must be logged in to vote

1 reply

Comment options

zzzint Nov 8, 2024
Author

Ah ya I can definitely see being hesitant around relying on another extension. I can take care of creating an issue tomorrow if one hasn't been created by then.

2.b) Ya that makes sense - I think my intuitive expectation would either be:

Some kind of way to omituser_id (or any subset of columns) from being considered during training.
- This jives well with this partitioned table strategy, where we're passing along a specific partition as therelation_name, but maintaining partitioned tables is a hassle that I'd be hesitant to recommend as a first-class strategy for solving this kind of problem.
Some kind of way to inform.train via a group-by-esque argument that we want a model per whatever we're grouping by
- This jives well with just passing along a single table or view, but you lose the explicitness of providing an exactproject_name per invocation of.train.
The ability to pass along a subquery rather than a table or view.

I'm not extremely familiar with the ML world, so these intuitive expectations are strictly from an API-design perspective and much less of a "what might consumers of ML tools generally expect" perspective. I could be way off base!

2.c) The high level gist is:

Given a users catalog of liked songs and various metadata on those songs, we want to expose to them their unique clusters of musical tastes.
We don't have a great dataset to train a general clustering model on, and even if we did, the current intuition is that per-user models would generate more accurate clusters for them specifically.

Comment options

GitTorres
Nov 8, 2024

Adding to@zzzint point in 2.c, I think that even if a single global model is generally more accurate at identifying proper clusters for the purposes of his application, there will no doubt be some cases where a user-specific model could segment that user's data in a more meaningful way than the global model, possibly leading to a much better experience for that particular user, so I can understand the interest in exploring the effectiveness of user-specific models along with a single global model.

You must be logged in to vote

0 replies

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ergonomic pain points when designing a multi-model system#1653

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!