Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

stratified random sampling#1290

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Closed
ChuckHend wants to merge0 commits intopostgresml:masterfromChuckHend:master

Conversation

ChuckHend
Copy link
Contributor

Implements a stratified random sampling strategy, and sets that as the new default fortest_sampling.

montanalow reacted with eyes emoji
Copy link
Contributor

@montanalowmontanalow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thanks for hammering this out!

ChuckHend reacted with rocket emoji
#[derive(PostgresEnum, Copy, Clone, Eq, PartialEq, Debug, Deserialize)]
#[allow(non_camel_case_types)]
pub enum Sampling {
random,
last,
stratified_random,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thinking ahead to other stratification strategies, that use columns other than the target y_column_name, to guarantee you have true out of sample rows. e.g.user_id may be used to train with multiple instances from a particular user, but you want to ensure there is no data leakage where the model is just memorizing user_ids rather than the more abstract, so user_id should be excluded as a feature, but used for stratification.

I think that could work as an additionalstratified_column_name parameter to train. In that case though, the sampling wouldn't bestratified_random. So we'd need to add a different stratified type, or we could just call thisstratified, and if you don't specify a column, it's random by the y_column_name. If you do specify stratification column(s), then those column(s) get removed from features, and strictly used for stratification.

This can happen in a follow up PR, I'm just commenting so we get a forward looking name on the sampling strategy.

ChuckHend reacted with thumbs up emojiChuckHend reacted with eyes emoji
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Good ideas/points. I changed it to juststratified. For this PR, there will be no ability to specify the columns to stratify by,stratified only uses y_column_name. I think that should be a non-breaking change to add an optional parameter in the future that would change it from y_column_name to something else.

@montanalow
Copy link
Contributor

This will also need a migration insql/ to update the Postgres sampling enum.

ChuckHend reacted with thumbs up emoji

@ChuckHend
Copy link
ContributorAuthor

This will also need a migration in sql/ to update the Postgres sampling enum.

It looks like 2.8.2 hasn't gone out yet so I put the migration in./sql/pgml--2.8.1--2.8.2.sql . I can create a 2.8.3 migration if that would be preferred though.

Comment on lines 28 to 36

-- src/orm/sampling.rs:6
-- pgml::orm::sampling::Sampling
DROP TYPE IF EXISTS pgml.Sampling;
CREATE TYPE pgml.Sampling AS ENUM (
'random',
'last',
'stratified'
);
Copy link
ContributorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I still need to do some more testing to make sure this migration works as intended.

@montanalow , are there integration tests that assert migrations work? I didn't see any...

Copy link
Contributor

@montanalowmontanalowJan 19, 2024
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

There aren't integration tests. I typically run the test/test.sql to populate a database with the previous version, thenalter extension pgml update. You'll want toalter type add value... here rather than dropping the enum, as that would only work on an empty database.

ChuckHend reacted with thumbs up emojiChuckHend reacted with eyes emoji
@ChuckHendChuckHend marked this pull request as ready for reviewJanuary 29, 2024 14:46
@ChuckHendChuckHend mentioned this pull requestFeb 29, 2024
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@montanalowmontanalowmontanalow left review comments

Assignees
No one assigned
Labels
None yet
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

2 participants
@ChuckHend@montanalow

[8]ページ先頭

©2009-2025 Movatter.jp