NotificationsYou must be signed in to change notification settings
Fork26.1k
Star62.7k

DOC Update plots in Categorical Feature Support in GBDT example#31062

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

betatim merged 13 commits intoscikit-learn:mainfromArturoAmorQ:change_plots

Jul 17, 2025

Merged

DOC Update plots in Categorical Feature Support in GBDT example#31062

betatim merged 13 commits intoscikit-learn:mainfromArturoAmorQ:change_plots

Jul 17, 2025

Conversation

Copy link

Member

ArturoAmorQ commentedMar 24, 2025•
edited
Loading

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Inspired bythis example's way of showing the test_score/fit_time trade-off, I find a scatter plot easier to read and interpret than bar plots.

This PR also introduces a log scale for fit times.

Current plot:

This PR (most recent render):

sphx_glr_plot_gradient_boosting_categorical_001

Any other comments?

I took the opportunity to show the intermediate html diagrams and a introduce a wording tweak.

ArturoAmorQ added2 commits

March 24, 2025 16:17

DOC Update plots in plot_gradient_boosting_categorical

78409cb

Display pipelines and wording tweak

85a9ffb

github-actionsbot added the Documentation label

Mar 24, 2025

Copy link

github-actionsbot commentedMar 24, 2025•
edited
Loading

✔️ Linting Passed

All linting checks passed. Your pull request is in excellent shape! ☀️

_{Generated for commit:96b7c7c. Link to the linter CI:here}

sylvaincom mentioned this pull request

Apr 23, 2025

feat(ComparisonReport): Suggestions of plotsprobabl-ai/skore#1340

Open

Copy link

Member

ogrisel commentedApr 30, 2025•
edited
Loading

As discussed during the bi-weekly meeting, it would be great to make it explicit that the best models are in the bottom left corner, maybe using a matplotlib arrow annotation in the bottom left corner with the text "best models" pointing towards the point at coordinate(0, 0) (or(0.1, 0.1) because of the log scale on the x axis).

Also, could you please add more ticks on the x axis, e.g.: 0.1, 0.15, 0.2, 0.25, 0.3, 0.35, 0.4.

Also, please add theTargetEncoder to this plot.

ArturoAmorQ added3 commits

May 2, 2025 15:07

Merge main

71a19b9

Add arrow and denser ticks as per Olivier's suggestion

07c9764

Avoid override of previous results

c9771b9

Copy link

MemberAuthor

ArturoAmorQ commentedMay 2, 2025

Current result:

Also, please add theTargetEncoder to this plot.

As that also requires adding narrative on the interpretation, I would rather leave that to a follow-up PR.

ogrisel approved these changes

May 5, 2025

View reviewed changes

Copy link

Member

ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM, thanks!

ArturoAmorQ requested a review fromlucyleeow

May 14, 2025 09:55

ArturoAmorQ mentioned this pull request

Jul 9, 2025

Add dynamic placement of arrow in string encoders benchmark plotskrub-data/skrub#1495

Merged

ArturoAmorQ added2 commits

July 10, 2025 12:01

Merge main

a201d7d

Simplify code

f210c2d

lucyleeow approved these changes

Jul 11, 2025

View reviewed changes

Copy link

Member

lucyleeow left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sorry this took me so long to get to!
This is a much nicer graph, thank you!

I couldn't comment on the line but maybe we could add the name of the category handling strategy in the bullet list at the top, i.e.:

- "Dropped": dropping the categorical features- "One Hot": using a :class:`~preprocessing.OneHotEncoder`- "Ordinal": using an :class:`~preprocessing.OrdinalEncoder` and treat categories as  ordered, equidistant quantities- "Native": using an :class:`~preprocessing.OrdinalEncoder` and rely on the :ref:`native  category support <categorical_support_gbdt>` of the  :class:`~ensemble.HistGradientBoostingRegressor` estimator.

Do you think it's worth mentioning somewhere on the graph that the error bars are 1 standard deviation or is it obvious what the error bars are?

I saw thecomment on addingTargetEncoder but maybe that could be left to another PR...?

It's nice that we make it clear to the user where the best models are (with the arrow and 'best models' text), but it does confuse me at first:

I expect it to be pointing to something, e.g., a scatter point. We could add a circle that it points to, but this is not necessarily clear either
The graph axes does not start at 0,0 so it's possibly mis-leading?

I can't think how to improve this. Maybe it is clear enough if people read the label titles, that smaller error/faster is better? Maybe we could add to the x and y axes labels, e.g., an arrow; "<- faster fitting" and "↓ better model/lower error"). Or just explaining it in the text may be enough?

Edit: Maybe we could just have the text 'Best models' in the bottom left corner and no arrow ?

examples/ensemble/plot_gradient_boosting_categorical.py OutdatedShow resolvedHide resolved

examples/ensemble/plot_gradient_boosting_categorical.py Outdated



		class CustomLogFormatter(ticker.Formatter):
		def __call__(self, x, pos=None):

Copy link

Member

lucyleeowJul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Maybe a docstring here?

Copy link

Member

lucyleeowJul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Also I'm surprised it's so hard to format superscript in mpl?!

Copy link

Member

lucyleeowJul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Maybe scientific notation is good enough? e.g.,"%1.1e" ?

Copy link

MemberAuthor

ArturoAmorQJul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

You are right, that simplifies code a lot.

examples/ensemble/plot_gradient_boosting_categorical.py


		def plot_performance_tradeoff(results, title):
		fig, ax = plt.subplots()
		markers = ["s", "o", "^", "x"]

Copy link

Member

lucyleeowJul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nitpick, the last graph, it is a bit tricky to quickly tell the native and ordinal markers/error bars
Would it be 'easy' to change the colour of the the marker, to be the same colour as the scatter points?

Copy link

MemberAuthor

ArturoAmorQJul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think that risks leading the reader to think that the marker is an actual scatter point.

Copy link

Member

lucyleeowJul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That's a good point.

Stupid question, for 'Ordinal', why is the black error bar (?) marker not centered between the horizontal and vertical error bars?

Copy link

Member

lucyleeowJul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nevermind, I see that it is in the new rendered doc, for some reason the horizontal error bar was not rendering properly in the first one above

examples/ensemble/plot_gradient_boosting_categorical.py Outdated

		coeff_str = f"{coeff:.1f}x"

		# Format exponent using Unicode superscripts
		superscripts = str.maketrans("-0123456789", "⁻⁰¹²³⁴⁵⁶⁷⁸⁹")

Copy link

Member

lucyleeowJul 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Would math notation work? e.g.,"$10^{%d}$"

ArturoAmorQand others added4 commits

July 16, 2025 11:52

Update examples/ensemble/plot_gradient_boosting_categorical.py

1458d42

Co-authored-by: Lucy Liu <jliu176@gmail.com>

Address Lucy's comments

10b4a04

Apply suggestions from code review

4d48f5d

Mention that error bars correspond to 1 std

23be2cc

Copy link

MemberAuthor

ArturoAmorQ commentedJul 16, 2025

I saw the#31062 (comment) on adding TargetEncoder but maybe that could be left to another PR...?

Yes. I prefer reducing the scope of this PR.

Merge branch 'main' into change_plots

272ea7f

lucyleeow approved these changes

Jul 17, 2025

View reviewed changes

Copy link

Member

lucyleeow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM thanks!

Sorry I meant the words 'faster fitting'/'lower error' may be added to the x and y axis labels.
They do also make sense here, but maybe they are too long inside the graph and 'Best models' was better?