Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork2.2k
Refactor and optimize schema cleaning logic#11244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
cloudflare-workers-and-pagesbot commentedJan 9, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Deploying pydantic-docs with |
Latest commit: | 76b7109 |
Status: | ✅ Deploy successful! |
Preview URL: | https://2c6bd658.pydantic-docs.pages.dev |
Branch Preview URL: | https://ref-schema-walking.pydantic-docs.pages.dev |
codspeed-hqbot commentedJan 9, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
CodSpeed Performance ReportMerging#11244 willimprove performances by 33.86%Comparing Summary
Benchmarks breakdown
|
48dc1a7
to818269d
CompareI think we can mark this as closing#10655 |
As in, in a new commit on this PR, you're just refraining for now? |
A few questions:
I think it would be really good to loop in@adriangb here - he worked a lot with refs/defs in core schema gen, or so the blame says 😉. |
sydney-runkle commentedJan 22, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Alright, a few things before I dive into an in depth review:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Generally, seems promising. I've already voiced some concerns in independent comments that we can chat about.
I think it'd be helpful to see the deletion of those other methods for comparison purposes on this PR 👍
'$defs': {'MySeq_int_': {'items': {'type': 'integer'}, 'type': 'array'}}, | ||
'properties': {'my_int_seq': {'$ref': '#/$defs/MySeq_int_'}}, | ||
'$defs': {'MyIntSeq': {'items': {'type': 'integer'}, 'type': 'array'}}, | ||
'properties': {'my_int_seq': {'$ref': '#/$defs/MyIntSeq'}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
A note about this change, discussed in detail previously:#10655 (comment)
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
It it the same
If you have more that 1 reference to a) It does affect JSON Schema gen
Only failing tests are because of the match statement that I need to change.
yes sorry the PR description isn't complete yet. |
The issue is not just verbose schemas but also duplicate SchemaValidator's. |
I'll note that this is only an issue with the following setup: classStandaloneModel(BaseModel): ...classM1(BaseModel):m:StandaloneModel# If StandaloneModel is referenced another time,# M1's core schema is transformed in a `'definitions'` schema,# and we don't have any duplication of StandaloneModel's cs.classM2(BaseModel):m:StandaloneModel# Same note as M1...classOuter(BaseModel):m:StandaloneModel# Same note as M1 and M2m1:M1# If M1 is referenced another time,# Outer's core schema is transformed in a `'definitions'` schema,# and we don't have any duplication of M1's csm2:M2# Same ...mx:MX Which is relatively uncommon. Boxy's work will help here as well in reusing validator instances. JSON Schema changes is still an issue though (still valid, but the structure changes and this generally create churn in user code), so I'm looking into a way to have the previous behavior). |
b258a3e
tofe0db49
CompareThanks for the clarification@Viicos. I'm less concerned now about this defs/refs simplification now that I understand it's semi-limited. That being said, I do think think we might want to revisit an extra pass here in the future if/when we can get significant memory usage improvements with Boxy's work (cc@davidhewitt). Happy to take another pass over this - could you remove the existing walk / discriminator logic so I can compare? Thanks for all of your work here and the detailed explanations of some complex patterns. |
dbced49
to0321015
Comparegithub-actionsbot commentedJan 24, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Coverage reportClick to see where and how coverage changed
This report was generated bypython-coverage-comment-action |
return schema | ||
def _can_be_inlined(def_ref: core_schema.DefinitionReferenceSchema) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Difference from MS PR: instead of checking for specific keys in the metadata schema, we just check that there are no metadata at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Why can't we inline if there's metadata, ex json schema related metadata?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
fromtypingimportAnnotatedfrompydanticimportBaseModel,WithJsonSchematypeTest=intclassModel(BaseModel):f:Annotated[Test,WithJsonSchema({})]Model.__pydantic_core_schema__{│'type':'definitions',│'schema': {│ │'type':'model',│ │'cls':<class'__main__.Model'>,│ │'schema': {│ │ │'type':'model-fields',│ │ │'fields': {│ │ │ │'t1': {'type':'model-field','schema': {'type':'int'},'metadata': {}},│ │ │ │'t2': {│ │ │ │ │'type':'model-field',│ │ │ │ │'schema': {'type':'definition-ref','schema_ref':'__main__.Test:124259184101792','metadata': {'<stripped>'}},│ │ │ │ │'metadata': {}│ │ │ │ }│ │ │ },│ │ │'model_name':'Model',│ │ │'computed_fields': []│ │ },│ │'config': {'title':'Model'},│ │'ref':'__main__.Model:107106664271472',│ │'metadata': {'<stripped>'}│ },│'definitions': [{'type':'int','ref':'__main__.Test:124259184101792'}]}
If you inline the definition ref, you loose the JSON Schema metadata. It could still be inlined and the metadata moved to the referenced schema, but you'll need to make a copy of it and merge metadata somehow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Can you add a note about this to the code?
@pytest.mark.parametrize('deep_ref', [False,True]) | ||
@pytest.mark.xfail( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Difference from MS PR due to the type refs changes (specifically the GPCS method change). The generated schemas are still valid, but in this case the output is different:
On this branch, the core schema for M1 is:
{│'type':'definitions',│'schema': {│ │'type':'model',│ │'cls':<class'__main__.M1'>,│ │'schema': {│ │ │'type':'model-fields',│ │ │'fields': {'a': {'type':'model-field','schema': {'type':'definition-ref','schema_ref':'__main__.M2:101691608428272'},'metadata': {}}},│ │ │'model_name':'M1',│ │ │'computed_fields': []│ │ },│ │'config': {'title':'M1'},│ │'ref':'__main__.M1:101691610240000',│ │'metadata': {'<stripped>'}│ },│'definitions': [│ │ {│ │ │'type':'model',│ │ │'cls':<class'__main__.M2'>,│ │ │'schema': {│ │ │ │'type':'model-fields',│ │ │ │'fields': {│ │ │ │ │'b': {│ │ │ │ │ │'type':'model-field',│ │ │ │ │ │'schema': {│ │ │ │ │ │ │'type':'model',│ │ │ │ │ │ │'cls':<class'__main__.M1'>,│ │ │ │ │ │ │'schema': {│ │ │ │ │ │ │ │'type':'model-fields',│ │ │ │ │ │ │ │'fields': {│ │ │ │ │ │ │ │ │'a': {│ │ │ │ │ │ │ │ │ │'type':'model-field',│ │ │ │ │ │ │ │ │ │'schema': {'type':'definition-ref','schema_ref':'__main__.M2:101691608428272'},│ │ │ │ │ │ │ │ │ │'metadata': {}│ │ │ │ │ │ │ │ │ }│ │ │ │ │ │ │ │ },│ │ │ │ │ │ │ │'model_name':'M1',│ │ │ │ │ │ │ │'computed_fields': []│ │ │ │ │ │ │ },│ │ │ │ │ │ │'config': {'title':'M1'},│ │ │ │ │ │ │'ref':'__main__.M1:101691610240000',│ │ │ │ │ │ │'metadata': {'<stripped>'}│ │ │ │ │ │ },│ │ │ │ │ │'metadata': {}│ │ │ │ │ }│ │ │ │ },│ │ │ │'model_name':'M2',│ │ │ │'computed_fields': []│ │ │ },│ │ │'config': {'title':'M2'},│ │ │'ref':'__main__.M2:101691608428272',│ │ │'metadata': {'<stripped>'}│ │ }│ ]}
As we can see, the core schema forM1
appears twice and could be inlined.
c1fffce
toe093879
CompareThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
In general, I feel like_generate_schema.py
could use a lot more documentation, given that schema gen + ref/def simplification is complicated by nature.
This constitutes a really significant improvement though, great work.
I like the new typed nature of the schema gathering results, etc. Easier to follow that logic as well! I'd like to take one more pass over_schema_gather.py
, but generally, really happy with this improvement.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
return schema | ||
def _can_be_inlined(def_ref: core_schema.DefinitionReferenceSchema) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Why can't we inline if there's metadata, ex json schema related metadata?
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
d74190f
to7340799
CompareThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Looking really good. Happy to approve pending consideration of my final nit picks :)
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Great work@Viicos, perf wins like this take a lot of deep thought and work.
@MarkusSintonen, thanks for the ideas and blueprint here! We're really excited about this perf boost.
8fe3aae
intomainUh oh!
There was an error while loading.Please reload this page.
MarkusSintonen commentedJan 30, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
Thanks@sydney-runkle /@Viicos! Why the generic Generic models construction is the most painful issue as that has been so slow. |
It's not entirely clear to me why this benchmark had better results than the other ones. I compared both flamegraphs on this benchmark, on this PR and yours, and couldn't find any significant time difference being spent in schema gathering (as that's the only significant difference between the two PRs -- this one is implemented in Python and so is a bit slower, however this isn't significant). Note that since then, we merged other performance improvements so the percentages aren't really comparable. |
Uh oh!
There was an error while loading.Please reload this page.
Change Summary
Closes#10655,fixes#10285
This is slightly identical to#10655, with the difference that:
pydantic-core
. Out of 3-4s, the rust implementation takes 20ms, while the Python one takes 230ms. I don't know yet if we should keep it in Python (it has advantages: type checking, easier debugging).Concerns (1 so far):
Consequence of the new schema traversing logic: we don't traverse schemas that have been unpacked from another model. With the following:
The core schema of M1 is:
M1 core schema
As you can see,
Model
is inlined inside the core schema for fieldm
.Now if you do:
StandaloneModel
is used twice inM2
(inm
andM1.m
). However, the inlined core schema forM1.StandaloneModel
is not transformed back into a'definition-ref'
schema (and doing so would require a copy of the schema to avoid mutating the core schema ofM1
fromM2
!):M2 core schema
On
main
,StandaloneModel
is properly stored in definitions, and'definition-ref'
schemas are used:M2 core schema on main
Both approaches are valid, but it might be that this cause unexpected issues (JSON Schema differences, etc). Regarding memory consumption, it is slightly affected. Taking the
k8s_v2.py
file as an example (it is a good one because the "issue" described here happens several times in it), we don't see any increase/decrease in %RAM consumption, although profile over time differs:main