pydantic/pydanticPublic

NotificationsYou must be signed in to change notification settings
Fork2.3k
Star26k

Evolving Pydantic:`BaseStruct` #10032

New issue

Open

Evolving Pydantic:BaseStruct#10032

Labels

feature requestv3Under consideration for V3

Description

samuelcolvin

opened

on Aug 2, 2024

(Edited by@davidhewitt to show current roadmap, see original post by Samuel in the expandable section at the bottom of this post)

This is a feature plan for the next step in thepydantic API, which we'd like to callBaseStruct.

The high-level goal is to take some of the lessons we've learned fromBaseModel, and present a new type which should achieve higher performance, more features and better default semantics.BaseModel may eventually gain some of these changes itself, but we would prefer to be extremely conservative with changingBaseModel and introduce a new type so that users can opt-in to the changed semantics.

We will mergeBaseStruct into thepydantic.experimental.structs namespace while this type is being developed, possibly as soon as Pydantic 2.13. We anticipate that Pydantic V3 may be the milestone whereBaseStruct is stabilised, although this is not strictly required.

Rough outline of the planned semantics ofBaseStruct:

Performance
- BaseStruct is planned to keep more data as (internal?) native state rather than Python objects
- Field accessors will probably be cached so that once the Python value is accessed, it will have normal Python semantics
- BaseStruct will validate by iterating the input data, rather than iterating the expected fields and using attribute lookup (see notes in Samuel's original post for details / motivation)
Ergonomics
- BaseStruct will have no methods (model_validate,model_dump). Instead we will expose free functions such asvalidate,validate_json,to_python andto_json
- There will likely be aStructMethodsMixin type to add methods such asstruct_validate where desired
- We might try to do something different with generic structs so that we don't create new subclasses when doing generic parameterisation, and just make "regular" generic aliases.
Features
- We would like to eventually offer compatibilty between struct types and dataframes, particularly arrow "struct arrays". The exact form of this is yet to be explored.

Original post by Samuel

This is a write up of a plan I've had for a long time to improve the performance of Pydantic/pydantic-core by another big step, hopefully up to 5x-10x in some cases.

Summary

The core idea is to significantly improve the performance of pydantic model validation by:

Keeping values as rust types until they need to be materialized as Python objects, e.g. if they're access as attributes or the developer calls.model_dump(), in the case where you're doingm = MyModel.model_validate_json(input_data); m.model_dump_json() the Python objects are never created — Python is just an orchestration language for Rust logic
Since we aren't building a Python dict, we have the option to validate the input data using a faster algorithm

The idea would be to introduce this asRustModel in the.experimental namespace in a minor release of V2.

If it:

Proves to be significantly faster than the existingBaseModel implementation
Does not cause significant compatibility issues withBaseModel (or those can all be solved)

Then we might replaceBaseModel with theRustModel implementation in V3. Again: only if we can be very confident it won't break things.

Details

At the heart of Pydantic validation (presumably any validation of a dict/mapping object) is the following two choices of algorithm:

1. iterate over expected fields, getting values from the input data

This is what both Pydantic V1 and V2 do, (using python for brevity, in reality this logic isimplement in Rust inpydantic-core) it looks something like this:

defvalidate(model:ModelValidator,input_data:dict[str,Any])->BaseModel:output_data= {}errors= []forfieldinmodel.fields:try:input_value=input_data[field.alias]exceptKeyError:errors.append(MissingError())continuetry:output_data[field.name]=field.validate(input_value)exceptValidationErrorase:errors.extend(e.errors())ifmodel.extrain ('forbid','allow'):forkeyininput_data.keys():ifkeynotinmodel.fields:ifmodel.extra=='forbid':errors.append(ExtraFieldError(key))else:output_data[key]=input_data[key]iferrors:raiseValidationError(errors)else:returnModelThing.create(output_data)

Advantages:

conceptually simple
can easily be extended to support alias choices or even alias paths (as V2 supports)
if there are lots of extra keys ininput_data which you want to ignore, they don't cost you anything extra
output_data order matchesmodel.fields order without any extra work

Disadvantages:

you need to look upinput_data for each field of the model, which is somewhat slower than iterating overinput_data (providedinput_data is roughly the right size, which is very common)
input_data must be a mapping, e.g. you can't process a generator or equivalent - this is particularly problematic inpydantic-core where we'd like to be able to validate JSON as we parse it, without the need to allocate an intermediateHashMap
if you haveextra='forbid' orextra='allow' you need to iterate overinput_data after iterating over fields

or,

2. iterate over input data, and look up fields

Using python for brevity, looks something like this:

defvalidate(model:ModelValidator,input_data:Iterable[tuple[Any,Any]])->BaseModel:output_data= {}errors= []forkey,valueininput_data.items():try:field=model.fields[key]exceptKeyError:ifmodel.extra=='forbid':errors.append(ExtraFieldError())elifmodel.extra=='allow':output_data[key]=valuecontinuetry:output_data[field.name]=field.validate(value)exceptValidationErrorase:errors.extend(e.errors())iflen(output_data)<len(model.fields):forfieldinmodel.fields:iffield.namenotinoutput_data:errors.append(MissingError())iferrors:raiseValidationError(errors)else:output_data=reorder_ouput_to_match_fields(model,output_data)returnModelThing.create(output_data)

The advantages/disadvantages of this are really just the reverse of the above.

The point is that this approach should be faster in most cases.

The major problem with this in a Python world is (referred to asreorder_ouput_to_match_fields in the code above), is it's very cheap to build a dict in python with the order matching the build order, but prohibitively slow to "reorder" the dict (actually build a new dict) to match some other desired order. (We want the order of data in.model_dump() or.model_dump_json() to match how the fields are defined on model)

This is the reason we haven't been able to use the second approach in V2:

in the end,pydantic-core creates a new model, then sets the.__dict__ attribute tooutput_data
it's a hard requirement of Python that.__dict__ is a vanilla dict
the performance overhead of reordering the output data is too high
so we're stuck to approach 1

`RustModel` implementation

The core idea here is that we have a Rust struct exposed as a Python class (thanks PyO3) which is used to store the validate data.

We never have to create theoutput_data Python dict, and don't need to create the Python objects until they'r need (the attribute is accessed ormodel_dump is called), if at all.

The skeleton ofRustModel would look something like:

classRustModel:__slots__= ('__pydantic_raw_data__',)__pydantic_raw_data__:InternalDatadef__getattr__(self,item):returnself.__pydantic_raw_data__.get(item)defmodel_dump(self):returnself.__pydantic_raw_data__.model_dump()defmodel_dump_json(self):returnself.__pydantic_raw_data__.model_dump_json()

InternalData is the python class exported from Rust, which would look something like (again lots of detail and nuance omitted) this:

#[pyclass]structInternalData{data:Vec<Option<FieldData>>,key_lookup:Arc<HashMap<String,usize>>,}implInternalData{fnnew_empty(key_lookup:Arc<HashMap<String,usize>>) ->Self{Self{data:vec![None; key_lookup.len()],            key_lookup,}}fnset(&mutself,index:usize,value:FieldData){self.data[index] =Some(value);}fnfinish(self) ->Vec<Error>{// raise missing error for any None valuesself.data.iter().filter_map(|v|{if v.is_none(){Some(Error::MissingField)}else{None}}).collect()}}#[pymethods]implInternalData{fnget(&self,py:Python,key:String) ->PyObject{ifletSome(index) = key_lookup.get(&key){let f =self.data[*index];            f.get_python_value()}else{PyKeyError::new_err(key)}}fnmodel_dump(&self,py:Python) ->PyObject{letmut dict =PyDict::new(py);for finself.data{            dict.set_item(f.key, f.get_python_value());}        dict.into()}fnmodel_dump_json(&self) ->String{letmut json_builder =JsonBuilder::new();for finself.data{            json_builder.insert(f.key, f.get_json_value());}        serde_json::to_string(&dict).unwrap()}}

I've omitted a bunch of details here, but I think this is a powerful enough concept to be worth working on.

Metadata

Assignees

No one assigned

Labels

feature requestv3Under consideration for V3

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Evolving Pydantic:`BaseStruct` #10032

Description

Summary

Details

1. iterate over expected fields, getting values from the input data

2. iterate over input data, and look up fields

`RustModel` implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Movatterモバイル変換

Uh oh!

Evolving Pydantic:BaseStruct #10032

Description

Summary

Details

1. iterate over expected fields, getting values from the input data

2. iterate over input data, and look up fields

RustModel implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Evolving Pydantic:`BaseStruct` #10032

`RustModel` implementation