Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Improve prefetching of relationships#921

SafaAlfulaij started this conversation inIdeas
Apr 17, 2021· 4 comments· 18 replies
Discussion options

Refs:#337,#600

Currently, to improve performance one can specifyselect_for_includes andprefetch_for_includes in viewsets. This ensures that the required fields are loaded in the minimum amount of queries when?includeed.

I haven't seen anything talking about how to reuse defined fields in other views.
Example:
CommentViewSet has these attributes defined:

select_for_includes= {"writer": ["author__bio"]}prefetch_for_includes= {"__all__": [],"author": ["author__bio","author__entries"],    }

EntrySerializer allows one to includecomments, but since we haven't defined something like the above forEntryViewSet, nothing will be pre-fetched.
We can go and define it like so:

prefetch_for_includes= {"__all__": [],"comments.writer": ["comments__author__bio","comments__author__entries"],"comments.author": ["comments__author__bio","comments__author__entries"],    }

As we see, we kind of repeat ourselves.
We also should note thatwriter was aselect_related field, which is nowprefetch_related field.

Actual discussion

Now since serializers themselves are the ones defining all the includes, they should be the ones defining what to prefetch and what to not, and how, without having to repeat the same fields many times across different viewsets
It also should be smart enough to understand that a field can be selected in a viewset, and prefetch in another viewset, depending on the "parent" field.

You must be logged in to vote

Replies: 4 comments 18 replies

Comment options

And while we are at this, we need to think how to support prefetching related fields inSerializerMethodResourceRelatedField results.
Currently,http://127.0.0.1:8000/entries?include=suggested.authors will add 9 more queries to fetch the authors.
What comes to mind is passing extra kwargs toget_<field_name> letting the user handle the rest if needed.

You must be logged in to vote
0 replies
Comment options

I have run into this issue as well that definingprefetch_for_includes is a bit of a head ache. Main issue here is though thatAutoPrefetchMixin is not smart enough.AutoPrefetchMixin should be able to automatically prefetch the relationships of an included resource and mostprefetch_for_includes configuration would be obsolete (at least the once I have done in the past... 😄).

It is an interesting idea though to move the prefetching configuration to the serializer. A proof of concept would be needed also in terms of what API would fit. I would see above issue improving ofAutoPrefetchMixin the higher priority though as it is a quick win for all users without a need for configuration.

You must be logged in to vote
5 replies
@SafaAlfulaij
Comment options

The problem I found withAutoPrefetchMixin is that it's completely automatic, with no easy way (but to rewrite it) to fine-tune it.
Many times I need to usePrefetch to get a subset of related items (based on user permissions, etc), whichAutoPrefetchMixin doesn't offer.
I don't think that the use cases where you'll prefetchevery related entity is more than when you need to customize it and limit it more.
Most simple ones are not showing hidden/deleted/removed/inactive/flagged-for-review entities, and the list goes on.

@sliverc
Comment options

I agree automatic will never cover all use cases but the most common once it should. And for the rest manually configuring prefetching needs to be possible.

You are talking about having a subset of related items - where do you usually do this? By filtering the queryset of aResourceRelatedField or somewhere else?

@SafaAlfulaij
Comment options

The queryset ofResourceRelatedField is afaik for input validation. I do it inprefetch_for_includes, where I modifiedPreloadIncludesMixin to accept a callable (passing the request to it), to filter against user, etc.

@sliverc
Comment options

I was a bit too quick when writing 😉 but of coursequeryset onResourceRelatedField is about input validation. I would rather do this by overwriting theget_queryset on the view.

I do see your problem though when it comes to nested includes and multiple definition of thePrefetch configuration on different views. This brings the question whetherincluded_serializers is actually the correct way of defining includes and whether it should not be configured on the view asincluded_views. Idea behind this is that a include should be serialized exactly the same then when the specific end point is called. And theget_queryset is important in this as it defines what data can actually been seen.

I haven't thought this through but just came to my mind when thinking about your issue because serializers are about serializing objects and the view about what data should actually be serialized.

@sliverc
Comment options

On second thought I do not think prefetching etc. would work with included views and basicallyincluded_serializers is like nested serializers in DRF (but simply optional by theinclude parameter).
I do think thoughrelated_serializers feature needs to be thought through and should rather use views instead of serializers directly (but I guess this is rather the topic of discussion#916)

I think the limitation with usingincluded_serializers (as in fact the same in DRF with nested serialzers) though is that the permission of the parent object apply. So if a relationship has different permissions than main object the included feature should not be activated for this relationship.

Comment options

My two cents: I started playing with this a while back and also came to the conclusion that all the prefetching logic needs to be defined on the serializer, else includes are either left out in the dark, or require a tonne of repetition to ensure they're supported across all views.

My approach would've been to define aSparseFieldsMeta inner class on the serializer which'd be responsible for:

  • Annotations -- Performing particular annotations on the queryset when particular fields are requested. (eg something like.annotate_with_totals() whentotal is in the sparsefields request). And of course conversely avoiding calling the potentially expensive annotation when the field is not needed.
  • Permissioning -- refusing to return certain fields under certain conditions. For this package I think it'd be nice to support permission-based field control out of the box, but leave space for extensibility, so end users can customise this further to check for arbitrary feature flags etc.
  • On-demand-only -- Perhaps breaking scope of this discussion completely, however currently when no sparsefields are defined, ALL fields are returned. We've found it useful in our own project to roll with a concept called "on_demand_only" fields, which will not come through be default (because they're eg usually irrelevant, or expensive to compute), and must be explicitly requested through sparsefields. The reason I mention it here is that such a a thing would also fit quite well into thisSparseFieldsMeta concept.
  • Prefetching/Selectrelating -- and of course the meat of this whole discussion. This needs to be defined in a way that allows for recursive construction ofPrefectch objects, as well as appropriate attachment ofselect_related andprefetch_related calls. It needs to be intelligent -- a relationship requested through sparsefields isn't enough to trigger it; the include is needed too. Further, all of the other points above (eg qs annotations) would need to be packed into these Prefetch objects too, since we're looking at a nested qs that might be built of out multiple layers.

Here's an example of what such a thing would look like:

MySerializer:    class SparseFieldMeta:        # Example of a permission-based field. LHS defines req'd perm, RHS is        # a list of fields protected by that perm.        permission_required = {            'accounts.view_rate': ['rate'],        }        # Example of annotations & on-demand-only. LHS defines annotation        # that will get called when any of the fields on the RHS are requested.        # Fields on RHS will not come through by default, only when explicitly        # listed in the sparsefields list.        on_demand_only = {            'all': ['__str__', 'contractor_set', 'client_set', 'clientgroup_set'],            'annotate_with_usage': ['usage_percentage'],        }        # Example of prefetching logic -- if any of these are listed in the includes,        # they will be added on as a prefetch too. This section would potentially        # need to support tuples in the list as well, for more complex Prefetch        # definitions. Conversely, perhaps this section would be entirely inferable        # from the Serializer definition and other entries in the SparseFieldMeta.        prefetch_related = [            'branches',            'groups',        ]        # Example of select_related logic. LHS defines additional FKs to be        # select_related when any of the fields on the RHS are requested. The        # fields on RHS would likely be computed on the fly and require data from        # these relationships to build.        select_related = {            'type': ['__str__'],            'variant': ['__str__'],        }        # Example of a custom addition that hides fields on the RHS if the        # feature_flag on the LHS isn't enabled. The logic for supporting this        # config would need to live somewhere comfortable, probably as a        # Serializer mixin that can interact cleanly with whatever consumes        # SparseFieldMeta.        feature_flag_required = {            'branches': ['branches'],        }        # Example of a custom addition that restricts non-staff users to only a        # small subset of the field. Ditto re mixin.        nonstaff_fields = [            '__str__', 'name',        ]
You must be logged in to vote
11 replies
@jarekwg
Comment options

Hey@sliverc thanks for taking a look!

1a.SerializerMethodResourceRelatedFields aren't covered (and can't be), because the relation won't exist on the model, soprefetch_related just can't be used. In these cases whatever these things are relying on could be handled manually inViewSet.get_queryset (in the absence ofprefetch_for_related). This wouldn't work with nested includes, but it doesn't currently either. So no feature loss. We could attempt to discuss how we could find a way to fold in support for these as part of the work I'm proposing. Potentially there could be some arg onSerializerMethodResourceRelatedField that dictates how to approach prefetching them. I think if wedo consider something like this, it'd be in a future iteration though.

1b. I have a solution for user-based visibility reduction in my production code, but didn't want to bloat out the PR further. Summarising briefly, where I've putOnDemandFieldsMixin in this PR, in our production code I've called itRestrictedFieldsMixin, and it's responsible for popping off not only expensive fields, but also fields that are restricted based on various rules (feature flags, user permissions, etc). I'll paste a sanitised copy of it in a followup comment. Let me know your thoughts. I don't want to bloat out this work too much, but we could potentially just leave a stub method in there, for developers to be able to overwrite if they want to be doing such magic. That same mixin also allows for explicitly defining one's own ondemand fields, other than the expensive ones. Eg stuff like cached_properties or calculations that one might not want to always auto-include.

2.. Agreed. I'll move the ondemand stuff into a separate PR. The two work best hand-in-hand, but given what I've said in 1b, we could probs put a bit more work into the ondemand mixin to support configurability in addition to the automatic excludes. This'd cover the expensive file field you mentioned too.

I'll close the current PR, and we can spin up a couple of smaller ones off it, one at a time (can happen over several releases, even).
I'm thinking, in this order:

  1. Automatic nested prefetching: UpgradeAutoPrefetchMixin to useadd_nested_prefetches_to_qs logic.
  2. Automatic ondemand: IntroduceOnDemandFieldsMixin to automatically remove "expensive" fields from the Serializer unless they've been explicitly asked for in the sparsefieldset. By default, "expensive fields" would just be reverse relations and m2ms, toggleable with a settings option.
  3. ExtendOnDemandFieldsMixin to allow explicitly defining one's own "expensive" fields. I have an implementation already in use in production, but it introduces an opinionated config syntax for serializers, so would want to workshop this here first and make sure the syntax is as appropriate as possible for the users of this package before continuing.
  4. Field permissioning: Introduce aRestrictedFieldsMixin to automatically remove fields from the Serializer based on information on the request. Could be user permissions, api version, system feature flags, etc. These fields would always get excluded, regardless of whether sparsefieldsets tried to ask for them or not.
  5. Remove/DeprecatePreloadIncludesMixin, given all use cases are covered by the above.

Let me know what you think. Gimme a green light and I'm on it. I'm itching to get it done while it's still fresh in memory hehe. I'd also be keen to do some housekeeping cleanup on this package too. There's currently a lot of cruft, and the single utils file is uniweldy.

@jarekwg
Comment options

Here's the fullRestrictedFields implementation I'm using in production. Would split it into points 2,3,4 above, using two separate mixins.

RestrictedFieldsMixin
classRestrictedFieldsMixin:def__init_subclass__(cls,**kwargs):"""Run a smidge of validation at class declaration, to avoid silly mistakes."""# Throw error if there are expensive relations listed in the on_demand_only config.ifrestricted_fields_meta:=getattr(cls,'RestrictedFields',None):expensive_relation_fields=set(get_expensive_relational_fields(cls))forfieldsingetattr(restricted_fields_meta,'on_demand_only', {}).values():ifmistake_fields:=expensive_relation_fields&set(fields):raiseAttributeError(f"{cls.__name__!r} includes relational fields in its `on_demand_only` list."" Expensive relations are already considered ondemand implicitly."f" Please remove:{mistake_fields}"                    )returnsuper().__init_subclass__(**kwargs)def__init__(self,*args,**kwargs):super().__init__(*args,**kwargs)# Pop any fields off the serializer that shouldn't come through.forfieldinself.get_excluded_restricted_fields()|self.get_excluded_ondemand_fields():self.fields.pop(field,None)defget_excluded_restricted_fields(self)->list[str]:"""        Determines which fields may not be accessed, based on any of the following reasons:        - Non-staff -- user is not staff, and thus only has access to a limited subset of all fields.        - Permission -- user is lacking particular permissions to view certain fields.        - Extension -- site has certain extensions disabled, concealing certain fields.        """restricted_fields_meta=getattr(self,'RestrictedFields',None)restricted_fields=set()ifuser:=getattr(self.context.get('request',None),'user',None):# If user is not staff, swap out the base list of fields for a cut-down version.ifnot (user.is_stafforuser.is_superuser):nonstaff_fields=getattr(restricted_fields_meta,'nonstaff_fields', [])# Mark all serializer fields for removal, except those listed in nonstaff_fields.# Also, custom fields are handled elsewhere, so don't remove any of those either.restricted_fields|= {fforfinself.fields.keys()iffnotinnonstaff_fieldsandnotf.startswith('extra_fields_')}# If the user is lacking certain permissions, remove corresponding fields.forperm,fieldsingetattr(restricted_fields_meta,'permission_required', {}).items():ifnotuser.has_perm(perm):restricted_fields|=set(fields)# If the customer has certain extensions disabled, remove corresponding fields.forapp,fieldsingetattr(restricted_fields_meta,'extension_required', {}).items():ifnotExtension.objects.is_installed(app):restricted_fields|=set(fields)returnrestricted_fieldsdefget_excluded_ondemand_fields(self)->list[str]:"""        Determine which fields should be popped off if not explicitly asked for.        Will not nominate any fields that have been designated as `demanded_fields` in context.        Ondemand fields are determined in two ways:        - Fields that we automatically determine to be expensive, and thus automatically remove           from the default offering. Currently such fields are M2Ms and reverse FKs.        - Explicitly listed on the serializer, in the `RestrictedFields.on_demand_only` section,           oft accompanied by an annotation method that needs to be run in order for the ondemand           field to exist.        """restricted_fields_meta=getattr(self,'RestrictedFields',None)# M2Ms and reverse FKs.if<some_settings_flag>:ondemand_fields=set()# Expensive fields are not considered.else:ondemand_fields=set(get_expensive_relational_fields(type(self)))# `on_demand_only` declarations.forfieldsingetattr(restricted_fields_meta,'on_demand_only', {}).values():ondemand_fields|=set(fields)# If we've instantiated the serializer ourselves, we'll have fed `demanded_fields` into its context.# If it's happened as part of drf render internals, then we have a fallback where the view# has provided the entire sparsefields context for us to pick through.if'demanded_fields'inself.context:demanded_fields=set(self.context.get('demanded_fields'))else:resource_name=get_resource_type_from_serializer(type(self))demanded_fields=set(self.context.get('all_sparsefields', {}).get(resource_name, []))# We only want to exclude those ondemand fields that haven't been explicitly requested.returnondemand_fields-set(demanded_fields)
@sliverc
Comment options

Thanks for your response.

In terms ofRestrictedFieldsMixin that was not what I wanted to refer to. I also think your code snippet is not REST conform, see more inthis discussion. What I was actually referring to is when you have a M2M relation and the API should not render relations which for instance are marked as deleted.

In DRF someone could overwriteget_queryset like the following (a dirty example):

defget_queryset(self):returnmodels.Blog.objects.prefetch_related(Prefetch('tags',models.Tags.objects.filter(is_deleted=False)) )

This is defined on the view but actually does not work with the DJA included feature. For this we need a solution whichPreloadIncludesMixin currently also does not offer.

There are other use cases where an automatic approach won't work. For instance on a serializer aSerializerMethodField is used which calculates a name which it gets from two different relations (FK). This serializer gets then defined as a include in another serializer.PreloadIncludesMixin currently supports such use cases where the configuration defines what relations need to be prefetched for such a field in case serializer is actually included in the request.

All in all It is really important to makeAutoPrefetchMixin better but I believe there will always be a need for manual configuration as only the API developer knows what actually needs to be prefetched especially in complex cases.

In terms of priorities as DJA is an open source project it really depends on the contributors what they want to invest time on.

From a maintainer perspective though I certainly prefer to fix existing features first before adding new features. So this said I would prefer to makeAutoPrefetchMixin andPreloadIncludesMixin work well first.

In terms of the on demand only fields I prefer this to be discussed in another discussion thread where we can go deeper on what the actual use cases are for this feature and how those use cases can be best covered.

@jarekwg
Comment options

I also think your code snippet is not REST conform, see more inthis discussion.

Not sure I'm following on this one. What I believe you're saying here is that whether I hit/api/a/?include=b, or/api/b/, the shape ofb should present the exact same way regardless. Am I understanding correctly? Where in what I've proposed is that conformance broken?

What I was actually referring to is when you have a M2M relation and the API should not render relations which for instance are marked as deleted.

In DRF someone could overwriteget_queryset like the following (a dirty example):

defget_queryset(self):returnmodels.Blog.objects.prefetch_related(Prefetch('tags',models.Tags.objects.filter(is_deleted=False)) )

This is defined on the view but actually does not work with the DJA included feature. For this we need a solution whichPreloadIncludesMixin currently also does not offer.

Yeah, so the solution I'm currently running with is to use the queryset defined on the ResourceRelatedField on the serializer. This is the best place to specify this, as it means we can abide by it at all levels of nesting. However I'm aware this is somewhat of an abuse, asResourceRelatedField.queryset is designed more for filtering options/validating input, rather than the actual related items. There is an overlap in concepts here, but the overlap isn't perfect.. my thoughts of how to get around this are either to introduce an additional field onResourceRelatedField, which allows defining aget_queryset() method for this purpose (this'd be a hard sell i suspect...) or conjuring up a way to tell the automatic prefetcher to skip the given field, because the user will deal with its prefetching themselves.

There are other use cases where an automatic approach won't work. For instance on a serializer aSerializerMethodField is used which calculates a name which it gets from two different relations (FK). This serializer gets then defined as a include in another serializer.PreloadIncludesMixin currently supports such use cases where the configuration defines what relations need to be prefetched for such a field in case serializer is actually included in the request.

It "supports" it, but at great cost -- any viewset that might potentially drill down to hitting this field, needs to have its own version of specifications for this include in its config. The solution I'm currently running builds this into the ondemand fields concept: you can designate certain fields as ondemand_only, and as part of that config you can nominate queryset methods you want run when said fields are requested, to perform any additional annotations or prefetch theSerializerMethodField might need. In short, I believe this'd be buildable into some sort of config on the serializer, or through introduction of additional arguments toSerializerMethodField, to achieve the same result.

All in all It is really important to makeAutoPrefetchMixin better but I believe there will always be a need for manual configuration as only the API developer knows what actually needs to be prefetched especially in complex cases.

Agreed, there need to be escape hatches for non-standard behaviour. DJA should be opinionated tosome degree, to prevent foot-shooting, and to push REST/JSONAPI spec compliance, but we certainly shouldn't be regressing away from what's currently possible, at least not without a very strong reason.

In terms of priorities as DJA is an open source project it really depends on the contributors what they want to invest time on.

From a maintainer perspective though I certainly prefer to fix existing features first before adding new features. So this said I would prefer to makeAutoPrefetchMixin andPreloadIncludesMixin work well first.

Can we agreePreloadIncludesMixin is inherently flawed and the goal should be to offer sufficient parallel functionality in order to be able to remove it? Viewset-based prefetching/select_relating willnever play nicely with nested includes.

I see what you're saying though, and I definitely appreciate a measured approach to this. Stability is important when there are so many projects depending on the package, and I wish it had come sooner (eg stuff likethis needless and expensive pluralisation, using inflection library, no less, should never have made it in)

In my view, as it stands, the n+1 problem requires significant changes for it to get addressed properly. I've currently got some energy to make these changes, and am happy to put some time into drilling down into this discussion with yourself or whoever, step by step, to implement these changes in as minimally regressive a way as possible, with an approach we all agree on. If you're not sold on the 5 steps I proposed above, what sort of roadmap would you suggest?

In terms of the on demand only fields I prefer this to be discussed in another discussion thread where we can go deeper on what the actual use cases are for this feature and how those use cases can be best covered.

Sounds good. I wanted to mention them initially here, for context, as they play a role in addressing the problem too, however they can be fleshed out more carefully down the line, separately.

Cheers,

@sliverc
Comment options

Thanks for your lengthy response and willingness to work on this. I am not gonna address all what you mentioned above as of lack of time (can follow up on the different issue in the respective discussions).

I noticed though that my sentence aboutAutoPrefetchMixin andPreloadIncludesMixin can be misunderstood. I am not suggesting that those mixins remain but the concept behind them ofauto-prefetching andmanual configuration of prefetching do. As those concepts are already introduced in DJA they have in my opinion priority over totally new features.

Hence this would be my action plan:

  1. Improve auto prefetching: As far as I see is there no more discussion needed on this and you could open an issue explaining what needs to be improved and create a PR which address this. (so far what I see is the missing sparse field and recurse M2M and reverse relationship support. There might be more.)
  2. Continue discussion on how manual configuration of prefetching could work. As this is the initial idea of this discussion I would continue here. A PR with some POC code might also help.
  3. Open another discussion about on demand only fields so we can talk about use cases and possible implementation.

The different discussions can of course also happen in parallel.

Comment options

In issue#337@sliverc specifically mentions an existing N+1 query problem for any reverse relationships (eg. one to one in reverse or one to many) and how that's not handled byAutoPrefetchMixin

In our use of the framework, we make our own base classes and have amendedAutoPrefetchMixin to fix the N+1 query reverse relationship issue. Basically our mixin is just a copy of theAutoPrefetchMixin with added logic at the beginning, examing the users "include" strings and adding auto fetch of reverse relationships 1 level deeper.

Since it is using the serializersincluded_serializers, it can only apply auto prefetch when it is able to fetch the classes (in our projects, we require thatincluded_serializers is always complete and have added validations for this), but even without having completeness the addition to the mixin will still work to reduce N+1 queries in any cases where the class can be found.

The core of the solution is examining the relationships for models for each serializer:

# snip ... looping through the base view serializer and any serializers that the user asked for with "include" ...level_model=getattr(level_serializer_class.Meta,"model")info=model_meta.get_field_info(level_model)relations_set= {xforxininfo.relations.keys()ifnotx.endswith("_set")        }forfieldinrelations_set:field_info=info.relations[field]iffield_info.reverseorfield_info.to_many:included_recursive_prefetch.add(f"{level_agg}{field}")#... snip# And then resume the work of `AutoPrefetchMixin`included_resources=get_included_resources(self.request,self.serializer_class)+list(included_recursive_prefetch)qs=super().get_queryset(*args,**kwargs)# ...

Full gist here:https://gist.github.com/dsschneidermann/ddf9c9e4a782c2f6769d503ff0a0a42e

An additional optimization missing from here would be restricting the queries to doSELECT id FROM .. rather than fetching the full instances - but I honestly have no idea how to make that effect happen in Django ORM. And in that case it would not anymore be an addition toAutoPrefetchMixin but a new thing entirely.

As it is, it can be used as a drop-in replacement forAutoPrefetchMixin to fix only N+1 queries on reverse relationships and it'll bring no new complexity to the framework, which I think is a major advantage.

You must be logged in to vote
2 replies
@jarekwg
Comment options

Nice approach! Minimising added complexity is a definite boon, however correctness/useability come first.

Currently the two offerings in drfjsonapi don't cut it:

  • PreloadIncludesMixin forces unnecessary manual effort, and is fundamentally flawed in that it cannot be set up to consider nesting properly.
  • AutoPrefetchMixin doesn't recurse (which your solution addresses), nor does it offer a way of opting out of fetching the additional data if sparsefieldsets haven't asked for it. Specifically, for reverse relations and M2Ms, there are three cases to consider: [1] don't return the field at all (no prefetch), [2] return only the ids (simple prefetch), and [3] return the entire relations (deep prefetch and recurse this same logic at the next level down).

So I don't think avoiding added complexity or shaking this up too much is an option here.

@sliverc
Comment options

Thanks@dsschneidermann for posting your mixin here. Very helpful for people who stumble upon this thread and need a quick solution which might very well work for their use case. As@jarekwg mentioned and he is already working on in#964 I would prefer when in the futureAutoPrefetchMixin also consider sparse field sets.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Category
Ideas
Labels
None yet
4 participants
@SafaAlfulaij@sliverc@jarekwg@dsschneidermann

[8]ページ先頭

©2009-2025 Movatter.jp