- Notifications
You must be signed in to change notification settings - Fork11.2k
Deprecate spider attributes that can be replaced by settings, round 2#7039
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:master
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Pull Request Overview
This PR deprecates spider attributes that can be replaced by settings, continuing the work from issue#7038. The changes introduce a centralized warning function for deprecated spider attributes and apply deprecation warnings across multiple middleware components.
- Adds a centralized
warn_on_deprecated_spider_attributefunction for consistent deprecation messaging - Replaces inline deprecation warnings with calls to the new function across multiple middleware files
- Updates redirect middleware to include
REDIRECT_ALLOWED_HTTP_CODESsetting support
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| scrapy/utils/deprecate.py | Adds centralized function for spider attribute deprecation warnings |
| scrapy/spidermiddlewares/httperror.py | Integrates REDIRECT_ALLOWED_HTTP_CODES setting into allowed codes list |
| scrapy/extensions/throttle.py | Adds deprecation warning for download_delay spider attribute |
| scrapy/downloadermiddlewares/useragent.py | Adds deprecation warning for user_agent spider attribute |
| scrapy/downloadermiddlewares/redirect.py | Adds REDIRECT_ALLOWED_HTTP_CODES setting support and handle_httpstatus_list deprecation |
| scrapy/downloadermiddlewares/httpcompression.py | Refactors to use centralized deprecation warning function |
| scrapy/downloadermiddlewares/downloadtimeout.py | Adds deprecation warning for download_timeout spider attribute |
| scrapy/core/http2/protocol.py | Adds deprecation warnings for download size attributes |
| scrapy/core/downloader/handlers/http11.py | Adds deprecation warnings for download size attributes |
| scrapy/core/downloader/init.py | Refactors to use centralized deprecation warning function |
| docs/topics/downloader-middleware.rst | Updates documentation to reference new setting instead of deprecated attribute |
Tip: Customize your code reviews with copilot-instructions.md.Create the file orlearn how to get started.
Uh oh!
There was an error while loading.Please reload this page.
Uh oh!
There was an error while loading.Please reload this page.
| self.handle_httpstatus_list:list[int]= [ | ||
| *settings.getlist("HTTPERROR_ALLOWED_CODES"), | ||
| *settings.getlist("REDIRECT_ALLOWED_HTTP_CODES"), | ||
| ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
We already haveHTTPERROR_ALLOWED_CODES that overridehandle_httpstatus_list but the setting name is not genericHTTPERROR_, we could use onRedirectMiddleware instead of creating a new setting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Right,handle_httpstatus_list (the spider attr and the meta key) are mostly forHttpErrorMiddleware,RedirectMiddleware is their additional user. Similarly forHTTPERROR_ALLOW_ALL (the setting) andhandle_httpstatus_all (the spider attr and the meta key). But out of theseRedirectMiddleware only uses both meta keys and thehandle_httpstatus_list spider attr, not using the spider attr and the setting forall. This is inconsistent (though I haven't considered what will the spider behavior be if this made consistent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I don’t like the HTTP errors middleware reading theREDIRECT_ALLOWED_HTTP_CODES setting…
I wonder if we should take a step back and try to think through what the ideal “settings and meta API” should be, and then try to move towards it in a backward-compatible way with deprecations where needed.
I feel this should be similar to dont_filter, where you have a generic setting/meta and a specific setting/meta (e.g. allow_offsite). I think the specific should override the generic, i.e. no merging, e.g. if generic is [401] and want to extend with 402 for a specific middleware, you set the specific setting/meta to [401,402], not just [402]. But I am not completely sure.
codecovbot commentedSep 3, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
❌ 1 Tests Failed:
View the top 1 failed test(s) by shortest run time
To view more test analytics, go to theTest Analytics Dashboard |
wRAR commentedSep 3, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There are tests that use these newly deprecated attributes that need to be fixed or removed. For download handler tests that likely means creating the crawler and the handler in the test function like already done in |
wRAR commentedSep 14, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
In case you've missed it (as it's only in |
| ), | ||
| ], | ||
| ) | ||
| deftest_startdelay_definition(min_spider,min_setting,start_setting,expected): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Is this test useless/duplicate withoutmin_spider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It seems, if you see the only assert is:assert spider.download_delay == expected lmk if you want to keep and drop themin_spider no sure what we would check in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
OK, this is stupid :)
The extension setsspider.download_delay in_spider_opened() (based onAUTOTHROTTLE_START_DELAY,DOWNLOAD_DELAY and the original value ofspider.download_delay, just like the params of this test), so that the downloader could use it as the start delay value for newly created slots inscrapy.core.downloader._get_concurrency_delay(). So if we want to stop usingspider.download_delay we need this to work in some other way (and I don't think we can change theDOWNLOAD_DELAY setting value here).
@Gallaecio thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Puff, headache incoming… 😓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Maybe we should keep download_delay until#6660 is addressed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Laerte commentedSep 14, 2025
Thanks for flagging, fixed. |
| ifhasattr(spider,"download_maxsize"): | ||
| warn_on_deprecated_spider_attribute("download_maxsize","DOWNLOAD_MAXSIZE") | ||
| ifhasattr(spider,"download_warnsize"): | ||
| warn_on_deprecated_spider_attribute( | ||
| "download_warnsize","DOWNLOAD_WARNSIZE" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I wonder if we should try to find a way to trigger this only once per spider, but no strong opinion. Maybe louder is better 🙂
| ingetattr(self.crawler.spider,"handle_httpstatus_list", []) | ||
| orresponse.statusinrequest.meta.get("handle_httpstatus_list", []) | ||
| orresponse.statusinself.handle_httpstatus_list |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I wonder if we should handle this somewhat differently.
i.e., when both the setting and the spider attribute are defined, the spider attribute, even if deprecated, should be considered an override of the setting value, and the setting value should be ignored.
In fact, I wonder if we should also consider the meta value an override of both the setting and the spider attribute, and hence ignore those if the meta value is set. But I am fine with keeping that out of scope here.
Either way, we should make sure we clarify this in the docs and the release notes.
close:#7038