- Notifications
You must be signed in to change notification settings - Fork523
Open
Description
In the Javascript version, the error handler is able to access thePage object viaPlaywrightCrawlingContext.page. I discovered that the Python version doesn't implement this when porting theContextPipeline to Javascript.
Test case
asyncdeftest_error_handler_can_access_page(server_url:URL)->None:crawler=PlaywrightCrawler(max_request_retries=2)request_handler=mock.AsyncMock(side_effect=RuntimeError('Intentional crash'))crawler.router.default_handler(request_handler)error_handler_calls:list[str|None]= []@crawler.error_handlerasyncdeferror_handler(context:BasicCrawlingContext|PlaywrightCrawlingContext,_error:Exception)->None:error_handler_calls.append(awaitcontext.page.content()ifisinstance(context,PlaywrightCrawlingContext)elseNone )awaitcrawler.run([str(server_url/'hello-world')])asserterror_handler_calls== [HELLO_WORLD,HELLO_WORLD,HELLO_WORLD]
Possible solutions
- Run the error handlers before the cleanup step of the context pipeline
- this is a fairly big change and we probably want to do it afterfix: Only apply requestHandlerTimeout to request handler #1474
- changing this in the adaptive playwright crawler will be especially tricky
- Add some "deferred cleanup" step to the context pipeline and callthat after error handlers are done
- it's unclear how this would fit in the current async generator based middleware model
- considerable refactoring of the
_run_request_handlerand__run_task_functionwould still be necessary - error handlers are called by the latter and context pipeline is only handled in the former