- Notifications
You must be signed in to change notification settings - Fork620
How to share and export consistent statistics across multiple crawlers?#966
-
Hi, I want to use the same statistics for different crawlers. I have one HTTP client that I pass to two crawlers (PlaywrightCrawler and BeautifulSoupCrawler). However, when I execute these crawlers, I receive different statistics. Additionally, I want to export these statistics into FinalStatistics and save them in a storage format (JSON or CSV). My goal is to manage multiple scrapers and save the statistics to analyze them. |
BetaWas this translation helpful?Give feedback.
All reactions
Replies: 2 comments 3 replies
-
Hello, you can make a custom instance of I'm not sure I understand what you want to do with |
BetaWas this translation helpful?Give feedback.
All reactions
-
Thanks@janbuchar. Your answer works fine for me, but the time of the first execution of the scraper does not add to the second scraper. Here are the logs: [crawlee.statistics._statistics] INFO Statistics┌───────────────────────────────┬─────────┐│ requests_finished │ 0 ││ requests_failed │ 0 ││ retry_histogram │ [0] ││ request_avg_failed_duration │ None ││ request_avg_finished_duration │ None ││ requests_finished_per_minute │ 0 ││ requests_failed_per_minute │ 0 ││ request_total_duration │ 0.0 ││ requests_total │ 0 ││ crawler_runtime │ 0.02422 │└───────────────────────────────┴─────────┘[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0[crawlee.crawlers._playwright._playwright_crawler] INFO Navigating to ...[crawlee.crawlers._playwright._playwright_crawler] INFO --- Fetch cookies ---[crawlee.crawlers._playwright._playwright_crawler] INFO --- End of cookies ---[crawlee._autoscaling.autoscaled_pool] INFO Waitingfor remaining tasks to finish[crawlee.crawlers._playwright._playwright_crawler] INFO Final request statistics:┌───────────────────────────────┬───────────┐│ requests_finished │ 1 ││ requests_failed │ 0 ││ retry_histogram │ [1] ││ request_avg_failed_duration │ None ││ request_avg_finished_duration │ 11.137499 ││ requests_finished_per_minute │ 4 ││ requests_failed_per_minute │ 0 ││ request_total_duration │ 11.137499 ││ requests_total │ 1 ││ crawler_runtime │ 14.854891 │└───────────────────────────────┴───────────┘[rich] INFO Found 12 cookies >>>>> Here finished the execution of the firsr scraper (Playwright)<<<<<<[rich] INFO Obteniendo ítems...[crawlee.statistics._statistics] INFO Statistics┌───────────────────────────────┬───────────┐│ requests_finished │ 1 ││ requests_failed │ 0 ││ retry_histogram │ [1] ││ request_avg_failed_duration │ None ││ request_avg_finished_duration │ 11.137499 ││ requests_finished_per_minute │ 2543 ││ requests_failed_per_minute │ 0 ││ request_total_duration │ 11.137499 ││ requests_total │ 1 ││ crawler_runtime │ 0.023598 │└───────────────────────────────┴───────────┘[crawlee._autoscaling.autoscaled_pool] INFO current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Page index: 0. Items remaining: 31 of 31[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Page index: 1. Items remaining: 11 of 31[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Page index: 2. Items remaining: 0 of 40[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Allitems already processed[crawlee._autoscaling.autoscaled_pool] INFO Waitingfor remaining tasks to finish[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO Final request statistics:┌───────────────────────────────┬───────────┐│ requests_finished │ 4 ││ requests_failed │ 0 ││ retry_histogram │ [4] ││ request_avg_failed_duration │ None ││ request_avg_finished_duration │ 3.891656 ││ requests_finished_per_minute │ 13 ││ requests_failed_per_minute │ 0 ││ request_total_duration │ 15.566624 ││ requests_total │ 4 ││ crawler_runtime │ 18.088441 │└───────────────────────────────┴───────────┘ Could it be the way I'm running them? Here is a summary of the code: http_client= ...my_stats= ...crawler_1=PlaywrightCrawler(statistics=my_stats)# ... here I have stored the cookies in a storage/dataset/cookiesawaitcrawler_1.run([my_url])crawler_1=BeautifulSoupCrawler(statistics=my_stats)# Here I have added the cookies I have stored before.awaitcrawler_2.run([my_url]) |
BetaWas this translation helpful?Give feedback.
All reactions
-
Yeah, it looks like the Could you try to explain what you're trying to do so that we can come up with a solution or perhaps an idea how to improve the |
BetaWas this translation helpful?Give feedback.
All reactions
-
I want to extract cookies from a page using PlaywrightCrawler because they differ from those obtained with BeautifulSoupCrawler (which I recently replaced with HttpCrawler to simplify my scrapers). However, sometimes the Playwright-based scraper fails, requiring a retry. Once I successfully retrieve the cookies, I save them, close the PlaywrightCrawler (terminating it), and then continue scraping using HttpCrawler, passing the saved cookies to it. The reason for this approach is that I need specific cookies to extract information from the site's hidden API, which is used to hydrate the page. Additionally, I want to save statistics for the entire process because, in the future, I plan to run multiple distributed scrapers. I will use Kubernetes to manage them, and these statistics will help me monitor and evaluate their performance. |
BetaWas this translation helpful?Give feedback.
All reactions
-
I see. Then I'd probably make a fresh |
BetaWas this translation helpful?Give feedback.