apify/crawlee-pythonPublic

NotificationsYou must be signed in to change notification settings
Fork620
Star8.1k

How to share and export consistent statistics across multiple crawlers?#966

Unanswered

francomanca93 asked this question inQ&A

francomanca93

Feb 7, 2025

· 2 comments· 3 replies

Return to top

Discussion options

francomanca93
Feb 7, 2025

Hi,

I want to use the same statistics for different crawlers. I have one HTTP client that I pass to two crawlers (PlaywrightCrawler and BeautifulSoupCrawler). However, when I execute these crawlers, I receive different statistics.

Additionally, I want to export these statistics into FinalStatistics and save them in a storage format (JSON or CSV). My goal is to manage multiple scrapers and save the statistics to analyze them.

You must be logged in to vote

Replies: 2 comments 3 replies

Comment options

janbuchar
Feb 7, 2025
Maintainer

Hello, you can make a custom instance ofcrawlee.statistics.Statistics and pass it to your crawlers when you instantiate them - e.g.,PlaywrightCrawler(statistics=your_custom_statistics).

I'm not sure I understand what you want to do withFinalStatistics, but that object is calculated by calling theStatistics.calculate() method - you can do that instead of relying onBasicCrawler.run to return it.

You must be logged in to vote

0 replies

Comment options

francomanca93
Feb 7, 2025
Author

Thanks@janbuchar. Your answer works fine for me, but the time of the first execution of the scraper does not add to the second scraper. Here are the logs:

[crawlee.statistics._statistics] INFO  Statistics┌───────────────────────────────┬─────────┐│ requests_finished             │ 0       ││ requests_failed               │ 0       ││ retry_histogram               │ [0]     ││ request_avg_failed_duration   │ None    ││ request_avg_finished_duration │ None    ││ requests_finished_per_minute  │ 0       ││ requests_failed_per_minute    │ 0       ││ request_total_duration        │ 0.0     ││ requests_total                │ 0       ││ crawler_runtime               │ 0.02422 │└───────────────────────────────┴─────────┘[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0[crawlee.crawlers._playwright._playwright_crawler] INFO  Navigating to ...[crawlee.crawlers._playwright._playwright_crawler] INFO  --- Fetch cookies ---[crawlee.crawlers._playwright._playwright_crawler] INFO  --- End of cookies ---[crawlee._autoscaling.autoscaled_pool] INFO  Waitingfor remaining tasks to finish[crawlee.crawlers._playwright._playwright_crawler] INFO  Final request statistics:┌───────────────────────────────┬───────────┐│ requests_finished             │ 1         ││ requests_failed               │ 0         ││ retry_histogram               │ [1]       ││ request_avg_failed_duration   │ None      ││ request_avg_finished_duration │ 11.137499 ││ requests_finished_per_minute  │ 4         ││ requests_failed_per_minute    │ 0         ││ request_total_duration        │ 11.137499 ││ requests_total                │ 1         ││ crawler_runtime               │ 14.854891 │└───────────────────────────────┴───────────┘[rich] INFO  Found 12 cookies >>>>> Here finished the execution of the firsr scraper (Playwright)<<<<<<[rich] INFO  Obteniendo ítems...[crawlee.statistics._statistics] INFO  Statistics┌───────────────────────────────┬───────────┐│ requests_finished             │ 1         ││ requests_failed               │ 0         ││ retry_histogram               │ [1]       ││ request_avg_failed_duration   │ None      ││ request_avg_finished_duration │ 11.137499 ││ requests_finished_per_minute  │ 2543      ││ requests_failed_per_minute    │ 0         ││ request_total_duration        │ 11.137499 ││ requests_total                │ 1         ││ crawler_runtime               │ 0.023598  │└───────────────────────────────┴───────────┘[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Page index: 0. Items remaining: 31 of 31[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Page index: 1. Items remaining: 11 of 31[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Page index: 2. Items remaining: 0 of 40[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Allitems already processed[crawlee._autoscaling.autoscaled_pool] INFO  Waitingfor remaining tasks to finish[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO  Final request statistics:┌───────────────────────────────┬───────────┐│ requests_finished             │ 4         ││ requests_failed               │ 0         ││ retry_histogram               │ [4]       ││ request_avg_failed_duration   │ None      ││ request_avg_finished_duration │ 3.891656  ││ requests_finished_per_minute  │ 13        ││ requests_failed_per_minute    │ 0         ││ request_total_duration        │ 15.566624 ││ requests_total                │ 4         ││ crawler_runtime               │ 18.088441 │└───────────────────────────────┴───────────┘

Could it be the way I'm running them? Here is a summary of the code:

http_client= ...my_stats= ...crawler_1=PlaywrightCrawler(statistics=my_stats)# ... here I have stored the cookies in a storage/dataset/cookiesawaitcrawler_1.run([my_url])crawler_1=BeautifulSoupCrawler(statistics=my_stats)# Here I have added the cookies I have stored before.awaitcrawler_2.run([my_url])

You must be logged in to vote

3 replies

Comment options

janbuchar Feb 11, 2025
Maintainer

Yeah, it looks like theStatistics class overwrites the timestamp of the start of the crawl. Sadly, this renders theper_minute statistics in the secondFinalStatistics object unusable.

Could you try to explain what you're trying to do so that we can come up with a solution or perhaps an idea how to improve theStatistics class?

Comment options

francomanca93 Feb 12, 2025
Author

I want to extract cookies from a page using PlaywrightCrawler because they differ from those obtained with BeautifulSoupCrawler (which I recently replaced with HttpCrawler to simplify my scrapers). However, sometimes the Playwright-based scraper fails, requiring a retry. Once I successfully retrieve the cookies, I save them, close the PlaywrightCrawler (terminating it), and then continue scraping using HttpCrawler, passing the saved cookies to it.

The reason for this approach is that I need specific cookies to extract information from the site's hidden API, which is used to hydrate the page.

Additionally, I want to save statistics for the entire process because, in the future, I plan to run multiple distributed scrapers. I will use Kubernetes to manage them, and these statistics will help me monitor and evaluate their performance.

Comment options

janbuchar Mar 6, 2025
Maintainer

I see. Then I'd probably make a freshStatistics instance for every crawler run and after the run finishes, I'd store it separately. This way, you can be sure about how to interpret the data there.

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to share and export consistent statistics across multiple crawlers?#966

Uh oh!

{{title}}

Uh oh!

francomanca93
Feb 7, 2025

Replies: 2 comments 3 replies

Uh oh!

{{title}}