Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

How to share and export consistent statistics across multiple crawlers?#966

Unanswered
francomanca93 asked this question inQ&A
Discussion options

Hi,

I want to use the same statistics for different crawlers. I have one HTTP client that I pass to two crawlers (PlaywrightCrawler and BeautifulSoupCrawler). However, when I execute these crawlers, I receive different statistics.

Additionally, I want to export these statistics into FinalStatistics and save them in a storage format (JSON or CSV). My goal is to manage multiple scrapers and save the statistics to analyze them.

You must be logged in to vote

Replies: 2 comments 3 replies

Comment options

Hello, you can make a custom instance ofcrawlee.statistics.Statistics and pass it to your crawlers when you instantiate them - e.g.,PlaywrightCrawler(statistics=your_custom_statistics).

I'm not sure I understand what you want to do withFinalStatistics, but that object is calculated by calling theStatistics.calculate() method - you can do that instead of relying onBasicCrawler.run to return it.

You must be logged in to vote
0 replies
Comment options

Thanks@janbuchar. Your answer works fine for me, but the time of the first execution of the scraper does not add to the second scraper. Here are the logs:

[crawlee.statistics._statistics] INFO  Statistics┌───────────────────────────────┬─────────┐│ requests_finished             │ 0       ││ requests_failed               │ 0       ││ retry_histogram               │ [0]     ││ request_avg_failed_duration   │ None    ││ request_avg_finished_duration │ None    ││ requests_finished_per_minute  │ 0       ││ requests_failed_per_minute    │ 0       ││ request_total_duration        │ 0.0     ││ requests_total                │ 0       ││ crawler_runtime               │ 0.02422 │└───────────────────────────────┴─────────┘[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0.0; mem = 0.0; event_loop = 0.0; client_info = 0.0[crawlee.crawlers._playwright._playwright_crawler] INFO  Navigating to ...[crawlee.crawlers._playwright._playwright_crawler] INFO  --- Fetch cookies ---[crawlee.crawlers._playwright._playwright_crawler] INFO  --- End of cookies ---[crawlee._autoscaling.autoscaled_pool] INFO  Waitingfor remaining tasks to finish[crawlee.crawlers._playwright._playwright_crawler] INFO  Final request statistics:┌───────────────────────────────┬───────────┐│ requests_finished             │ 1         ││ requests_failed               │ 0         ││ retry_histogram               │ [1]       ││ request_avg_failed_duration   │ None      ││ request_avg_finished_duration │ 11.137499 ││ requests_finished_per_minute  │ 4         ││ requests_failed_per_minute    │ 0         ││ request_total_duration        │ 11.137499 ││ requests_total                │ 1         ││ crawler_runtime               │ 14.854891 │└───────────────────────────────┴───────────┘[rich] INFO  Found 12 cookies >>>>> Here finished the execution of the firsr scraper (Playwright)<<<<<<[rich] INFO  Obteniendo ítems...[crawlee.statistics._statistics] INFO  Statistics┌───────────────────────────────┬───────────┐│ requests_finished             │ 1         ││ requests_failed               │ 0         ││ retry_histogram               │ [1]       ││ request_avg_failed_duration   │ None      ││ request_avg_finished_duration │ 11.137499 ││ requests_finished_per_minute  │ 2543      ││ requests_failed_per_minute    │ 0         ││ request_total_duration        │ 11.137499 ││ requests_total                │ 1         ││ crawler_runtime               │ 0.023598  │└───────────────────────────────┴───────────┘[crawlee._autoscaling.autoscaled_pool] INFO  current_concurrency = 0; desired_concurrency = 2; cpu = 0; mem = 0; event_loop = 0.0; client_info = 0.0[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Page index: 0. Items remaining: 31 of 31[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Page index: 1. Items remaining: 11 of 31[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Page index: 2. Items remaining: 0 of 40[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO - Allitems already processed[crawlee._autoscaling.autoscaled_pool] INFO  Waitingfor remaining tasks to finish[crawlee.crawlers._abstract_http._abstract_http_crawler] INFO  Final request statistics:┌───────────────────────────────┬───────────┐│ requests_finished             │ 4         ││ requests_failed               │ 0         ││ retry_histogram               │ [4]       ││ request_avg_failed_duration   │ None      ││ request_avg_finished_duration │ 3.891656  ││ requests_finished_per_minute  │ 13        ││ requests_failed_per_minute    │ 0         ││ request_total_duration        │ 15.566624 ││ requests_total                │ 4         ││ crawler_runtime               │ 18.088441 │└───────────────────────────────┴───────────┘

Could it be the way I'm running them? Here is a summary of the code:

http_client= ...my_stats= ...crawler_1=PlaywrightCrawler(statistics=my_stats)# ... here I have stored the cookies in a storage/dataset/cookiesawaitcrawler_1.run([my_url])crawler_1=BeautifulSoupCrawler(statistics=my_stats)# Here I have added the cookies I have stored before.awaitcrawler_2.run([my_url])
You must be logged in to vote
3 replies
@janbuchar
Comment options

Yeah, it looks like theStatistics class overwrites the timestamp of the start of the crawl. Sadly, this renders theper_minute statistics in the secondFinalStatistics object unusable.

Could you try to explain what you're trying to do so that we can come up with a solution or perhaps an idea how to improve theStatistics class?

@francomanca93
Comment options

I want to extract cookies from a page using PlaywrightCrawler because they differ from those obtained with BeautifulSoupCrawler (which I recently replaced with HttpCrawler to simplify my scrapers). However, sometimes the Playwright-based scraper fails, requiring a retry. Once I successfully retrieve the cookies, I save them, close the PlaywrightCrawler (terminating it), and then continue scraping using HttpCrawler, passing the saved cookies to it.

The reason for this approach is that I need specific cookies to extract information from the site's hidden API, which is used to hydrate the page.

Additionally, I want to save statistics for the entire process because, in the future, I plan to run multiple distributed scrapers. I will use Kubernetes to manage them, and these statistics will help me monitor and evaluate their performance.

@janbuchar
Comment options

I see. Then I'd probably make a freshStatistics instance for every crawler run and after the run finishes, I'd store it separately. This way, you can be sure about how to interpret the data there.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Category
Q&A
Labels
None yet
2 participants
@francomanca93@janbuchar

[8]ページ先頭

©2009-2026 Movatter.jp