Movatterモバイル変換

flairNLP/fundusPublic

NotificationsYou must be signed in to change notification settings
Fork108
Star415

A very simple news crawler with a funny name

License

MIT license

415 stars 108 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 3,102 Commits
.github		.github
docs		docs
resources/logo		resources/logo
scripts		scripts
src/fundus		src/fundus
tests		tests
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
pyproject.toml		pyproject.toml

Repository files navigation

A very simplenews crawler in Python.Developed at Humboldt University of Berlin.

Quick Start |Tutorials |News Sources |Paper

Disclaimer: Although we try to provide an indication of whether a publisher has not explicitly objected to the training of AI models on its data, we would like to point out that this information must be verified independently before their content is used.More details can be foundhere.

Fundus is:

A static news crawler.Fundus lets you crawl online news articles with only a few lines of Python code!Be it from live websites or the CC-NEWS dataset.
An open-source Python package.Fundus is built on the idea of building something together.We welcome your contribution to help Fundusgrow!

Quick Start

To install from pip, simply do:

pip install fundus

Fundus requires Python 3.8+.

Example 1: Crawl a bunch of English-language news articles

Let's use Fundus to crawl 2 articles from publishers based in the US.

fromfundusimportPublisherCollection,Crawler# initialize the crawler for news publishers based in the UScrawler=Crawler(PublisherCollection.us)# crawl 2 articles and printforarticleincrawler.crawl(max_articles=2):print(article)

That's already it!

If you run this code, it should print out something like this:

Fundus-Article including 1 image(s):- Title: "Feinstein's Return Not Enough for Confirmation of Controversial New [...]"- Text:  "89-year-old California senator arrived hour late to Judiciary Committee hearing          to advance President Biden's stalled nominations  Democrats [...]"- URL:    https://freebeacon.com/politics/feinsteins-return-not-enough-for-confirmation-of-controversial-new-hampshire-judicial-nominee/- From:   The Washington Free Beacon (2023-05-11 18:41)Fundus-Article including 3 image(s):- Title: "Northwestern student government freezes College Republicans funding over [...]"- Text:  "Student government at Northwestern University in Illinois "indefinitely" froze          the funds of the university's chapter of College Republicans [...]"- URL:    https://www.foxnews.com/us/northwestern-student-government-freezes-college-republicans-funding-poster-critical-lgbtq-community- From:   Fox News (2023-05-09 14:37)

This printout tells you that you successfully crawled two articles!

For each article, the printout details:

the number of images included in the article
the "Title" of the article, i.e. its headline
the "Text", i.e. the main article body text
the "URL" from which it was crawled
the news source it is "From"

Example 2: Crawl a specific news source

Maybe you want to crawl a specific news source instead. Let's crawl news articles from The New Yorker only:

fromfundusimportPublisherCollection,Crawler# initialize the crawler for The New Yorkercrawler=Crawler(PublisherCollection.us.TheNewYorker)# crawl 2 articles and printforarticleincrawler.crawl(max_articles=2):print(article)

Example 3: Crawl 1 Million articles

To crawl such a vast amount of data, Fundus relies on theCommonCrawl web archive, in particular the news crawlCC-NEWS.If you're not familiar withCommonCrawl orCC-NEWS check out their websites.Simply import ourCCNewsCrawler and make sure to check out ourtutorial beforehand.

fromfundusimportPublisherCollection,CCNewsCrawler# initialize the crawler using all publishers supported by funduscrawler=CCNewsCrawler(*PublisherCollection)# crawl 1 million articles and printforarticleincrawler.crawl(max_articles=1000000):print(article)

Note: By default, the crawler utilizes all available CPU cores on your system.For optimal performance, we recommend manually setting the number of processes using theprocesses parameter.A good rule of thumb is to allocateone process per 200 Mbps of bandwidth.This can vary depending on core speed.

Note: The crawl above took ~7 hours using the entirePublisherCollection on a machine with 1000 Mbps connection, Core i9-13905H, 64GB Ram, Windows 11 and without printing the articles.The estimated time can vary substantially depending on the publisher used and the available bandwidth.Additionally, not all publishers are included in theCC-NEWS crawl (especially US based publishers).For large corpus creation, one can also use the regular crawler by utilizing only sitemaps, which requires significantly less bandwidth.

fromfundusimportPublisherCollection,Crawler,Sitemap# initialize a crawler for us/uk based publishers and restrict to Sitemaps onlycrawler=Crawler(PublisherCollection.us,PublisherCollection.uk,restrict_sources_to=[Sitemap])# crawl 1 million articles and printforarticleincrawler.crawl(max_articles=1000000):print(article)

Example 4: Crawl some images

By default, Fundus tries to parse the images included in every crawled article.Let's crawl an article and print out the images for some more details.

fromfundusimportPublisherCollection,Crawler# initialize the crawler for The LA Timescrawler=Crawler(PublisherCollection.us.LATimes)# crawl 1 article and print the imagesforarticleincrawler.crawl(max_articles=1):forimageinarticle.images:print(image)

Forthis article you will get the following output:

Fundus-Article Cover-Image:-URL: 'https://ca-times.brightspotcdn.com/dims4/default/41c9bc4/2147483647/strip/true/crop/4598x3065+0+0/resize/1200x800!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F77%2Feb%2F7fed2d3942fd97b0f7325e7060cf%2Flakers-timberwolves-basketball-33765.jpg'-Description:         'Minnesota Timberwolves forward Julius Randle (30) works toward the basket.'-Caption: 'Minnesota Timberwolves forward Julius Randle, left, controls the ball in front of Lakers forward Anthony Davis during the first half of the Lakers’ 97-87 loss Friday.'-Authors: ['Abbie Parr / Associated Press']-Versions: [320x213, 568x379, 768x512, 1024x683, 1200x800]Fundus-Article Image:-URL: 'https://ca-times.brightspotcdn.com/dims4/default/9a22715/2147483647/strip/true/crop/4706x3137+0+0/resize/1200x800!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2Ff7%2F52%2Fdcd6b263480ab579ac583a4fdbbf%2Flakers-timberwolves-basketball-48004.jpg'-Description:         'Lakers coach JJ Redick talks with forward Anthony Davis during a loss to the Timberwolves.'-Caption: 'Lakers coach JJ Redick, right, talks with forward Anthony Davis during the first half of a 97-87 loss to the Timberwolves on Friday night.'-Authors: ['Abbie Parr / Associated Press']-Versions: [320x213, 568x379, 768x512, 1024x683, 1200x800]Fundus-Article Image:-URL: 'https://ca-times.brightspotcdn.com/dims4/default/580bae4/2147483647/strip/true/crop/5093x3470+0+0/resize/1200x818!/format/webp/quality/75/?url=https%3A%2F%2Fcalifornia-times-brightspot.s3.amazonaws.com%2F3b%2Fdf%2F64c0198b4c2fb2b5824aaccb64b7%2F1486148-sp-nba-lakers-trailblazers-25-gmf.jpg'-Description:         'Lakers star LeBron James sits in street clothes on the bench next to his son, Bronny James.'-Caption: 'Lakers star LeBron James sits in street clothes on the bench next to his son, Bronny James, during a win over Portland at Crypto.com Arena on Dec. 8.'-Authors: ['Gina Ferazzi / Los Angeles Times']-Versions: [320x218, 568x387, 768x524, 1024x698, 1200x818]

For each image, the printout details:

The cover image designation (if applicable).
The URL for the highest-resolution version of the image.
A description of the image.
The image's caption.
The name of the copyright holder.
A list of all available versions of the image.

Tutorials

We providequick tutorials to get you started with the library:

If you wish to contribute check out these tutorials:

Currently Supported News Sources

You can find the publishers currently supportedhere.

Also:Adding a new publisher is easy - consider contributing to the project!

Evaluation Benchmark

Check out our evaluationbenchmark.

The following table summarizes the overall performance of Fundus and evaluated scrapers in terms of averaged ROUGE-LSum precision, recall and F1-score and their standard deviation. The table is sorted in descending order over the F1-score:

Scraper	Precision	Recall	F1-Score	Version
Fundus	99.89_±0.57	96.75_±12.75	97.69_±9.75	0.4.1
Trafilatura	93.91_±12.89	96.85_±15.69	93.62_±16.73	1.12.0
news-please	97.95_±10.08	91.89_±16.15	93.39_±14.52	1.6.13
BTE	81.09_±19.41	98.23_±8.61	87.14_±15.48	/
jusText	86.51_±18.92	90.23_±20.61	86.96_±19.76	3.0.1
BoilerNet	85.96_±18.55	91.21_±19.15	86.52_±18.03	/
Boilerpipe	82.89_±20.65	82.11_±29.99	79.90_±25.86	1.3.0

Cite

Please cite the followingpaper when using Fundus or building upon our work:

@inproceedings{dallabetta-etal-2024-fundus,title ="Fundus: A Simple-to-Use News Scraper Optimized for High Quality Extractions",author ="Dallabetta, Max  and      Dobberstein, Conrad  and      Breiding, Adrian  and      Akbik, Alan",editor ="Cao, Yixin  and      Feng, Yang  and      Xiong, Deyi",booktitle ="Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)",month = aug,year ="2024",address ="Bangkok, Thailand",publisher ="Association for Computational Linguistics",url ="https://aclanthology.org/2024.acl-demos.29",pages ="305--314",}

Contact

Please email your questions or comments toMax Dallabetta

Contributing

Thanks for your interest in contributing! There are many ways to get involved;start with ourcontributor guidelines and thencheck theseopen issues for specific tasks.

License

MIT

About

A very simple news crawler with a funny name

Releases16

v0.5.2 Latest

Sep 26, 2025

+ 15 releases

Packages

No packages published

Contributors62

+ 48 contributors

Languages

Python100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Quick Start

Example 1: Crawl a bunch of English-language news articles

Example 2: Crawl a specific news source

Example 3: Crawl 1 Million articles

Example 4: Crawl some images

Tutorials

Currently Supported News Sources

Evaluation Benchmark

Cite

Contact

Contributing

License

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases16

Packages

Uh oh!

Contributors62

Uh oh!

Languages

Movatterモバイル変換

License

flairNLP/fundus

Folders and files

Latest commit

History

Repository files navigation

Quick Start

Example 1: Crawl a bunch of English-language news articles

Example 2: Crawl a specific news source

Example 3: Crawl 1 Million articles

Example 4: Crawl some images

Tutorials

Currently Supported News Sources

Evaluation Benchmark

Cite

Contact

Contributing

License

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases16

Packages0

Uh oh!

Contributors62

Uh oh!

Languages

Packages