Common Crawl Foundation

Common Crawl provides an archive of webpages going back to 2007.

Verified
We've verified that the organizationcommoncrawl controls the domain:
- commoncrawl.org
Learn more about verified organizations

PinnedLoading

cc-pysparkcc-pysparkPublic
Process Common Crawl data with Python and Spark
Python 422 89
cc-crawl-statisticscc-crawl-statisticsPublic
Statistics of Common Crawl monthly archives mined from URL index files
Python 175 11
cc-index-tablecc-index-tablePublic
Index Common Crawl archives in tabular format
Java 113 9
cc-warc-examplescc-warc-examplesPublic
Forked fromSmerity/cc-warc-examples
CommonCrawl WARC/WET/WAT examples and processing code for Java + Hadoop
Java 38 18
cc-citationscc-citationsPublic
Scientific articles using or citing Common Crawl data
Jupyter Notebook 19 3
cc-notebookscc-notebooksPublic
Various Jupyter notebooks about Common Crawl data
Jupyter Notebook 51 9

Repositories

Showing 10 of 65 repositories

robotstxt-experiments Public
How is the Robots Exclusion Protocol (robots.txt) used in the WWW? This projects tries to get some insights mining Common Crawl's robots.txt captures of the years 2016 – 2024.
commoncrawl/robotstxt-experiments’s past year of commit activity
Jupyter Notebook0MIT0 0 0 UpdatedMar 27, 2025
web-languages Public
Crowd-sourced lists of urls to help Common Crawl crawl under-resourced languages. Seehttps://github.com/commoncrawl/web-languages-code/ for the code
commoncrawl/web-languages’s past year of commit activity
38 39 0 0 UpdatedMar 24, 2025
cc-downloader Public
A polite and user-friendly downloader for Common Crawl data
commoncrawl/cc-downloader’s past year of commit activity
Rust 36Apache-2.0 1 2 (1 issue needs help) 0 UpdatedMar 22, 2025
cc-citations Public
Scientific articles using or citing Common Crawl data
commoncrawl/cc-citations’s past year of commit activity
Jupyter Notebook 19 3 0 0 UpdatedMar 21, 2025
cc-crawl-statistics Public
Statistics of Common Crawl monthly archives mined from URL index files
commoncrawl/cc-crawl-statistics’s past year of commit activity
Python 175Apache-2.0 11 0 0 UpdatedMar 16, 2025
nutch Public Forked fromAloisius/nutch
Common Crawl fork of Apache Nutch
commoncrawl/nutch’s past year of commit activity
Java 32Apache-2.0 1,260 6 (1 issue needs help) 0 UpdatedMar 15, 2025
crawler-commons Public Forked fromcrawler-commons/crawler-commons
A set of reusable Java components that implement functionality common to any web crawler
commoncrawl/crawler-commons’s past year of commit activity
Java 2Apache-2.0 82 0 0 UpdatedMar 13, 2025
ia-web-commons Public Forked fromAloisius/ia-web-commons
Web archiving utility library
commoncrawl/ia-web-commons’s past year of commit activity
Java 11Apache-2.0 76 4 1 UpdatedMar 12, 2025
web-languages-code Public
The code used to generate templates for the web-languages repohttps://github.com/commoncrawl/web-languages
commoncrawl/web-languages-code’s past year of commit activity
Python 2Apache-2.0 1 0 1 UpdatedMar 11, 2025
cc-index-table Public
Index Common Crawl archives in tabular format
commoncrawl/cc-index-table’s past year of commit activity
Java 113Apache-2.0 9 5 3 UpdatedMar 10, 2025