commoncrawl/cc-webgraphPublic

NotificationsYou must be signed in to change notification settings
Fork5
Star92

Tools to construct and process Common Crawl webgraphs

License

Apache-2.0 license

92 stars 5 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 128 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
graph-exploration-README.md		graph-exploration-README.md
pom.xml		pom.xml

Repository files navigation

cc-webgraph

Tools to construct web graphs from Common Crawl data, process and explore them.

Compiling and Packaging Java Tools

Java 11 or upwards are required.

The Java tools are compiled and packaged byMaven. If Maven is installed just runmvn package. Now the Java tools can be run via

java -cp target/cc-webgraph-0.1-SNAPSHOT-jar-with-dependencies.jar <classname> <args>...

The assembly jar file includes also theWebGraph andLAW packages required to process the webgraphs and computePageRank orHarmonic Centrality.

Javadocs

The Javadocs are created bymvn javadoc:javadoc. Then open the filetarget/site/apidocs/index.html in a browser.

Memory and Disk Requirements

Note that the webgraphs are usually multiple Gigabytes in size and require for processing

a sufficient Java heap size (Java option-Xmx)
enough disk space to store the graphs and temporary data.

The exact requirements depend on the graph size and the task – graph exploration or ranking, etc.

Construction and Ranking of Host- and Domain-Level Web Graphs

Host-Level Web Graph

The host-level web graph is built with help of PySpark, the corresponding code is found in the projectcc-pyspark. Instructions are found in the scriptbuild_hostgraph.sh.

Domain-Level Web Graph

The domain-level web graph is distilled from the host-level graph by mapping host names to domain names. The ID mapping is kept in memory as an int array orFastUtil's big array if the host-level graph has more vertices than a Java array can hold (around 2³¹). The Java tool to fold the host graph is best run from the scripthost2domaingraph.sh.

Processing Graphs using the WebGraph Framework

To analyze the graph structure and calculate rankings you may further process the graphs using software from the Laboratory for Web Algorithmics (LAW) at the University of Milano, namely theWebGraph framework and theLAW library.

A couple of scripts may help you to run the WebGraph tools to build and process the graphs are provided insrc/script/webgraph_ranking/. They're also used to prepare the Common Crawl web graph releases.

To process a webgraph and rank the nodes, you should first adapt the configuration to your graph and hardware setup:

vi ./src/script/webgraph_ranking/webgraph_config.sh

After running

./src/script/webgraph_ranking/process_webgraph.sh graph_name vertices.txt.gz edges.txt.gz output_dir

theoutput_dir/ should contain all generated files, eg.graph_name.graph andgraph_name-ranks.txt.gz.

The shell script is easily adapted to your needs. Please refer to theLAW dataset tutorial, theAPI docs of LAW andwebgraph for further information.

Exploring Webgraph Data Sets

The Common Crawl webgraph data sets are announced on theCommon Crawl web site.

For instructions how to explore the webgraphs using the JShell please see the tutorialInteractive Graph Exploration. For an older approach usingJython andpyWebGraph, see thecc-notebooks project.

Credits

Thanks to the authors of theWebGraph framework used to process the graphs and compute page rank and harmonic centrality. See also Sebastiano Vigna's projectswebgraph andwebgraph-big.

About

Tools to construct and process Common Crawl webgraphs

commoncrawl.org/web-graphs

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

cc-webgraph

Compiling and Packaging Java Tools

Javadocs

Memory and Disk Requirements

Construction and Ranking of Host- and Domain-Level Web Graphs

Host-Level Web Graph

Domain-Level Web Graph

Processing Graphs using the WebGraph Framework

Exploring Webgraph Data Sets

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors4

Uh oh!

Languages

Movatterモバイル変換

License

commoncrawl/cc-webgraph

Folders and files

Latest commit

History

Repository files navigation

cc-webgraph

Compiling and Packaging Java Tools

Javadocs

Memory and Disk Requirements

Construction and Ranking of Host- and Domain-Level Web Graphs

Host-Level Web Graph

Domain-Level Web Graph

Processing Graphs using the WebGraph Framework

Exploring Webgraph Data Sets

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors4

Uh oh!

Languages

Packages