apache/nutchPublic

NotificationsYou must be signed in to change notification settings
Fork1.3k
Star3k

Apache Nutch is an extensible and scalable web crawler

License

Apache-2.0, Apache-2.0 licenses found

Licenses found

3k stars 1.3k forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 3,508 Commits
.github		.github
conf		conf
docker		docker
ivy		ivy
lib/native		lib/native
licenses-binary		licenses-binary
src		src
.asf.yaml		.asf.yaml
.gitignore		.gitignore
CHANGES.md		CHANGES.md
KEYS		KEYS
LICENSE-binary		LICENSE-binary
LICENSE.txt		LICENSE.txt
NOTICE-binary		NOTICE-binary
NOTICE.txt		NOTICE.txt
README.md		README.md
build.xml		build.xml
default.properties		default.properties
eclipse-codeformat.xml		eclipse-codeformat.xml

Repository files navigation

Apache Nutch README

For the latest information about Nutch, please visit our website at:

https://nutch.apache.org/

and our wiki, at:

https://cwiki.apache.org/confluence/display/NUTCH/Home

To get started using Nutch read Tutorial:

https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial

Contributing

To contribute a patch, follow these instructions (note that installingHub is not strictly required, but is recommended).

Download and install hub.github.com
File JIRA issue for your fix athttps://issues.apache.org/jira/projects/NUTCH/issues
- you will get issue id NUTCH-xxxx where xxxx is the issue ID.
git clone https://github.com/apache/nutch.git
cd nutch
git checkout -b NUTCH-xxxx
edit files (please try and include a test case if possible)
git status (make sure it shows what files you expected to edit)
Make sure that your code complies with theNutch codeformatting template, which is basially two space indents
git add <files>
git commit -m "fix for NUTCH-xxx contributed by <your username>"
hub fork (if hub is not installed, you can fork the project using the "fork" button on theNutch Github project page)
git push -u <your git username> NUTCH-xxxx
hub pull-request (if hub is not installed, please follow the instructions how tocreate a pull-request from a fork)

IDE setup

Eclipse

Generate Eclipse project files

ant eclipse

and follow the instructions inImporting existing projects.

You mustconfigure the nutch-site.xml before running. Make sure, you've addedhttp.agent.name andplugin.folders properties. The plugin.folders normally points to<project_root>/build/plugins.

Now create a Java Application Configuration, choose org.apache.nutch.crawl.Injector, add two paths as arguments. First one is the crawldb directory, second one is the URL directory where, the injector can read urls. Now run your configuration.

If we still see theNo plugins found on paths of property plugin.folders="plugins", update the plugin.folders in the nutch-default.xml, this is a quick fix, but should not be used.

Intellij IDEA

First install theIvyIDEA Plugin. then runant eclipse. This will create the necessary.classpath and .project files so that Intellij can import the project in the next step.

In Intellij IDEA, select File > New > Project from Existing Sources. Select the nutch home directory and click "Open".

On the "Import Project" screen select the "Import project from external model" radio button and select "Eclipse".Click "Create". On the next screen the "Eclipse projects directory" should be already set to the nutch folder.Leave the "Create module files near .classpath files" radio button selected.Click "Next" on the next screens. On the project SDK screen select Java 11 and click "Create".N.B. For anyone on a Mac with a homebrew-installed openjdk, you need to use the directory underlibexec:<openjdk11_directory>/libexec/openjdk.jdk/Contents/Home.

Once the project is imported, you will see a popup saying "Ant build scripts found", "Frameworks detected - IvyIDEA Framework detected". Click "Import".If you don't get the pop-up, I'd suggest going through the steps again as this happens from time to time. There is anotherAnt popup that asks you to configure the project. Do NOT click "Configure".

To import the code-style, Go to Intellij IDEA > Preferences > Editor > Code Style > Java.

For the Scheme dropdown select "Project". Click the gear icon and select "Import Scheme" > "Eclipse XML file".

Select the eclipse-format.xml file and click "Open". On next screen check the "Current Scheme" checkbox and hit OK.

Running in Intellij IDEA

Running in Intellij

Open Run/Debug Configurations
Select "+" to create a new configuration and select "Application"
For "Main Class" enter a class with a main function (e.g. org.apache.nutch.indexer.IndexingJob).
For "Program Arguments" add the arguments needed for the class. You can get these by running the crawl executable for your job. Use full-qualified paths. (e.g. /Users/kamil/workspace/external/nutch/crawl/crawldb /Users/kamil/workspace/external/nutch/crawl/segments/20221222160141 -deleteGone)
For "Working Directory" enter "/Users/kamil/workspace/external/nutch/runtime/local".
Select "Modify options" > "Modify Classpath" and add the config directory belonging to the "Working Directory" from the previous step (e.g. /Users/kamil/workspace/external/nutch/runtime/local/conf). This will allow the resource loader to load that configuration.
Select "Modify options" > "Add VM Options". Add the VM options needed. You can get these by running the crawl executable for your job (e.g. -Xmx4096m -Dhadoop.log.dir=/Users/kamil/workspace/external/nutch/runtime/local/logs -Dhadoop.log.file=hadoop.log -Dmapreduce.job.reduces=2 -Dmapreduce.reduce.speculative=false -Dmapreduce.map.speculative=false -Dmapreduce.map.output.compress=true)

Note: You will need to manually trigger a build through ANT to get latest updated changes when running. This is because the ant build system is separate from the Intellij one.

About

Apache Nutch is an extensible and scalable web crawler

nutch.apache.org/

Topics

java hadoop web-crawler nutch crawling apache

Resources

Readme

License

Apache-2.0, Apache-2.0 licenses found

Licenses found

Code of conduct

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Licenses found

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Apache Nutch README

Contributing

IDE setup

Eclipse

Intellij IDEA

Running in Intellij IDEA

About

Topics

Resources

License

Licenses found

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors53

Uh oh!

Languages

Movatterモバイル変換

License

Licenses found

apache/nutch

Folders and files

Latest commit

History

Repository files navigation

Apache Nutch README

Contributing

IDE setup

Eclipse

Intellij IDEA

Running in Intellij IDEA

About

Topics

Resources

License

Licenses found

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors53

Uh oh!

Languages

Packages