- Notifications
You must be signed in to change notification settings - Fork4.2k
A scalable web crawler framework for Java.
License
code4craft/webmagic
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
A scalable crawler framework. It covers the whole lifecycle of crawler: downloading, url management, content extraction and persistent. It can simplify the development of a specific crawler.
- Simple core with high flexibility.
- Simple API for html extracting.
- Annotation with POJO to customize a crawler, no configuration.
- Multi-thread and Distribution support.
- Easy to be integrated.
Add dependencies to your pom.xml:
<dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-core</artifactId> <version>${webmagic.version}</version></dependency><dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-extension</artifactId> <version>${webmagic.version}</version></dependency>
WebMagic use slf4j with slf4j-log4j12 implementation. If you customized your slf4j implementation, please exclude slf4j-log4j12.
<exclusions> <exclusion> <groupId>org.slf4j</groupId> <artifactId>slf4j-log4j12</artifactId> </exclusion></exclusions>
Write a class implements PageProcessor. For example, I wrote a crawler of github repository information.
publicclassGithubRepoPageProcessorimplementsPageProcessor {privateSitesite =Site.me().setRetryTimes(3).setSleepTime(1000);@Overridepublicvoidprocess(Pagepage) {page.addTargetRequests(page.getHtml().links().regex("(https://github\\.com/\\w+/\\w+)").all());page.putField("author",page.getUrl().regex("https://github\\.com/(\\w+)/.*").toString());page.putField("name",page.getHtml().xpath("//h1[@class='public']/strong/a/text()").toString());if (page.getResultItems().get("name")==null){//skip this pagepage.setSkip(true); }page.putField("readme",page.getHtml().xpath("//div[@id='readme']/tidyText()")); }@OverridepublicSitegetSite() {returnsite; }publicstaticvoidmain(String[]args) {Spider.create(newGithubRepoPageProcessor()).addUrl("https://github.com/code4craft").thread(5).run(); }}
page.addTargetRequests(links)Add urls for crawling.
You can also use annotation way:
@TargetUrl("https://github.com/\\w+/\\w+")@HelpUrl("https://github.com/\\w+")publicclassGithubRepo {@ExtractBy(value ="//h1[@class='public']/strong/a/text()",notNull =true)privateStringname;@ExtractByUrl("https://github\\.com/(\\w+)/.*")privateStringauthor;@ExtractBy("//div[@id='readme']/tidyText()")privateStringreadme;publicstaticvoidmain(String[]args) {OOSpider.create(Site.me().setSleepTime(1000) ,newConsolePageModelPipeline(),GithubRepo.class) .addUrl("https://github.com/code4craft").thread(5).run(); }}
Documents:http://webmagic.io/docs/
The architecture of webmagic (referred toScrapy)
There are more examples inwebmagic-samples package.
Licensed underApache 2.0 license
To write webmagic, I refered to the projects below :
Scrapy
A crawler framework in Python.
Spiderman
Another crawler framework in Java.
https://groups.google.com/forum/#!forum/webmagic-java
http://list.qq.com/cgi-bin/qf_invite?id=023a01f505246785f77c5a5a9aff4e57ab20fcdde871e988
QQ Group: 373225642 542327088
A web console based on WebMagic for Spider configuration and management.
About
A scalable web crawler framework for Java.
Topics
Resources
License
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Packages0
Uh oh!
There was an error while loading.Please reload this page.


