fivesmallq/web-data-extractorPublic

NotificationsYou must be signed in to change notification settings
Fork20
Star56

Extracting and parsing structured data with jQuery Selector, XPath or JsonPath from common web format like HTML, XML and JSON.

License

Apache-2.0 license

56 stars 20 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 166 Commits
src		src
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Repository files navigation

web-data-extractor

Extracting and parsing structured data with Jquery Selector, XPath or JsonPath from common web format like HTML, XML and JSON.

Implements:

Jquery Selector -Jsoup andJerry
XPath -Jdom2
JsonPath -JsonPath

###UsageTo add a dependency on Web-Data-Extractor using Maven, use the following:

<dependency>    <groupId>im.nll.data</groupId>    <artifactId>extractor</artifactId>    <version>0.9.6</version></dependency>

To add a dependency using Gradle:

dependencies {  compile 'im.nll.data:extractor:0.9.6'}

##Examples

###extract single data

Stringfollowers =Extractors.on(baseHtml)                   .extract(newSelectorExtractor("div.followers"))                   .with(newRegexExtractor("\\d+"))                   .asString();

or use static method

Stringfollowers =Extractors.on(baseHtml)                   .extract(selector("div.followers"))                   .with(regex("\\d+"))                   .asString();

or short string

Stringfollowers =Extractors.on(baseHtml)                   .extract("selector:div.followers"))                   .with(regex("\\d+"))                   .asString();

more method

Stringyear =Extractors.on("<div> Talk is cheap. Show me the code. - Fri, 25 Aug 2000 </div>")                .extract(selector("div"))// extract with selector                .filter(value ->value.trim())// trim result                .with(regex("20\\d{2}"))// get year with regex                .filter(value ->"from " +value)// append 'from' string                .asString();Assert.assertEquals("from 2000",year);

###extract data to map

@TestpublicvoidtestToMap()throwsException {Map<String,String>dataMap =Extractors.on(baseHtml)                .extract("title",selector("a.title"))                .extract("followers",selector("div.followers")).with(regex("\\d+"))                .extract("description",selector("div.description"))                .asMap();Assert.assertEquals("fivesmallq",dataMap.get("title"));Assert.assertEquals("29671",dataMap.get("followers"));Assert.assertEquals("Talk is cheap. Show me the code.",dataMap.get("description"));    }

###extract data to map list

@TestpublicvoidtestToMapList()throwsException {//split param must implements ListableExtractorList<Map<String,String>>languages =Extractors.on(listHtml)            .split(selector("tr.item.html"))                .extract("type",selector("td.type"))                .extract("name",selector("td.name"))                .extract("url",selector("td.url"))                .asMapList();Assert.assertNotNull(languages);Map<String,String>second =languages.get(1);Assert.assertEquals(languages.size(),3);Assert.assertEquals(second.get("type"),"dynamic");Assert.assertEquals(second.get("name"),"Ruby");Assert.assertEquals(second.get("url"),"https://www.ruby-lang.org");    }

###extract data to bean

@TestpublicvoidtestToBean()throwsException {Basebase =Extractors.on(baseHtml)                .extract("title",selector("a.title"))                .extract("followers",selector("div.followers")).with(regex("\\d+"))                .extract("description",selector("div.description"))                .asBean(Base.class);Assert.assertEquals("fivesmallq",base.getTitle());Assert.assertEquals("29671",base.getFollowers());Assert.assertEquals("Talk is cheap. Show me the code.",base.getDescription());    }

###extract data to bean list

@TestpublicvoidtestToBeanList()throwsException {List<Language>languages =Extractors.on(listHtml)            .split(selector("tr.item.html"))                .extract("type",selector("td.type"))                .extract("name",selector("td.name"))                .extract("url",selector("td.url"))                .asBeanList(Language.class);Assert.assertNotNull(languages);Languagesecond =languages.get(1);Assert.assertEquals(languages.size(),3);Assert.assertEquals(second.getType(),"dynamic");Assert.assertEquals(second.getName(),"Ruby");Assert.assertEquals(second.getUrl(),"https://www.ruby-lang.org");    }

###filterbefore andafter is the global filter.

@TestpublicvoidtestToBeanListFilterBeforeAndAfter()throwsException {List<Language>languages =Extractors.on(listHtml)//before and after just process the last extract value.                .before(value ->"|before|" +value)                .after(value ->value +"|after|")                .split(xpath("//tr[@class='item']"))                .extract("type",xpath("//td[1]/text()")).filter(value ->"filter:" +value)                .extract("name",xpath("//td[2]/text()")).filter(value ->"filter:" +value)                .extract("url",xpath("//td[3]/text()")).filter(value ->"filter:" +value)                .asBeanList(Language.class);Assert.assertNotNull(languages);Languagesecond =languages.get(1);Assert.assertEquals(languages.size(),3);Assert.assertEquals(second.getType(),"filter:|before|dynamic|after|");Assert.assertEquals(second.getName(),"filter:|before|Ruby|after|");Assert.assertEquals(second.getUrl(),"filter:|before|https://www.ruby-lang.org|after|");    }

seeExample

About

Extracting and parsing structured data with jQuery Selector, XPath or JsonPath from common web format like HTML, XML and JSON.

fivesmallq.github.io/web-data-extractor

Releases3

v0.9.6 Latest

Apr 26, 2016

+ 2 releases

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

web-data-extractor

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases3

Packages

Uh oh!

Contributors3

Uh oh!

Languages

Movatterモバイル変換

License

fivesmallq/web-data-extractor

Folders and files

Latest commit

History

Repository files navigation

web-data-extractor

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases3

Packages0

Uh oh!

Contributors3

Uh oh!

Languages

Packages