Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Extracting and parsing structured data with jQuery Selector, XPath or JsonPath from common web format like HTML, XML and JSON.

License

NotificationsYou must be signed in to change notification settings

fivesmallq/web-data-extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Maven CentralBuild Statuscodecov.ioLicense

Extracting and parsing structured data with Jquery Selector, XPath or JsonPath from common web format like HTML, XML and JSON.

Implements:

###UsageTo add a dependency on Web-Data-Extractor using Maven, use the following:

<dependency>    <groupId>im.nll.data</groupId>    <artifactId>extractor</artifactId>    <version>0.9.6</version></dependency>

To add a dependency using Gradle:

dependencies {  compile 'im.nll.data:extractor:0.9.6'}

##Examples

###extract single data

Stringfollowers =Extractors.on(baseHtml)                   .extract(newSelectorExtractor("div.followers"))                   .with(newRegexExtractor("\\d+"))                   .asString();

or use static method

Stringfollowers =Extractors.on(baseHtml)                   .extract(selector("div.followers"))                   .with(regex("\\d+"))                   .asString();

or short string

Stringfollowers =Extractors.on(baseHtml)                   .extract("selector:div.followers"))                   .with(regex("\\d+"))                   .asString();

more method

Stringyear =Extractors.on("<div> Talk is cheap. Show me the code. - Fri, 25 Aug 2000 </div>")                .extract(selector("div"))// extract with selector                .filter(value ->value.trim())// trim result                .with(regex("20\\d{2}"))// get year with regex                .filter(value ->"from " +value)// append 'from' string                .asString();Assert.assertEquals("from 2000",year);

###extract data to map

@TestpublicvoidtestToMap()throwsException {Map<String,String>dataMap =Extractors.on(baseHtml)                .extract("title",selector("a.title"))                .extract("followers",selector("div.followers")).with(regex("\\d+"))                .extract("description",selector("div.description"))                .asMap();Assert.assertEquals("fivesmallq",dataMap.get("title"));Assert.assertEquals("29671",dataMap.get("followers"));Assert.assertEquals("Talk is cheap. Show me the code.",dataMap.get("description"));    }

###extract data to map list

@TestpublicvoidtestToMapList()throwsException {//split param must implements ListableExtractorList<Map<String,String>>languages =Extractors.on(listHtml)            .split(selector("tr.item.html"))                .extract("type",selector("td.type"))                .extract("name",selector("td.name"))                .extract("url",selector("td.url"))                .asMapList();Assert.assertNotNull(languages);Map<String,String>second =languages.get(1);Assert.assertEquals(languages.size(),3);Assert.assertEquals(second.get("type"),"dynamic");Assert.assertEquals(second.get("name"),"Ruby");Assert.assertEquals(second.get("url"),"https://www.ruby-lang.org");    }

###extract data to bean

@TestpublicvoidtestToBean()throwsException {Basebase =Extractors.on(baseHtml)                .extract("title",selector("a.title"))                .extract("followers",selector("div.followers")).with(regex("\\d+"))                .extract("description",selector("div.description"))                .asBean(Base.class);Assert.assertEquals("fivesmallq",base.getTitle());Assert.assertEquals("29671",base.getFollowers());Assert.assertEquals("Talk is cheap. Show me the code.",base.getDescription());    }

###extract data to bean list

@TestpublicvoidtestToBeanList()throwsException {List<Language>languages =Extractors.on(listHtml)            .split(selector("tr.item.html"))                .extract("type",selector("td.type"))                .extract("name",selector("td.name"))                .extract("url",selector("td.url"))                .asBeanList(Language.class);Assert.assertNotNull(languages);Languagesecond =languages.get(1);Assert.assertEquals(languages.size(),3);Assert.assertEquals(second.getType(),"dynamic");Assert.assertEquals(second.getName(),"Ruby");Assert.assertEquals(second.getUrl(),"https://www.ruby-lang.org");    }

###filterbefore andafter is the global filter.

@TestpublicvoidtestToBeanListFilterBeforeAndAfter()throwsException {List<Language>languages =Extractors.on(listHtml)//before and after just process the last extract value.                .before(value ->"|before|" +value)                .after(value ->value +"|after|")                .split(xpath("//tr[@class='item']"))                .extract("type",xpath("//td[1]/text()")).filter(value ->"filter:" +value)                .extract("name",xpath("//td[2]/text()")).filter(value ->"filter:" +value)                .extract("url",xpath("//td[3]/text()")).filter(value ->"filter:" +value)                .asBeanList(Language.class);Assert.assertNotNull(languages);Languagesecond =languages.get(1);Assert.assertEquals(languages.size(),3);Assert.assertEquals(second.getType(),"filter:|before|dynamic|after|");Assert.assertEquals(second.getName(),"filter:|before|Ruby|after|");Assert.assertEquals(second.getUrl(),"filter:|before|https://www.ruby-lang.org|after|");    }

seeExample

About

Extracting and parsing structured data with jQuery Selector, XPath or JsonPath from common web format like HTML, XML and JSON.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors3

  •  
  •  
  •  

[8]ページ先頭

©2009-2025 Movatter.jp