till-tietz/parselPublic

NotificationsYou must be signed in to change notification settings
Fork3
Star14

parallel execution of RSelenium

License

Unknown, MIT licenses found

Licenses found

14 stars 3 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 145 Commits
.github		.github
R		R
docs		docs
man		man
tests		tests
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
cran-comments.md		cran-comments.md
parsel.Rproj		parsel.Rproj

Repository files navigation

parsel

parsel is a framework for parallelized dynamic web-scraping usingRSelenium. Leveraging parallel processing, it allows you to run anyRSelenium web-scraping routine on multiple browser instancessimultaneously, thus greatly increasing the efficiency of your scraping.parsel utilizes chunked input processing as well as error catching andlogging, to ensure seamless execution of your scraping routine andminimal data loss, even in the presence of unforeseenRSeleniumerrors.parsel additionally provides convenient wrapper functionsaroundRSelenium methods, that allow you to quickly generate safescraping code with minimal coding on your end.

You can find theparsel websitehere.

Installation

# Install parsel from CRANinstall.packages("parsel")# Or the development version from GitHub:# install.packages("devtools")devtools::install_github("till-tietz/parsel")

Usage

Parallel Scraping

The following example will hopefully serve to illustrate thefunctionality and ideas behind howparsel operates. We’ll set up thefollowing scraping job:

navigate to a random Wikipedia article
retrieve its title
navigate to the first linked page on the article
retrieve the linked page’s title and first section

and parallelize it withparsel.

parsel requires two things:

a scraping function defining the actions to be executed in eachRSelenium instance. Actions to be executed in each browserinstance should be written in the conventionalRSelenium syntaxwithremDr$ specifying the remote driver.
some inputx to those actions (e.g. search terms to be entered insearch boxes or links to navigate to etc.)

library(RSelenium)library(parsel)#let's define our scraping function input#we want to run our function 4 times and we want it to start on the wikipedia main page each timeinput<- rep("https://de.wikipedia.org",4)#let's define our scraping functionget_wiki_text<-function(x){input_i<-x#navigate to input page (i.e wikipedia)remDr$navigate(input_i)#find and click random articlerand_art<-remDr$findElement(using="id","n-randompage")$clickElement()#get random article titletitle<-remDr$findElement(using="id","firstHeading")$getElementText()[[1]]#check if there is a linked pagelink_exists<- try(remDr$findElement(using="xpath","/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]"))#if no linked page fill output with NAif(is(link_exists,"try-error")){first_link_title<-NAfirst_link_text<-NA#if there is a linked page  }else {#click on linklink<-remDr$findElement(using="xpath","/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]")$clickElement()#get link page titlefirst_link_title<- try(remDr$findElement(using="id","firstHeading"))if(is(first_link_title,"try-error")){first_link_title<-NA    }else{first_link_title<-first_link_title$getElementText()[[1]]    }#get 1st section of link pagefirst_link_text<- try(remDr$findElement(using="xpath","/html/body/div[3]/div[3]/div[5]/div[1]/p[1]"))if(is(first_link_text,"try-error")){first_link_text<-NA    }else{first_link_text<-first_link_text$getElementText()[[1]]    }  }out<-data.frame("random_article"=title,"first_link_title"=first_link_title,"first_link_text"=first_link_text)return(out)}

Now that we have our scrape function and input we can parallelize theexecution of the function. For speed and efficiency reasons, it isadvisable to specify the headless browser option in theextraCapabilities argument.parscrape will show a progress bar, aswell as elapsed and estimated remaining time so you can keep track ofscraping progress.

wiki_text<-parsel::parscrape(scrape_fun=get_wiki_text,scrape_input=input,cores=2,packages= c("RSelenium","XML"),browser="firefox",scrape_tries=1,extraCapabilities=list("moz:firefoxOptions"=list(args=list('--headless'))                                     ))

parscrape returns a list with two elements:

a list of your scrape function output
a data.frame of inputs it was unable to scrape, and the associatederror messages

RSelenium Constructors

parsel allows you to generate safe scraping code with minimal hassleby simply composingconstructor functions that effectively act aswrappers aroundRSelenium methods in a pipe. You can return a scraperfunction defined byconstructors to the environment by starting yourpipe withstart_scraper() and ending it withbuild_scraper().Alternatively you can dump the code generated by yourconstructor pipeto the console viashow(). We’ll reproduce a slightly stripped downversion of theRSelenium code in the above wikipedia scraping routinevia theparselconstructor functions.

library(parsel)# returning a scaper functionstart_scraper(args="x",name="get_wiki_text") %>>%  go(url="x") %>>%   click(using="id",value="'n-randompage'",name="rand_art") %>>%  get_element(using="id",value="'firstHeading'",name="title") %>>%  click(using="xpath",value="'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]'",name="link") %>>%  get_element(using="id",value="'firstHeading'",name="first_link_title") %>>%  get_element(using="xpath",value="'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]'",name="first_link_text") %>>%  build_scraper()#> [1] "scraping function get_wiki_text constructed and in environment"#> [1] "scraping function get_wiki_text constructed and in environment"ls()#> [1] "get_wiki_text"# dumping generated code to consolego(url="x") %>>%  click(using="id",value="'n-randompage'",name="rand_art") %>>%  get_element(using="id",value="'firstHeading'",name="title") %>>%  click(using="xpath",value="'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]'",name="link") %>>%  get_element(using="id",value="'firstHeading'",name="first_link_title") %>>%  get_element(using="xpath",value="'/html/body/div[3]/div[3]/div[5]/div[1]/p[1]'",name="first_link_text") %>>%  show()#> # navigate to url#> not_loaded <- TRUE#> remDr$navigate(x)#> while(not_loaded){#> Sys.sleep(0.25)#> current <- remDr$getCurrentUrl()[[1]]#> if(current == x){#> not_loaded <- FALSE#> }#> }#>#>  rand_art <- remDr$findElement(using = 'id', 'n-randompage')#> rand_art$clickElement()#> Sys.sleep(0.25)#>#>  title <- try(remDr$findElement(using = 'id', 'firstHeading'))#> if(is(title,'try-error')){#> title <- NA#> } else {#> title <- title$getElementText()[[1]]#> }#>#>  link <- remDr$findElement(using = 'xpath', '/html/body/div[3]/div[3]/div[5]/div[1]/p[1]/a[1]')#> link$clickElement()#> Sys.sleep(0.25)#>#>  first_link_title <- try(remDr$findElement(using = 'id', 'firstHeading'))#> if(is(first_link_title,'try-error')){#> first_link_title <- NA#> } else {#> first_link_title <- first_link_title$getElementText()[[1]]#> }#>#>  first_link_text <- try(remDr$findElement(using = 'xpath', '/html/body/div[3]/div[3]/div[5]/div[1]/p[1]'))#> if(is(first_link_text,'try-error')){#> first_link_text <- NA#> } else {#> first_link_text <- first_link_text$getElementText()[[1]]#> }

About

parallel execution of RSelenium

Topics

cran r parallel web-scraping rselenium

Resources

Readme

License

Unknown, MIT licenses found

Releases4

parsel 0.3.0 Latest

Feb 23, 2023

+ 3 releases

Packages

No packages published

Languages

R100.0%

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Licenses found

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

parsel

Installation

Usage

Parallel Scraping

RSelenium Constructors

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases4

Packages

Uh oh!

Languages

Movatterモバイル変換

License

Licenses found

till-tietz/parsel

Folders and files

Latest commit

History

Repository files navigation

parsel

Installation

Usage

Parallel Scraping

RSelenium Constructors

About

Topics

Resources

License

Licenses found

Uh oh!

Stars

Watchers

Forks

Releases4

Packages0

Uh oh!

Languages

Packages