ropensci/epubrPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star24

Read EPUB files in R

License

Unknown, MIT licenses found

Licenses found

24 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
R		R
data-raw		data-raw
inst		inst
man		man
pkgdown		pkgdown
revdep		revdep
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
codecov.yml		codecov.yml
codemeta.json		codemeta.json
cran-comments.md		cran-comments.md
epubr.Rproj		epubr.Rproj

Repository files navigation

epubr

Read EPUB files in R

Read EPUB text and metadata.

Theepubr package provides functions supporting the reading andparsing of internal e-book content from EPUB files. E-book metadata andtext content are parsed separately and joined together in a tidy, nestedtibble data frame.

E-book formatting is not completely standardized across all literature.It can be challenging to curate parsed e-book content across anarbitrary collection of e-books perfectly and in completely generalform, to yield a singular, consistently formatted output. Many EPUBfiles do not even contain all the same pieces of information in theirrespective metadata.

EPUB file parsing functionality in this package is intended forrelatively general application to arbitrary EPUB e-books. However,poorly formatted e-books or e-books with highly uncommon formatting maynot work with this package. There may even be cases where an EPUB filehas DRM or some other property that makes it impossible to read withepubr.

Text is read ‘as is’ for the most part. The only nominal changes areminor substitutions, for example curly quotes changed to straightquotes. Substantive changes are expected to be performed subsequently bythe user as part of their text analysis. Additional text cleaning can beperformed at the user’s discretion, such as with functions from packagesliketm orqdap.

Installation

Installepubr from CRAN with:

install.packages("epubr")

Install the development version from GitHub with:

# install.packages("remotes")remotes::install_github("ropensci/epubr")

Example

Bram Stoker’s Dracula novel sourced from Project Gutenberg is a goodexample of an EPUB file with unfortunate formatting. The first thingthat stands out is the naming convention usingitem followed by someordered digits does not differentiate sections like the book preamblefrom the chapters. The numbering also starts in a weird place. But it isactually worse than this. Notice that sections are not broken intochapters; they can begin and end in the middle of chapters!

These annoyances aside, the metadata and contents can still be read intoa convenient table. Text mining analyses can still be performed on theoverall book, if not so easily on individual chapters. See thepackagevignette forexamples on how to further improve the structure of an e-book withformatting like this.

file<- system.file("dracula.epub",package="epubr")(x<- epub(file))#> # A tibble: 1 × 9#>   rights         identifier creator title language subject date  source data#>   <chr>          <chr>      <chr>   <chr> <chr>    <chr>   <chr> <chr>  <list>#> 1 Public domain… http://ww… Bram S… Drac… en       Horror… 1995… http:… <tibble>x$data[[1]]#> # A tibble: 15 × 4#>    section           text                                            nword nchar#>    <chr>             <chr>                                           <int> <int>#>  1 item6             "The Project Gutenberg EBook of Dracula, by Br… 11446 60972#>  2 item7             "But I am not in heart to describe beauty, for… 13879 71798#>  3 item8             "\" 'Lucy, you are an honest-hearted girl, I k… 12474 65522#>  4 item9             "CHAPTER VIIIMINA MURRAY'S JOURNAL\nSame day, … 12177 62724#>  5 item10            "CHAPTER X\nLetter, Dr. Seward to Hon. Arthur … 12806 66678#>  6 item11            "Once again we went through that ghastly opera… 12103 62949#>  7 item12            "CHAPTER XIVMINA HARKER'S JOURNAL\n23 Septembe… 12214 62234#>  8 item13            "CHAPTER XVIDR. SEWARD'S DIARY-continued\nIT w… 13990 72903#>  9 item14            "\"Thus when we find the habitation of this ma… 13356 69779#> 10 item15            "\"I see,\" I said. \"You want big things that… 12866 66921#> 11 item16            "CHAPTER XXIIIDR. SEWARD'S DIARY\n3 October.-T… 11928 61550#> 12 item17            "CHAPTER XXVDR. SEWARD'S DIARY\n11 October, Ev… 13119 68564#> 13 item18            " \nLater.-Dr. Van Helsing has returned. He ha…  8435 43464#> 14 item19            "End of the Project Gutenberg EBook of Dracula…  2665 18541#> 15 coverpage-wrapper ""                                                  0     0

Related packages

tesseract by @jeroen for moredirect control of the OCR process.

pdftools for extracting metadataand text from PDF files (therefore more specific to PDF, and without aJava dependency)

tabulizer by @leeper and@tpaskhalis, Bindings for Tabula PDF Table Extractor Library, to extracttables, therefore not text, from PDF files.

rtika by @goodmansasha for moregeneral text parsing.

gutenbergr by @dgrtwo forsearching and downloading public domain texts from Project Gutenberg.

Please note that theepubr project is released with aContributorCode ofConduct.By contributing to this project, you agree to abide by its terms.

About

Read EPUB files in R

docs.ropensci.org/epubr

Topics

r epub rstats r-package epub-format epub-files peer-reviewed

Resources

Readme

License

Unknown, MIT licenses found

Releases8

epubr 0.6.5 release Latest

Sep 19, 2024

+ 7 releases

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Licenses found

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

epubr

Read EPUB files in R

Installation

Example

Related packages

About

Topics

Resources

License

Licenses found

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases8

Packages

Contributors2

Uh oh!

Languages

Movatterモバイル変換

License

Licenses found

ropensci/epubr

Folders and files

Latest commit

History

Repository files navigation

epubr

Read EPUB files in R

Installation

Example

Related packages

About

Topics

Resources

License

Licenses found

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases8

Packages0

Contributors2

Uh oh!

Languages

Packages