ropensci/rtikaPublic

NotificationsYou must be signed in to change notification settings
Fork8
Star55

R Interface to Apache Tika

License

Apache-2.0 license

55 stars 8 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 184 Commits
.github		.github
.vscode		.vscode
R		R
data-raw		data-raw
docs		docs
inst/extdata		inst/extdata
java		java
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CONDUCT.md		CONDUCT.md
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yaml		_pkgdown.yaml
codecov.yml		codecov.yml
rtika.Rproj		rtika.Rproj

Repository files navigation

rtika

Extract text or metadata from over a thousand file types.

Apache Tika is a content detection and analysis framework, written inJava, stewarded at the Apache Software Foundation. It detects andextracts metadata and text from over a thousand different file types,and as well as providing a Java library, has server and command-lineeditions suitable for use from other programming languages …

For most of the more common and popular formats, Tika then providescontent extraction, metadata extraction and language identificationcapabilities. (Fromhttps://en.wikipedia.org/wiki/Apache_Tika,accessed Jan 18, 2018)

This is an R interface to the Tika software.

Installation

To start, you need R andJava 11 (or newer, e.g. OpenJDK 17+). Higherversions work. To check your version, run the commandjava -versionfrom a terminal. Get Java installation tips athttps://www.java.com/en/download/ orhttps://openjdk.org/install/.Because therJava package isnot required, installation issimple. You can cut and paste the following snippet:

install.packages('rtika',repos='https://cloud.r-project.org')library('rtika')# You need to install the Apache Tika .jar once.install_tika()

Read anintroductoryarticleathttps://docs.ropensci.org/rtika/articles/rtika_introduction.html.

Key Features

tika_text() to extract plain text.
tika_xml() andtika_html() to get a structured XHMTL rendition.
tika_json() to get metadata as.json, with XHMTL content.
tika_json_text() to get metadata as.json, with plain textcontent.
tika() is the main function the others above inherit from.
tika_fetch() to download files with a file extension matching theContent-Type.

Supported File Types

Tika parses and extracts text or metadata from over one thousand digitalformats, including:

Portable Document Format (.pdf)
Microsoft Office document formats (Word, PowerPoint, Excel, etc.)
Rich Text Format (.rtf)
Electronic Publication Format (.epub)
Image formats (.jpeg,.png, etc.)
Mail formats (.mbox, Outlook)
HyperText Markup Language (.html)
XML and derived formats (.xml, etc.)
Compression and packaging formats (.gzip,.rar, etc.)
OpenDocument Format
iWorks document formats
WordPerfect document formats
Text formats
Feed and Syndication formats
Help formats
Audio formats
Video formats
Java class files and archives
Source code
CAD formats
Font formats
Scientific formats
Executable programs and libraries
Crypto formats

For a list of MIME types, look for the “Supported Formats” page here:https://tika.apache.org/

Get Plain Text

Thertika package processes batches of documents efficiently, so Irecommend batches. Currently, thetika() parsers take a tiny bit oftime to spin up, and that will get annoying with hundreds of separatecalls to the functions.

# Test filesbatch<- c(  system.file("extdata","jsonlite.pdf",package="rtika"),  system.file("extdata","curl.pdf",package="rtika"),  system.file("extdata","table.docx",package="rtika"),  system.file("extdata","xml2.pdf",package="rtika"),  system.file("extdata","R-FAQ.html",package="rtika"),  system.file("extdata","calculator.jpg",package="rtika"),  system.file("extdata","tika.apache.org.zip",package="rtika"))# batches are best, and can also be piped with magrittr.text<- tika_text(batch)# text has one string for each document:length(text)#> [1] 7# A snippet:cat(substr(text[1],54,190))#> lite’#> June 1, 2017#>#> Version 1.5#>#> Title A Robust, High Performance JSON Parser and Generator for R#>#> License MIT + file LICENSE#>#> NeedsCompi

To learn more and find out how to extract structured text and metadata,read the vignette:https://docs.ropensci.org/rtika/articles/rtika_introduction.html.

Enhancements

Tika also can interact with the Tesseract OCR program on some Linuxvariants, to extract plain text from images of text. Iftesseract-ocris installed, Tika should automatically locate and use it for images andPDFs that contain images of text. However, this does not seem to work onOS X or Windows. To try on Linux, first follow theTesseractinstallationinstructions. The nexttime Tika is run, it should work. For a different approach, I suggesttesseract package by @jeroen,which is a specialized R interface.

The Apache Tika community welcomes your feedback. Issues regarding the Rinterface should be raised at therTika Github IssueTracker. If you are confidentthe issue concerns Tika or one of its underlying parsers, use theTikaBugtrackingSystem.

Using the Tika App Directly

If your project or package needs to use the Tika App.jar, you canincluderTika as a dependency and call thertika::tika_jar()function to get the path to the Tika app installed on the system.

Similar R Packages

The are a number of specialized parsers that overlap in functionality.For example, thepdftoolspackage extracts metadata and text from PDF files, theantiword package extracts textfrom recent versions of Word, and theepubr package by @leonawiczprocessesepub files. These packages do not depend on Java, whilerTika does.

The big difference between Tika and a specialized parser is that Tikaintegrates dozens of specialist libraries maintained by the ApacheFoundation. Apache Tika processes over a thousand file types andmultiple versions of each. This eases the processing of digital archivesthat contain unpredictable files. For example, researchers use Tika toprocess archives from court cases, governments, or the Internet Archivethat span multiple years. These archives frequently contain diverseformats and multiple versions of each format. Because Tika finds thematching parser for each individual file, is well suited to diverse setsof documents. In general, the parsing quality is good and consistentlyso. In contrast, specialized parsers may only work with a particularversion of a file, or require extra tinkering.

On the other hand, a specialized library can offer more control andfeatures when it comes to structured data and formatting. For example,thetabulapdf package by@leeper and @tpaskhalis includes bindings to the ‘Tabula PDF TableExtractor Library’. Because PDF files store tables as a series ofpositions with no obvious boundaries between data cells, extracting adata.frame ormatrix requires heuristics and customization whichthat package provides. To be fair to Tika, there are some formats wherertika will extract data as table-like XML. For example, with Word andExcel documents, Tika extracts simple tables as XHTML data that can beturned into a tabulardata.frame using thervest::html_table()function.

History

In September 2017, github.com userkyusque releasedtikaR, whichuses therJava package to interact with Tika (See:https://github.com/kyusque/tikaR). As of writing, it provided similartext and metadata extraction, but onlyxml output.

Back in March 2012, I started a similar project to interface with ApacheTika. My code also used low-level functions from therJava package. Ihalted development after discovering that the Tika command lineinterface (CLI) was easier to use. My empty repository is athttps://r-forge.r-project.org/projects/r-tika/.

I chose to finally develop this package after getting excited by Tika’snew ‘batch processor’ module, written in Java. The batch processor hasvery good efficiency when processing tens of thousands of documents.Further, it is not too slow for a single document either, and handleserrors gracefully. ConnectingR to the Tika batch processor turned outto be relatively simple, because theR code is simple. It uses the CLIto point Tika to the files. Simplicity, along with continuous testing,should ease integration. I anticipate that some researchers will needplain text output, while others will wantjson output. Some will wantmultiple processing threads to speed things up. These features are nowimplemented inrtika, although apparently not intikaR yet.

Code of Conduct

Please note that this project is released with aContributor Code ofConduct. Byparticipating in this project you agree to abide by its terms.

About

R Interface to Apache Tika

docs.ropensci.org/rtika

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

rtika

Installation

Key Features

Supported File Types

Get Plain Text

Enhancements

Using the Tika App Directly

Similar R Packages

History

Code of Conduct

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Contributors4

Uh oh!

Languages

Movatterモバイル変換

License

ropensci/rtika

Folders and files

Latest commit

History

Repository files navigation

rtika

Installation

Key Features

Supported File Types

Get Plain Text

Enhancements

Using the Tika App Directly

Similar R Packages

History

Code of Conduct

About

Topics

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors4

Uh oh!

Languages

Packages