Extract text or metadata from over a thousand filetypes.
Apache Tika is a content detection and analysis framework, written inJava, stewarded at the Apache Software Foundation. It detects andextracts metadata and text from over a thousand different file types,and as well as providing a Java library, has server and command-lineeditions suitable for use from other programming languages …
For most of the more common and popular formats, Tika then providescontent extraction, metadata extraction and language identificationcapabilities. (Fromhttps://en.wikipedia.org/wiki/Apache_Tika, accessed Jan18, 2018)
This is an R interface to the Tika software.
To start, you need R andJava 11 (or newer, e.g. OpenJDK17+). Higher versions work. To check your version, run the commandjava -version from a terminal. Get Java installation tipsathttps://www.java.com/en/download/ orhttps://openjdk.org/install/. Because therJava package isnot required,installation is simple. You can cut and paste the following snippet:
install.packages('rtika',repos ='https://cloud.r-project.org')library('rtika')# You need to install the Apache Tika .jar once.install_tika()Read anintroductoryarticle athttps://docs.ropensci.org/rtika/articles/rtika_introduction.html.
tika_text() to extract plain text.tika_xml() andtika_html() to get astructured XHMTL rendition.tika_json() to get metadata as.json, withXHMTL content.tika_json_text() to get metadata as.json,with plain text content.tika() is the main function the others above inheritfrom.tika_fetch() to download files with a file extensionmatching the Content-Type.Tika parses and extracts text or metadata from over one thousanddigital formats, including:
.pdf).rtf).epub).jpeg,.png, etc.).mbox, Outlook).html).xml, etc.).gzip,.rar, etc.)For a list of MIME types, look for the “Supported Formats” page here:https://tika.apache.org/
Thertika package processes batches of documentsefficiently, so I recommend batches. Currently, thetika() parsers take a tiny bit of time to spin up, and thatwill get annoying with hundreds of separate calls to the functions.
# Test filesbatch<-c(system.file("extdata","jsonlite.pdf",package ="rtika"),system.file("extdata","curl.pdf",package ="rtika"),system.file("extdata","table.docx",package ="rtika"),system.file("extdata","xml2.pdf",package ="rtika"),system.file("extdata","R-FAQ.html",package ="rtika"),system.file("extdata","calculator.jpg",package ="rtika"),system.file("extdata","tika.apache.org.zip",package ="rtika"))# batches are best, and can also be piped with magrittr.text<-tika_text(batch)# text has one string for each document:length(text)#> [1] 7# A snippet:cat(substr(text[1],54,190))#> lite’#> June 1, 2017#>#> Version 1.5#>#> Title A Robust, High Performance JSON Parser and Generator for R#>#> License MIT + file LICENSE#>#> NeedsCompiTo learn more and find out how to extract structured text andmetadata, read the vignette:https://docs.ropensci.org/rtika/articles/rtika_introduction.html.
Tika also can interact with the Tesseract OCR program on some Linuxvariants, to extract plain text from images of text. Iftesseract-ocr is installed, Tika should automaticallylocate and use it for images and PDFs that contain images of text.However, this does not seem to work on OS X or Windows. To try on Linux,first follow theTesseractinstallation instructions. The next time Tika is run, it shouldwork. For a different approach, I suggesttesseractpackage by@jeroen,which is a specialized R interface.
The Apache Tika community welcomes your feedback. Issues regardingthe R interface should be raised at therTika GithubIssue Tracker. If you are confident the issue concerns Tika or oneof its underlying parsers, use theTikaBugtracking System.
If your project or package needs to use the Tika App.jar, you can includerTika as a dependencyand call thertika::tika_jar() function to get the path tothe Tika app installed on the system.
The are a number of specialized parsers that overlap infunctionality. For example, thepdftoolspackage extracts metadata and text from PDF files, theantiwordpackage extracts text from recent versions of Word, and theepubr packageby@leonawiczprocessesepub files. These packages do not depend on Java,whilerTika does.
The big difference between Tika and a specialized parser is that Tikaintegrates dozens of specialist libraries maintained by the ApacheFoundation. Apache Tika processes over a thousand file types andmultiple versions of each. This eases the processing of digital archivesthat contain unpredictable files. For example, researchers use Tika toprocess archives from court cases, governments, or the Internet Archivethat span multiple years. These archives frequently contain diverseformats and multiple versions of each format. Because Tika finds thematching parser for each individual file, is well suited to diverse setsof documents. In general, the parsing quality is good and consistentlyso. In contrast, specialized parsers may only work with a particularversion of a file, or require extra tinkering.
On the other hand, a specialized library can offer more control andfeatures when it comes to structured data and formatting. For example,thetabulapdfpackage by@leeper and@tpaskhalisincludes bindings to the ‘Tabula PDF Table Extractor Library’. BecausePDF files store tables as a series of positions with no obviousboundaries between data cells, extracting adata.frame ormatrix requires heuristics and customization which thatpackage provides. To be fair to Tika, there are some formats wherertika will extract data as table-like XML. For example,with Word and Excel documents, Tika extracts simple tables as XHTML datathat can be turned into a tabulardata.frame using thervest::html_table() function.
In September 2017, github.com userkyusque releasedtikaR, which uses therJava package tointeract with Tika (See:https://github.com/kyusque/tikaR). As of writing, itprovided similar text and metadata extraction, but onlyxmloutput.
Back in March 2012, I started a similar project to interface withApache Tika. My code also used low-level functions from therJava package. I halted development after discovering thatthe Tika command line interface (CLI) was easier to use. My emptyrepository is athttps://r-forge.r-project.org/projects/r-tika/.
I chose to finally develop this package after getting excited byTika’s new ‘batch processor’ module, written in Java. The batchprocessor has very good efficiency when processing tens of thousands ofdocuments. Further, it is not too slow for a single document either, andhandles errors gracefully. ConnectingR to the Tika batchprocessor turned out to be relatively simple, because theRcode is simple. It uses the CLI to point Tika to the files. Simplicity,along with continuous testing, should ease integration. I anticipatethat some researchers will need plain text output, while others willwantjson output. Some will want multiple processingthreads to speed things up. These features are now implemented inrtika, although apparently not intikaRyet.
Please note that this project is released with aContributorCode of Conduct. By participating in this project you agree to abideby its terms.