- Notifications
You must be signed in to change notification settings - Fork8
ropensci/rtika
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Extract text or metadata from over a thousand file types.
Apache Tika is a content detection and analysis framework, written inJava, stewarded at the Apache Software Foundation. It detects andextracts metadata and text from over a thousand different file types,and as well as providing a Java library, has server and command-lineeditions suitable for use from other programming languages …
For most of the more common and popular formats, Tika then providescontent extraction, metadata extraction and language identificationcapabilities. (Fromhttps://en.wikipedia.org/wiki/Apache_Tika,accessed Jan 18, 2018)
This is an R interface to the Tika software.
To start, you need R andJava 11 (or newer, e.g. OpenJDK 17+). Higherversions work. To check your version, run the commandjava -versionfrom a terminal. Get Java installation tips athttps://www.java.com/en/download/ orhttps://openjdk.org/install/.Because therJava package isnot required, installation issimple. You can cut and paste the following snippet:
install.packages('rtika',repos='https://cloud.r-project.org')library('rtika')# You need to install the Apache Tika .jar once.install_tika()
Read anintroductoryarticleathttps://docs.ropensci.org/rtika/articles/rtika_introduction.html.
tika_text()to extract plain text.tika_xml()andtika_html()to get a structured XHMTL rendition.tika_json()to get metadata as.json, with XHMTL content.tika_json_text()to get metadata as.json, with plain textcontent.tika()is the main function the others above inherit from.tika_fetch()to download files with a file extension matching theContent-Type.
Tika parses and extracts text or metadata from over one thousand digitalformats, including:
- Portable Document Format (
.pdf) - Microsoft Office document formats (Word, PowerPoint, Excel, etc.)
- Rich Text Format (
.rtf) - Electronic Publication Format (
.epub) - Image formats (
.jpeg,.png, etc.) - Mail formats (
.mbox, Outlook) - HyperText Markup Language (
.html) - XML and derived formats (
.xml, etc.) - Compression and packaging formats (
.gzip,.rar, etc.) - OpenDocument Format
- iWorks document formats
- WordPerfect document formats
- Text formats
- Feed and Syndication formats
- Help formats
- Audio formats
- Video formats
- Java class files and archives
- Source code
- CAD formats
- Font formats
- Scientific formats
- Executable programs and libraries
- Crypto formats
For a list of MIME types, look for the “Supported Formats” page here:https://tika.apache.org/
Thertika package processes batches of documents efficiently, so Irecommend batches. Currently, thetika() parsers take a tiny bit oftime to spin up, and that will get annoying with hundreds of separatecalls to the functions.
# Test filesbatch<- c( system.file("extdata","jsonlite.pdf",package="rtika"), system.file("extdata","curl.pdf",package="rtika"), system.file("extdata","table.docx",package="rtika"), system.file("extdata","xml2.pdf",package="rtika"), system.file("extdata","R-FAQ.html",package="rtika"), system.file("extdata","calculator.jpg",package="rtika"), system.file("extdata","tika.apache.org.zip",package="rtika"))# batches are best, and can also be piped with magrittr.text<- tika_text(batch)# text has one string for each document:length(text)#> [1] 7# A snippet:cat(substr(text[1],54,190))#> lite’#> June 1, 2017#>#> Version 1.5#>#> Title A Robust, High Performance JSON Parser and Generator for R#>#> License MIT + file LICENSE#>#> NeedsCompi
To learn more and find out how to extract structured text and metadata,read the vignette:https://docs.ropensci.org/rtika/articles/rtika_introduction.html.
Tika also can interact with the Tesseract OCR program on some Linuxvariants, to extract plain text from images of text. Iftesseract-ocris installed, Tika should automatically locate and use it for images andPDFs that contain images of text. However, this does not seem to work onOS X or Windows. To try on Linux, first follow theTesseractinstallationinstructions. The nexttime Tika is run, it should work. For a different approach, I suggesttesseract package by @jeroen,which is a specialized R interface.
The Apache Tika community welcomes your feedback. Issues regarding the Rinterface should be raised at therTika Github IssueTracker. If you are confidentthe issue concerns Tika or one of its underlying parsers, use theTikaBugtrackingSystem.
If your project or package needs to use the Tika App.jar, you canincluderTika as a dependency and call thertika::tika_jar()function to get the path to the Tika app installed on the system.
The are a number of specialized parsers that overlap in functionality.For example, thepdftoolspackage extracts metadata and text from PDF files, theantiword package extracts textfrom recent versions of Word, and theepubr package by @leonawiczprocessesepub files. These packages do not depend on Java, whilerTika does.
The big difference between Tika and a specialized parser is that Tikaintegrates dozens of specialist libraries maintained by the ApacheFoundation. Apache Tika processes over a thousand file types andmultiple versions of each. This eases the processing of digital archivesthat contain unpredictable files. For example, researchers use Tika toprocess archives from court cases, governments, or the Internet Archivethat span multiple years. These archives frequently contain diverseformats and multiple versions of each format. Because Tika finds thematching parser for each individual file, is well suited to diverse setsof documents. In general, the parsing quality is good and consistentlyso. In contrast, specialized parsers may only work with a particularversion of a file, or require extra tinkering.
On the other hand, a specialized library can offer more control andfeatures when it comes to structured data and formatting. For example,thetabulapdf package by@leeper and @tpaskhalis includes bindings to the ‘Tabula PDF TableExtractor Library’. Because PDF files store tables as a series ofpositions with no obvious boundaries between data cells, extracting adata.frame ormatrix requires heuristics and customization whichthat package provides. To be fair to Tika, there are some formats wherertika will extract data as table-like XML. For example, with Word andExcel documents, Tika extracts simple tables as XHTML data that can beturned into a tabulardata.frame using thervest::html_table()function.
In September 2017, github.com userkyusque releasedtikaR, whichuses therJava package to interact with Tika (See:https://github.com/kyusque/tikaR). As of writing, it provided similartext and metadata extraction, but onlyxml output.
Back in March 2012, I started a similar project to interface with ApacheTika. My code also used low-level functions from therJava package. Ihalted development after discovering that the Tika command lineinterface (CLI) was easier to use. My empty repository is athttps://r-forge.r-project.org/projects/r-tika/.
I chose to finally develop this package after getting excited by Tika’snew ‘batch processor’ module, written in Java. The batch processor hasvery good efficiency when processing tens of thousands of documents.Further, it is not too slow for a single document either, and handleserrors gracefully. ConnectingR to the Tika batch processor turned outto be relatively simple, because theR code is simple. It uses the CLIto point Tika to the files. Simplicity, along with continuous testing,should ease integration. I anticipate that some researchers will needplain text output, while others will wantjson output. Some will wantmultiple processing threads to speed things up. These features are nowimplemented inrtika, although apparently not intikaR yet.
Please note that this project is released with aContributor Code ofConduct. Byparticipating in this project you agree to abide by its terms.
About
R Interface to Apache Tika
Topics
Resources
License
Code of conduct
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.
Contributors4
Uh oh!
There was an error while loading.Please reload this page.
