Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

R Interface to Apache Tika

License

NotificationsYou must be signed in to change notification settings

ropensci/rtika

Repository files navigation

Extract text or metadata from over a thousand file types.

R-CMD-checkROpenSciCoverage statusCranlogs Downloads

Apache Tika is a content detection and analysis framework, written inJava, stewarded at the Apache Software Foundation. It detects andextracts metadata and text from over a thousand different file types,and as well as providing a Java library, has server and command-lineeditions suitable for use from other programming languages …

For most of the more common and popular formats, Tika then providescontent extraction, metadata extraction and language identificationcapabilities. (Fromhttps://en.wikipedia.org/wiki/Apache_Tika,accessed Jan 18, 2018)

This is an R interface to the Tika software.

Installation

To start, you need R andJava 11 (or newer, e.g. OpenJDK 17+). Higherversions work. To check your version, run the commandjava -versionfrom a terminal. Get Java installation tips athttps://www.java.com/en/download/ orhttps://openjdk.org/install/.Because therJava package isnot required, installation issimple. You can cut and paste the following snippet:

install.packages('rtika',repos='https://cloud.r-project.org')library('rtika')# You need to install the Apache Tika .jar once.install_tika()

Read anintroductoryarticleathttps://docs.ropensci.org/rtika/articles/rtika_introduction.html.

Key Features

  • tika_text() to extract plain text.
  • tika_xml() andtika_html() to get a structured XHMTL rendition.
  • tika_json() to get metadata as.json, with XHMTL content.
  • tika_json_text() to get metadata as.json, with plain textcontent.
  • tika() is the main function the others above inherit from.
  • tika_fetch() to download files with a file extension matching theContent-Type.

Supported File Types

Tika parses and extracts text or metadata from over one thousand digitalformats, including:

  • Portable Document Format (.pdf)
  • Microsoft Office document formats (Word, PowerPoint, Excel, etc.)
  • Rich Text Format (.rtf)
  • Electronic Publication Format (.epub)
  • Image formats (.jpeg,.png, etc.)
  • Mail formats (.mbox, Outlook)
  • HyperText Markup Language (.html)
  • XML and derived formats (.xml, etc.)
  • Compression and packaging formats (.gzip,.rar, etc.)
  • OpenDocument Format
  • iWorks document formats
  • WordPerfect document formats
  • Text formats
  • Feed and Syndication formats
  • Help formats
  • Audio formats
  • Video formats
  • Java class files and archives
  • Source code
  • CAD formats
  • Font formats
  • Scientific formats
  • Executable programs and libraries
  • Crypto formats

For a list of MIME types, look for the “Supported Formats” page here:https://tika.apache.org/

Get Plain Text

Thertika package processes batches of documents efficiently, so Irecommend batches. Currently, thetika() parsers take a tiny bit oftime to spin up, and that will get annoying with hundreds of separatecalls to the functions.

# Test filesbatch<- c(  system.file("extdata","jsonlite.pdf",package="rtika"),  system.file("extdata","curl.pdf",package="rtika"),  system.file("extdata","table.docx",package="rtika"),  system.file("extdata","xml2.pdf",package="rtika"),  system.file("extdata","R-FAQ.html",package="rtika"),  system.file("extdata","calculator.jpg",package="rtika"),  system.file("extdata","tika.apache.org.zip",package="rtika"))# batches are best, and can also be piped with magrittr.text<- tika_text(batch)# text has one string for each document:length(text)#> [1] 7# A snippet:cat(substr(text[1],54,190))#> lite’#> June 1, 2017#>#> Version 1.5#>#> Title A Robust, High Performance JSON Parser and Generator for R#>#> License MIT + file LICENSE#>#> NeedsCompi

To learn more and find out how to extract structured text and metadata,read the vignette:https://docs.ropensci.org/rtika/articles/rtika_introduction.html.

Enhancements

Tika also can interact with the Tesseract OCR program on some Linuxvariants, to extract plain text from images of text. Iftesseract-ocris installed, Tika should automatically locate and use it for images andPDFs that contain images of text. However, this does not seem to work onOS X or Windows. To try on Linux, first follow theTesseractinstallationinstructions. The nexttime Tika is run, it should work. For a different approach, I suggesttesseract package by @jeroen,which is a specialized R interface.

The Apache Tika community welcomes your feedback. Issues regarding the Rinterface should be raised at therTika Github IssueTracker. If you are confidentthe issue concerns Tika or one of its underlying parsers, use theTikaBugtrackingSystem.

Using the Tika App Directly

If your project or package needs to use the Tika App.jar, you canincluderTika as a dependency and call thertika::tika_jar()function to get the path to the Tika app installed on the system.

Similar R Packages

The are a number of specialized parsers that overlap in functionality.For example, thepdftoolspackage extracts metadata and text from PDF files, theantiword package extracts textfrom recent versions of Word, and theepubr package by @leonawiczprocessesepub files. These packages do not depend on Java, whilerTika does.

The big difference between Tika and a specialized parser is that Tikaintegrates dozens of specialist libraries maintained by the ApacheFoundation. Apache Tika processes over a thousand file types andmultiple versions of each. This eases the processing of digital archivesthat contain unpredictable files. For example, researchers use Tika toprocess archives from court cases, governments, or the Internet Archivethat span multiple years. These archives frequently contain diverseformats and multiple versions of each format. Because Tika finds thematching parser for each individual file, is well suited to diverse setsof documents. In general, the parsing quality is good and consistentlyso. In contrast, specialized parsers may only work with a particularversion of a file, or require extra tinkering.

On the other hand, a specialized library can offer more control andfeatures when it comes to structured data and formatting. For example,thetabulapdf package by@leeper and @tpaskhalis includes bindings to the ‘Tabula PDF TableExtractor Library’. Because PDF files store tables as a series ofpositions with no obvious boundaries between data cells, extracting adata.frame ormatrix requires heuristics and customization whichthat package provides. To be fair to Tika, there are some formats wherertika will extract data as table-like XML. For example, with Word andExcel documents, Tika extracts simple tables as XHTML data that can beturned into a tabulardata.frame using thervest::html_table()function.

History

In September 2017, github.com userkyusque releasedtikaR, whichuses therJava package to interact with Tika (See:https://github.com/kyusque/tikaR). As of writing, it provided similartext and metadata extraction, but onlyxml output.

Back in March 2012, I started a similar project to interface with ApacheTika. My code also used low-level functions from therJava package. Ihalted development after discovering that the Tika command lineinterface (CLI) was easier to use. My empty repository is athttps://r-forge.r-project.org/projects/r-tika/.

I chose to finally develop this package after getting excited by Tika’snew ‘batch processor’ module, written in Java. The batch processor hasvery good efficiency when processing tens of thousands of documents.Further, it is not too slow for a single document either, and handleserrors gracefully. ConnectingR to the Tika batch processor turned outto be relatively simple, because theR code is simple. It uses the CLIto point Tika to the files. Simplicity, along with continuous testing,should ease integration. I anticipate that some researchers will needplain text output, while others will wantjson output. Some will wantmultiple processing threads to speed things up. These features are nowimplemented inrtika, although apparently not intikaR yet.

Code of Conduct

Please note that this project is released with aContributor Code ofConduct. Byparticipating in this project you agree to abide by its terms.

ropensci_footer

Releases

No releases published

Packages

No packages published

Contributors4

  •  
  •  
  •  
  •  

Languages


[8]ページ先頭

©2009-2025 Movatter.jp