Movatterモバイル変換

Apache Tika

From Wikipedia, the free encyclopedia

Open-source content analysis framework

Tika

Developer(s)	Apache Software Foundation

Stable release	3.1.0^[1] / 31 January 2025; 45 days ago (31 January 2025)

Repository	Tika Repository
Written in	Java
Operating system	Cross-platform
Type	Search andindex API
License	Apache License 2.0
Website	tika.apache.org

Apache Tika is a content detection andanalysis framework, written inJava, stewarded at theApache Software Foundation.^[2] It detects and extracts metadata and text from over a thousand differentfile types, and as well as providing aJava library, has server and command-line editions suitable for use from other programming languages.

History

[edit]

The project originated as part of theApache Nutch codebase, to provide content identification and extraction whencrawling. In 2007, it was separated out, to make it more extensible and usable bycontent management systems, otherWeb crawlers, and information retrieval systems. The standalone Tika was founded by Jérôme Charron,Chris Mattmann and Jukka Zitting.^[3] In 2011 Chris Mattmann and Jukka Zitting released the Manning book "Tika in Action", and the project released version 1.0.

Features

[edit]

Tika provides capabilities for identification of more than 1400 file types from theInternet Assigned Numbers Authority taxonomy ofMIME types. For most of the more common and popular formats,^[4] Tika then provides content extraction, metadata extraction and language identification capabilities.

It can also get text from images by using theOCR softwareTesseract.^[5]

While Tika is written inJava, it is widely used from other languages.^[6] TheRESTful server andCLI Tool permit non-Java programs to access the Tika functionality.

Notable uses

[edit]

Tika is used by financial institutions including theFair Isaac Corporation (FICO),^[7] Goldman Sachs,^[8]NASA and academic researchers^[9] and by major content management systems includingDrupal,^[10] andAlfresco (software)^[11] to analyze large amounts of content, and to make it available in common formats using information retrieval techniques.

On April 4, 2016^[12]Forbes published an article identifying Tika as one of the key technologies used by more than 400 journalists to analyze 11.5 million leaked documents that expose an international scandal involving world leaders storing money in offshoreshell corporations. The leaked documents and the project to analyze them is referred to as thePanama Papers.

References

[edit]

^https://dist.apache.org/repos/dist/release/tika/3.1.0/CHANGES-3.1.0.txt.{{cite web}}:Missing or empty|title= (help)
^"Apache Tika". Retrieved2016-04-15.
^"Tika Proposal". Retrieved2016-04-15.
^"The Apache Software Foundation".Apache Tika formats page. Retrieved16 April 2016.
^"TikaOCR". Apache Tika. 2019-03-26. Retrieved2019-12-02.
^"API Bindings for Tika". Apache Tika. Retrieved2016-04-17.
^"FICO to Engage Kaggle's Community of 180,000 Data Scientists to Drive Innovation in the FICO Analytic Cloud | FICO".FICO | Decisions. Archived fromthe original on 2016-06-03. Retrieved2016-04-15.
^"Goldman Sachs Puts Elasticsearch To Work - InformationWeek".InformationWeek. Retrieved2017-06-21.
^"Studying polar data with the help of Apache Tika".Opensource.com. Retrieved2016-04-15.
^"Text Extract for Drupal using Tika | Drupal.org".www.drupal.org. 30 July 2012. Retrieved2016-04-15.
^"Content Transformation and Metadata Extraction with Apache Tika - alfrescowiki".wiki.alfresco.com. 5 June 2015. Retrieved2016-04-15.
^Fox-Brewster, Thomas."From Encrypted Drives To Amazon's Cloud -- The Amazing Flight Of The Panama Papers".Forbes. Retrieved2016-04-15.

v t e The Apache Software Foundation
Top-level projects	Accumulo ActiveMQ Airavata Airflow Allura Ambari Ant Aries Arrow Apache HTTP Server APR Avro Axis Axis2 Beam Bloodhound Brooklyn Calcite Camel CarbonData Cassandra Cayenne CloudStack Cocoon Cordova CouchDB cTAKES CXF Derby Directory Drill Druid Empire-db Felix Flex Flink Flume FreeMarker Geronimo Groovy Guacamole Gump Hadoop HBase Helix Hive Iceberg Ignite Impala Jackrabbit James Jena JMeter Kafka Kudu Kylin Lucene Mahout Maven MINA mod_perl MyFaces Mynewt NiFi NetBeans Nutch NuttX OFBiz Oozie OpenEJB OpenJPA OpenNLP OрenOffice ORC PDFBox Parquet Phoenix POI Pig Pinot Pivot Qpid Roller RocketMQ Samza Shiro SINGA Sling Solr Spark Storm SpamAssassin Struts 1 Subversion Superset SystemDS Tapestry Thrift Tika TinkerPop Tomcat Trafodion Traffic Server UIMA Velocity Wicket Xalan Xerces XMLBeans Yetus ZooKeeper
Commons	BCEL BSF Daemon Jelly Logging
Incubator	Taverna
Other projects	Batik FOP Ivy Log4j
Attic	Apex AxKit Beehive iBATIS Click Continuum Deltacloud Etch Giraph Hama Harmony Jakarta Marmotta MXNet ODE River Shale Slide Sqoop Stanbol Tuscany Wave XML
Licenses	Apache License
Category

Retrieved from "https://en.wikipedia.org/w/index.php?title=Apache_Tika&oldid=1237951715"

Categories:

Hidden categories:

[8]ページ先頭

Movatterモバイル変換

History

Features

Notable uses

See also

References