- Notifications
You must be signed in to change notification settings - Fork5
Python bindings for Apache Tika
License
fedelemantuano/tika-app-python
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
tika-app-python is a wrapper forApache Tika App.With this library you can analyze:
- file on disk
- payload in base64
- file object (like standard input)
To use file object function you should use Apache Tika version >= 1.17.
tika-app-python can be downloaded, used, and modified free of charge. It is available under the Apache 2 license.
Fedele Mantuano (Twitter:@fedelemantuano)
Clone repository
git clone https://github.com/fedelemantuano/tika-app-python.git
and install tika-app-python withsetup.py
:
cd tika-app-pythonpython setup.py install
or usepip
:
pip install tika-app
ImportTikaApp
class:
from tikapp import TikaApptika_client = TikaApp(file_jar="/opt/tika/tika-app-1.18.jar")
For getcontent type:
tika_client.detect_content_type("your_file")
For detectlanguage:
tika_client.detect_language("your_file")
For detectall metadata and content:
tika_client.extract_all_content("your_file")
For detectonly content:
tika_client.extract_only_content("your_file")
For detectonly metadata:
tika_client.extract_only_metadata("your_file")
You can analyze payload in base64 with the same methods, but passingpayload
argument:
tika_client.detect_content_type(payload="base64_payload")tika_client.detect_language(payload="base64_payload")tika_client.extract_all_content(payload="base64_payload")tika_client.extract_only_content(payload="base64_payload")tika_client.extract_only_metadata(payload="base64_payload")
or you can analyze file object (like standard input) with the same methods, but passingobjectInput
argument:
tika_client.detect_language(objectInput="objectInput")tika_client.extract_all_content(objectInput="objectInput")tika_client.extract_only_content(objectInput="objectInput")tika_client.extract_only_metadata(objectInput="objectInput")
If you installed tika-app-python withpip
orsetup.py
you can use it with command-line.To use tika-app-python you should submit the Apache Tika app JAR. You can:
- set the enviroment value
TIKA_APP_JAR
- use
--jar
switch
The last one overwrite all the others.
These are all swithes:
usage: tikapp [-h] (-f FILE | -p PAYLOAD | -k) [-j JAR] [-d] [-t] [-l] [-m] [-a] [-v]Wrapper for Apache Tika App.optional arguments: -h, --help show this help message and exit -f FILE, --file FILE File to submit (default: None) -p PAYLOAD, --payload PAYLOAD Base64 payload to submit (default: None) -k, --stdin Enable parsing from stdin (default: False) -j JAR, --jar JAR Apache Tika app JAR (default: None) -d, --detect Detect document type (default: False) -t, --text Output plain text content (default: False) -l, --language Output only language (default: False) -m, --metadata Output only metadata (default: False) -a, --all Output metadata and content from all embedded files (default: False) -v, --version show program's version number and exit
Example from file on disk:
$ tikapp -f example_file -a
Example from standard input
$ tikapp -a -k< example_file
These are the results of performance tests intests folder:
(Python 2)tika_content_type() 0.704840 sectika_detect_language() 1.592066 secmagic_content_type() 0.000215 sectika_extract_all_content() 0.816366 sectika_extract_only_content() 0.788667 sec(Python 3)tika_content_type() 0.698357 sectika_detect_language() 1.593452 secmagic_content_type() 0.000226 sectika_extract_all_content() 0.785915 sectika_extract_only_content() 0.766517 sec
About
Python bindings for Apache Tika