Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Python bindings for Apache Tika

License

NotificationsYou must be signed in to change notification settings

fedelemantuano/tika-app-python

Repository files navigation

PyPI versionBuild StatusCoverage StatusBCH compliance

tika-app-python

Overview

tika-app-python is a wrapper forApache Tika App.With this library you can analyze:

  • file on disk
  • payload in base64
  • file object (like standard input)

To use file object function you should use Apache Tika version >= 1.17.

Apache 2 Open Source License

tika-app-python can be downloaded, used, and modified free of charge. It is available under the Apache 2 license.

Authors

Main Author

Fedele Mantuano (Twitter:@fedelemantuano)

Installation

Clone repository

git clone https://github.com/fedelemantuano/tika-app-python.git

and install tika-app-python withsetup.py:

cd tika-app-pythonpython setup.py install

or usepip:

pip install tika-app

Usage in a project

ImportTikaApp class:

from tikapp import TikaApptika_client = TikaApp(file_jar="/opt/tika/tika-app-1.18.jar")

For getcontent type:

tika_client.detect_content_type("your_file")

For detectlanguage:

tika_client.detect_language("your_file")

For detectall metadata and content:

tika_client.extract_all_content("your_file")

For detectonly content:

tika_client.extract_only_content("your_file")

For detectonly metadata:

tika_client.extract_only_metadata("your_file")

You can analyze payload in base64 with the same methods, but passingpayload argument:

tika_client.detect_content_type(payload="base64_payload")tika_client.detect_language(payload="base64_payload")tika_client.extract_all_content(payload="base64_payload")tika_client.extract_only_content(payload="base64_payload")tika_client.extract_only_metadata(payload="base64_payload")

or you can analyze file object (like standard input) with the same methods, but passingobjectInput argument:

tika_client.detect_language(objectInput="objectInput")tika_client.extract_all_content(objectInput="objectInput")tika_client.extract_only_content(objectInput="objectInput")tika_client.extract_only_metadata(objectInput="objectInput")

Usage from command-line

If you installed tika-app-python withpip orsetup.py you can use it with command-line.To use tika-app-python you should submit the Apache Tika app JAR. You can:

  • set the enviroment valueTIKA_APP_JAR
  • use--jar switch

The last one overwrite all the others.

These are all swithes:

usage: tikapp [-h] (-f FILE | -p PAYLOAD | -k) [-j JAR] [-d] [-t] [-l]                   [-m] [-a] [-v]Wrapper for Apache Tika App.optional arguments:  -h, --help            show this help message and exit  -f FILE, --file FILE  File to submit (default: None)  -p PAYLOAD, --payload PAYLOAD                        Base64 payload to submit (default: None)  -k, --stdin           Enable parsing from stdin (default: False)  -j JAR, --jar JAR     Apache Tika app JAR (default: None)  -d, --detect          Detect document type (default: False)  -t, --text            Output plain text content (default: False)  -l, --language        Output only language (default: False)  -m, --metadata        Output only metadata (default: False)  -a, --all             Output metadata and content from all embedded files                        (default: False)  -v, --version         show program's version number and exit

Example from file on disk:

$ tikapp -f example_file -a

Example from standard input

$ tikapp -a -k< example_file

Performance tests

These are the results of performance tests intests folder:

(Python 2)tika_content_type()             0.704840 sectika_detect_language()          1.592066 secmagic_content_type()            0.000215 sectika_extract_all_content()      0.816366 sectika_extract_only_content()     0.788667 sec(Python 3)tika_content_type()             0.698357 sectika_detect_language()          1.593452 secmagic_content_type()            0.000226 sectika_extract_all_content()      0.785915 sectika_extract_only_content()     0.766517 sec

[8]ページ先頭

©2009-2025 Movatter.jp