This repository was archived by the owner on Feb 7, 2020. It is now read-only.

crate/elasticsearch-inout-pluginPublic archive

NotificationsYou must be signed in to change notification settings
Fork15
Star112

An Elasticsearch plugin which provides the ability to export data by query on server side.

License

Apache-2.0 license

112 stars 15 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 232 Commits
src		src
.gitignore		.gitignore
CHANGES.txt		CHANGES.txt
LICENSE.txt		LICENSE.txt
README.rst		README.rst
create_tag.sh		create_tag.sh
pom.xml		pom.xml

Repository files navigation

Elasticsearch InOut Plugin

This Elasticsearch plugin provides the ability to export data by queryon server side, by outputting the data directly on the according node.The export can happen on all indexes, on a specific index or on a specificdocument type.

The data will get exported as one json object per line:

{"_id":"id1","_source":{"type":"myObject","value":"value1"},"_version":1,"_index":"myIndex","_type":"myType"}{"_id":"id2","_source":{"type":"myObject","value":"value2"},"_version":2,"_index":"myIndex","_type":"myType"}

The exported data can also be imported into elasticsearch again. The import ishappening on each elasticsearch node by processing the files located in thespecified directory.

Examples

Below are some examples demonstrating what can be done with the elasticsearchinout plugin. The example commands require installation on a UNIX system.The plugin may also works with different commands on other operatingsystems supporting elasticsearch, but is not tested yet.

Export data to files in the node's file system. The filenames will be expandedby index and shard names (p.e. /tmp/dump-myIndex-0):

curl -X POST 'http://localhost:9200/_export' -d '{    "fields": ["_id", "_source", "_version", "_index", "_type"],    "output_file": "/tmp/es-data/dump-${index}-${shard}"}'

Do GZIP compression on file exports:

curl -X POST 'http://localhost:9200/_export' -d '{    "fields": ["_id", "_source", "_version", "_index", "_type"],    "output_file": "/tmp/es-data/dump-${index}-${shard}.gz",    "compression": "gzip"}'

Pipe the export data through a single argumentless command on the correspondingnode, like cat. This command actually returns the export data in the JSONresult's stdout field:

curl -X POST 'http://localhost:9200/_export' -d '{    "fields": ["_id", "_source", "_version", "_index", "_type"],    "output_cmd": "cat"}'

Pipe the export data through argumented commands (p.e. a shell script, orprovide your own sophisticated script on the node). This command willresult in transforming the data to lower case and write the file to thenode's file system:

curl -X POST 'http://localhost:9200/_export' -d '{    "fields": ["_id", "_source", "_version", "_index", "_type"],    "output_cmd": ["/bin/sh", "-c", "tr [A-Z] [a-z] > /tmp/outputcommand.txt"]}'

Limit the exported data with a query. The same query syntax as for search canbe used:

curl -X POST 'http://localhost:9200/_export' -d '{    "fields": ["_id", "_source", "_version", "_index", "_type"],    "output_file": "/tmp/es-data/query-${index}-${shard}",    "query": {        "match": {            "someField": "someValue"        }    }}'

Export only objects of a specifix index:

curl -X POST 'http://localhost:9200/myIndex/_export' -d '{    "fields": ["_id", "_source", "_version", "_type"],    "output_file": "/tmp/es-data/dump-${index}-${shard}"}'

Export only objects of a specific type of an index:

curl -X POST 'http://localhost:9200/myIndex/myType/_export' -d '{    "fields": ["_id", "_source", "_version"],    "output_file": "/tmp/es-data/dump-${index}-${shard}"}'

Import data from previously exported data into elastic search. This can be forexample a new set up elasticsearch server with empty indexes. Take care tohave the indexes prepared with correct mappings. The files must reside on thefile system of the elastic search node(s):

curl -X POST 'http://localhost:9200/_import' -d '{    "directory": "/tmp/es-data"}'

Import data of gzipped files:

curl -X POST 'http://localhost:9200/_import' -d '{    "directory": "/tmp/es-data",    "compression": "gzip"}'

Import data into a specific index. Can be used if no _index is given in theexport data or to force data of other indexes to be imported into a specificindex:

curl -X POST 'http://localhost:9200/myNewIndex/_import' -d '{    "directory": "/tmp/es-data"}'

Import data into a specific type of an index:

curl -X POST 'http://localhost:9200/myNewIndex/myType/_import' -d '{    "directory": "/tmp/es-data"}'

Use a regular expression to filter imported file names (e.g. for specificindexes):

curl -X POST 'http://localhost:9200/_import' -d '{    "directory": "/tmp/es-data",    "file_pattern": "dump-myindex-(\\d).json"}'

Exports

Elements of the request body

`fields`

A list of fields to export. Describes which data is exported for everyobject. A field name can be any property that is defined in the index/typemapping with"store": "yes" or one of the following special fields(prefixed with _):

_id: Delivers the ID of the object
_index: Delivers the index of the object
_routing: Delivers the routing value of the object
_source: Delivers the stored JSON values of the object
_timestamp: Delivers the time stamp when the object was created (or theexternally provided timestamp). Works only if the _timestamp field is enabledand set to"store": "yes" in the index/type mapping of the object.
_ttl: Delivers the expiration time stamp of the object if the _ttl fieldis enabled in the index/type mapping.
_type: Delivers the document type of the object
_version: Delivers the current version of the object

Example assuming that the propertiesname andaddress are definedin the index/type mapping with the property"store": "yes":

"fields": ["_id", "name", "address"]

Thefields element is required in the POST data of the request.

`output_cmd`

"output_cmd": "cat"
"output_cmd": ["/location/yourcommand", "argument1", "argument2"]

The command to execute. Might be defined as string or as array. Thecontent to export will get piped to Stdin of the command to execute.Some variable substitution is possible (see Variable Substitution)

Required (ifoutput_file has been omitted)

`output_file`

"output_file": "/tmp/dump"

A path to the resulting output file. The containing directory of thegivenoutput_file has to exist. The givenoutput_file MUST NOT exist,unless the parameterforce_overwrite is set to true.

If the path of the output file is relative, the files will be stored relativeto each node's first node data location, which is usually a subdirectory ofthe configured data location. This absolute path can be seen in the JSONresponse of the request. If you don't know where this location is, you can doa dry-run with theexplain element set totrue to find out.

Some variable substitution in the output_file's name is also possible (seeVariable Substitution).

Required (ifoutput_cmd has been omitted)

`force_overwrite`

"force_overwrite": true

Boolean flag to force overwriting existingoutput_file. This option onlymake sense ifoutput_file has been defined.

Optional (defaults to false)

`explain`

"explain": true

Option to evaluate the command to execute (like dry-run).

Optional (defaults to false)

`compression`

"compression": "gzip"

Option to activate compression to the output. Works both whetheroutput_file oroutput_cmd has been defined. Currently only thegzip compression type is available. Omitting the option will resultin uncompressed output to files or processes.

Optional (default is no compression)

`query`

The query element within the export request body allows to define aquery using the Query DSL. Seehttp://www.elasticsearch.org/guide/reference/query-dsl/

Optional

`settings`

"settings": true

Option to generate an index settings file next to the data files on allcorresponding shards. The generated settings file has the generated name ofthe output file with the.settings extension. This option is only possibleif the optionoutput_file has been defined.

Optional (defaults to false)

`mappings`

"mappings": true

Option to generate an index mapping file next to the data files on allcorresponding shards. The generated mapping file has the generated name ofthe output file with an.mapping extension. This option is only possibleif the optionoutput_file has been defined.

Optional (defaults to false)

Get parameters

The api provides the general behavior of the rest API. Seehttp://www.elasticsearch.org/guide/reference/api/

Preference

Controls a preference of which shard replicas to execute the exportrequest on. Different than in the search API, preference is set to"_primary" by default. Seehttp://www.elasticsearch.org/guide/reference/api/search/preference/

Variable Substitution

The following placeholders will be replaced with the actual value intheoutput_file oroutput_cmd fields:

${cluster}: The name of the cluster
${index}: The name of the index
${shard}: The id of the shard

JSON Response

The _export query returns a JSON response with information about the exportstatus. The output differs a bit whether an output command or an output fileis given in the request body.

Output file JSON response

The JSON response may look like this if an output file is given in therequest body:

{    "exports" : [        {            "index" : "myIndex",            "shard" : 0,            "node_id" : "the_node_id",            "numExported" : 5,            "output_file" : "/tmp/dump-myIndex-0"        }    ],    "totalExported" : 5,    "_shards" : {        "total" : 2,        "successful" : 1,        "failed" : 1,        "failures" : [            {                "index" : "myIndex",                "shard" : 1,                "reason" : "..."            }        ]    }}

Output command JSON response

The JSON response may look like this if an output command is given in therequest body:

{    "exports" : [        {            "index" : "myIndex",            "shard" : 0,            "node_id" : "the_node_id",            "numExported" : 5,            "output_cmd" : [                "/bin/sh",                "-c",                "tr [A-Z] [a-z] > /tmp/outputcommand.txt"            ],            "stderr" : "",            "stdout" : "",            "exitcode" : 0        }    ],    "totalExported" : 5,    "_shards" : {        "total" : 2,        "successful" : 1,        "failed" : 1,        "failures": [            {                "index" : "myIndex",                "shard" : 1,                "reason" : "..."            }        ]    }}

Hint

exports: List of successful exports
totalExported: Number of total exported objects
_shards: Shard information
index: The name of the exported index
shard: The number of the exported shard
node_id: The node id where the export happened
numExported: The number of exported objects in the shard
output_file: The file name of the output file with substituted variables
failures: List of failing shard operations
reason: The error report of a specific shard failure
output_cmd: The executed command on the node with substituted variables
stderr: The first 8K of the standard error log of the executed command
stdout: The first 8K of the standard output log of the executed command
exitcode: The exit code of the executed command

Imports

Import data

The import data requires the same format as the format that is generated bythe export. So every line in the import file represents an object in JSONformat.

The_source field is required for a successful import of an object. Ifthe_id field is not given, a random id is generated for the object.Also the_index and_type fields are required, as long as they are notgiven in the request URI (p.e.http://localhost:9200/<index>/<type>/_index).

Further optional fields are_routing,_timestamp,_ttl and_version. See thefields section on export for more details on thefields.

Elements of the request body

`directory`

Specifies the directory where the files to be imported reside. Every singlenode of the cluster imports files from that directory on it's file system.

If the directory is a relative path, it is based on the absolute path of eachnode's first node data location. Seeoutput_file in export documentationfor more information.

`compression`

"compression": "gzip"

Option to activate decompression on the import files. Currently only thegzip compression type is available.

Optional (default is no decompression)

`file_pattern`

"file_pattern": "index-(.*)-(\d).json"

Option to import only files with a given regular expression. Take care ofdouble escaping, as the JSON is decoded too in the process. For moreinformation on regular expressions visithttp://www.regular-expressions.info/

Optional (default is no filtering)

`settings`

"settings": true

Option to import index settings. All files in the import directory with aneponymic data file without the.settings extension will be handled. Alsouse thefile_pattern option to reduce imported settings files. The formatof a settings file is the same as the JSON output of_settings GET requests.

Optional (defaults to false)

`mappings`

"mappings": true

Option to import index mappings. All files in the import directory with aneponymic data file without the.mapping extension will be handled. Alsouse thefile_pattern option to reduce imported mapping files. The formatof a mapping file is the same as the JSON output of_mapping GET requests.

Optional (defaults to false)

JSON Response

The JSON response of an import may look like this:

{    "imports" : [        {            "node_id" : "7RKUKxNDQlq0OzeOuZ02pg",            "took" : 61,            "imported_files" : [                {                    "file_name" : "dump-myIndex-1.json",                    "successes" : 150,                    "failures" : 0                },                {                    "file_name" : "dump-myIndex-2.json",                    "successes" : 149,                    "failures" : 1,                    "invalidated" : 1                }            ]        },        {            "node_id" : "IrMCOlKCTtW4aDhjXiYzTw",            "took" : 63,            "imported_files" : [                {                    "file_name" : "dump-myIndex-3.json",                    "successes" : 150,                    "failures" : 0                }            ]        }    ],    "failures" : [        {            "node_id" : "OATwHz48TEOshAISZlepcA",            "reason" : "..."        }    ]}

Hint

imports: List of successful imports
``node_id'': The node id where the import happened
took: Operation time of all imports on the node in milliseconds
imported_files: List of imported files in the import directory of the node's file system
file_name: File name of the handled file
successes: Number of successfully imported objects per file
failures (in imported_files list): Number of not imported objects because of a failure
invalidated: Number of not imported objects because of invalidation (time to live exceeded)
failures (in root): List of failing node operations
reason: The error report of a specific node failure

Dump

The idea behind dump is to export all relevant data to recreate thecluster as it was at the time of the dump.

The basic usage of the endpoint is:

curl -X POST 'http://localhost:9200/_dump'

All data (including also settings and mappings) will get saved to a subfolderwithin each nodes data directory.

It's possible to call _dump on root level, on index level or on typelevel.

Elements of the request body

`directory`

The directory option defines where to store exported files. If thedirectory is a relative path, it is based on the absolute path of eachnode's first node data location. Seeoutput_file in exportdocumentation for more information. If the directory was omitted thedefault location dump within the node data location will be used.

`force_overwrite`

"force_overwrite": true

Boolean flag to force overwriting existingoutput_file. Thisoption is identical to the force_overwrite option of the _exportendpoint.

Restore

Dumped data is intended to get restored. This can be done by the _restoreendpoint:

curl -X POST 'http://localhost:9200/_restore'

It's possible to call _restore on root level, on index level or on typelevel.

Elements of the request body

`directory`

Specifies the directory where the files to be restored reside. Seedirectory in import documentation for more details. If thedirectory was omitted the default location dump within the node datalocation will be used.

`settings` and`mappings`

Defaults to true on restore. See the Import documentation for more details.

Reindex

The_reindex endpoint can reindex documents of a given search query.

Reindex all indexes:

curl -X POST 'http://localhost:9200/_reindex'

Reindex a specific index:

curl -X POST 'http://localhost:9200/myIndex/_reindex'

Reindex documents of a specified query:

curl -X POST 'http://localhost:9200/myIndex/aType/_reindex' -d '{    "query": {"text": {"name": "tobereindexed"}}}'

An example can be found in theReindex DocTest.

Search Into

Via the_search_into endpoint it is possible to put the result ofa given query directly into an index:

curl -X POST 'http://localhost:9200/oldindex/_search_into -d '{    "fields": ["_id", "_source", ["_index", "'newindex'"]]}'

An example can be found in theSearch Into DocTest.

Installation

Clone this repo with git clonegit@github.com:crate/elasticsearch-inout-plugin.git
Checkout the tag (find out via git tag) you want to build with(possibly master is not for your elasticsearch version)
Run: mvn clean package -DskipTests=true – this does not run any unittests, as they take some time. If you want to run them, better runmvn clean package
Install the plugin: /path/to/elasticsearch/bin/plugin -installelasticsearch-inout-plugin -urlfile:///$PWD/target/elasticsearch-inout-plugin-$version.jar

About

An Elasticsearch plugin which provides the ability to export data by query on server side.

Languages

Java99.2%
Other0.8%

Movatterモバイル変換

License

crate/elasticsearch-inout-plugin

Folders and files

Latest commit

History

Repository files navigation

Elasticsearch InOut Plugin

Examples

Exports

Elements of the request body

fields

output_cmd

output_file

force_overwrite

explain

compression

query

settings

mappings

Get parameters

Preference

Variable Substitution

JSON Response

Output file JSON response

Output command JSON response

Imports

Import data

Elements of the request body

directory

compression

file_pattern

settings

mappings

JSON Response

Dump

Elements of the request body

directory

force_overwrite

Restore

Elements of the request body

directory

settings andmappings

Reindex

Search Into

Installation

About

Resources

License

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Contributors5

Uh oh!

Languages

`fields`

`output_cmd`

`output_file`

`force_overwrite`

`explain`

`compression`

`query`

`settings`

`mappings`

`directory`

`compression`

`file_pattern`

`settings`

`mappings`

`directory`

`force_overwrite`

`directory`

`settings` and`mappings`

Packages