- Notifications
You must be signed in to change notification settings - Fork41
Bulk indexing command line tool for elasticsearch.
License
miku/esbulk
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Fast parallel command linebulk loading utility for elasticsearch. Data is read from anewline delimited JSON file or stdin and indexed into elasticsearch in bulkand in parallel. The shortest command would be:
$ esbulk -index my-index-name< file.ldj
Caveat: If indexingpressure on the bulk API is too high (dozens or hundreds ofparallel workers, large batch sizes, depending on you setup), esbulk will haltand report an error:
$ esbulk -index my-index-name -w 100 file.ldj2017/01/02 16:25:25 error during bulk operation, try less workers (lower -w value) or increase thread_pool.bulk.queue_sizein your nodes
Please note that, in such a case, some documents are indexed and some are not.Your index will be in an inconsistent state, since there is no transactionalbracket around the indexing process.
However, using defaults (parallelism: number of cores) on a single node setupwill just work. For larger clusters, increase the number of workers until yousee full CPU utilization. After that, more workers won't buy any more speed.
Currently, esbulk istested against elasticsearchversions 2, 5, 6, 7 and 8 usingtestcontainers. Originally written forLeipzig UniversityLibrary,projectfinc.
$ go install github.com/miku/esbulk/cmd/esbulk@latest
Fordeb
orrpm
packages, see:https://github.com/miku/esbulk/releases
$ esbulk -hUsage of esbulk: -0 set the number of replicas to 0 during indexing -c string create index mappings, settings, aliases, https://is.gd/3zszeu -cpuprofile string write cpu profile to file -id string name of field to use as id field, by default ids are autogenerated -index string index name -mapping string mapping string or filename to apply before indexing -memprofile string write heap profile to file -optype string optype (index - will replace existing data, create - will only create a new doc, update - create new or update existing data) (default "index") -p string pipeline to use to preprocess documents -purge purge any existing index before indexing -purge-pause duration pause after purge (default 1s) -r string Refresh interval after import (default "1s") -server value elasticsearch server, this works with https as well -size int bulk batch size (default 1000) -skipbroken skip broken json -type string elasticsearch doc type (deprecated since ES7) -u string http basic auth username:password, like curl -u -v prints current program version -verbose output basic progress -w int number of workers to use (default 8) -z unzip gz'd file on the fly
To index a JSON file, that contains one documentper line, just run:
$ esbulk -index example file.ldj
Wherefile.ldj
is line delimited JSON, like:
{"name": "esbulk", "version": "0.2.4"}{"name": "estab", "version": "0.1.3"}...
By defaultesbulk
will use as many parallelworkers, as there are cores. To tweak the indexingprocess, adjust the-size
and-w
parameters.
You can index from gzipped files as well, usingthe-z
flag:
$ esbulk -z -index example file.ldj.gz
Starting with 0.3.7 the preferred method to set anon-default server hostport is via-server
, e.g.
$ esbulk -server https://0.0.0.0:9201
This way, you can use https as well, which was notpossible before. Options-host
and-port
aregone as ofesbulk 0.5.0.
Since version 0.3.8: If you want to reuse IDs from your documents in elasticsearch, youcan specify the ID field via-id
flag:
$ cat file.json{"x": "doc-1", "db": "mysql"}{"x": "doc-2", "db": "mongo"}
Here, we would like to reuse the ID from fieldx.
$ esbulk -id x -index throwaway -verbose file.json...$ curl -s http://localhost:9200/throwaway/_search | jq{ "took": 2, "timed_out": false, "_shards": { "total": 5, "successful": 5, "failed": 0 }, "hits": { "total": 2, "max_score": 1, "hits": [ { "_index": "throwaway", "_type": "default", "_id": "doc-2", "_score": 1, "_source": { "x": "doc-2", "db": "mongo" } }, { "_index": "throwaway", "_type": "default", "_id": "doc-1", "_score": 1, "_source": { "x": "doc-1", "db": "mysql" } } ] }}
Version 0.4.3 adds support for nested ID fields:
$ cat fixtures/pr-8-1.json{"a": {"b": 1}}{"a": {"b": 2}}{"a": {"b": 3}}$ esbulk -index throwaway -id a.b < fixtures/pr-8-1.json...
Version 0.4.3 adds support for IDs that are the concatenation of multiple fields:
$ cat fixtures/pr-8-2.json{"a": {"b": 1}, "c": "a"}{"a": {"b": 2}, "c": "b"}{"a": {"b": 3}, "c": "c"}$ esbulk -index throwaway -id a.b,c < fixtures/pr-8-1.json... { "_index": "xxx", "_type": "default", "_id": "1a", "_score": 1, "_source": { "a": { "b": 1 }, "c": "a" } },
Since 0.4.2: support for secured elasticsearch nodes:
$ esbulk -u elastic:changeme -index myindex file.ldj
A similar project has been started for solr, calledsolrbulk.
- klaubert
- sakshambathla
- mumoshu
- albertpastrana
- faultlin3
- gransy
- Christoph Kepper
- Christian Solomon
- Mikael Byström
and others.
$ csvlook -I measurements.csv| es| esbulk| docs| avg_b| nodes| cores| total_heap_gb| t_s| docs_per_s| repl||-------|--------|-----------|-------|-------|-------|---------------|-------|------------|------|| 6.1.2| 0.4.8| 138000000| 2000| 1| 32| 64| 6420| 22100| 1|| 6.1.2| 0.4.8| 138000000| 2000| 1| 8| 30| 27360| 5100| 1|| 6.1.2| 0.4.8| 1000000| 2000| 1| 4| 1| 300| 3300| 1|| 6.1.2| 0.4.8| 10000000| 26| 1| 4| 8| 122| 81000| 1|| 6.1.2| 0.4.8| 10000000| 26| 1| 32| 64| 32| 307000| 1|| 6.2.3| 0.4.10| 142944530| 2000| 2| 64| 128| 26253| 5444| 1|| 6.2.3| 0.4.10| 142944530| 2000| 2| 64| 128| 11113| 12831| 0|| 6.2.3| 0.4.13| 15000000| 6000| 2| 64| 128| 2460| 6400| 0|
Why not add arow?
About
Bulk indexing command line tool for elasticsearch.