miku/esbulkPublic

NotificationsYou must be signed in to change notification settings
Fork41
Star281

Bulk indexing command line tool for elasticsearch.

License

GPL-3.0 license

281 stars 41 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 330 Commits
cmd/esbulk		cmd/esbulk
docs		docs
extra		extra
fixtures		fixtures
packaging		packaging
.gitignore		.gitignore
.travis.yml		.travis.yml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
administration.go		administration.go
doc.go		doc.go
flags.go		flags.go
go.mod		go.mod
go.sum		go.sum
indexing.go		indexing.go
measurements.csv		measurements.csv
run.go		run.go
run_test.go		run_test.go

Repository files navigation

esbulk

Fast parallel command linebulk loading utility for elasticsearch. Data is read from anewline delimited JSON file or stdin and indexed into elasticsearch in bulkand in parallel. The shortest command would be:

$ esbulk -index my-index-name< file.ldj

Caveat: If indexingpressure on the bulk API is too high (dozens or hundreds ofparallel workers, large batch sizes, depending on you setup), esbulk will haltand report an error:

$ esbulk -index my-index-name -w 100 file.ldj2017/01/02 16:25:25 error during bulk operation, try less workers (lower -w value) or                    increase thread_pool.bulk.queue_sizein your nodes

Please note that, in such a case, some documents are indexed and some are not.Your index will be in an inconsistent state, since there is no transactionalbracket around the indexing process.

However, using defaults (parallelism: number of cores) on a single node setupwill just work. For larger clusters, increase the number of workers until yousee full CPU utilization. After that, more workers won't buy any more speed.

Currently, esbulk istested against elasticsearchversions 2, 5, 6, 7 and 8 usingtestcontainers. Originally written forLeipzig UniversityLibrary,projectfinc.

Installation

$ go install github.com/miku/esbulk/cmd/esbulk@latest

Fordeb orrpm packages, see:https://github.com/miku/esbulk/releases

Usage

$ esbulk -hUsage of esbulk:  -0    set the number of replicas to 0 during indexing  -c string        create index mappings, settings, aliases, https://is.gd/3zszeu  -cpuprofile string        write cpu profile to file  -id string        name of field to use as id field, by default ids are autogenerated  -index string        index name  -mapping string        mapping string or filename to apply before indexing  -memprofile string        write heap profile to file  -optype string        optype (index - will replace existing data,                create - will only create a new doc,                update - create new or update existing data)        (default "index")  -p string        pipeline to use to preprocess documents  -purge        purge any existing index before indexing  -purge-pause duration        pause after purge (default 1s)  -r string        Refresh interval after import (default "1s")  -server value        elasticsearch server, this works with https as well  -size int        bulk batch size (default 1000)  -skipbroken        skip broken json  -type string        elasticsearch doc type (deprecated since ES7)  -u string        http basic auth username:password, like curl -u  -v    prints current program version  -verbose        output basic progress  -w int        number of workers to use (default 8)  -z    unzip gz'd file on the fly

To index a JSON file, that contains one documentper line, just run:

$ esbulk -index example file.ldj

Wherefile.ldj is line delimited JSON, like:

{"name": "esbulk", "version": "0.2.4"}{"name": "estab", "version": "0.1.3"}...

By defaultesbulk will use as many parallelworkers, as there are cores. To tweak the indexingprocess, adjust the-size and-w parameters.

You can index from gzipped files as well, usingthe-z flag:

$ esbulk -z -index example file.ldj.gz

Starting with 0.3.7 the preferred method to set anon-default server hostport is via-server, e.g.

$ esbulk -server https://0.0.0.0:9201

This way, you can use https as well, which was notpossible before. Options-host and-port aregone as ofesbulk 0.5.0.

Reusing IDs

Since version 0.3.8: If you want to reuse IDs from your documents in elasticsearch, youcan specify the ID field via-id flag:

$ cat file.json{"x": "doc-1", "db": "mysql"}{"x": "doc-2", "db": "mongo"}

Here, we would like to reuse the ID from fieldx.

$ esbulk -id x -index throwaway -verbose file.json...$ curl -s http://localhost:9200/throwaway/_search | jq{  "took": 2,  "timed_out": false,  "_shards": {    "total": 5,    "successful": 5,    "failed": 0  },  "hits": {    "total": 2,    "max_score": 1,    "hits": [      {        "_index": "throwaway",        "_type": "default",        "_id": "doc-2",        "_score": 1,        "_source": {          "x": "doc-2",          "db": "mongo"        }      },      {        "_index": "throwaway",        "_type": "default",        "_id": "doc-1",        "_score": 1,        "_source": {          "x": "doc-1",          "db": "mysql"        }      }    ]  }}

Nested ID fields

Version 0.4.3 adds support for nested ID fields:

$ cat fixtures/pr-8-1.json{"a": {"b": 1}}{"a": {"b": 2}}{"a": {"b": 3}}$ esbulk -index throwaway -id a.b < fixtures/pr-8-1.json...

Concatenated ID

Version 0.4.3 adds support for IDs that are the concatenation of multiple fields:

$ cat fixtures/pr-8-2.json{"a": {"b": 1}, "c": "a"}{"a": {"b": 2}, "c": "b"}{"a": {"b": 3}, "c": "c"}$ esbulk -index throwaway -id a.b,c < fixtures/pr-8-1.json...      {        "_index": "xxx",        "_type": "default",        "_id": "1a",        "_score": 1,        "_source": {          "a": {            "b": 1          },          "c": "a"        }      },

Using X-Pack

Since 0.4.2: support for secured elasticsearch nodes:

$ esbulk -u elastic:changeme -index myindex file.ldj

A similar project has been started for solr, calledsolrbulk.

Contributors

and others.

Measurements

$ csvlook -I measurements.csv| es| esbulk| docs| avg_b| nodes| cores| total_heap_gb| t_s| docs_per_s| repl||-------|--------|-----------|-------|-------|-------|---------------|-------|------------|------|| 6.1.2| 0.4.8| 138000000| 2000| 1| 32|  64|  6420|  22100| 1|| 6.1.2| 0.4.8| 138000000| 2000| 1|  8|  30| 27360|   5100| 1|| 6.1.2| 0.4.8|   1000000| 2000| 1|  4|   1|   300|   3300| 1|| 6.1.2| 0.4.8|  10000000|   26| 1|  4|   8|   122|  81000| 1|| 6.1.2| 0.4.8|  10000000|   26| 1| 32|  64|    32| 307000| 1|| 6.2.3| 0.4.10| 142944530| 2000| 2| 64| 128| 26253|   5444| 1|| 6.2.3| 0.4.10| 142944530| 2000| 2| 64| 128| 11113|  12831| 0|| 6.2.3| 0.4.13|  15000000| 6000| 2| 64| 128|  2460|   6400| 0|

Why not add arow?

About

Bulk indexing command line tool for elasticsearch.

Releases55

esbulk 0.7.22 Latest

Mar 12, 2025

+ 54 releases

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

esbulk

Installation

Usage

Reusing IDs

Nested ID fields

Concatenated ID

Using X-Pack

Contributors

Measurements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases55

Packages

Uh oh!

Contributors8

Uh oh!

Languages

Movatterモバイル変換

License

miku/esbulk

Folders and files

Latest commit

History

Repository files navigation

esbulk

Installation

Usage

Reusing IDs

Nested ID fields

Concatenated ID

Using X-Pack

Contributors

Measurements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases55

Packages0

Uh oh!

Contributors8

Uh oh!

Languages

Packages