Skip to content

Dump an Elasticsearch (v7 or v8) index via scrolling. Achieve speeds up to 1 million doc/s on a single node

License

Notifications You must be signed in to change notification settings

hchargois/esdump

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

8 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

esdump

esdump is a simple and efficient CLI tool to dump (retrieve) the documents contained in an Elasticsearch index via scrolling.

It outputs to standard output, in JSON lines (a.k.a. JSONL or NDJSON) format.

It works with Elasticsearch versions 7.x and 8.x.

Features

  • Correctly clears the scroll contexts before exiting, to free Elasticsearch resources
  • Automatically uses slicing for best performance
  • You can specify which document fields to output
  • Options to scroll in a random order and only a given number of docs, to easily get samples
  • You can specify a query in full Elasticsearch format or in the query_string (a.k.a. Lucene) syntax
  • Adaptive throttling
  • Works with security (username/password & HTTPS with custom certificate)
  • Option to turn HTTP gzip compression off (uses less CPU if network is not a bottleneck)
  • ... and more!

Install

Requires Go >= 1.18

go install github.com/hchargois/esdump@latest

Examples

Usage is detailed in sections below, but here are a few simple examples:

# Dump an index to a file using default settings
esdump http://localhost myindex > out.jsonl

# Dump multiple indexes using multi-target notation
esdump http://localhost myindex1,myindex2*

# Select the document fields to dump
esdump http://localhost myindex --fields id,date,description

# Specify a query in "query_string" format
esdump http://localhost myindex --query "rabbit OR bunny"

# Specify a search query on standard input
echo '{"query": {"term": {"animal": "rabbit"}}}' | esdump http://localhost myindex

# Dump a random sample of 1000 documents
esdump http://localhost myindex --random --count 1000

# Access an Elasticsearch server secured with TLS (with a custom cert) and username/password
esdump https://user:pass@localhost myindex --verify=cacert.pem

Usage

esdump base-url index-target [flags]

Arguments:

  base-url      The base URL of the Elasticsearch server (e.g. http://localhost)
                If the port is not specified, 9200 is assumed
  index-target  The name of the index you want to dump. Multi-target syntax is
                also supported (e.g. myindex1,myindex2 or myindex*)

Flags:

  -f, --fields string             comma-separated list of fields to include in the output, or if starting with ^ to exclude
  -q, --query string              filter the documents with a "query_string" query
  -t, --throttle float32          delay factor for adaptive throttling, set 0 to disable throttling (default 4)
  -n, --count uint                output that many documents maximum (default unlimited)
  -s, --scroll-size int           number of hits per scroll request (default 1000)
  -m, --metadata                  include hit metadata (_index, _id, _source...), if not set only outputs the contents of _source
  -M, --metadata-only             only include hit metadata (_index, _id...), no _source
  -r, --random                    dump the documents in a random order
  -z, --no-compression            disable HTTP gzip compression
      --verify string             certificate file to verify the server's certificate, or "no" to skip all TLS verification
      --slices int                max number of slices per index (default 10)
      --scroll-timeout duration   scroll timeout (default 1m0s)
      --http-timeout duration     HTTP client timeout (default 1m0s)

How to...

Filter the documents to dump

There are two options to selectively dump a subset of the index documents.

The simple one is using the -q flag with a "query string" query, described here https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html

The more advanced one is supplying a full Elasticsearch query to standard input. The format to use in this case is the format that is suitable for the /_search Elasticsearch endpoint, i.e. a JSON object containing a "query" key. You can also use this to set a custom sort order.

Choose what to dump

By default, esdump dumps only the documents, i.e. the contents of the "_source" in the Elasticsearch hits:

{"title": "lorem ipsum", "price": 1.23}
{"title": "dolor sit amet", "price": 4.56}
{"title": "consectetur adipiscing elit", "price": 7.89}

With the -m/--metadata flag, the documents are wrapped with their metadata:

{"_index": "items", "_id": "1", "_source": {"title": "lorem ipsum", "price": 1.23}}
{"_index": "items", "_id": "2", "_source": {"title": "dolor sit amet", "price": 4.56}}
{"_index": "items", "_id": "3", "_source": {"title": "consectetur adipiscing elit", "price": 7.89}}

With the -M/--metadata-only flag, only the metadata is output, which is the fastest way to retrieve only the _id:

{"_index": "items", "_id": "1"}
{"_index": "items", "_id": "2"}
{"_index": "items", "_id": "3"}

The -f/--fields can be used to specify a set of fields to include or exclude:

  • -f a,b,c will only output the fields a, b and c
  • -f ^a,b,c will output all the fields except a, b and c

Adjust the load on the server with adaptive throttling

esdump uses a very simple but effective throttling algorithm that automatically adapts to the capabilities and current load of the Elasticsearch cluster.

Instead of using the usual token bucket with a fixed rate, or a fixed delay between requests, you can set a relative throttle factor, which depends on the time taken by the last request.

For example, if you set a throttle factor of 10 (-t10), and the first scroll request takes 10 ms, then esdump will sleep for 100 ms before sending the next scroll request. If the next request takes only 5 ms, esdump will then sleep for 50 ms. This ensures that the scroll requests take a constant proportion of the cluster's load, even if it becomes more or less loaded for other reasons.

By default, a throttle factor of 4 is used.

To completely disable throttling, set a 0 throttle factor (-t0).

Go fast

  • Disable throttling with -t0
  • Select only the required fields with -f...
  • ... or if you only need the document _ids, use -M to only retrieve the metadata
  • If the network is fast, try disabling the gzip compression with -z (automatically done if the server is on a loopback address)
  • Increase the maximum number of slices with --slices; but this will only have an effect if your index has at least as many shards
  • Do not use random scrolling (no -r); do not specify a custom sort order in a query supplied on stdin

Work with a secured Elasticsearch cluster

If the cluster uses TLS, make sure to use the HTTPS scheme in the URL:

https://localhost

By default, your host's CA bundle is used to verify the server certificate. To specify a custom trusted certificate, use --verify:

esdump https://localhost myindex --verify cert.pem

... or turn off certificate validation with --verify no

To use a username and password, simply include them in the URL:

https://username:password@localhost

Benchmark

Against what seems to be the most popular alternative, elasticsearch-dump/elasticsearch-dump (7k+ stars!)

Dumping an index that has 2 million documents of around 500 bytes each, in 10 shards, on a single node (a desktop computer) on localhost.

command time speed
elasticdump --input=http://localhost:9200/testindex1 --output=out.json 5 hours 33 min 100 docs/s
esdump http://localhost testindex1 > out.json (default, 4x throttling) 7.28 s 274 774 docs/s
esdump http://localhost testindex1 -t0 > out.json (no throttling) 1.67 s 1 199 154 docs/s

No, there is no error. esdump really is ten thousand times faster than elasticdump. The (main) reason is that elasticdump has an insane throttling of 5 requests every 5 seconds. And the worst part of it? It is hardcoded, so it can't be configured, and it's also undocumented. Yep. It's that bad.

Alternatives

You may ask, surely there must already exist such a tool, right? Well, I searched for a few hours and couldn't find something that "just worked". So I had to make my own...

  • elasticsearch-dump/elasticsearch-dump: apart from the throttling issue described above, it also doesn't use slicing, and doesn't clear scroll contexts...
  • miku/esdump: does not produce JSONL, cannot specify an Elasticsearch query (only a Lucene one), no slicing...
  • wubin1989/esdump: seems that it cannot actually dump but only reindex into another Elasticsearch index?
  • shinexia/elasticdump: cannot dump to stdout, doesn't use slicing...

There are also a few that are simply too old and don't work with recent (>=7.x) Elasticsearch versions:

About

Dump an Elasticsearch (v7 or v8) index via scrolling. Achieve speeds up to 1 million doc/s on a single node

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages