-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENHANCEMENT] Bulk API helper functions #62
Comments
I don't even understand how to do a bulk... body.push(json!({"index": {"_id": "1"}}).into());
body.push(json!({
"id": 1,
"user": "kimchy",
"post_date": "2009-11-15T00:00:00Z",
"message": "Trying out Elasticsearch, so far so good?"
}).into());
// add the second operation and document
body.push(json!({"index": {"_id": "2"}}).into());
body.push(json!({
"id": 2,
"user": "forloop",
"post_date": "2020-01-08T00:00:00Z",
"message": "Bulk indexing with the rust client, yeah!"
}).into());
let response = client
.bulk(BulkParts::Index("tweets"))
.body(body)
.send()
.await?; so "_id" is set to "1" (not zero.... anyway I past over non sense now), then "_id" because "id" in the document, why repeat the information twice ? why it's take two line to do one operation ? Then Index is "tweets". so no id or _id ? I don't understand a thing of all of this, the doc is unreadable. Reading, https://www.elastic.co/fr/blog/what-is-an-elasticsearch-index:
nuanced...
"like" the following line don't make any sense to me. There is no mention to any "id" or "_id", no clear link to where to find information. so what doesn't this mean, do I need to create an id ? do I need to make it unique ? do I really need to put in the real document ? can I use a integer ? is there a limit of operation ? so many question. If I need to write myself 99% of the http I don't see why this crate is for. I will for sure find the information somewhere already find some but this should be clear, for now I must play cluedo each time I have to deal with anything related to ES Work with elastic search is painful ! |
@Stargateur In answer to your questions
You can set
The bulk API expects newline delimited JSON where the operation to perform and the optional document involved in the operation are on consecutive lines. For example, a bulk index operation {"index": {"_id": "1"}}
{ "id": 1,"user": "kimchy","post_date": "2009-11-15T00:00:00Z","message": "Trying out Elasticsearch, so far so good?"} The first line contains information about the operation, the second line is the document. Take a look at the bulk API documentation for more details.
The blog post is from 7 years ago and the database analogy has been retired as it isn't a great one. There's a better overview on the website. An index in Elasticsearch is a collection of documents. An index has a mapping, either explicitly defined or implicitly inferred from the documents indexed into it, that indicates how fields in documents are mapped in Elasticsearch. For example, the A running instance of Elasticsearch is typically referred to as an Elasticsearch cluster. A cluster is made up of one or more nodes. An index may persist data across more than one node for high availability, fault tolerance, etc. An index is made up of one or more shards, and it is these shards that may be spread across more than one node.
You don't need to create an id for a document, in which case Elasticsearch will generate one for the document upon indexing it. This can be a suitable approach for many types of data. You typically want to assign an id to a document however in cases where you want to identify and operate on a specific document. The id must be unique - if you index a document with the same id, the document will overwrite a previous document of the same id. You can use an integer for the id.
The bulk API sends multiple documents in one HTTP request, so you typically want to keep the overall size of one bulk request to a reasonable size, like ~5Mb. If you have lots of documents to index, you may want to send multiple concurrent bulk requests, which is what this issue is to discuss.
You're free to choose whether to use this crate or not. There are many reasons why you might want to though, one of which is that it provides functions for all stable Elasticsearch APIs.
The best place to ask questions is the discuss forums, which has a lot of active community members willing to share knowledge and help with issues. There's also the webinars on elastic.co and the reference documentation. |
thank a lot all is way more clear, your answer will help a lot of people I think. The link I found was the top link to show in my search and looked official, I didn't expect the article was obsolete. that totally my bad. That said, if I understand this mean the body is not valid json, why make life complicated, just: [{
"op": {
"index": {
"_id": "1"
}
},
"content": {
"id": 1,
"user": "kimchy",
"post_date": "2009-11-15T00:00:00Z",
"message": "Trying out Elasticsearch, so far so good?"
}
}]
No I'm force to. I didn't wish to work with es if it was my choice I would not, rs-es is full of bug and not async, this crate is still on tokio 0.2, and I didn't find anything else decent in Rust. And I'm not talking about problem we have with the server. The only reason we use ES is that our R&D team want to for reason I don't agree at all. Sorry, this has to come out. ES doing breaking change in minor version (AFAIK), benashford/rs-es#148, forcing me to make random patch on a crate I don't know anything on a tech I don't know much, were the doc is hard to find that has so many problem make me angry. So, after my first try there is 6-7 months, I try now to use the official client for the second time. And again for the second time, I want to break my home. That said, https://docs.rs/elasticsearch/7.10.0-alpha.1/elasticsearch/struct.BulkOperation.html#method.create is probably what I need.
How can I know how many bytes my data will take ? If I understand correctly it's the crate that handle the json conversion, "reasonable" is not an acceptable answer for an API. Put a magic number in my code, like 5000 items by bulk without known if one day, my bulk will be too big will for sure one day run happen. I could maybe get access to the average size of my document in mongo and so estimated my max items, but that also could blown up. |
Take a look at tune for indexing speed for how to size bulk requests. |
Seem to work quite nicely:
|
Is there any progress regarding the original issue? My objective is to stream a File for bulk indexing in Elasticsearch, where the file already contains all the correctly formatted documents to be indexed. It's worth noting that Reqwest, which is internally used by elasticsearch-rs, always supports From for the Body type. However, the current implementation of Body in elasticsearch-rs reads the entire request body into memory before performing the POST request, which is not the desired behavior in my case. |
Landed here with the same use-case: I want to stream a |
The bulk API can be used to index multiple documents into Elasticsearch by constructing a bulk request containing multiple documents, and executing against Elasticsearch. When the number of documents is large however, a consumer needs to construct multiple bulk requests, each containing a slice of the documents to be indexed, and execute these against Elasticsearch.
Many of the existing Elasticsearch clients provide a "bulk helper" for this purpose. The helper can be:
Too Many Requests
HTTP response.An example helper is the
BulkAllObservable
from the C#/.NET client.The Rust client should provide a similar, idiomatic way of helping consumers bulk index a large collection of documents.
The text was updated successfully, but these errors were encountered: