import_bulk fails to import large input files #41

hadisinaee · 2024-01-30T02:48:12Z

When trying to use the import_bulk API from python-arango, I noticed that it fails to import all the docs. The following is my input to the function:

res=collection.import_bulk(
                documents=map(lambda x: x.to_dict(), documents),
                overwrite=self.reset_collection
                )

I have to call .to_dict on all objects because they are the IndalekoObjects class. To make it JSON serializable, we should create a dictionary from them.

The error I get is:

Exception: Can't connect to host(s) within limit (3)

The size of the document is 827481.

The text was updated successfully, but these errors were encountered:

fsgeek · 2024-01-30T21:35:25Z

Interesting. I haven't tried to use the bulk uploader API call and I haven't seen this issue with the arangoimport tool, even using it to upload a file of ~5 million entries to a WAN based ArangoDB instance.

Is the issue that by using a lambda, you're injecting the time to construct the dictionary into the "connect" sequence? Would it be more resilient to just build the dictionary first? Plus, one reason I avoided going down this pat is concerns I had with batching entries (which the external tool already seems to handle.)

hadisinaee · 2024-02-03T04:24:51Z

Interesting. I haven't tried to use the bulk uploader API call and I haven't seen this issue with the arangoimport tool, even using it to upload a file of ~5 million entries to a WAN based ArangoDB instance.

Yeah, the arangoimport can handle large files, but the API seems to be tricky to use.

Is the issue that by using a lambda, you're injecting the time to construct the dictionary into the "connect" sequence? Would it be more resilient to just build the dictionary first? Plus, one reason I avoided going down this pat is concerns I had with batching entries (which the external tool already seems to handle.)

Yes, it might be that. I can try to build the array first and pass it to the function. If it didn't work properly, I'd go then and then simply run the arangoimport from my python script. I'll give it a try.

hadisinaee · 2024-02-06T06:52:49Z

I worked on this issue and tried the following methods:

In the issue’s body, I explained that I used the map API to create a list of documents to import. One hypothesis was that it is probably the bottleneck because it has to build the list while ingesting, and it takes a lot of time. Therefore, I created the list before passing it to the function for import. TLDR; it didn’t work. The reason is that the Docker container shuts down! I don’t know what the problem is with it. I think Tony runs the ADB on his machine; I was wondering if he encounters the same issue or not.
The second approach I tried was to split the array into chunks and then insert it into the database. It worked, but it took about ~3-5 minutes to ingest both vertices (~830K docs) and relationships (1.6M docs) with a chunk size of 50K docs per upload. I tried different chunk sizes, but 50K was the maximum I could use.
The third approach I tried was to parallelize the second approach. Each thread now tried to ingest its portion into the database using a single connection. Unsurprisingly, it failed because it’s similar to the first approach. I wanted to see if it is the database connection that cannot handle that large array size or if it is the database itself. Then, I went and created one connection per thread. But it didn’t work either.

hadisinaee added the question Further information is requested label Jan 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

import_bulk fails to import large input files #41

import_bulk fails to import large input files #41

hadisinaee commented Jan 30, 2024

fsgeek commented Jan 30, 2024

hadisinaee commented Feb 3, 2024

hadisinaee commented Feb 6, 2024

import_bulk fails to import large input files #41

import_bulk fails to import large input files #41

Comments

hadisinaee commented Jan 30, 2024

fsgeek commented Jan 30, 2024

hadisinaee commented Feb 3, 2024

hadisinaee commented Feb 6, 2024