Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import_bulk fails to import large input files #41

Open
hadisinaee opened this issue Jan 30, 2024 · 3 comments
Open

import_bulk fails to import large input files #41

hadisinaee opened this issue Jan 30, 2024 · 3 comments
Labels
question Further information is requested

Comments

@hadisinaee
Copy link
Collaborator

When trying to use the import_bulk API from python-arango, I noticed that it fails to import all the docs. The following is my input to the function:

res=collection.import_bulk(
                documents=map(lambda x: x.to_dict(), documents),
                overwrite=self.reset_collection
                )

I have to call .to_dict on all objects because they are the IndalekoObjects class. To make it JSON serializable, we should create a dictionary from them.

The error I get is:

Exception: Can't connect to host(s) within limit (3)

The size of the document is 827481.

@hadisinaee hadisinaee added the question Further information is requested label Jan 30, 2024
@fsgeek
Copy link
Contributor

fsgeek commented Jan 30, 2024

Interesting. I haven't tried to use the bulk uploader API call and I haven't seen this issue with the arangoimport tool, even using it to upload a file of ~5 million entries to a WAN based ArangoDB instance.

Is the issue that by using a lambda, you're injecting the time to construct the dictionary into the "connect" sequence? Would it be more resilient to just build the dictionary first? Plus, one reason I avoided going down this pat is concerns I had with batching entries (which the external tool already seems to handle.)

@hadisinaee
Copy link
Collaborator Author

Interesting. I haven't tried to use the bulk uploader API call and I haven't seen this issue with the arangoimport tool, even using it to upload a file of ~5 million entries to a WAN based ArangoDB instance.

Yeah, the arangoimport can handle large files, but the API seems to be tricky to use.

Is the issue that by using a lambda, you're injecting the time to construct the dictionary into the "connect" sequence? Would it be more resilient to just build the dictionary first? Plus, one reason I avoided going down this pat is concerns I had with batching entries (which the external tool already seems to handle.)

Yes, it might be that. I can try to build the array first and pass it to the function. If it didn't work properly, I'd go then and then simply run the arangoimport from my python script. I'll give it a try.

@hadisinaee
Copy link
Collaborator Author

I worked on this issue and tried the following methods:

  • In the issue’s body, I explained that I used the map API to create a list of documents to import. One hypothesis was that it is probably the bottleneck because it has to build the list while ingesting, and it takes a lot of time. Therefore, I created the list before passing it to the function for import. TLDR; it didn’t work. The reason is that the Docker container shuts down! I don’t know what the problem is with it. I think Tony runs the ADB on his machine; I was wondering if he encounters the same issue or not.
  • The second approach I tried was to split the array into chunks and then insert it into the database. It worked, but it took about ~3-5 minutes to ingest both vertices (~830K docs) and relationships (1.6M docs) with a chunk size of 50K docs per upload. I tried different chunk sizes, but 50K was the maximum I could use.
  • The third approach I tried was to parallelize the second approach. Each thread now tried to ingest its portion into the database using a single connection. Unsurprisingly, it failed because it’s similar to the first approach. I wanted to see if it is the database connection that cannot handle that large array size or if it is the database itself. Then, I went and created one connection per thread. But it didn’t work either.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants