-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Example in readme should not use chunksize=2 #25
Comments
It would be even better to be able to configure the npartitions directly rather than the chunksize. The we can remove the duplicate query for the $count. Something like: # Read Dask Bag from Mongo database
b = dask_mongo.read_mongo(
database="your_database",
collection="your_collection",
connection_kwargs={"host": "localhost", "port": 27017},
npartitions=16,
) |
Hey @ShaneHarvey thanks for bringing this to our attention. In general, I agree that specifying |
Adding an inline comment would be a great start, maybe something like Support for the npartitions feature would be a nice addition since I believe it would scale better in most cases. When npartitions is added, we could update the readme to use npartitions=16. Sure, the same problem of using a poor default value exists here (maybe npartitions=4 or npartitions=50 is better for a given query) but I imagine npartitions=16 is much better than chunksize=2 in general. |
The example in readme should not use chunksize=2:
IIUC a chunksize of 2 will cause read_mongo to execute 1 query for every 2 documents in the result set, with 1000 docs there will be 500 queries. Let's use a realistic value of chunksize (maybe 10,000?) to avoid users copy/pasting a poor default value.
The text was updated successfully, but these errors were encountered: