Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Example in readme should not use chunksize=2 #25

Open
ShaneHarvey opened this issue Oct 7, 2022 · 3 comments
Open

Example in readme should not use chunksize=2 #25

ShaneHarvey opened this issue Oct 7, 2022 · 3 comments

Comments

@ShaneHarvey
Copy link

The example in readme should not use chunksize=2:

# Read Dask Bag from Mongo database
b = dask_mongo.read_mongo(
    database="your_database",
    collection="your_collection",
    connection_kwargs={"host": "localhost", "port": 27017},
    chunksize=2,
)

IIUC a chunksize of 2 will cause read_mongo to execute 1 query for every 2 documents in the result set, with 1000 docs there will be 500 queries. Let's use a realistic value of chunksize (maybe 10,000?) to avoid users copy/pasting a poor default value.

@ShaneHarvey
Copy link
Author

It would be even better to be able to configure the npartitions directly rather than the chunksize. The we can remove the duplicate query for the $count. Something like:

# Read Dask Bag from Mongo database
b = dask_mongo.read_mongo(
    database="your_database",
    collection="your_collection",
    connection_kwargs={"host": "localhost", "port": 27017},
    npartitions=16,
)

@phobson
Copy link

phobson commented Nov 29, 2022

Hey @ShaneHarvey thanks for bringing this to our attention. In general, I agree that specifying npartitions would be better, but the example in the README is very small and self-contained. Do you think that it would be better to generate a bigger example dask bag at the risk of making the example less concrete and grok-able? Or perhaps we could add comments next to the chunksize and npartitions kwargs

@ShaneHarvey
Copy link
Author

Adding an inline comment would be a great start, maybe something like chunksize=2, # Increase this value for better performance.. I still worry about using a poor default value even if it's a toy example.

Support for the npartitions feature would be a nice addition since I believe it would scale better in most cases. When npartitions is added, we could update the readme to use npartitions=16. Sure, the same problem of using a poor default value exists here (maybe npartitions=4 or npartitions=50 is better for a given query) but I imagine npartitions=16 is much better than chunksize=2 in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants