-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming refacto #353
base: main
Are you sure you want to change the base?
Streaming refacto #353
Conversation
#339 related issue |
one alternative I'm considering (but maybe that can simply live next to batch and service instead of replacing them): implement things as tasks and then use a distributed queue like celery to run things. I don't like making installing a service like reddis part of user requirements though |
maybe ray https://stackoverflow.com/a/54738967 is an alternative to celery |
https://github.com/veonua/laion-yt-dlp example for celery |
I think this branch is an interesting POC but it's too many changes I think a PR that would only extract the core as a standalone function / process would be a good start. It may need dropping the keys based on shard first |
https://github.com/rom1504/img2dataset/blob/streaming_refacto/img2dataset/core/downloader.py#L29-L51 I'm not sure I like the pydantic part |
maybe https://abseil.io/docs/python/quickstart is a good alternative |
https://docs.celeryq.dev/en/stable/userguide/tasks.html celery take on how to configure tasks |
the flags should be much more separated process by process |
https://github.com/rom1504/img2dataset/blob/streaming_refacto/img2dataset/core/resizer.py#L98 gets duplicated compared to https://github.com/rom1504/img2dataset/blob/streaming_refacto/img2dataset/core/resizer.py#L13C7-L13C22 ; how can we improve it |
https://github.com/rom1504/img2dataset/blob/streaming_refacto/img2dataset/core/downloader.py#L133 fiddle/omegaconf might automate this line |
next step here: try to POC a few different ideas of parameter definition instead of pydantic |
anything can be generated into a fastapi model at the end if necessary https://github.com/rom1504/img2dataset/blob/streaming_refacto/img2dataset/service/service.py#L37 |
maybe check a few projects using ray or celery and see if something helps |
configuration for processes |
one idea:
input and output calls are generated based on feature names so have a generic config for processes then apply that here but also any2dataset/video2dataset |
try to write that out as json maybe so all actual config here would be one json per process we would also have support top level for the flat args at top level for legacy reasons |
write it out and see how that plays out for all the projects |
being able to configure these processes properly here should make all the wrapper around it (service, batch, distributors) come along naturally |
resizer options:
img_stream, blurring_bbox_list=None <- input feature this is a process that applies to one image json for this? |
maybe a typed thing is better here?
|
write down some of these and some potential schema and find what sticks |
Refactor img2dataset in 3 parts:
The main ideas here are:
The implementation is mostly working but requires refinement before considering merging.
Was implemented some 8 months ago, but I'm thinking to finish it now