Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Corpus Download Step #25

Open
varisd opened this issue Jul 19, 2024 · 3 comments
Open

Add a Corpus Download Step #25

varisd opened this issue Jul 19, 2024 · 3 comments
Labels
feature New feature or request

Comments

@varisd
Copy link
Contributor

varisd commented Jul 19, 2024

We should implement a CorpusStep that downloads existing URL and sets it up for the follow up steps.
This step would ideally be included in usage examples to avoid making users setting up the data directories (for RawStep) themselves.
In the end, a pipeline with the download step should be executed out-of-the-box using single init and run command without any knowledge about OpusPocus.

@varisd varisd added the feature New feature or request label Jul 19, 2024
@bhaddow
Copy link
Contributor

bhaddow commented Jul 19, 2024

This is an essential step for model building, but should OpusPocus handle downloading itself, or defer it to OpusCleaner (or even mtdata?)

@varisd
Copy link
Contributor Author

varisd commented Jul 19, 2024

If I remember correctly, OpusCleaner did not support corpus download outside of the web UI at the time of integration to OpusPocus so downloading was avoided. However, I am not against putting this burden on OpusCleaner if it can be done by it.

On the other hand, in practice, the end user will still need to get the filter.json files for OpusCleaner cleaning from somewhere which gets us back to the original question whether to just include dedicated step which downloads data and filters.jsons from a specific URL (data na filters can be in different locations)

@bhaddow
Copy link
Contributor

bhaddow commented Jul 19, 2024

There is now an opuscleaner-download binary in OpusCleaner (I need to get it into PyPi).

But yes, we still need for OpusPocus to download the filter.json and call this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants