You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We should implement a CorpusStep that downloads existing URL and sets it up for the follow up steps.
This step would ideally be included in usage examples to avoid making users setting up the data directories (for RawStep) themselves.
In the end, a pipeline with the download step should be executed out-of-the-box using single init and run command without any knowledge about OpusPocus.
The text was updated successfully, but these errors were encountered:
If I remember correctly, OpusCleaner did not support corpus download outside of the web UI at the time of integration to OpusPocus so downloading was avoided. However, I am not against putting this burden on OpusCleaner if it can be done by it.
On the other hand, in practice, the end user will still need to get the filter.json files for OpusCleaner cleaning from somewhere which gets us back to the original question whether to just include dedicated step which downloads data and filters.jsons from a specific URL (data na filters can be in different locations)
We should implement a CorpusStep that downloads existing URL and sets it up for the follow up steps.
This step would ideally be included in usage examples to avoid making users setting up the data directories (for RawStep) themselves.
In the end, a pipeline with the download step should be executed out-of-the-box using single init and run command without any knowledge about OpusPocus.
The text was updated successfully, but these errors were encountered: