Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download remote files on demand #59

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bertsky
Copy link
Contributor

@bertsky bertsky commented Feb 27, 2023

Fix #52

@bertsky
Copy link
Contributor Author

bertsky commented Feb 27, 2023

Not sure if this has the unintended consequence of changing remote URLs to local OTHER refs when in edit mode, because OCR-D core does not differentiate between .url and .local_filename – but this will be improved in core soon.

It works for me – perhaps if you merge with your new test branch?

@hnesk
Copy link
Owner

hnesk commented Feb 27, 2023

This is of course the simplest solution possible and I like it, but I'm a bit worried about UI responsiveness at startup when a lot of files need to be downloaded. With GTK and the Python GIL the UI will freeze until everything is downloaded.
Still better than crashing, like it is now.
The "right" solution would be to populate the page browser with the remote files, and download them in the background from another thread, so the UI stays responsive.
I will merge your changes into the test branch and will try the threaded solution based on that.

@bertsky
Copy link
Contributor Author

bertsky commented Feb 27, 2023

Indeed, it's not more than a workaround as it stands.

But apart from a multithreaded solution, couldn't we also replace remote images with placeholder images? These could be downloaded on (actual) demand in resolve_image...

@hnesk
Copy link
Owner

hnesk commented Mar 2, 2023

Can you try if the support_remote_images branch works for you?
I implemented the threaded download solution there. I'm not sure yet if we need additionally "download on demand"/"download by priority"-functionality, because if you want to check the last page of 500-page document that could still be a lot of waiting.

@bertsky
Copy link
Contributor Author

bertsky commented Mar 5, 2023

Can you try if the support_remote_images branch works for you?
I implemented the threaded download solution there.

Fantastic! That works perfectly well, and the wait time is no issue IMO – it does not crash, and eventually serves the image.

Before rebuilding the Docker image, one needs to update ocrd_browser/ui.gresource though.

And I did see a glitch on the command line (but not in the UI):

Traceback (most recent call last):
  File "/ocrd_browser/model/page.py", line 104, in get_image
    page_image, page_coords, page_image_info = ws.image_from_page(self.page, self.id, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/ocrd/workspace.py", line 606, in image_from_page
    page_image_info = self.resolve_image_exif(page.imageFilename)
AttributeError: 'NoneType' object has no attribute 'imageFilename'

(Might have been for a page without a PAGE though, so no issue really.)

@bertsky
Copy link
Contributor Author

bertsky commented Jun 28, 2023

Should I close here, so you can proceed merging https://github.com/hnesk/browse-ocrd/tree/support_remote_images?

@bertsky
Copy link
Contributor Author

bertsky commented Jan 18, 2024

I did find another glitch on your support_remote_images branch:

Traceback (most recent call last):
  File "/data/ocr-d/ocrd_browser/ocrd_browser/ui/icon_store.py", line 79, in _collect_workers
    row[i] = new_row_data[i]
  File "/data/ocr-d/ocrd_all/venv38/lib/python3.8/site-packages/gi/overrides/Gtk.py", line 1155, in __setitem__
    self.model.set_value(self.iter, key, value)
  File "/data/ocr-d/ocrd_all/venv38/lib/python3.8/site-packages/gi/overrides/Gtk.py", line 1049, in set_value
    value = self._convert_value(column, value)
  File "/data/ocr-d/ocrd_all/venv38/lib/python3.8/site-packages/gi/overrides/Gtk.py", line 921, in _convert_value
    return GObject.Value(self.get_column_type(column), value)
  File "/data/ocr-d/ocrd_all/venv38/lib/python3.8/site-packages/gi/overrides/GObject.py", line 208, in __init__
    self.set_value(py_value)
  File "/data/ocr-d/ocrd_all/venv38/lib/python3.8/site-packages/gi/overrides/GObject.py", line 240, in set_value
    raise TypeError("Expected string but got %s%s" %
TypeError: Expected string but got THUMBS/FILE_0001_THUMBS.png<class 'pathlib.PosixPath'>

It's a bit fuzzy – running multiple times will download all the missing files eventually.

The cause is a change in recent ocrd_models.OcrdFile.local_filename (after 2.54), which now is pathlib.Path instead of str.
@kba – was that an intentional API change? (I did not see it in the changelog.)

I can see the following spots which need adaptation (i.e. an extra str() around local_filename):

  • ui/page_store.py:202 and ui/page_store.py:207
  • view/images.py:134

@bertsky
Copy link
Contributor Author

bertsky commented Jan 19, 2024

The latter will instead be fixed in core: OCR-D/core#1167

@kba
Copy link
Contributor

kba commented Mar 13, 2024

OcrdFile.local_filename is back to being a str consistently since v2.63.0, so at least that blocker is gone.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support remote images
3 participants