Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset: old_book_illustrations #69

Open
1 task done
giganttheo opened this issue Jul 22, 2022 · 14 comments
Open
1 task done

Add dataset: old_book_illustrations #69

giganttheo opened this issue Jul 22, 2022 · 14 comments
Assignees
Labels
dataset Dataset to be added

Comments

@giganttheo
Copy link

A URL for this dataset

https://www.oldbookillustrations.com/

Dataset description

The Old Book Illustrations website contains a dataset of illustrations scanned from old books. Each illustration page also contains infos about the illustrator, the illustration and the book it's taken from as well as a title, a description, and a few keywords. As of today, the website contains 3150 images.

I already wrote a script to scrap all the content since the api does not give access to all the information (for instance the image is not is the best resolution).

Is it a dataset that is relevant for this project?

About the license, the website reads:

  • Text content (descriptions, translations, etc.) is published under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
  • Although we do our best to offer only Illustrations that are considered public domain in most countries, copyright laws vary from one jurisdiction to another, and you agree that you are solely responsible for abiding by all laws and regulations that may be applicable to using the Illustrations.

More info on the term of use page.

Dataset modality

Image

Dataset licence

Creative Commons Public Domain Dedication and Certification

Other licence

No response

How can you access this data

Other

Confirm the dataset has an open licence

  • To the best of my knowledge, this dataset is accessible via an open licence

Contact details for data custodian

No response

@giganttheo giganttheo added the candidate-dataset Proposed dataset to be added label Jul 22, 2022
@davanstrien davanstrien changed the title Add dataset: [old_book_illustrations] Add dataset: old_book_illustrations Jul 25, 2022
@davanstrien
Copy link
Collaborator

@giganttheo, thanks for suggesting this. I think it's a super interesting dataset, but I have a few questions about how we could access this dataset. On the terms of use they say:

You are welcome to download as many pictures as you wish, with no restriction in time or quantity; but we do not approve of the use of offline browsing software, or website downloaders, such as HTTRack, WebReaper, etc, due to the heavy load they put on the server. Please don’t use them.

I think a scraping script would likely fall under this category. I suggest that it might be worth reading out to the website creators to ask if they would be keen to contribute a dataset derived from the site. It may also be possible to get some additional metadata about the items. In particular, it would be helpful to have a citation to the source for each image included in the dataset, so it's possible to confirm the copyright status of those items if needed.

WDYT?

@giganttheo
Copy link
Author

Yeah you are right.
I just sent an email to the contact adress from the website, to ask for a mirror or a special authorization to use a scraping tool.
I'll update this if I have a response.

@davanstrien
Copy link
Collaborator

Yeah you are right. I just sent an email to the contact adress from the website, to ask for a mirror or a special authorization to use a scraping tool. I'll update this if I have a response.

Thanks! would be great to have this available so hopefully they are keen :)

@giganttheo
Copy link
Author

Update, I received a response from the webmaster:

Hi Théo,
Thank you for your interest in oldbookillustrations.com. We do indeed restrict bulk downloads, simply because we're on fairly cheap hosting and are concerned the server might not withstand the strain of full-on pounding that's often involved in site-scraping. Having said that, I have no objection to allowing you a one-time access to get the data needed for your project, provided we can agree on "gentle" settings for your scraper.
I would imagine that allowing around 3 to 5 seconds between the download of each image file, and a pause of about 10 seconds every ten downloads would be safe. I'm not sure what else you would need and how many hits that would generate, but the same amount of caution would be in order. Even better if the bulk of the activity can take place between 3 am and 7 am GMT.
The total weight of the available image files is around 8 Go.
Hope this helps.
With best regards,
Harvey Livet,
Webmaster of oldbookillustrations.com

I will use a scraping tool by respecting the restrictions agreed upon. And then upload the dataset to the hub!

@davanstrien
Copy link
Collaborator

Awesome, the only other I would check is that when you download the images we can get sufficient metadata for each image to verify the licence/copyright. What information is downloaded at the moment?

@giganttheo
Copy link
Author

We have access to:
the artist name (of the illustration), the engravers, the book and author (of the book), as well as the source of the illustration, and the Open Library record. For instance, you can check this page: https://www.oldbookillustrations.com/illustrations/pula-temple-augustus/

All the information that is shown on this page can be scraped. (I prefer scraping the page than downloading the "json record" that lacks some image sizes, and the keywords for instance)

@giganttheo
Copy link
Author

The source reads:

The New York Public Library believes that this item is in the public domain under the laws of the United States, but did not make a determination as to its copyright status under the copyright laws of other countries.

@davanstrien
Copy link
Collaborator

Great that looks good. I think if we can include the source information/URL that would be great. My own preference would also to be include as much information as possible for each image since it may be useful for someone working with the data WDYT?

@giganttheo
Copy link
Author

giganttheo commented Jul 28, 2022

Last night, I scraped the pages from the website, by following the restrictions agreed upon. This is the resulting dataset, stored on the hub: https://huggingface.co/datasets/gigant/oldbookillustrations_2

Do you think any other information might be interesting? There is most of the data from the pages, with the urls, and the sources.

If that's ok for you, I can add it to the BigLAM org, and create a comprehensive dataset card.

@davanstrien
Copy link
Collaborator

Last night, I scraped the pages from the website, by following the restrictions agreed upon. This is the resulting dataset, stored on the hub: huggingface.co/datasets/gigant/oldbookillustrations_2

Do you think any other information might be interesting? There is most of the data from the pages, with the urls, and the sources.

If that's ok for you, I can add it to the BigLAM org, and create a comprehensive dataset card.

This looks amazing! Happy for you to move to the BigLAM org -- let me know if you want any help with the dataset card.

@giganttheo
Copy link
Author

Update:

Here is the dataset on the hub, with a comprehensive dataset card:
https://huggingface.co/datasets/biglam/oldbookillustrations

Let me know what you think can be improved.

@davanstrien
Copy link
Collaborator

Thanks so much for this. Having given this a bit more thought, I think it probably makes sense to try and filter out the items which may have copyright issues. I think the best way to do this would be to filter out based on the artist_date field and set a reasonably conservative threshold. My suggestion would be to push this to a new dataset and keep the other one private. This means we can use the current version to update the cleaned version in the future.

@giganttheo WDYT?

@davanstrien
Copy link
Collaborator

davanstrien commented Aug 18, 2022

Thanks so much for this. Having given this a bit more thought, I think it probably makes sense to try and filter out the items which may have copyright issues. I think the best way to do this would be to filter out based on the artist_date field and set a reasonably conservative threshold. My suggestion would be to push this to a new dataset and keep the other one private. This means we can use the current version to update the cleaned version in the future.

@giganttheo WDYT?

suggested approach in this notebook https://gist.github.com/davanstrien/e34e239cbf792057f79e2e2162d1e4b1

@giganttheo
Copy link
Author

I agree that it would be nice to make sure the version we share does not have any copyright infringement issue. However, from my understanding, checking if a work is public domain might not be as straightforward as the filter you set up, since it depends on whether the work was published or not, as well as the country of origin. According to the "public domain" Wikipedia Page:

Determination of whether a copyright has expired depends on an examination of the copyright in its source country.

I think adding a filter to know if an artwork is public domain is a good idea, but it will require a lot more work: the dataset shall include some information I missed out (I just included the artist birth date for instance, but really it's the death date that is more important in that case), and we neet some basics of copyright law for the source countries of the artworks.

In my opinion, we could keeping a complete version available, with a warning about copyright issues at first, and then when a new version is ready we could add another one with the public domain artworks only. What do you think?

I will investigate this when I have some time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset Dataset to be added
Development

No branches or pull requests

2 participants