-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dataset: old_book_illustrations #69
Comments
@giganttheo, thanks for suggesting this. I think it's a super interesting dataset, but I have a few questions about how we could access this dataset. On the terms of use they say:
I think a scraping script would likely fall under this category. I suggest that it might be worth reading out to the website creators to ask if they would be keen to contribute a dataset derived from the site. It may also be possible to get some additional metadata about the items. In particular, it would be helpful to have a citation to the source for each image included in the dataset, so it's possible to confirm the copyright status of those items if needed. WDYT? |
Yeah you are right. |
Thanks! would be great to have this available so hopefully they are keen :) |
Update, I received a response from the webmaster:
I will use a scraping tool by respecting the restrictions agreed upon. And then upload the dataset to the hub! |
Awesome, the only other I would check is that when you download the images we can get sufficient metadata for each image to verify the licence/copyright. What information is downloaded at the moment? |
We have access to: All the information that is shown on this page can be scraped. (I prefer scraping the page than downloading the "json record" that lacks some image sizes, and the keywords for instance) |
The source reads: The New York Public Library believes that this item is in the public domain under the laws of the United States, but did not make a determination as to its copyright status under the copyright laws of other countries. |
Great that looks good. I think if we can include the source information/URL that would be great. My own preference would also to be include as much information as possible for each image since it may be useful for someone working with the data WDYT? |
Last night, I scraped the pages from the website, by following the restrictions agreed upon. This is the resulting dataset, stored on the hub: https://huggingface.co/datasets/gigant/oldbookillustrations_2 Do you think any other information might be interesting? There is most of the data from the pages, with the urls, and the sources. If that's ok for you, I can add it to the BigLAM org, and create a comprehensive dataset card. |
This looks amazing! Happy for you to move to the BigLAM org -- let me know if you want any help with the dataset card. |
Update: Here is the dataset on the hub, with a comprehensive dataset card: Let me know what you think can be improved. |
Thanks so much for this. Having given this a bit more thought, I think it probably makes sense to try and filter out the items which may have copyright issues. I think the best way to do this would be to filter out based on the @giganttheo WDYT? |
suggested approach in this notebook https://gist.github.com/davanstrien/e34e239cbf792057f79e2e2162d1e4b1 |
I agree that it would be nice to make sure the version we share does not have any copyright infringement issue. However, from my understanding, checking if a work is public domain might not be as straightforward as the filter you set up, since it depends on whether the work was published or not, as well as the country of origin. According to the "public domain" Wikipedia Page:
I think adding a filter to know if an artwork is public domain is a good idea, but it will require a lot more work: the dataset shall include some information I missed out (I just included the artist birth date for instance, but really it's the death date that is more important in that case), and we neet some basics of copyright law for the source countries of the artworks. In my opinion, we could keeping a complete version available, with a warning about copyright issues at first, and then when a new version is ready we could add another one with the public domain artworks only. What do you think? I will investigate this when I have some time. |
A URL for this dataset
https://www.oldbookillustrations.com/
Dataset description
The Old Book Illustrations website contains a dataset of illustrations scanned from old books. Each illustration page also contains infos about the illustrator, the illustration and the book it's taken from as well as a title, a description, and a few keywords. As of today, the website contains 3150 images.
I already wrote a script to scrap all the content since the api does not give access to all the information (for instance the image is not is the best resolution).
Is it a dataset that is relevant for this project?
About the license, the website reads:
More info on the term of use page.
Dataset modality
Image
Dataset licence
Creative Commons Public Domain Dedication and Certification
Other licence
No response
How can you access this data
Other
Confirm the dataset has an open licence
Contact details for data custodian
No response
The text was updated successfully, but these errors were encountered: