Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset: biodiversity_heritage_library #72

Open
1 task done
nabsiddiqui opened this issue Jul 27, 2022 · 12 comments
Open
1 task done

Add dataset: biodiversity_heritage_library #72

nabsiddiqui opened this issue Jul 27, 2022 · 12 comments
Labels
candidate-dataset Proposed dataset to be added

Comments

@nabsiddiqui
Copy link

A URL for this dataset

https://www.flickr.com/photos/biodivlibrary/

Dataset description

Dataset of images uploaded from the Biodiversity Heritage Library hosted at the Smithsonian Institution. Useful for GANs, understanding the history of biodiversity, understanding changing aesthetic values over time, etc.

Dataset modality

Image

Dataset licence

Creative Commons Attribution Non Commercial Share Alike 4.0 International

Other licence

No response

How can you access this data

As a download from a repository/website

Confirm the dataset has an open licence

  • To the best of my knowledge, this dataset is accessible via an open licence

Contact details for data custodian

No response

@nabsiddiqui nabsiddiqui added the candidate-dataset Proposed dataset to be added label Jul 27, 2022
@nabsiddiqui
Copy link
Author

nabsiddiqui commented Jul 28, 2022

@MikeTrizna, @cceyda, and @shamikbose. This is the new repository for the biodiversity heritage library Flickr.

@MikeTrizna, I think the easiest way is for me to download the images and extract the EXIF data using R. It shouldn't take me too long. I will then upload it to Hugging Face. After that, I think the thing to do is create a variety of metadata that we can just upload in a CSV. This is where you could add the embeddings information.
My goal is to have something like the following:

Image File Description Keyword Embeddings Labels from VGNet 16 Labels from VGNet 19 Other Meta Data

I agree that imagenet is not ideal but we could use memespector to run through the images just in case. I haven't used memespector with so many images before so lets see how it goes.

I will try to think of some ideas for the Journal of Open Humanities Data article and create a Google Doc after that and invite everyone to contribute.

@davanstrien
Copy link
Collaborator

Sorry for coming to this discussion bit late. My preference would be to include the tags/descriptions generated by cataloguers/Flickr users rather than add labels output from a generic ML model.

For example, this image https://www.flickr.com/photos/biodivlibrary/7268619716/in/photolist-2kM8xSd-9vVsjU-2kM8xXD-XGMoPR-2kF7kHZ-wnaCYy-2jAiBTp-bu7ij8-c5izpA-bvSk4u-atGxWy-atDT9z-aUT4DB-atGxQS-akH6mL-d9s1hq-2m8ksrw-vGLxu7-wEgNjr-eV4dYa-wBsXJj-2j1mFUu-wEgJuH-wnhPxM-wEgKMH-ag15Gu-ag15CN-bK4bht contains the following EXIF data on Flickr

JFIFVersion - 1.02
X-Resolution - 96 dpi
Y-Resolution - 96 dpi
XMPToolkit - Image::ExifTool 12.30
About - uuid:faf5bdd5-ba3d-11da-ad31-d33d75182f1b
Creator - Craig, Hugh, ed.
Description - Johnson's household book of nature,. New York,H.J.Johnson,[1880]. http://biodiversitylibrary.org/page/39741740
Rights - Public Domain
Subject - Mammals
Credit - Image courtesy of BHL
Source - http://biodiversitylibrary.org 

and has the following tags:
Screenshot 2022-07-28 at 11 15 52

Although that information is noisy, I think it is more useful and could be used to evaluate noisy supervision and/or contrastive learning methods.

I'm not familiar with memespector but from looking at the GitHub for the tool, it seems like this offers a GUI to run images through one of the commercial vision platforms?

I just put an example through the Google API:

Screenshot 2022-07-28 at 11 18 35

Although this prediction isn't 'wrong', it's also arguably not that useful. I'm also quite wary of using commercial platforms like this because it's often difficult/impossible to get a complete list of possible labels they can assign to an image. Potentially something that could be evaluated using the existing metadata for the images is how closely the predictions of one of the commercial API vision services match the labels assigned on Flickr?

Maybe it makes sense to see what @MikeTrizna already has prepared and take it from there?

@nabsiddiqui
Copy link
Author

@davanstrien That makes sense. It turns out my ISP has a limit on downloading so I can't download all the images. @MikeTrizna will have to upload them.

I think what @MikeTrizna has is the metadata of the files, which include the EXIF data along with some Flickr API stuff. I will go ahead and clean that data up and make it a CSV file. I'll put the results here.

In terms of memespector. There are the commercial APIs it provides along with free versions like VGNet, etc. It can also do a lot of more interesting network stuff. See http://tallerdeletras.letras.uc.cl/index.php/Disena/article/view/27271/33509.

The open source options provide labels of the items, but for 300k images, it is probably too much. I will use that for a different dataset perhaps in the future. For this, I think the metadata should be sufficient. I will start working on that.

@MikeTrizna
Copy link

I don't think I ever mentioned that I would be uploading EXIF data, and as @davanstrien showed above it is superfluous to the tags and information provided directly through Flickr. And Daniel thanks for the clarification on the model-generated labels. I agree that it makes sense to leave those out, since they can always be re-generated separately as new and better models are released.

A quick question on the metadata: Is it possible to upload multiple CSVs to an image dataset? I'm specifically thinking of the tag data, which I would have to come up with some sort of concatenation scheme to squeeze into a single table.

@MikeTrizna
Copy link

Oh, and @nabsiddiqui, where did you get the "Creative Commons Attribution Non Commercial Share Alike 4.0 International" license information that you show above? I haven't checked through all of the images, but it was my understanding that they are all marked as https://creativecommons.org/publicdomain/mark/1.0/. But if you saw somewhere in the documentation that maybe the dataset as a whole is CC BY-NC-SA 4.0, we can defer to that.

@shamikbose
Copy link

shamikbose commented Jul 28, 2022 via email

@nabsiddiqui
Copy link
Author

@MikeTrizna I looked at the script that you had. The keywords, tags, etc. that I believe you have are copies of the EXIF data but they have additional metadata such as "ispublic" etc from the FlickrAPI. I am using the same thing and am cleaning the data up a bit so that it is easier to work with. For instance, the page information in the "keywords" Iv made into a separate column called "page". Hope that makes sense. I'll upload both the JSON and CSV following the suggestion of @shamikbose.

And yes, if you send me a CSV or JSON, I can merge it.

In terms of the copyright, I got it from https://about.biodiversitylibrary.org/help/copyright-and-reuse/#reuse. This may be incorrect.

@davanstrien
Copy link
Collaborator

A quick question on the metadata: Is it possible to upload multiple CSVs to an image dataset? I'm specifically thinking of the tag data, which I would have to come up with some sort of concatenation scheme to squeeze into a single table.

In the actual dataset you can have nested data/sequences so it could look something like:

image width height tags
image1.jpg 200 100 ["bird of paradise", "botany"]

This also makes it a bit easier to deal with tags of different lengths etc.

This is for example how the annotations are stored in this dataset: https://huggingface.co/datasets/biglam/nls_chapbook_illustrations.

Practically it's possible to upload CSV/JSON files etc that contain this metadata. The dataset script could then do a lookup to get relevant tags etc. @MikeTrizna I'm happy to have a look at how the data is structured at the moment and see how it might fit in a dataset loading script if useful.

@MikeTrizna
Copy link

@nabsiddiqui, thanks for pointing out that copyright page. That's the exact info I was looking for. I don't think I pulled down the image-level copyright status with my download, but I can go back and get that. If a large majority are public domain, we might be better off filtering to those so that we can label the whole dataset as such.

@nabsiddiqui
Copy link
Author

@davanstrien yes. What you are describing is what I am working on doing. It requires extracting the tags from the JSON and cleaning it up. There are non-tags in the tags section so there are some errors that I have to go through and it is about 300k files.

@MikeTrizna I'm fine with doing that but don't know how we would filter it.

@MikeTrizna
Copy link

In checking out the example that @davanstrien linked, I noticed that the files are being pulled directly from NLS servers. I'm sure that the data custodians at BHL would be more amenable to directly serving these files from BHL servers (or the Smithsonian FigShare repo) to show a little bit more "ownership". I am in discussions to meet with the BHL data manager next week, and I can run it by them then.

So in the meantime, @nabsiddiqui, I would hold off on the JSON file you are building.

@nabsiddiqui
Copy link
Author

Sounds good. I just have the script for the JSON file running in the background. I'll still upload it here just in case but we can delete it if it's not needed.

I wanted to practice using the Flickr API anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
candidate-dataset Proposed dataset to be added
Projects
None yet
Development

No branches or pull requests

4 participants