-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dataset: biodiversity_heritage_library #72
Comments
@MikeTrizna, @cceyda, and @shamikbose. This is the new repository for the biodiversity heritage library Flickr. @MikeTrizna, I think the easiest way is for me to download the images and extract the EXIF data using R. It shouldn't take me too long. I will then upload it to Hugging Face. After that, I think the thing to do is create a variety of metadata that we can just upload in a CSV. This is where you could add the embeddings information.
I agree that imagenet is not ideal but we could use memespector to run through the images just in case. I haven't used memespector with so many images before so lets see how it goes. I will try to think of some ideas for the Journal of Open Humanities Data article and create a Google Doc after that and invite everyone to contribute. |
Sorry for coming to this discussion bit late. My preference would be to include the tags/descriptions generated by cataloguers/Flickr users rather than add labels output from a generic ML model. For example, this image https://www.flickr.com/photos/biodivlibrary/7268619716/in/photolist-2kM8xSd-9vVsjU-2kM8xXD-XGMoPR-2kF7kHZ-wnaCYy-2jAiBTp-bu7ij8-c5izpA-bvSk4u-atGxWy-atDT9z-aUT4DB-atGxQS-akH6mL-d9s1hq-2m8ksrw-vGLxu7-wEgNjr-eV4dYa-wBsXJj-2j1mFUu-wEgJuH-wnhPxM-wEgKMH-ag15Gu-ag15CN-bK4bht contains the following EXIF data on Flickr
Although that information is noisy, I think it is more useful and could be used to evaluate noisy supervision and/or contrastive learning methods. I'm not familiar with memespector but from looking at the GitHub for the tool, it seems like this offers a GUI to run images through one of the commercial vision platforms? I just put an example through the Google API: Although this prediction isn't 'wrong', it's also arguably not that useful. I'm also quite wary of using commercial platforms like this because it's often difficult/impossible to get a complete list of possible labels they can assign to an image. Potentially something that could be evaluated using the existing metadata for the images is how closely the predictions of one of the commercial API vision services match the labels assigned on Flickr? Maybe it makes sense to see what @MikeTrizna already has prepared and take it from there? |
@davanstrien That makes sense. It turns out my ISP has a limit on downloading so I can't download all the images. @MikeTrizna will have to upload them. I think what @MikeTrizna has is the metadata of the files, which include the EXIF data along with some Flickr API stuff. I will go ahead and clean that data up and make it a CSV file. I'll put the results here. In terms of memespector. There are the commercial APIs it provides along with free versions like VGNet, etc. It can also do a lot of more interesting network stuff. See http://tallerdeletras.letras.uc.cl/index.php/Disena/article/view/27271/33509. The open source options provide labels of the items, but for 300k images, it is probably too much. I will use that for a different dataset perhaps in the future. For this, I think the metadata should be sufficient. I will start working on that. |
I don't think I ever mentioned that I would be uploading EXIF data, and as @davanstrien showed above it is superfluous to the tags and information provided directly through Flickr. And Daniel thanks for the clarification on the model-generated labels. I agree that it makes sense to leave those out, since they can always be re-generated separately as new and better models are released. A quick question on the metadata: Is it possible to upload multiple CSVs to an image dataset? I'm specifically thinking of the tag data, which I would have to come up with some sort of concatenation scheme to squeeze into a single table. |
Oh, and @nabsiddiqui, where did you get the "Creative Commons Attribution Non Commercial Share Alike 4.0 International" license information that you show above? I haven't checked through all of the images, but it was my understanding that they are all marked as https://creativecommons.org/publicdomain/mark/1.0/. But if you saw somewhere in the documentation that maybe the dataset as a whole is CC BY-NC-SA 4.0, we can defer to that. |
For multiple metadata tags, you could put it in a list/other iterable in
the schema and it should be fine
On Thu, Jul 28, 2022 at 10:11 AM Mike Trizna ***@***.***> wrote:
I don't think I ever mentioned that I would be uploading EXIF data, and as
@davanstrien <https://github.com/davanstrien> showed above it is
superfluous to the tags and information provided directly through Flickr.
And Daniel thanks for the clarification on the model-generated labels. I
agree that it makes sense to leave those out, since they can always be
re-generated separately as new and better models are released.
A quick question on the metadata: Is it possible to upload multiple CSVs
to an image dataset? I'm specifically thinking of the tag data, which I
would have to come up with some sort of concatenation scheme to squeeze
into a single table.
—
Reply to this email directly, view it on GitHub
<#72 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AMD3OJJCAORVCQUWBCJ7SIDVWKIJTANCNFSM543J6GHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
--
*-Regards, Shamik Bose*
|
@MikeTrizna I looked at the script that you had. The keywords, tags, etc. that I believe you have are copies of the EXIF data but they have additional metadata such as "ispublic" etc from the FlickrAPI. I am using the same thing and am cleaning the data up a bit so that it is easier to work with. For instance, the page information in the "keywords" Iv made into a separate column called "page". Hope that makes sense. I'll upload both the JSON and CSV following the suggestion of @shamikbose. And yes, if you send me a CSV or JSON, I can merge it. In terms of the copyright, I got it from https://about.biodiversitylibrary.org/help/copyright-and-reuse/#reuse. This may be incorrect. |
In the actual dataset you can have nested data/sequences so it could look something like:
This also makes it a bit easier to deal with tags of different lengths etc. This is for example how the annotations are stored in this dataset: https://huggingface.co/datasets/biglam/nls_chapbook_illustrations. Practically it's possible to upload CSV/JSON files etc that contain this metadata. The dataset script could then do a lookup to get relevant tags etc. @MikeTrizna I'm happy to have a look at how the data is structured at the moment and see how it might fit in a dataset loading script if useful. |
@nabsiddiqui, thanks for pointing out that copyright page. That's the exact info I was looking for. I don't think I pulled down the image-level copyright status with my download, but I can go back and get that. If a large majority are public domain, we might be better off filtering to those so that we can label the whole dataset as such. |
@davanstrien yes. What you are describing is what I am working on doing. It requires extracting the tags from the JSON and cleaning it up. There are non-tags in the tags section so there are some errors that I have to go through and it is about 300k files. @MikeTrizna I'm fine with doing that but don't know how we would filter it. |
In checking out the example that @davanstrien linked, I noticed that the files are being pulled directly from NLS servers. I'm sure that the data custodians at BHL would be more amenable to directly serving these files from BHL servers (or the Smithsonian FigShare repo) to show a little bit more "ownership". I am in discussions to meet with the BHL data manager next week, and I can run it by them then. So in the meantime, @nabsiddiqui, I would hold off on the JSON file you are building. |
Sounds good. I just have the script for the JSON file running in the background. I'll still upload it here just in case but we can delete it if it's not needed. I wanted to practice using the Flickr API anyway. |
A URL for this dataset
https://www.flickr.com/photos/biodivlibrary/
Dataset description
Dataset of images uploaded from the Biodiversity Heritage Library hosted at the Smithsonian Institution. Useful for GANs, understanding the history of biodiversity, understanding changing aesthetic values over time, etc.
Dataset modality
Image
Dataset licence
Creative Commons Attribution Non Commercial Share Alike 4.0 International
Other licence
No response
How can you access this data
As a download from a repository/website
Confirm the dataset has an open licence
Contact details for data custodian
No response
The text was updated successfully, but these errors were encountered: