Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dataset: odeuropa_smell_objects #71

Open
1 task done
davanstrien opened this issue Jul 27, 2022 · 17 comments
Open
1 task done

Add dataset: odeuropa_smell_objects #71

davanstrien opened this issue Jul 27, 2022 · 17 comments
Assignees
Labels
dataset Dataset to be added

Comments

@davanstrien
Copy link
Collaborator

A URL for this dataset

https://doi.org/10.5281/zenodo.6367776

Dataset description

From the Zenodo page:

This dataset is released as part of the Odeuropa project. The annotations are identical to the training set of the ICPR2022-ODOR Challenge.
It contains bounding box annotations for smell-active objects in historical artworks gathered from various digital connections.
The smell-active objects annotated in the dataset either carry smells themselves or hint at the presence of smells.
The dataset provides 15484 bounding boxes on 2116 artworks in 87 object categories.
An additional csv file contains further image-level metadata such as artist, collection, or year of creation.

Object detection datasets are time consuming to collect and there are relativlely few datasets for object detection that use LAM data. Those that do exist often use the output of one of the various YOLO models which may be of some interest but often includes categories which are unlikely to be particularly useful for research/curation of LAM collections. This dataset, in contrast, includes categories related to smell: a topic of interest to both art historians and social historians. As a result, this dataset offers a much richer exploration of the possibilities of using object detection with historical paintings.

Dataset modality

Image

Dataset licence

Creative Commons Attribution 4.0 International

Other licence

No response

How can you access this data

Other

Confirm the dataset has an open licence

  • To the best of my knowledge, this dataset is accessible via an open licence

Contact details for data custodian

No response

@davanstrien davanstrien added the candidate-dataset Proposed dataset to be added label Jul 27, 2022
@davanstrien
Copy link
Collaborator Author

Happy to help anyone who wants to work on this. I have a WIP loading script for another COCO formatted dataset: https://huggingface.co/datasets/biglam/nls_chapbook_illustrations

@davanstrien davanstrien added dataset Dataset to be added and removed candidate-dataset Proposed dataset to be added labels Jul 27, 2022
@davanstrien
Copy link
Collaborator Author

Also, I really want to call this dataset smelly_objects...

@shamikbose
Copy link

shamikbose commented Jul 27, 2022

I'd love to work on this! Will be a good change from the text datasets so far.

@shamikbose
Copy link

#self-assign

@davanstrien
Copy link
Collaborator Author

Awesome, and don't worry if you can't finish this before you go away. It can wait until you're back too 🙂

@shamikbose
Copy link

Hopefully, I should be able to get it done. From the Zenodo page:

Due to licensing issues, we cannot provide the images directly, but instead provide a collection of links and a download script.

Should the dataset just contain the links to the images then?

@davanstrien
Copy link
Collaborator Author

Hopefully, I should be able to get it done. From the Zenodo page:

Due to licensing issues, we cannot provide the images directly, but instead provide a collection of links and a download script.

Should the dataset just contain the links to the images then?

Yes I think that would be best for this one. We can provide example code for downloading the images in the datacard.

@shamikbose
Copy link

@davanstrien This dataset has a lot of associated metadata

       ['File Name', 'Artist', 'Title', 'Query', 'Part', 'Earliest Date',
       'Latest Date', 'Margin Years', 'Genre', 'Material', 'Medium',
       'Height of Image Field', 'Width of Image Field', 'Type of Object',
       'Height of Object', 'Width of Object', 'Diameter of Object',
       'Position of Depiction on Object', 'Current Location',
       'Repository Number', 'Original Location', 'Original Place',
       'Original Position', 'Context', 'Place of Discovery',
       'Place of Manufacture', 'Associated Scenes', 'Object Categories',
       'Related Works of Art', 'Type of Similarity', 'Inscription',
       'Text Source', 'Bibliography', 'Photo Archive', 'Image URL',
       'Details URL', 'Additional Information']

Should they all be included in the dataset? Most of them are missing, from a cursory glance at the data. Current Location, Earliest Date, Latest Date, Genre, Material and Medium are populated for most of the images. I was thinking some of the fields like Material and Medium could be used for classification, maybe

@davanstrien
Copy link
Collaborator Author

@davanstrien This dataset has a lot of associated metadata

       ['File Name', 'Artist', 'Title', 'Query', 'Part', 'Earliest Date',
       'Latest Date', 'Margin Years', 'Genre', 'Material', 'Medium',
       'Height of Image Field', 'Width of Image Field', 'Type of Object',
       'Height of Object', 'Width of Object', 'Diameter of Object',
       'Position of Depiction on Object', 'Current Location',
       'Repository Number', 'Original Location', 'Original Place',
       'Original Position', 'Context', 'Place of Discovery',
       'Place of Manufacture', 'Associated Scenes', 'Object Categories',
       'Related Works of Art', 'Type of Similarity', 'Inscription',
       'Text Source', 'Bibliography', 'Photo Archive', 'Image URL',
       'Details URL', 'Additional Information']

Should they all be included in the dataset? Most of them are missing, from a cursory glance at the data. Current Location, Earliest Date, Latest Date, Genre, Material and Medium are populated for most of the images. I was thinking some of the fields like Material and Medium could be used for classification, maybe

My own feeling would be to include as much as possible. One option if things are often missing would be to put some of this metadata in an additional metadata column as a dictionary? This way it doesn't get lost but also is slightly less distracting than having a lot of columns with mostly missing data?

@shamikbose
Copy link

Yeah, I was building out the features as follows:

features = datasets.Features(
                {
                    "id": datasets.Value("string"),
                    "url": datasets.Value("string"),
                    "annotations": datasets.Value("string"),
                    "date": datasets.Value("string"),
                    "genre": datasets.Value("string"),
                    "material": datasets.Value("string"),
                    "metadata": {
                        "artist": datasets.Value("string"),
                        "query": datasets.Value("string"),
                        "title": datasets.Value("string"),
                        "height": datasets.Value("string"),
                        "width": datasets.Value("string"),
                    }
                }
            )

I'll probably get back to this in about two weeks, after I come back from vacation

@davanstrien
Copy link
Collaborator Author

I'll probably get back to this in about two weeks, after I come back from vacation

Have a great vacation!

@shamikbose
Copy link

shamikbose commented Sep 10, 2022

@davanstrien I'm back to working on this dataset, but it seems like the URLs aren't accessible. Even the download script provided in the dataset gives the following error:
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
Example from the first image in the metadata document:
URL: http://www.sigecweb.beniculturali.it/images/fullsize/ICCD50007114/ICCD4644613_SBAS%20RM%20223305.jpg

@davanstrien
Copy link
Collaborator Author

@shamikbose hey, hope you had a good break!

I'll try and take a look at this too but also tagging @kiymetakdemir who works on this project and might be able to help with this.

@shamikbose
Copy link

@davanstrien I did! It was a much needed break
Thanks for adding @kiymetakdemir. Hoping this data can still be accessed

@kiymetakdemir
Copy link

Hi @shamikbose, can you check it again? Now I tried to download the images with the given script but I haven't encountered any error, it downloaded successfully.

@shamikbose
Copy link

@kiymetakdemir I was able to download them today. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset Dataset to be added
Development

No branches or pull requests

3 participants