Major TOM: Expandable Datasets for Earth Observation #176

brunosan · 2024-03-11T15:05:55Z

brunosan
Mar 11, 2024
Maintainer

This is the paper
Github code repo
Hugging Face has the datasets and demo visualizer app.

Interview on the effort: https://www.youtube.com/watch?v=KonWxQ1mCpA

This paper comes from @aliFrancis and @mikonvergence.

The main takeaway is that "Major TOM" is mostly

a grid system, that was used to create a
large dataset covering all sentinel land with both Sentinel 2 L2A and L21C. They also mention ongoing work creating another dataset with ground data ("MajorTOM-LUCAS-2018").

To my knowledge this is the largest EO dataset by Earth coverage. I also appreciate the simplicity of the grid spec. As they also highlight on the intro, makes me wonder about streaming datasets, instead of pre-baked ones.

My main takeaway:
In the end this dataset can be recreated fully from the metadata without the actual pixel values... so it begs the question if we could stream pixels directly to the training. Since streaming is MUCH slower (10x?) than disk reading, we might need to have p parallel workers constantly streaming the next n batches before being read, and delete the data after being used. Then repeat for each epoch.

Specially when the Earth coverage is very large ( PBs ), as we need to do for Clay, the cost of buffering and re-downloading for each epoch seems a low cost compared to the highly redundant storage of pixel values where only a small number of values is needed at a time. Storing only the metadata makes transparency and provenance chain much easier to create and mantain (provided the source stays up). I believe this is the goal of what @yellowcap is testing on #165.

On the spec:

The main property that defines the creation of this grid system is to be as simple as possible without projection effects. The rest of the spec is, or can be, STAC spec.

The grid needs a "size" which in their case is 10km. All cells in the grid are, on the ground 10km on the side of each square. The origins is the (0,0) long/lat and the cells are named Left, Right (for East, West) and Up, Down (for North and South). E.g. 395U, 218R is 395 rows of 10k up the equator, 218 right (in Crete).

This code here defines the creation of the grid.

On the dataset:

This seems a full coverage of Sentinel (except very cloudy places in e.g. Greenland), where each grid has the least cloudy patch on the Sentinel Archive. Earth-wise is reported to be 50% coverage.

The dataset is saved as parquet files, with 4,491,772 rows, and total size of 23TB. Each band is saved as rasterio MemoryFile which seems to allow seeks (range reads).

I could not find the actual code to create it but I assume it favors smallest nodata, ealier over older, nadir if possible, and low cloud cover (as provided in the metadata, and then calculated with other classifier).

mikonvergence · 2024-03-11T15:51:57Z

mikonvergence
Mar 11, 2024

That's an amazing summary!

Each band is saved as rasterio MemoryFile which seems to allow seeks (range reads).

To clarify, we directly save the bytestream from the original band .tif files into the parquet so that all original metadata etc. is preserved 'as is'. When reading, MemoryFile is used to interpret that byte stream as a file and it basically behaves as if it was a regular local tif.

Not sure about what is meant by ranged reads, so to clarify that too:

Specific samples can be accessed by reading an individual group rows in a parquet - independently of whether it contains .tif files or anything else (this is based on pyarrow in our scripts)
Windowed reads of specific .tif files are possible once MemoryFile is loaded. By default, the entire content of each requested column for the requested rows has to be loaded beforehand, which means that for streaming, windowed reading doesn't bring huge speed ups. We haven't come up with a workaround to reduce the redundant downloaded throughput when the user prefers smaller crops - and we are not yet sure if it's possible with parquet-based infrastructure.

If you have any questions, we are keen to discuss!

2 replies

brunosan Mar 12, 2024
Maintainer Author

Thank for the quick comment. My hope was that with parquet+rasterio we could reference via URL the row directly, and within it, the column, and within it, range read a subwindow... But based on your 2nd bullet point that is not the case.
There has to be a way to do that...

mikonvergence Mar 12, 2024

I've been digging quite a bit to try to get this kind of interface working, but it does not seem to be currently supported by any common framework. That would indeed be the ideal way for distributing large EO datasets.

It is technically possible to design a storage format that would support this (like you say, effectively, all of this could be achieved with multiple levels of ranged reads), but after quite a bit of discussion with HF staff working on datasets, I am fairly convinced this is not yet supported by any widely known framework. We would of course be very keen to learn if you know about any workarounds or formats that can make this work.

aliFrancis · 2024-03-11T16:04:55Z

aliFrancis
Mar 11, 2024

Hi @brunosan !

Great to see you're taking so much interest in Major TOM! Here's a few details that may elucidate things a little further.

RE spec: you are right about the spec being mostly just the grid at the moment. We're hoping to slowly develop the spec as more datasets are added. If you have thoughts about the STAC implementation, especially a standardised way of generating the json files, then we'd be very receptive to advice. Ultimately, a difficulty for us will be when we have to define the metadata fields that must be included for a sample in all Major TOM datasets, vs. ones that may be dataset specific (this will be a problem whether or not the metadata is in STAC format).
RE cloudiness: it's a bit more complex than what you have assumed here. We wanted to prioritise low cloud imagery, but also product diversity, and seasonal diversity. Taking the least cloudy would strongly bias the season and products from which data came from. The general approach we took to sampling was:
1. Pick random 2 month window from start of 2016 --> present
2. Take all S2 scenes from this period for which the in-built L2A cloud mask said was >60% cloud (this is over an entire scene, not grid cell)
3. In order of lowest cloud first, apply our cloud masking algorithm (paper in prep) to the area over the grid cell.
4. If cloud percentage is <25%, take that scene
5. If top ten scenes in the window fail this test, try with a new 2 month window (three times over)
6. Give up!
This was relatively costly in terms of processing, maxing out an A100 gpu for about 3-4 months to compute all the cloud masks required, but we think it's resulted in a highly diverse sampling, and gives users a cloud mask to use which is on the whole more accurate than the off-the-shelf solution.
RE nodata: there are some unavoidable areas close to UTM zone boundaries / orbit paths that lead to difficulties. Generally, one finds that in these places there may be a family of scenes from one UTM zone projection, and another family from the neighbouring one. We tested and tried to always select from the family that had least nodata.
RE coverage: there are some areas missed not due to cloud but because of weird, rare errors that we never quite got round to fixing. E.g. the west coast of Norway and an orbit that scrapes along the east coast of Australia and up through the Solomon Islands 😢. We will probably fill these in with a few more bits of data in a later version, when we find the time.
RE number of rows: the actual number is half of the number you quote, 2,245,886, because the Hugging Face dataset viewer counts the rows in the metadata.parquet file as well as in the main image files
RE code for dataset creation: We decided that the codebase was too complex and particular to our systems to be useful for others, or even usable. It would have been super cool to have full reproducibility but would have been very difficult.

Let me know if you have any other questions or anything! Looking forward to seeing how you guys can use the dataset, and possibly add to the effort in the future 😄

edit: removed some duplication of info from @mikonvergence's comment above

1 reply

brunosan Mar 12, 2024
Maintainer Author

Thanks for all the details!!

I'd encurage you to drop the code to select and create the dataset, even if not up to quality, it will help other understand, even if just skimming the code.
There is intrinsic value in all the cloud detection inference you did. Have you considered dropping a "simple" .jsonl with {bounding boxes, timestamp, Source file, model version, and cloud mask output}. You can drop it on source.coop My thinking is that this is an intrinsic value of the historical images, and ideally, we catalogue that. When I was at the Planetary Computer, one dream was to run inference on all kinds of ML models for each file on ingestion and adding the result as STAC metadata. Cloud detection is an obvious one.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Major TOM: Expandable Datasets for Earth Observation #176

{{title}}

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Major TOM: Expandable Datasets for Earth Observation #176

brunosan Mar 11, 2024 Maintainer

On the spec:

On the dataset:

Replies: 2 comments · 3 replies

mikonvergence Mar 11, 2024

brunosan Mar 12, 2024 Maintainer Author

mikonvergence Mar 12, 2024

aliFrancis Mar 11, 2024

brunosan Mar 12, 2024 Maintainer Author

brunosan
Mar 11, 2024
Maintainer

Replies: 2 comments 3 replies

mikonvergence
Mar 11, 2024

brunosan Mar 12, 2024
Maintainer Author

aliFrancis
Mar 11, 2024

brunosan Mar 12, 2024
Maintainer Author