Major TOM: Expandable Datasets for Earth Observation #176
Replies: 2 comments 3 replies
-
That's an amazing summary!
To clarify, we directly save the bytestream from the original band Not sure about what is meant by ranged reads, so to clarify that too:
If you have any questions, we are keen to discuss! |
Beta Was this translation helpful? Give feedback.
-
Hi @brunosan ! Great to see you're taking so much interest in Major TOM! Here's a few details that may elucidate things a little further.
Let me know if you have any other questions or anything! Looking forward to seeing how you guys can use the dataset, and possibly add to the effort in the future 😄 edit: removed some duplication of info from @mikonvergence's comment above |
Beta Was this translation helpful? Give feedback.
-
Interview on the effort: https://www.youtube.com/watch?v=KonWxQ1mCpA
This paper comes from @aliFrancis and @mikonvergence.
The main takeaway is that "Major TOM" is mostly
To my knowledge this is the largest EO dataset by Earth coverage. I also appreciate the simplicity of the grid spec. As they also highlight on the intro, makes me wonder about streaming datasets, instead of pre-baked ones.
My main takeaway:
In the end this dataset can be recreated fully from the metadata without the actual pixel values... so it begs the question if we could stream pixels directly to the training. Since streaming is MUCH slower (10x?) than disk reading, we might need to have
p
parallel workers constantly streaming the nextn
batches before being read, and delete the data after being used. Then repeat for each epoch.Specially when the Earth coverage is very large ( PBs ), as we need to do for Clay, the cost of buffering and re-downloading for each epoch seems a low cost compared to the highly redundant storage of pixel values where only a small number of values is needed at a time. Storing only the metadata makes transparency and provenance chain much easier to create and mantain (provided the source stays up). I believe this is the goal of what @yellowcap is testing on #165.
On the spec:
The main property that defines the creation of this grid system is to be as simple as possible without projection effects. The rest of the spec is, or can be, STAC spec.
The grid needs a "size" which in their case is
10km
. All cells in the grid are, on the ground10km
on the side of each square. The origins is the (0,0) long/lat and the cells are named Left, Right (for East, West) and Up, Down (for North and South). E.g.395U, 218R
is395
rows of 10k up the equator,218
right (in Crete).This code here defines the creation of the grid.
On the dataset:
This seems a full coverage of Sentinel (except very cloudy places in e.g. Greenland), where each grid has the least cloudy patch on the Sentinel Archive. Earth-wise is reported to be 50% coverage.
The dataset is saved as parquet files, with
4,491,772
rows, and total size of23TB
. Each band is saved as rasterioMemoryFile
which seems to allow seeks (range reads).I could not find the actual code to create it but I assume it favors smallest
nodata
, ealier over older, nadir if possible, and low cloud cover (as provided in the metadata, and then calculated with other classifier).Beta Was this translation helpful? Give feedback.
All reactions