Working with large and wide-extent datasets is very limited #1677

MathiasBaumgartinger · 2023-01-27T14:43:17Z

MathiasBaumgartinger
Jan 27, 2023

Hello,

I have been more intensely working with rastervision for about 3 months now and I greatly prefer the clear geodata workflows as opposed to the concept I've used before. Prior to rastervision, I used the gdal2tiles.py functionality to split datasets into small and chipped versions of the bigger geodata, did the training/classification task and used a custom script for recombining the tiles. This had many downsides to the way rastervision approaches geodatasets, however there is one side-effect which your framework seems somewhat limited in - large and/or high-extent datasets...

As already described in #1657, when working with a high extent and high quality, RAM will quickly get a problem. As @AdeelH there are some workarounds, however an out of the box approach would (at least in my case studies) reduce a lot of pre- and postprocessing steps.

Furthermore, I stumbled across a new problem today: Given multiple reasonably small geodatasets (max 2000m x 1000m) spread over a large region (in my case ~300km x 150km), rastervision will autmatically create a VRT of the whole region. Consequently, when doing a training/prediction task, the entire RAM will be consumed which (at least in my case) will lead to the process being killed by the OS. Hence, I have no stack-trace to share. Going through the documentation I figured this problem might be solved using AOIs. However, the problem persisted and when looking into the source code (`pytorch_learner/dataset/dataset.py:SlidingWindowGeoDataset:init_windows - line 220), I saw that the windows are pre-computed before filtering them with the AOI-polygons and therefore still use up all available RAM.

Especially when working with open-data (e.g. OSM), the available data will, in most cases, be spread over a huge areas. I think for many training scenarios a circumvention of the problem could be highly beneficial. Personally, I was not sure if others might be affected by this limitation too. Moreover, I am not entirely sure if this is in fact doable without monstrous workloads and changing the whole architecture. As a consequence, before creating a feature-request, I wanted to discuss.

If you are interested i can express my ideas on how to approach this, but would also like to hear your opinion on the described limitation. In any case, I would be very willing to contribute here.

AdeelH · 2023-01-27T14:54:35Z

AdeelH
Jan 27, 2023
Maintainer

Given multiple reasonably small geodatasets (max 2000m x 1000m) spread over a large region (in my case ~300km x 150km), rastervision will autmatically create a VRT of the whole region. Consequently, when doing a training/prediction task, the entire RAM will be consumed which (at least in my case) will lead to the process being killed by the OS.

Do you mean that you have your large region divided into multiple GeoTIFFs? If so, then yes, passing them all to a single RasterSource/GeoDataset will create a large VRT and that can run into OOM issues when making predictions. You should instead be making one RasterSource/GeoDataset per GeoTIFF.

If you can share some example code for how you are currently doing it, I might be able to help with this.

11 replies

MathiasBaumgartinger Jan 30, 2023
Author

Sure, thanks for your quick replies!

raster_source.extent = Box(ymin=0, xmin=0, ymax=1190928, xmax=1879047)
windows = raster_source.extent.get_windows(size=your_chip_size, stride=your_stride) => same behaviour as described before: memory slowly fills up entirely until OS will kill the process

AdeelH Jan 30, 2023
Maintainer

Okay, clearly we're creating too many boxes. But Box objects are fairly small and shouldn't take up too much memory even if they are in the millions. What is your chip size and stride?

MathiasBaumgartinger Jan 30, 2023
Author

I have tried different configurations which all led to the same result - but currently size=256, stride=10

AdeelH Jan 30, 2023
Maintainer

That's a very small stride! You do realize that this will make adjacent chips have a huge overlap and be very redundant, right?

The number of windows generated by Box.get_windows() (assuming default values for padding and pad_direction) is roughly:

((width + (size/2) - size) / stride) * ((height + (size/2) - size) / stride)

With your values, this would be:

((1879047 + 128 - 256) / 10) * ((1190928 + 128 - 256) / 10) = 22,374,167,452

Per sys.getsizeof() ints are 28 bytes and each Box has 4 ints, so each Box is at least 28 * 4 = 112 bytes. That means we're trying to generate 22,374,167,452 * 112 / 2^30 ~= 2334 GB of Boxes!

Setting stride=256 (no overlap b/w adjacent windows) gets this down to ~3.6 GB. Setting stride=200 (56px of overlap b/w adjacent windows) gives ~5.8 GB.

Note: Also, unfortunately, Box.get_windows() seems to be using .append() while it's generating Boxes. This means that the list is being dynamically resized at runtime and this can cause Python to try to allocate even more memory than needed.

MathiasBaumgartinger Jan 30, 2023
Author

Seems i misunderstood your definition of stride as really the size between windows composed of size:

Step size between windows.

But after your explanations it seems clear. Thx for helping me out!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Working with large and wide-extent datasets is very limited #1677

{{title}}

Replies: 1 comment 11 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Working with large and wide-extent datasets is very limited #1677

MathiasBaumgartinger Jan 27, 2023

Replies: 1 comment · 11 replies

AdeelH Jan 27, 2023 Maintainer

MathiasBaumgartinger Jan 30, 2023 Author

AdeelH Jan 30, 2023 Maintainer

MathiasBaumgartinger Jan 30, 2023 Author

AdeelH Jan 30, 2023 Maintainer

MathiasBaumgartinger Jan 30, 2023 Author

MathiasBaumgartinger
Jan 27, 2023

Replies: 1 comment 11 replies

AdeelH
Jan 27, 2023
Maintainer

MathiasBaumgartinger Jan 30, 2023
Author

AdeelH Jan 30, 2023
Maintainer

MathiasBaumgartinger Jan 30, 2023
Author

AdeelH Jan 30, 2023
Maintainer

MathiasBaumgartinger Jan 30, 2023
Author