Working with large and wide-extent datasets is very limited #1677
Replies: 1 comment 11 replies
-
Do you mean that you have your large region divided into multiple GeoTIFFs? If so, then yes, passing them all to a single If you can share some example code for how you are currently doing it, I might be able to help with this. |
Beta Was this translation helpful? Give feedback.
-
Hello,
I have been more intensely working with rastervision for about 3 months now and I greatly prefer the clear geodata workflows as opposed to the concept I've used before. Prior to rastervision, I used the
gdal2tiles.py
functionality to split datasets into small and chipped versions of the bigger geodata, did the training/classification task and used a custom script for recombining the tiles. This had many downsides to the way rastervision approaches geodatasets, however there is one side-effect which your framework seems somewhat limited in - large and/or high-extent datasets...As already described in #1657, when working with a high extent and high quality, RAM will quickly get a problem. As @AdeelH there are some workarounds, however an out of the box approach would (at least in my case studies) reduce a lot of pre- and postprocessing steps.
Furthermore, I stumbled across a new problem today: Given multiple reasonably small geodatasets (max 2000m x 1000m) spread over a large region (in my case ~300km x 150km), rastervision will autmatically create a VRT of the whole region. Consequently, when doing a training/prediction task, the entire RAM will be consumed which (at least in my case) will lead to the process being killed by the OS. Hence, I have no stack-trace to share. Going through the documentation I figured this problem might be solved using AOIs. However, the problem persisted and when looking into the source code (`pytorch_learner/dataset/dataset.py:SlidingWindowGeoDataset:init_windows - line 220), I saw that the windows are pre-computed before filtering them with the AOI-polygons and therefore still use up all available RAM.
Especially when working with open-data (e.g. OSM), the available data will, in most cases, be spread over a huge areas. I think for many training scenarios a circumvention of the problem could be highly beneficial. Personally, I was not sure if others might be affected by this limitation too. Moreover, I am not entirely sure if this is in fact doable without monstrous workloads and changing the whole architecture. As a consequence, before creating a feature-request, I wanted to discuss.
If you are interested i can express my ideas on how to approach this, but would also like to hear your opinion on the described limitation. In any case, I would be very willing to contribute here.
Beta Was this translation helpful? Give feedback.
All reactions