[WIP] Switch to using pickle instead of npz to store intermediate results #54

betatim · 2018-03-22T16:14:15Z

Closes #51

Work in progress code to check out using pickles to store intermediate results over NPZ files.

Not quite sure how to nicely benchmark this. This speeds up (for example) the time between running label-maker images and it printing "Downloading 10874 tiles to ...". With this branch there is nearly no delay between starting label-maker and seeing that printout. With the npz setup it takes "a while" with the below zurich.json (a while == minutes or longer, I can measure it later).

(This branch needs cleaning up a bit before merging, but wanted to show the basic idea.)

{
    "country": "switzerland",
    "bounding_box": [8.488103,47.359111,8.582088,47.407637],
    "zoom": 19,
    "classes": [
      { "name": "Pools", "filter": ["==", "leisure", "swimming_pool"] },
      { "name": "Bridge", "filter": ["has", "bridge"], "buffer": 5 },
      { "name": "Roads", "filter": ["all",
        ["has", "highway"],
        ["in", "highway", "motorway", "primary", "secondary", "residential"]
      ], "buffer": 3
      },
      { "name": "Buildings", "filter": ["has", "building"], "buffer": 3 },
      { "name": "Water", "filter": ["==", "natural", "water"] },
      { "name": "Forest", "filter": ["==", "landuse", "forest"] }
    ],
    "imagery": "http://a.tiles.mapbox.com/v4/mapbox.satellite/{z}/{x}/{y}.jpg?access_token=your_token_here",
    "background_ratio": 1,
    "ml_type": "classification"
  }

drewbo · 2018-04-17T20:13:05Z

@betatim update here, I timed the two on a separate dataset (50k tiles) and pickle was considerably faster to load (~4 seconds vs. 90). I don't totally understand why but this line is the culprit; my guess is that iterating over the file list of an npz object is much less efficient than .items() on a dict.

Do you want to remove the older commented code and then I'll merge?

betatim · 2018-04-18T15:12:40Z

I think the problem is in how the npz is read. If you dig a bit into the numyp docs it suggests that (maybe) the way it is implemented is as one file per key. So I would not be surprised if on the inside there is one open() call for each key. This would be much slower than open('foo.pkl').read().

I'll remove the commented code and take a look at the failing tests.

Switch to using pickle instead of npz to store intermediate results

061d80e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Switch to using pickle instead of npz to store intermediate results #54

[WIP] Switch to using pickle instead of npz to store intermediate results #54

betatim commented Mar 22, 2018 •

edited

Loading

drewbo commented Apr 17, 2018

betatim commented Apr 18, 2018

[WIP] Switch to using pickle instead of npz to store intermediate results #54

Are you sure you want to change the base?

[WIP] Switch to using pickle instead of npz to store intermediate results #54

Conversation

betatim commented Mar 22, 2018 • edited Loading

drewbo commented Apr 17, 2018

betatim commented Apr 18, 2018

betatim commented Mar 22, 2018 •

edited

Loading