Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Switch to using pickle instead of npz to store intermediate results #54

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

betatim
Copy link
Contributor

@betatim betatim commented Mar 22, 2018

Closes #51

Work in progress code to check out using pickles to store intermediate results over NPZ files.

Not quite sure how to nicely benchmark this. This speeds up (for example) the time between running label-maker images and it printing "Downloading 10874 tiles to ...". With this branch there is nearly no delay between starting label-maker and seeing that printout. With the npz setup it takes "a while" with the below zurich.json (a while == minutes or longer, I can measure it later).

(This branch needs cleaning up a bit before merging, but wanted to show the basic idea.)


{
    "country": "switzerland",
    "bounding_box": [8.488103,47.359111,8.582088,47.407637],
    "zoom": 19,
    "classes": [
      { "name": "Pools", "filter": ["==", "leisure", "swimming_pool"] },
      { "name": "Bridge", "filter": ["has", "bridge"], "buffer": 5 },
      { "name": "Roads", "filter": ["all",
        ["has", "highway"],
        ["in", "highway", "motorway", "primary", "secondary", "residential"]
      ], "buffer": 3
      },
      { "name": "Buildings", "filter": ["has", "building"], "buffer": 3 },
      { "name": "Water", "filter": ["==", "natural", "water"] },
      { "name": "Forest", "filter": ["==", "landuse", "forest"] }
    ],
    "imagery": "http://a.tiles.mapbox.com/v4/mapbox.satellite/{z}/{x}/{y}.jpg?access_token=your_token_here",
    "background_ratio": 1,
    "ml_type": "classification"
  }

@drewbo
Copy link
Contributor

drewbo commented Apr 17, 2018

@betatim update here, I timed the two on a separate dataset (50k tiles) and pickle was considerably faster to load (~4 seconds vs. 90). I don't totally understand why but this line is the culprit; my guess is that iterating over the file list of an npz object is much less efficient than .items() on a dict.

Do you want to remove the older commented code and then I'll merge?

@betatim
Copy link
Contributor Author

betatim commented Apr 18, 2018

I think the problem is in how the npz is read. If you dig a bit into the numyp docs it suggests that (maybe) the way it is implemented is as one file per key. So I would not be surprised if on the inside there is one open() call for each key. This would be much slower than open('foo.pkl').read().

I'll remove the commented code and take a look at the failing tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants