Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Corrupt WOF records are not skipped during import #405

Open
nilsnolde opened this issue Sep 29, 2018 · 9 comments
Open

Corrupt WOF records are not skipped during import #405

nilsnolde opened this issue Sep 29, 2018 · 9 comments

Comments

@nilsnolde
Copy link

nilsnolde commented Sep 29, 2018

Hi,
there were quite a few corrupt WOF postal code records during download, e.g.

error downloading whosonfirst-data-postalcode-ca-latest.tar.bz2 bundle: Error: Command failed: curl https://dist.whosonfirst.org/bundles/whosonfirst-data-postalcode-ca-latest.tar.bz2 | tar -xj --strip-components=1 --exclude=README.txt -C /srv/pelias_importer_ext/data/wof && mv /srv/pelias_importer_ext/data/wof/whosonfirst-data-postalcode-ca-latest.csv /srv/pelias_importer_ext/data/wof/meta

That was a few days ago. Weirdly,
curl https://dist.whosonfirst.org/bundles/whosonfirst-data-postalcode-ca-latest.tar.bz2 | tar -xj --strip-components=1 --exclude=README.txt
works now, if done manually. Even though the timestamp of that record didn't change since July..

There's a bunch of bzip2 and tar errors in the logs for the above command. Same for Japan, Portugal and GB postal codes. When downloading is finished and he's trying to inject them to ES, the following fatal error occurs:

2018-09-27T19:29:31.322Z - ESC[32minfoESC[39m: [whosonfirst] Loading whosonfirst-data-postalcode-ca-latest.csv records from /srv/pelias_importer_ext/data/wof/meta
events.js:183
      throw er; // Unhandled 'error' event
      ^

Error: ENOENT: no such file or directory, open '/srv/pelias_importer_ext/data/wof/meta/whosonfirst-data-postalcode-ca-latest.csv'
npm ERR! code ELIFECYCLE
npm ERR! errno 1
npm ERR! [email protected] start: `node import.js`
npm ERR! Exit status 1
npm ERR! 
npm ERR! Failed at the [email protected] start script.
npm ERR! This is probably not a problem with npm. There is likely additional logging output above.

npm ERR! A complete log of this run can be found in:
npm ERR!     /home/ubuntu/.npm/_logs/2018-09-27T19_29_31_479Z-debug.log

Then the WOF importer ends on error instead of skipping the Canada postal codes.
Any chance of improving that behavior?

@nilsnolde
Copy link
Author

And if this would be a good first issue, let me know. I'm serious about contributing. At some point I wanna have another Pelias importer for custom user data anyways. Getting to know the stack with smaller issues would be helpful. But then a quick pointer in the right direction would be appreciated:)

@missinglink
Copy link
Member

Hi @nilsnolde. What do you think the issue is here?

Do you think maybe it was just that the server was having a bad day and now it works, or is there a difference between executing the commands manually on the CLI vs. using the Pelias scripts?

@nilsnolde
Copy link
Author

nilsnolde commented Oct 4, 2018

Well, it has to be a network issue. I'm not saying there's a flaw in the code, obviously I executed the exact same command manually without problems.
I does happen consistently though: canada won't get processed when using the importer, it's always the above error over the last 2-3 weeks. The real problem is likely curl and its GnuTLS. So, yes, I have to dig in there.

My question here is: the importer could also skip files, in case it doesn't find them right. Instead of failing entirely.

@orangejulius
Copy link
Member

We had a bug a while back where one layer's data was being downloaded twice, and the two downloads interacted badly together, so something like that could happen still.

However, in general I think all our importers need to support a missingFilesAreFatal flag everywhere, like this importer already does in some places.

Sometimes, you want a build to fail if anything at all goes wrong. Sometimes you want to ignore failures to download/parse data. Most Pelias code was written to fail if anything goes wrong, since that's what we wanted for Mapzen Search, but I think we need to support both modes everwhere.

@nilsnolde
Copy link
Author

Ah, damn it, there's the switch right there!! And of course it's set true in our pelias.json..

Thanks @orangejulius again!

And for the record: I agree, having a switch is really good. At least for importers importing heaps of file like wof and oa.

@orangejulius
Copy link
Member

Oh, I didn't know it would actually work :) There's a lot of ways to download files in this repo, and I bet in at least some of the code paths, failures are still not ignored. So if you do see that, let us know.

@nilsnolde
Copy link
Author

Hm ok, sorry, apparently setting missingFilesAreFatal=false didn't fix the issue:

0 info it worked if it ends with ok
1 verbose cli [ '/usr/bin/node', '/usr/bin/npm', 'start' ]
2 info using [email protected]
3 info using [email protected]
4 verbose run-script [ 'prestart', 'start', 'poststart' ]
5 info lifecycle [email protected]~prestart: [email protected]
6 info lifecycle [email protected]~start: [email protected]
7 verbose lifecycle [email protected]~start: unsafe-perm in lifecycle true
8 verbose lifecycle [email protected]~start: PATH: /usr/lib/node_modules/npm/node_modules/npm-lifecycle/node-gyp-bin:/home/ubuntu/whosonfirst/node_modules/.bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin
9 verbose lifecycle [email protected]~start: CWD: /home/ubuntu/whosonfirst
10 silly lifecycle [email protected]~start: Args: [ '-c', './bin/start' ]
11 silly lifecycle [email protected]~start: Returned: code: 1  signal: null
12 info lifecycle [email protected]~start: Failed to exec start script
13 verbose stack Error: [email protected] start: `./bin/start`
13 verbose stack Exit status 1
13 verbose stack     at EventEmitter.<anonymous> (/usr/lib/node_modules/npm/node_modules/npm-lifecycle/index.js:285:16)
13 verbose stack     at emitTwo (events.js:126:13)
13 verbose stack     at EventEmitter.emit (events.js:214:7)
13 verbose stack     at ChildProcess.<anonymous> (/usr/lib/node_modules/npm/node_modules/npm-lifecycle/lib/spawn.js:55:14)
13 verbose stack     at emitTwo (events.js:126:13)
13 verbose stack     at ChildProcess.emit (events.js:214:7)
13 verbose stack     at maybeClose (internal/child_process.js:925:16)
13 verbose stack     at Process.ChildProcess._handle.onexit (internal/child_process.js:209:5)
14 verbose pkgid [email protected]
15 verbose cwd /home/ubuntu/whosonfirst
16 verbose Linux 4.4.0-28-generic
17 verbose argv "/usr/bin/node" "/usr/bin/npm" "start"
18 verbose node v8.11.3
19 verbose npm  v5.6.0
20 error code ELIFECYCLE
21 error errno 1
22 error [email protected] start: `./bin/start`
22 error Exit status 1
23 error Failed at the [email protected] start script.
23 error This is probably not a problem with npm. There is likely additional logging output above.
24 verbose exit [ 1, true ]

Any idea?

@nilsnolde nilsnolde reopened this Oct 13, 2018
@jeremy-rutman
Copy link

jeremy-rutman commented Apr 6, 2019

I also hit a download error (in a file that downloads with wget in a few seconds) and so set missingFileAreFatal to false, after which the downloads seem to be fine - either due to the switch or a serendipitously faster connection but this didnt help matters . I did see that the tar took a long time (more than 5 min) once I had the file with wget . SO possibly the line in download_data_all.js

`curl -s ${wofDataHost}/bundles/${bundle} | tar -xj --strip-components=1 --exclude=README.txt -C ` +
`${directory} && mv ${path.join(directory, csvFilename)} ${path.join(directory, 'meta')}`;

should be broken into several lines. I don;t know if wget is more reliable than curl

@jeremy-rutman
Copy link

ok by changing to wget and local files this runs to completion , I will do a pr

  function generateCommand(bundle, directory) {
    const csvFilename = bundle.replace(/-\d{8}T\d{6}-/, '-latest-') // support timestamped downloads
                              .replace('.tar.bz2', '.csv');
//    return `curl -s ${wofDataHost}/bundles/${bundle} | tar -xj --strip-components=1 --exclude=README.txt -C ` +
//      `${directory} && mv ${path.join(directory, csvFilename)} ${path.join(directory, 'meta')}`;
     var command=`wget ${wofDataHost}/bundles/${bundle} && tar -xvj --strip-components=1 --exclude=README.txt -C ${directory} -$
     console.log(command);
     return command;
  }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants