open-meta-extraction/project-todo.org at main · adamchandra/open-meta-extraction · GitHub

List project plan for current task

TASK Modify openreview note fetcher to use ‘?after=’ param

Fetch Service

[X] Modify runRelayFetch logic to store and use last known noteId (most recently fetched).
- [X] Implement a note iterator/generator with a limit count
- [X] Use named cursor e.g., ‘front/abstract’, ‘rear/*’
- [X] Run fetching w/ sort num:asc
[ ] Create logic to re-run fetch for all papers
- [ ] Use named cursors, stored in mongodb
[X] Create test plan
- [X] Run against live site w/dev config
- [X] Run against mocked api
  - [X] Koa-based api mimicking openreview
    - [X] /login api mockup
    - [X] /notes api mockup
- [X] Profile and report api fetch times
[ ] Delete downloaded htmls/artifacts when done
[ ] Delete /tmp files created by chrome
[ ] Reap dead chrome instances

Next Actions

[ ] Deploy new code manually
- [ ] merge adam -> iesl
- [ ] Get PM2 running w/o bree wrapper
[ ] refactor monitor/stats module to show basic functionality

Ideas

keep track of slow extraction ids
Fetch should only hit openreview api 1 time.
- keep track of hash of extracted fields, make note of when they change
- re-extracting from the beginning is a local-only operation, make a log record when updates should/do happen
use PM2 hooks to autodeploy
allow multiple extraction workers when responseUrl is known and hosts can be spread out over time
Make spider not write body/header files (use cli option to control behavior)

Cursor design/implementation

Cursor is (noteId, role)

Fetch Cursor

Extraction Cursor filters

abstract/pdf-link/both

just process certain domains

Use url_status responseUrl to avoid redirect issues

Operations

Worker locks/owns current cursor

Worker unlocks cursor

Worker advances cursor to next available note