Skip to content

Latest commit

 

History

History
55 lines (46 loc) · 1.83 KB

project-todo.org

File metadata and controls

55 lines (46 loc) · 1.83 KB

List project plan for current task

TASK Modify openreview note fetcher to use ‘?after=’ param

Fetch Service

  • [X] Modify runRelayFetch logic to store and use last known noteId (most recently fetched).
    • [X] Implement a note iterator/generator with a limit count
    • [X] Use named cursor e.g., ‘front/abstract’, ‘rear/*’
    • [X] Run fetching w/ sort num:asc
  • [ ] Create logic to re-run fetch for all papers
    • [ ] Use named cursors, stored in mongodb
  • [X] Create test plan
    • [X] Run against live site w/dev config
    • [X] Run against mocked api
      • [X] Koa-based api mimicking openreview
        • [X] /login api mockup
        • [X] /notes api mockup
    • [X] Profile and report api fetch times
  • [ ] Delete downloaded htmls/artifacts when done
  • [ ] Delete /tmp files created by chrome
  • [ ] Reap dead chrome instances

Next Actions

  • [ ] Deploy new code manually
    • [ ] merge adam -> iesl
    • [ ] Get PM2 running w/o bree wrapper
  • [ ] refactor monitor/stats module to show basic functionality

Ideas

  • keep track of slow extraction ids
  • Fetch should only hit openreview api 1 time.
    • keep track of hash of extracted fields, make note of when they change
    • re-extracting from the beginning is a local-only operation, make a log record when updates should/do happen
  • use PM2 hooks to autodeploy
  • allow multiple extraction workers when responseUrl is known and hosts can be spread out over time
  • Make spider not write body/header files (use cli option to control behavior)

Cursor design/implementation

Cursor is (noteId, role)

Fetch Cursor

Extraction Cursor filters

abstract/pdf-link/both

just process certain domains

  • Use url_status responseUrl to avoid redirect issues

Operations

Worker locks/owns current cursor

Worker unlocks cursor

Worker advances cursor to next available note