Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support module scrapers #12

Open
andykais opened this issue Feb 20, 2019 · 1 comment
Open

Support module scrapers #12

andykais opened this issue Feb 20, 2019 · 1 comment
Labels
enhancement New feature or request

Comments

@andykais
Copy link
Owner

The dream here is to let other users maintain scrapers in a community repo, or on their own githubs, and let developers simply install them via npm.

npm i scrape-pages @community-scrapers/twitter-feed @community-scrapers/twitter-login

ConfigInit:

scrape:
  module: '@community-scrapers/twitter-feed'

yields Config:

input:
  - '@community-scrapers/twitter-feed:username'
define:
  @community-scrapers/twitter-feed:feedpage: ...
  @community-scrapers/twitter-feed:post: ...
  @community-scrapers/twitter-feed:post-media: ...
scrape:
  module: '@community-scrapers/twitter-feed'

Local define defs can override those inside module define.

How to wire this stuff up?

inputs

Create a object in each ScrapeStep that came from a module. Object should map full input keys to module's internal keys. The internal keys will be the ones actually used in the handlebar templates. E.g.

{
  '@community-scrapers/twitter-feed:username': 'username'
}

scrape

Two options:

  1. Create a separate flow.ts instance for a module and hook that up to whatever is above/below it.
  2. Crawl through a module scraper, find all empty scrapeEach arrays and reattach the rest of the structure there.

stateful values

There may be times when a local/module scraper gets a value that you want for the rest of the run. Most often this will be an auth/access token.

define:
  'user-likes-page':
    download:
       urlTemplate: 'https://twitter.com/likes'
       headerTemplates:
         'x-twitter-access-token': '{{ accessToken }}'
    parse:
      selector: '.post a'
      attribute: 'href'
scrape:
  module: '@community-scrapers/twitter-login'
  valueAsInput: 'accessToken'
  forEach:
    - scraper: 'user-likes-page'

This is essentially global state, whenever '@community-scrapers/twitter-login' gives us a value, we update the input value for 'accessToken', and replace the passed down value with ''

organizing dependencies

It is possible to have a separate directory where module scrapers live using worker_threads.

mkdir scrape-pages-runners
cd scrape-pages-runners
npm init
npm i scrape-pages @community-scrapers/twitter-feed @community-scrapers/twitter-login

Your main nodejs process can run something like

const { Worker } = require('worker_threads')
const worker = new Worker('./scrape-pages-runners/worker.js', { workerData: { config, options } })
worker.on('message', ([event, data]) => console.log(event, data)) // wire up scraper events here
worker.on('exit', () => console.log('complete.')
@andykais andykais added the enhancement New feature or request label Feb 20, 2019
@andykais andykais mentioned this issue Feb 22, 2019
8 tasks
@andykais
Copy link
Owner Author

for now, this is a back-burner issue. The biggest use case was to reuse login logic for different scrapers.

If I see a pressing reason, I will implement it, until then I will encourage the community to build full independent configs & options.

sample consumable scraper:

example-scraper/
  package.json
  config.json
  options.json
  readme.md

andykais added a commit that referenced this issue Feb 27, 2019
for now, this is not a pressing feature. See
#12 for details.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant