Skip to content

Commit

Permalink
Merge pull request #6 from andykais/improved-logging
Browse files Browse the repository at this point in the history
Improved Logging
  • Loading branch information
andykais authored Jan 19, 2019
2 parents 23c670d + 35abcbf commit 065820a
Show file tree
Hide file tree
Showing 111 changed files with 2,598 additions and 1,738 deletions.
2 changes: 2 additions & 0 deletions .eslintignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
website
lib
12 changes: 12 additions & 0 deletions .eslintrc
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
{
"parser": "eslint-plugin-typescript/parser",
"plugins": ["typescript", "no-only-tests"],
"rules": {
"no-unused-expressions": "error",
"no-console": "warn",
"typescript/no-unused-vars": "error",
"typescript/explicit-member-accessibility": "error",
"typescript/member-ordering": "error",
"no-only-tests/no-only-tests": "error"
}
}
5 changes: 5 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,11 @@ jobs:
node_js: '8'
script: npm run lint

- stage: typecheck-test-lint
name: check formatting
node_js: '8'
script: npm run format:check

- stage: deploy-npm
node_js: '8'
script: skip
Expand Down
81 changes: 45 additions & 36 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,10 +17,10 @@ npm install scrape-pages

## Usage

lets download the five most recent images from nasa's image of the day archive
lets download the five most recent images from NASA's image of the day archive

```javascript
const ScrapePages = require('scrape-pages')
const { scraper } = require('scrape-pages')
// create a config file
const config = {
scrape: {
Expand All @@ -42,21 +42,22 @@ const config = {
}
}
}
const options = {
folder: './downloads',
logLevel: 'info',
logFile: './nasa-download.log'
}

// load the config into a new 'scraper'
const siteScraper = new ScrapePages(config)
// begin scraping
const emitter = siteScraper.run({ folder: './downloads' })

emitter.on('image:complete', (queryFor, { id }) =>
const scraper = await scrape(config, options)
const { on, emit, query } = scraper
on('image:compete', id => {
console.log('COMPLETED image', id)
)

emitter.on('done', async queryFor => {
})
on('done', () => {
console.log('finished.')
const result = await queryFor({ scrapers: { images: ['filename'] } })
console.log(result)
// [{
const result = query({ scrapers: ['images'] })
// result = [{
// images: [{ filename: 'img1.jpg' }, { filename: 'img2.jpg' }, ...]
// }]
})
Expand All @@ -66,41 +67,49 @@ For more real world examples, visit the [examples](examples) directory

## Documentation

Detailed usage documentation is coming, but for now, [typescript](https://www.typescriptlang.org/) typings
exist for the surface API.

- for scraper config object documentation see [src/configuration/types.ts](src/configuration/types.ts)
- for runtime options documentation see [src/run-options/types.ts](src/run-options/types.ts)

The scraper instance created from a config object is meant to be reusable and cached. It only knows about the
config object. `scraper.run` can be called multiple times, and, as long as different folders are
provided, each run will work independently. `scraper.run` returns **emitter**

### emitter
### scrape

#### Listenable events
| param | type | required | type file | description |
| ------- | ---------------- | -------- | -------------------------------------------------------------- | ----------------------------- |
| config | `ConfigInit` | Yes | [src/settings/config/types.ts](src/settings/config/types.ts) | _what_ is being downloaded |
| options | `RunOptionsInit` | Yes | [src/settings/options/types.ts](src/settings/options/types.ts) | _how_ something is downloaded |

each event will return the **queryFor** function as its first argument

- `'done'`: when the scraper has completed
- `'error'`: when the scraper encounters an error (this also stops the scraper)
- `'<scraper>:progress'`: emits progress of download until completed
- `'<scraper>:queued'`: when a download is queued
- `'<scraper>:complete'`: when a download is completed
### scraper
The `scrape` function returns a promise which yeilds these utilities (`on`, `emit`, and `query`)

#### Emittable events
#### on
Listen for events from the scraper
| event | callback arguments | description |
| ---------------------- | --------------------- | ------------------------------------------ |
| `'done'` | queryFor | when the scraper has completed |
| `'error'` | error | if the scraper encounters an error |
| `'<scraper>:progress'` | queryFor, download id | emits progress of download until completed |
| `'<scraper>:queued'` | queryFor, download id | when a download is queued |
| `'<scraper>:complete'` | queryFor, download id | when a download is completed |

- '`useRateLimiter'`: pass a boolean to turn on or off the rate limit defined in the run options
- `'stop'`: emit this event to stop the crawler (note that any in progress promises will still complete)
#### emit

### queryFor
While the scraper is working, you can affect its behavior by emitting these events:
| event | arguments | description |
| --- | --- | --- |
| `'useRateLimiter'` | boolean | turn on or off the rate limit defined in the run options |
| `'stop'` | | stop the crawler (note that in progress requests will still complete) |

each event will return the **queryFor** function as its first argument

This function is used to get data back out of the scraper whenever you need it. The function takes an object
with three keys:
#### query

- `scrapers`: `{ [name]: Array<'filename'|'parsedValue'> }`
- `groupBy?`: name of a scraper which will delineate the values in `scrapers`
- `stmtCacheKey?`: `Symbol` which helps the internal database cache queries.
This function is an argument in the emitter callback and is used to get data back out of the scraper whenever
you need it. These are its arguments:
| name | type | required | description |
| --- | --- | --- | --- |
| `scrapers` | `string[]` | Yes | scrapers who will return their filenames and parsed values, in order |
| `groupBy` | `string` | Yes | name of a scraper which will delineate the values in `scrapers` |

## Motivation

Expand Down
7 changes: 7 additions & 0 deletions custom.d.ts
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,10 @@ declare module 'flow-runtime' {
const content: any
export default content
}

type ArgumentTypes<F extends Function> = F extends (...args: infer A) => any
? A
: never

type Nullable<T> = T | null
type Voidable<T> = T | void
2 changes: 1 addition & 1 deletion examples/__tests__/deviantart.unit.test.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import deviantartConfig from '../deviantart.config.json'
import { assertConfigType } from '../../src/configuration/site-traversal'
import { assertConfigType } from '../../src/settings/config'

describe('deviantart config', () => {
it('is properly typed', () => {
Expand Down
2 changes: 1 addition & 1 deletion examples/__tests__/nasa-image-of-the-day.unit.test.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import nasaIotdConfig from '../nasa-image-of-the-day.config.json'
import { assertConfigType } from '../../src/configuration/site-traversal'
import { assertConfigType } from '../../src/settings/config'

describe('nasa iotd config', () => {
it('is properly typed', () => {
Expand Down
2 changes: 1 addition & 1 deletion examples/__tests__/simple-config.unit.test.ts
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
import { assertConfigType } from '../../src/configuration/site-traversal'
import { assertConfigType } from '../../src/settings/config'
import * as testingConfigs from '../../testing/resources/testing-configs'

describe('example simple config', () => {
Expand Down
2 changes: 1 addition & 1 deletion examples/__tests__/tumblr.unit.test.ts
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import tumblrConfig from '../tumblr.config.json'
import { assertConfigType } from '../../src/configuration/site-traversal'
import { assertConfigType } from '../../src/settings/config'

describe('tumblr config', () => {
it('is properly typed', () => {
Expand Down
Loading

0 comments on commit 065820a

Please sign in to comment.