How to prevent Dataset from being deleted when running the crawler? #1878

Ramin-Bateni · 2023-04-24T08:14:36Z

Ramin-Bateni
Apr 24, 2023

I want have a dataset file that save my scraped data by crawler and append to it each time I start the crawler (I want to avoid deleting the Dataset file):

So I config the crawler by this way:

const config = new Configuration({
  purgeOnStart: false,
  persistStorage: true,
  defaultDatasetId: 'website1',
  defaultKeyValueStoreId: 'website11',
});

const crawler = new CheerioCrawler({
......
}, config);

But Dataset file being deleted each time I run the crawler. Why and how can I solve it?
How to set filename of Dataset and key-values by config?

Answered by B4nan

Apr 24, 2023

You need to purge on start, otherwise only requests that were never processed can go through. That's the whole point of the auto-purging, to clear the crawler state. If you disable it, things can run only once, then the queue will consist only of processed requests, so it's expected you wont get inside the request handler at all.

You could use named dataset, those are not purged anyhow (its up to you to remove the data, either manually or via drop method).

// create named dataset
const ds = await Dataset.open('my-data');

// push to it
await ds.pushData({
  url: request.loadedUrl,
  title,
});

View full answer

B4nan · 2023-04-24T08:45:18Z

B4nan
Apr 24, 2023
Maintainer

My guess is that the purge is happening before this code is executed, or through something else than the crawler object itself, e.g. if you use some static helpers like KeyValueStore.getValue(), those also purge the storage, and will use the global configuration. Can you share complete executable reproduction?

Alternative to what you are doing can be using env vars or crawlee.json config file.

https://crawlee.dev/docs/guides/configuration

0 replies

Ramin-Bateni · 2023-04-24T10:05:49Z

Ramin-Bateni
Apr 24, 2023
Author

@B4nan, Thank you for your reply.

I checked the code but did not find any line like KeyValueStore.getValue() in it. I run the scrawler at the first but I added the following config file and it has bellow effects:

{
  "purgeOnStart": false
}

When purgeOnStart is false -->> requestHandler dose not run!
When purgeOnStart is true -->> requestHandler runs!

What do you think?

I created a new project with default template of CheerioCrawler and added my Configurations to it, but when I run the App it remove all files in Dataset directory at first:

main.js

import { CheerioCrawler, Configuration, ProxyConfiguration } from 'crawlee';
import { router } from './routes.js';

const startUrls = ['https://crawlee.dev'];

const crawler = new CheerioCrawler({
  requestHandler: router,
}, new Configuration({
  purgeOnStart: false
}));

await crawler.run(startUrls);

import { Dataset, createCheerioRouter } from 'crawlee';

export const router = createCheerioRouter();

router.addDefaultHandler(async ({ enqueueLinks, log }) => {
  log.info(`enqueueing new URLs`);
  await enqueueLinks({
    globs: ['https://crawlee.dev/**'],
    label: 'detail',
  });
});

router.addHandler('detail', async ({ request, $, log }) => {
  const title = $('title').text();
  log.info(`${title}`, { url: request.loadedUrl });

  await Dataset.pushData({
    url: request.loadedUrl,
    title,
  });
});

In this case if I use bellow crawler.json file:

{
  "purgeOnStart": false
}

When purgeOnStart is false -->> My breakpoint in requestHandler dose not fire!
When purgeOnStart is true -->> My breakpoint in requestHandler fires!

1 reply

B4nan Apr 24, 2023
Maintainer

You need to purge on start, otherwise only requests that were never processed can go through. That's the whole point of the auto-purging, to clear the crawler state. If you disable it, things can run only once, then the queue will consist only of processed requests, so it's expected you wont get inside the request handler at all.

You could use named dataset, those are not purged anyhow (its up to you to remove the data, either manually or via drop method).

// create named dataset
const ds = await Dataset.open('my-data');

// push to it
await ds.pushData({
  url: request.loadedUrl,
  title,
});

Answer selected by Ramin-Bateni

Ramin-Bateni · 2023-04-24T12:27:17Z

Ramin-Bateni
Apr 24, 2023
Author

Thank you @B4nan, I managed it by your way.

is it possible to answer the following question of me too?
#1875

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to prevent Dataset from being deleted when running the crawler? #1878

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How to prevent Dataset from being deleted when running the crawler? #1878

Ramin-Bateni Apr 24, 2023

Replies: 3 comments · 1 reply

B4nan Apr 24, 2023 Maintainer

Ramin-Bateni Apr 24, 2023 Author

B4nan Apr 24, 2023 Maintainer

Ramin-Bateni Apr 24, 2023 Author

Ramin-Bateni
Apr 24, 2023

Replies: 3 comments 1 reply

B4nan
Apr 24, 2023
Maintainer

Ramin-Bateni
Apr 24, 2023
Author

B4nan Apr 24, 2023
Maintainer

Ramin-Bateni
Apr 24, 2023
Author