Skip to content

SimZal/crawler

 
 

Repository files navigation

Crawl links on a website

Latest Version on Packagist Software License Build Status SensioLabsInsight Quality Score Total Downloads

THIS IS A FORK OF THE SPATIE CRAWLER. IT ADDS A CALLBACK FUNCTION TO RECIEVE ALL THE LINKS ON THE CRAWLED PAGE.

This package provides a class to crawl links on a website.

Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

Installation

This package can be installed via Composer:

composer require spatie/crawler

Usage

The crawler can be instantiated like this

Crawler::create()
    ->setCrawlObserver(<implementation of \Spatie\Crawler\CrawlObserver>)
    ->startCrawling($url);

The argument passed to setObserver must be an instance that implement the \Spatie\Crawler\CrawlObserver-interface:

/**
 * Called when the crawler will crawl the given url.
 *
 * @param \Spatie\Crawler\Url $url
 */
public function willCrawl(Url $url);

/**
 * Called when the crawler has crawled the given url.
 *
 * @param \Spatie\Crawler\Url       $url
 * @param \Psr\Http\Message\ResponseInterface $response
 */
public function hasBeenCrawled(Url $url, ResponseInterface $response);

/**
 * Called when the crawler has found links on the page
 *
 * @param \SimZal\Crawler\Url                       $url
 * @param \Illuminate\Support\Collection            $links
 */
public function foundLinks(Url $url, $links);

/**
 * Called when the crawl has ended.
 */
public function finishedCrawling();

Filtering certain url's

You can tell the crawler not to visit certain url's by passing using the setCrawlProfile-function. That function expects an objects that implements the Spatie\Crawler\CrawlProfile-interface:

/**
 * Determine if the given url should be crawled.
 *
 * @param \Spatie\Crawler\Url $url
 *
 * @return bool
 */
public function shouldCrawl(Url $url);

Changelog

Please see CHANGELOG for more information what has changed recently.

Contributing

Please see CONTRIBUTING for details.

Security

If you discover any security related issues, please email [email protected] instead of using the issue tracker.

Credits

About Spatie

Spatie is a webdesign agency in Antwerp, Belgium. You'll find an overview of all our open source projects on our website.

License

The MIT License (MIT). Please see License File for more information.

Packages

No packages published

Languages

  • PHP 100.0%