GitHub

Brief

Write a simple web crawler in Go. The crawler should be limited to one domain - so when crawling it would crawl the domain, but not follow external links, for example to the Facebook and Twitter accounts.

Given a URL, your program should output a site map showing each page's url, title, static assets, internal links and external links.

The number of pages that are crawled should be configurable. We suggest crawling wikipedia and limiting the number of pages to 100.

Approach

Scrape website.
As soon as internal link found - start scraping it concurently
Continue extracting internal/external/static(Tags - "img", "audio", "script", "video", "embed", "source")
Once all data is scrapped - Print out result of this page.

The way it is implemented is pretty wild and should be relatively fast but should be further tested to make it more stable.

To further optimise it:

Add workers
Reuse connection
Benchmark data extraction.
Consider GoQuery/Regex(Not recommended)

How to run

$ make runwiki

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
crawler		crawler
vendor		vendor
Gopkg.lock		Gopkg.lock
Gopkg.toml		Gopkg.toml
README.md		README.md
main.go		main.go

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Brief

Approach

To further optimise it:

How to run

About

Releases

Packages

Languages

laurynasusas/webcrawler

Folders and files

Latest commit

History

Repository files navigation

Brief

Approach

To further optimise it:

How to run

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages