gocrawler

module

v0.0.0-...-8fe63c4 Latest Latest Go to latest Published: Jun 10, 2019 License: GPL-3.0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/goncalopereira/gocrawler

Links

Open Source Insights

README ¶

Intro

Thank you for the opportunity to do this exercise! Don't have much Go experience but trying my best !

Code Coverage

Code is sitting in a Github account I created to host this publicly outside my profile.

go get github.com/goncalopereira/gocrawler/cmd/crawler

Run

local run/test

./scripts/run.local.sh Includes -race flag

./scripts/run.local.norace.sh

./scripts/test.local.sh

docker run --rm -i hadolint/hadolint < ./build/Dockerfile

Production/test like Docker

./scripts/run.docker.sh

./scripts/test.docker.sh

Production like Minikube

./scripts/run.minikube.sh

Results

Test string: curl --data-urlencode "url=https://www.monzo.com" http://localhost:8080

At monzo.com with full depth you'll around 940 unique nodes in 30/40seconds, in the (Stdout is default) output.csv there is:

An ordered list of priority (number of pages linking in/pseudo page ranking)
A Breadth First Search display of all pages starting on the 'heaviest page' according to the previous ranking (starting domain)

Service will respect ENV settings, it will crawl one external request at a time and queue remaining external ones.

SIGTERM will try to wait for crawl to end for a graceful shutdown.

TravisCI Build available (see badge) as well as Containers.

Original

We'd like you to write a simple web crawler in a programming language of your choice. Feel free to either choose one you're very familiar with or, if you'd like to learn some Go, you can also make this your first Go program! The crawler should be limited to one domain - so when you start with https://monzo.com/, it would crawl all pages within monzo.com, but not follow external links, for example to the Facebook and Twitter accounts. Given a URL, it should print a simple site map, showing the links between pages.

Ideally, write it as you would a production piece of code. Bonus points for tests and making it as fast as possible!

Requirements

Simple Web Crawler
Optional language
Limited to one domain (? include subdomains if following links?) - Not include external links like Twitter, Facebook
Simple Site Map
Production ready
Tests

Approach

Web Crawler specifications W3, robots.txt
Find competing products Wget-Mirror, Scrapy, Apache-Nutch
RFC 🤷‍

Summary of characteristics in Web Crawlers

Respect HTTP Cache Headers as If-Modified-Since, Last-Modified
Identify using User-Agent
Respect robots.txt - At least 1s delay between queries, obey Crawl-Delay

Some characteristics in other products

Depth
Partition URLs
Rotate User-Agent (Bad)
Disable cookies (Bad)
Download Delay (Mentioned in robots.txt)
Skip existing URLs
Normalised URL
URL filters
Fetch if required
Parse batches
Data source Scoring
Stop when reaches max urls
Output

Some externally bound concerns (bottlenecks, failures)

Web Requests to site
DNS lookups
Write AND/OR indexing
Storage model
Map creation

Some Golang info I looked at whilst developing this

Decisions for Production and Testing

Directories ¶

Path	Synopsis
cmd
crawler
internal
crawler
crawler/rules
data
env
http
observability
sitemap
storage

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL