gocrawler

module
v0.0.0-...-8fe63c4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 10, 2019 License: GPL-3.0

README

Intro

Thank you for the opportunity to do this exercise! Don't have much Go experience but trying my best !

Go Report Card

Go Doc

TravisCI Build Status

Code Coverage

Code is sitting in a Github account I created to host this publicly outside my profile.

go get github.com/goncalopereira/gocrawler/cmd/crawler

image

Run

local run/test

./scripts/run.local.sh Includes -race flag

./scripts/run.local.norace.sh

./scripts/test.local.sh

docker run --rm -i hadolint/hadolint < ./build/Dockerfile

Production/test like Docker

./scripts/run.docker.sh

./scripts/test.docker.sh

Production like Minikube

./scripts/run.minikube.sh

Results

Test string: curl --data-urlencode "url=https://www.monzo.com" http://localhost:8080

At monzo.com with full depth you'll around 940 unique nodes in 30/40seconds, in the (Stdout is default) output.csv there is:

  • An ordered list of priority (number of pages linking in/pseudo page ranking)
  • A Breadth First Search display of all pages starting on the 'heaviest page' according to the previous ranking (starting domain)

Service will respect ENV settings, it will crawl one external request at a time and queue remaining external ones.

SIGTERM will try to wait for crawl to end for a graceful shutdown.

TravisCI Build available (see badge) as well as Containers.

Original

We'd like you to write a simple web crawler in a programming language of your choice. Feel free to either choose one you're very familiar with or, if you'd like to learn some Go, you can also make this your first Go program! The crawler should be limited to one domain - so when you start with https://monzo.com/, it would crawl all pages within monzo.com, but not follow external links, for example to the Facebook and Twitter accounts. Given a URL, it should print a simple site map, showing the links between pages.

Ideally, write it as you would a production piece of code. Bonus points for tests and making it as fast as possible!

Requirements

  • Simple Web Crawler
  • Optional language
  • Limited to one domain (? include subdomains if following links?) - Not include external links like Twitter, Facebook
  • Simple Site Map
  • Production ready
  • Tests

Approach

Summary of characteristics in Web Crawlers

  • Respect HTTP Cache Headers as If-Modified-Since, Last-Modified
  • Identify using User-Agent
  • Respect robots.txt - At least 1s delay between queries, obey Crawl-Delay

Some characteristics in other products

  • Depth
  • Partition URLs
  • Rotate User-Agent (Bad)
  • Disable cookies (Bad)
  • Download Delay (Mentioned in robots.txt)
  • Skip existing URLs
  • Normalised URL
  • URL filters
  • Fetch if required
  • Parse batches
  • Data source Scoring
  • Stop when reaches max urls
  • Output

Some externally bound concerns (bottlenecks, failures)

  • Web Requests to site
  • DNS lookups
  • Write AND/OR indexing
  • Storage model
  • Map creation

Some Golang info I looked at whilst developing this

Decisions for Production and Testing

Directories

Path Synopsis
cmd
internal
env

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL