crawler

command module
v0.0.0-...-b28cec7 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 14, 2020 License: MIT Imports: 10 Imported by: 0

README

crawler

Script to crawl html and add href links, crawling and indexing 5k sites / second into big-data graph DB.

Maintainability Test Coverage CircleCI

Build it

Binary
go get -u github.com/dgoldstein1/crawler
Docker
docker build . -t dgoldstein1/crawler

Run it

dc up -d

or with dependencies running locally

# run crawl on wikipedia
export GRAPH_DB_ENDPOINT="https://graphapi-twowaykv-dev.herokuapp.com/services/biggraph" # endpoint of graph database
export TWO_WAY_KV_ENDPOINT="https://graphapi-twowaykv-dev.herokuapp.com/services/twowaykv" # endpoint of k:v <-> v:k lookup metadata db
export STARTING_ENDPOINT="https://en.wikipedia.org/wiki/String_cheese" # if empty, finds random article
export PARALLELISM=20 # number of parallel threads to run
export MS_DELAY=5 # ms delay between each request
# export METRICS_PORT=8002 # port where prom metrics are served
export MAX_APPROX_NODES=1000 # approximate number of nodes to visit (+/- one order of magnitude), set to '-1' for unlimited crawl
export PORT=8888
export ENGLISH_WORD_LIST_PATH="/home/david/go/src/github.com/dgoldstein1/crawler/synonyms/english.txt"
build/crawler wikipedia

Development

Local Development
./watch_dev_changes.sh
Testing
go test $(go list ./... | grep -v /vendor/)
Benchmarks
Parallelism Nodes Added Time Nodes / Sec delay
1 90055 28.9 3116.1 5ms
2 119649 29.2 4097.5 5ms
4 118064 22.5 5158.4 5ms
8 328674 29.2 11255.9 5m
16 342114 29.0 11797.0 5m
32 364773 28.2 12935.2 5m

Time to get to 1001007 nodes: 3m18.5 Nodes / Sec: 5055.5 Size of graph: 644kb Size of entries: 32mb

Authors

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL