crawler

command module

v0.0.0-...-b28cec7 Latest Latest Go to latest Published: Jul 14, 2020 License: MIT Imports: 10 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/dgoldstein1/crawler

Links

Open Source Insights

README ¶

crawler

Script to crawl html and add href links, crawling and indexing 5k sites / second into big-data graph DB.

Build it

Binary

go get -u github.com/dgoldstein1/crawler

Docker

docker build . -t dgoldstein1/crawler

Run it

dc up -d

or with dependencies running locally

# run crawl on wikipedia
export GRAPH_DB_ENDPOINT="https://graphapi-twowaykv-dev.herokuapp.com/services/biggraph" # endpoint of graph database
export TWO_WAY_KV_ENDPOINT="https://graphapi-twowaykv-dev.herokuapp.com/services/twowaykv" # endpoint of k:v <-> v:k lookup metadata db
export STARTING_ENDPOINT="https://en.wikipedia.org/wiki/String_cheese" # if empty, finds random article
export PARALLELISM=20 # number of parallel threads to run
export MS_DELAY=5 # ms delay between each request
# export METRICS_PORT=8002 # port where prom metrics are served
export MAX_APPROX_NODES=1000 # approximate number of nodes to visit (+/- one order of magnitude), set to '-1' for unlimited crawl
export PORT=8888
export ENGLISH_WORD_LIST_PATH="/home/david/go/src/github.com/dgoldstein1/crawler/synonyms/english.txt"
build/crawler wikipedia

Development

Local Development

Install inotifywait

./watch_dev_changes.sh

Testing

go test $(go list ./... | grep -v /vendor/)

Benchmarks

Parallelism	Nodes Added	Time	Nodes / Sec	delay
1	90055	28.9	3116.1	5ms
2	119649	29.2	4097.5	5ms
4	118064	22.5	5158.4	5ms
8	328674	29.2	11255.9	5m
16	342114	29.0	11797.0	5m
32	364773	28.2	12935.2	5m

Time to get to 1001007 nodes: 3m18.5 Nodes / Sec: 5055.5 Size of graph: 644kb Size of entries: 32mb

Authors

David Goldstein - DavidCharlesGoldstein.com - Decipher Technology Studios

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Documentation ¶

There is no documentation for this package.

Source Files ¶

View all Source files

main.go

Directories ¶

Path	Synopsis
ar_synonyms
counties
crawler
db
synonyms
util
wikipedia

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL