gospider

command module
v0.0.0-...-04f6dc1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 16, 2017 License: MIT Imports: 1 Imported by: 0

README

gospider

gospider is a concurrent web spider. By default, it respects robots.txt entries.

Usage

make install

gospider is designed to be used either from the CLI or in code.

CLI

gospider can be run from the CLI by running

gospider start -r "http://foo.bar/" > out.html

By default, gospider writes an HTML sitemap to stdout.

Use gospider --help for more options.

Code

the spider.New function follows the functional options pattern. The only parameter which is required is the root URL - all others will be defaulted to sensible values if not supplied.

uri, _ := url.Parse("http://foo.bar/")

spider := spider.New(
  spider.WithRoot(uri),
  spider.WithConcurrency(5),
  spider.WithTimeout(time.Second * 2),
)

err = spider.Run()
if err != nil {
  log.Fatal("error running spider: ", err)
}
return spider.Report(os.Stdout)
Modularity

gospider ships with a simple HTML reporter and uses the default HTTP client to make requests. However, any requester or reporter can be used by supplying a struct which implements the Requester or Reporter interface. For example, to make requests through a proxy you could do:

type proxyRequester struct {
  client *http.Client
}

func (r *proxyRequester) Request(ctx context.Context, uri *url.URL) ([]byte, error) {
  res, err := r.client.Get(uri.String())
  // handle err, read body, etc.
  return body, nil
}

s := spider.New(
  WithRoot(...),
  WithRequester(&proxyRequester{
    client: &http.Client{
      Transport: &http.Transport{Proxy: http.ProxyURL(...)}
    }
  })
)

Concurrency

gospider uses a worker pool concurrency model. As URLs are found they are added to a queue. Each worker (controlled with the concurrency parameter) will poll the queue for work. Once the queue is empty, the worker pool is drained and the spider will stop.

Documentation

The Go Gopher

There is no documentation for this package.

Directories

Path Synopsis
internal/concurrency
Package concurrency provides common concurrency patterns and utilities.
Package concurrency provides common concurrency patterns and utilities.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL