webpage-archiver

module
v0.0.0-...-02e7294 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 30, 2022 License: MIT

README

webpage-archiver

Capture and archive webpages to WARC-files, available both as a command-line tool and as a Go library.

Features:

  • Customizable output
    • WARC
    • Single file support via Obelisk
  • Screenshot support
  • Archives using a headless Chrome instance
    • Will automatically download a compatible headless browser

Capturing pages

Store WARC files in a specific directory:

webpage-archiver --output directory/ urlToArchive

To archive as a single file instead:

webpage-archiver --output fileOrDirectory --single-file urlToArchive

Storing a screenshot of each page can be done with --screenshot:

webpage-archiver --output directory/ --screenshot urlToArchive

Multiple URLs can be captured to the same archive:

webpage-archiver --output directory/ urlToArchive anotherUrlToArchive

Viewing pages

WARC-files captured with this tool need to be replayed, the easiest way to replay a capture is to use a tool like ReplayWeb.page.

Using as Go Library

go get github.com/aholstenson/webpage-archiver

Create an archiver instance to start capturing pages:

archiver, err := archiver.NewArchiver(ctx)

Capture pages using Capture:

output, err := warc.NewOutput(warc.WithDirectory(directory))

err := archiver.Capture(ctx, url, output)

output.Close()

Close the archiver when it's no longer needed:

archiver.Close()
Tracking progress

Archiver can take an optional progress reporter that will be used to log actions, requests and responses:

reporter := progress.NewConsoleReporter()
archiver := archiver.NewArchiver(ctx, archiver.WithProgress(reporter))

To use a progress reporter for a specific capture:

err := archiver.Capture(ctx, url, output, archiver.WithProgress(reporter))
User agents

The user agent can be specified via WithUserAgent:

archiver.WithUserAgent("Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible) Safari/537.36")

This option can be applied both to NewArchiver and to Archiver.Capture.

Screenshots

The option WithScreenshot can be passed to Capture to receive a screenshot of the page as it looks before the archiving ends.

archiver.Capture(ctx, url, output, archiver.WithScreenshot(func(data []byte) error {
  // Handle screenshot data here
  return nil
}))

Directories

Path Synopsis
cmd
internal
pkg

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL