linkstorage

package
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 21, 2021 License: AGPL-3.0 Imports: 11 Imported by: 0

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func ReplaceSQL

func ReplaceSQL(old, searchPattern string) string

ReplaceSQL replaces the instance occurrence of any string pattern with an increasing $n based sequence

Types

type Link struct {
	FromU    *url.URL
	ToU      *url.URL
	LinkText string
}

Link is a link object

type LinkBatcher

type LinkBatcher struct {
	// contains filtered or unexported fields
}

LinkBatcher is a simple batching system for recording links to the db

func NewLinkBatcher

func NewLinkBatcher(maxBatch int, s *Storage) *LinkBatcher

NewLinkBatcher is a helpfer function for constructing a LinkBatcher object

func (lb *LinkBatcher) AddLink(link *Link) error

AddLink is a lightweight function to just whack that link into the channel

func (*LinkBatcher) KillWorkers

func (lb *LinkBatcher) KillWorkers()

KillWorkers simply kills all previously spawned workers

func (lb *LinkBatcher) ResilientBatchAddLinks(links []*Link) error

ResilientBatchAddLinks shrinks the batch sizes until it eventually works :shrug:

func (*LinkBatcher) SpawnWorkers

func (lb *LinkBatcher) SpawnWorkers(nWorkers int)

SpawnWorkers spawns n workers, and returns a kill channel

func (*LinkBatcher) WaitUntilEmpty

func (lb *LinkBatcher) WaitUntilEmpty() <-chan bool

WaitUntilEmpty returns a channel that receives input once the buffered channel is empty.

func (*LinkBatcher) Worker

func (lb *LinkBatcher) Worker(endSignal <-chan bool, doneChan chan<- bool)

Worker is the worker process for the link batcher This is straight up nicked from https://blog.drkaka.com/batch-get-from-golangs-buffered-channel-9638573f0c6e

type Page

type Page struct {
	U *url.URL
}

Page is a page object

type PageBatcher

type PageBatcher struct {
	Cache *lru.Cache
	// contains filtered or unexported fields
}

PageBatcher is a simple batching system for recording links to the db

func NewPageBatcher

func NewPageBatcher(maxBatch int, s *Storage) (*PageBatcher, error)

NewPageBatcher is a helpfer function for constructing a PageBatcher object

func (*PageBatcher) AddPage

func (pb *PageBatcher) AddPage(page *Page) bool

AddPage is a lightweight function to just whack that page into the channel Returns true if it added the page (hadn't been added previously)

func (*PageBatcher) KillWorkers

func (pb *PageBatcher) KillWorkers()

KillWorkers simply kills all previously spawned workers

func (*PageBatcher) SpawnWorkers

func (pb *PageBatcher) SpawnWorkers(nWorkers int)

SpawnWorkers spawns n workers, and returns a kill channel

func (*PageBatcher) WaitUntilEmpty

func (pb *PageBatcher) WaitUntilEmpty() <-chan bool

WaitUntilEmpty returns a channel that receives input once the buffered channel is empty.

func (*PageBatcher) Worker

func (pb *PageBatcher) Worker(endSignal <-chan bool, doneChan chan<- bool)

Worker is the worker process for the page batcher This is straight up nicked from https://blog.drkaka.com/batch-get-from-golangs-buffered-channel-9638573f0c6e

type Storage

type Storage struct {
	URI       string
	PageTable string
	LinkTable string
	// contains filtered or unexported fields
}

Storage implements a PostgreSQL storage backend for colly

func NewStorage

func NewStorage(
	uri string,
	pageTable string,
	linkTable string,
) (*Storage, error)

NewStorage is a wrapper for easily creating a storage object.

func (s *Storage) AddLink(link *Link) error

AddLink first checks that it does not exist, and then inserts the page

func (*Storage) AddPage

func (s *Storage) AddPage(page *Page) error

AddPage first checks that it does not exist, and then inserts the page

func (s *Storage) BatchAddLinks(links []*Link) error

BatchAddLinks takes a batch of links and inserts them, not giving a fuck whether or not they clash

func (*Storage) BatchAddPages

func (s *Storage) BatchAddPages(pages []*Page) error

BatchAddPages takes a batch of pages and inserts them, not giving a fuck whether or not they clash

func (*Storage) CheckLinkExists

func (s *Storage) CheckLinkExists(fromU *url.URL, toU *url.URL) (bool, error)

CheckLinkExists checks that the link exists in the visited database

func (*Storage) CheckPageExists

func (s *Storage) CheckPageExists(u *url.URL) (bool, error)

CheckPageExists checks that the page exists in the visited database

func (*Storage) Close

func (s *Storage) Close() error

Close closes connections.

func (s *Storage) CountLinks() (int, error)

CountLinks retrieves an estimate of the number of links scraped.

func (*Storage) CountPages

func (s *Storage) CountPages() (int, error)

CountPages retrieves an estimate of the number of pages scraped.

func (*Storage) GetLinksFrom

func (s *Storage) GetLinksFrom(pageHash string, limit int) ([]string, error)

GetLinksFrom retrieves the links from this page hash.

func (*Storage) GetLinksTo

func (s *Storage) GetLinksTo(pageHash string, limit int) ([]string, error)

GetLinksTo retrieves the links from this page hash.

func (*Storage) GetPage

func (s *Storage) GetPage(pageHash string) (*Page, error)

GetPage retrieves info about the page hash if it exists.

func (*Storage) GetPageHashesFromHost

func (s *Storage) GetPageHashesFromHost(host string, limit int) ([]string, error)

GetPageHashesFromHost retrieves the page hashes of all pages with this host.

func (*Storage) Init

func (s *Storage) Init() error

Init initializes the PostgreSQL storage

func (*Storage) KeepPingingOn

func (s *Storage) KeepPingingOn(d time.Duration) chan<- bool

KeepPingingOn periodically sends a ping to the db to keep the connection alive. You can kill this process by sending a boolean to the returned channel.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL