scrapePkg

package
v0.0.0-...-3f8eaf4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 19, 2024 License: GPL-3.0 Imports: 35 Imported by: 0

README

chifra scrape

The chifra scrape application creates TrueBlocks' chunked index of address appearances -- the fundamental data structure of the entire system. It also, optionally, pins each chunk of the index to IPFS.

chifra scrape is a long running process, therefore we advise you run it as a service or in terminal multiplexer such as tmux. You may start and stop chifra scrape as needed, but doing so means the scraper will not be keeping up with the front of the blockchain. The next time it starts, it will have to catch up to the chain, a process that may take several hours depending on how long ago it was last run. See the section below and the "Papers" section of our website for more information on how the scraping process works and prerequisites for its proper operation.

You may adjust the speed of the index creation with the --sleep and --block_cnt options. On some machines, or when running against some EVM node software, the scraper may overburden the hardware. Slowing things down will ensure proper operation. Finally, you may optionally --pin each new chunk to IPFS which naturally shards the database among all users. By default, pinning is against a locally running IPFS node, but the --remote option allows pinning to an IPFS pinning service such as Pinata.

Purpose:
  Scan the chain and update the TrueBlocks index of appearances.

Usage:
  chifra scrape [flags]

Flags:
  -n, --block_cnt uint   maximum number of blocks to process per pass (default 2000)
  -s, --sleep float      seconds to sleep between scraper passes (default 14)
  -l, --touch uint       first block to visit when scraping (snapped back to most recent snap_to_grid mark)
  -v, --verbose          enable verbose output
  -h, --help             display this help screen

Notes:
  - The --touch option may only be used for blocks after the latest scraped block (if any). It will be snapped back to the latest snap_to block.

Data models produced by this tool:

configuration

Each of the following additional configurable command line options are available.

Configuration file: trueBlocks.toml
Configuration group: [scrape.<chain>]

Item Type Default Description / Default
appsPerChunk uint64 2000000 the number of appearances to build into a chunk before consolidating it
snapToGrid uint64 250000 an override to apps_per_chunk to snap-to-grid at every modulo of this value, this allows easier corrections to the index
firstSnap uint64 2000000 the first block at which snap_to_grid is enabled
unripeDist uint64 28 the distance (in blocks) from the front of the chain under which (inclusive) a block is considered unripe
channelCount uint64 20 number of concurrent processing channels
allowMissing bool true do not report errors for blockchains that contain blocks with zero addresses

Note that for Ethereum mainnet, the default values for appsPerChunk and firstSnap are 2,000,000 and 2,300,000 respectively. See the specification for a justification of these values.

These items may be set in three ways, each overriding the preceding method:

-- in the above configuration file under the [scrape.<chain>] group,
-- in the environment by exporting the configuration item as UPPER_CASE (with underbars removed) and prepended with TB_SCRAPE_CHAIN_, or
-- on the command line using the configuration item with leading dashes and in snake case (i.e., --snake_case).

further information

Each time chifra scrape runs, it begins at the last block it completed processing (plus one). With each pass, the scraper descends as deeply as is possible into each block's data. (This is why TrueBlocks requires a --tracing node.) As the scraper encounters appearances of address in the block's data, it adds those appearances to a growing index. Periodically (after processing the block that contains the 2,000,000th appearance), the system consolidates an index chunk.

An index chunk is a portion of the index containing approximately 2,000,000 records (although, this number is adjustable for different chains). As part of the consolidation, the scraper creates a Bloom filter representing the set membership in the associated index portion. The Bloom filters are an order of magnitude smaller than the index chunks. The system then pushes both the index chunk and the Bloom filter to IPFS. In this way, TrueBlocks creates an immutable, uncapturable index of appearances that can be used not only by TrueBlocks, but any member of the community who needs it. (Hint: We all need it.)

Users of the TrueBlocks Explorer (or any other software) may subsequently download the Bloom filters, query them to determine which index chunks need to be downloaded, and thereby build a historical list of transactions for a given address. This is accomplished while imposing a minimum amount of resource requirement on the end user's machine.

Recently, we enabled the ability for the end user to pin these downloaded index chunks and blooms on their own machines. The user needs the data for the software to operate--sharing requires minimal effort and makes the data available to other people. Everyone is better off. A naturally-occuring network effect.

prerequisites

chifra scrape works with any EVM-based blockchain, but does not currently work without a "tracing, archive" RPC endpoint. The Erigon blockchain node, given its minimal disc footprint for an archive node and its support of the required trace_ endpoint routines, is highly recommended.

Please see this article for more information about running the scraper and building and sharing the index of appearances.

Other Options

All tools accept the following additional flags, although in some cases, they have no meaning.

  -v, --version         display the current version of the tool
      --output string   write the results to file 'fn' and return the filename
      --append          for --output command only append to instead of replace contents of file
      --file string     specify multiple sets of command line options in a file

Note: For the --file string option, you may place a series of valid command lines in a file using any valid flags. In some cases, this may significantly improve performance. A semi-colon at the start of any line makes it a comment.

Note: If you use --output --append option and at the same time the --file option, you may not switch export formats in the command file. For example, a command file with two different commands, one with --fmt csv and the other with --fmt json will produce both invalid CSV and invalid JSON.

Documentation

Overview

Package scrapePkg handles the chifra scrape command. It The application creates TrueBlocks' chunked index of address appearances -- the fundamental data structure of the entire system. It also, optionally, pins each chunk of the index to IPFS. is a long running process, therefore we advise you run it as a service or in terminal multiplexer such as tmux. You may start and stop as needed, but doing so means the scraper will not be keeping up with the front of the blockchain. The next time it starts, it will have to catch up to the chain, a process that may take several hours depending on how long ago it was last run. See the section below and the "Papers" section of our website for more information on how the scraping process works and prerequisites for its proper operation. You may adjust the speed of the index creation with the --sleep and --block_cnt options. On some machines, or when running against some EVM node software, the scraper may overburden the hardware. Slowing things down will ensure proper operation. Finally, you may optionally --pin each new chunk to IPFS which naturally shards the database among all users. By default, pinning is against a locally running IPFS node, but the --remote option allows pinning to an IPFS pinning service such as Pinata.

Index

Constants

This section is empty.

Variables

View Source
var ErrConfiguredButNotRunning = fmt.Errorf("listener is configured but not running")

Functions

func Notify

func Notify[T notify.NotificationPayload](notification notify.Notification[T]) error

Notify may be used to tell other processes about progress.

func NotifyChunkWritten

func NotifyChunkWritten(chunk index.Chunk, chunkPath string) (err error)

func NotifyConfigured

func NotifyConfigured() (bool, string)

NotifyConfigured returns true if notification feature is configured

func ResetOptions

func ResetOptions(testMode bool)

func RunScrape

func RunScrape(cmd *cobra.Command, args []string) error

RunScrape handles the scrape command for the command line. Returns error only as per cobra.

func ServeScrape

func ServeScrape(w http.ResponseWriter, r *http.Request) error

ServeScrape handles the scrape command for the API. Returns an error.

Types

type BlazeManager

type BlazeManager struct {
	// contains filtered or unexported fields
}

BlazeManager manages the scraper by keeping track of the progress of the scrape and maintaining the timestamp array and processed map. The processed map helps us know if every block was visited or not.

func (*BlazeManager) AllowMissing

func (bm *BlazeManager) AllowMissing() bool

AllowMissing returns true for all chains but mainnet and the value of the config item on mainnet (false by default). The scraper will halt if AllowMissing is false and a block with zero appearances is encountered.

func (*BlazeManager) AsciiFileToAppearanceMap

func (bm *BlazeManager) AsciiFileToAppearanceMap(fn string) (map[string][]index.AppearanceRecord, base.FileRange, int)

AsciiFileToAppearanceMap reads the appearances from the stage file and returns them as a map

func (*BlazeManager) BlockCount

func (bm *BlazeManager) BlockCount() base.Blknum

BlockCount returns the number of blocks to process for this pass of the scraper.

func (*BlazeManager) Consolidate

func (bm *BlazeManager) Consolidate(blocks []base.Blknum) (error, bool)

Consolidate calls into the block scraper to (a) call Blaze and (b) consolidate if applicable

func (*BlazeManager) EndBlock

func (bm *BlazeManager) EndBlock() base.Blknum

EndBlock returns the last block to process for this pass of the scraper.

func (*BlazeManager) FirstSnap

func (bm *BlazeManager) FirstSnap() base.Blknum

FirstSnap returns the first block to process.

func (*BlazeManager) HandleBlaze

func (bm *BlazeManager) HandleBlaze(blocks []base.Blknum) (err error, ok bool)

HandleBlaze does the actual scraping, walking through block_cnt blocks and querying traces and logs and then extracting addresses and timestamps from those data structures.

func (*BlazeManager) IsSnap

func (bm *BlazeManager) IsSnap(block base.Blknum) bool

IsSnap returns true if the block is a snap point.

func (*BlazeManager) IsTestMode

func (bm *BlazeManager) IsTestMode() bool

IsTestMode returns true if the scraper is running in test mode.

func (*BlazeManager) PerChunk

func (bm *BlazeManager) PerChunk() base.Blknum

PerChunk returns the number of blocks to process per chunk.

func (*BlazeManager) ProcessAppearances

func (bm *BlazeManager) ProcessAppearances(appearanceChannel chan scrapedData, appWg *sync.WaitGroup, tsChannel chan tslib.TimestampRecord) (err error)

ProcessAppearances processes scrapedData objects shoved down the appearanceChannel

func (*BlazeManager) ProcessBlocks

func (bm *BlazeManager) ProcessBlocks(blockChannel chan base.Blknum, blockWg *sync.WaitGroup, appearanceChannel chan scrapedData) (err error)

ProcessBlocks processes the block channel and for each block query the node for both traces and logs. Send results down appearanceChannel.

func (*BlazeManager) ProcessTimestamps

func (bm *BlazeManager) ProcessTimestamps(tsChannel chan tslib.TimestampRecord, tsWg *sync.WaitGroup) (err error)

ProcessTimestamps processes timestamp data (currently by printing to a temporary file)

func (*BlazeManager) RipeFolder

func (bm *BlazeManager) RipeFolder() string

RipeFolder returns the folder where the stage file is stored.

func (*BlazeManager) ScrapeBatch

func (bm *BlazeManager) ScrapeBatch(blocks []base.Blknum) (error, bool)

ScrapeBatch is called each time around the forever loop. It calls into HandleBlaze and writes the timestamps if there's no error.

func (*BlazeManager) SnapTo

func (bm *BlazeManager) SnapTo() base.Blknum

SnapTo returns the number of blocks to process per chunk.

func (*BlazeManager) StageFolder

func (bm *BlazeManager) StageFolder() string

StageFolder returns the folder where the stage file is stored.

func (*BlazeManager) StartBlock

func (bm *BlazeManager) StartBlock() base.Blknum

StartBlock returns the start block for the current pass of the scraper.

func (*BlazeManager) UnripeFolder

func (bm *BlazeManager) UnripeFolder() string

UnripeFolder returns the folder where the stage file is stored.

func (*BlazeManager) WriteAppearances

func (bm *BlazeManager) WriteAppearances(bn base.Blknum, addrMap uniq.AddressBooleanMap) (err error)

WriteAppearances writes the appearance for a chunk to a file

func (*BlazeManager) WriteTimestamps

func (bm *BlazeManager) WriteTimestamps(blocks []base.Blknum) error

type ScrapeOptions

type ScrapeOptions struct {
	BlockCnt  uint64                `json:"blockCnt,omitempty"`  // Maximum number of blocks to process per pass
	Sleep     float64               `json:"sleep,omitempty"`     // Seconds to sleep between scraper passes
	Touch     uint64                `json:"touch,omitempty"`     // First block to visit when scraping (snapped back to most recent snap_to_grid mark)
	RunCount  uint64                `json:"runCount,omitempty"`  // Run the scraper this many times, then quit
	Publisher string                `json:"publisher,omitempty"` // For some query options, the publisher of the index
	DryRun    bool                  `json:"dryRun,omitempty"`    // Show the configuration that would be applied if run,no changes are made
	Settings  config.ScrapeSettings `json:"settings,omitempty"`  // Configuration items for the scrape
	Globals   globals.GlobalOptions `json:"globals,omitempty"`   // The global options
	Conn      *rpc.Connection       `json:"conn,omitempty"`      // The connection to the RPC server
	BadFlag   error                 `json:"badFlag,omitempty"`   // An error flag if needed
	// EXISTING_CODE
	PublisherAddr base.Address `json:"-"`
}

ScrapeOptions provides all command options for the chifra scrape command.

func GetOptions

func GetOptions() *ScrapeOptions

func GetScrapeOptions

func GetScrapeOptions(args []string, g *globals.GlobalOptions) *ScrapeOptions

GetScrapeOptions returns the options for this tool so other tools may use it.

func (*ScrapeOptions) HandleScrape

func (opts *ScrapeOptions) HandleScrape() error

HandleScrape enters a forever loop and continually scrapes --block_cnt blocks (or less if close to the head). The forever loop pauses each round for --sleep seconds (or, if not close to the head, for .25 seconds).

func (*ScrapeOptions) HandleTouch

func (opts *ScrapeOptions) HandleTouch() error

func (*ScrapeOptions) Prepare

func (opts *ScrapeOptions) Prepare() (ok bool, err error)

Prepare performs actions that need to be done prior to entering the forever loop. Returns true if processing should continue, false otherwise. The routine cleans the temporary folders (if any) and then makes sure the zero block (reads the allocation file, if present) is processed.

func (*ScrapeOptions) ScrapeInternal

func (opts *ScrapeOptions) ScrapeInternal() error

ScrapeInternal handles the internal workings of the scrape command. Returns an error.

func (*ScrapeOptions) String

func (opts *ScrapeOptions) String() string

String implements the Stringer interface

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL