scrapePkg

package

v0.0.0-...-3f8eaf4 Latest Latest Go to latest Published: Apr 19, 2024 License: GPL-3.0 Imports: 35 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/TrueBlocks/trueblocks-core

Links

Open Source Insights

README ¶

chifra scrape

The chifra scrape application creates TrueBlocks' chunked index of address appearances -- the fundamental data structure of the entire system. It also, optionally, pins each chunk of the index to IPFS.

chifra scrape is a long running process, therefore we advise you run it as a service or in terminal multiplexer such as tmux. You may start and stop chifra scrape as needed, but doing so means the scraper will not be keeping up with the front of the blockchain. The next time it starts, it will have to catch up to the chain, a process that may take several hours depending on how long ago it was last run. See the section below and the "Papers" section of our website for more information on how the scraping process works and prerequisites for its proper operation.

You may adjust the speed of the index creation with the --sleep and --block_cnt options. On some machines, or when running against some EVM node software, the scraper may overburden the hardware. Slowing things down will ensure proper operation. Finally, you may optionally --pin each new chunk to IPFS which naturally shards the database among all users. By default, pinning is against a locally running IPFS node, but the --remote option allows pinning to an IPFS pinning service such as Pinata.

Purpose:
  Scan the chain and update the TrueBlocks index of appearances.

Usage:
  chifra scrape [flags]

Flags:
  -n, --block_cnt uint   maximum number of blocks to process per pass (default 2000)
  -s, --sleep float      seconds to sleep between scraper passes (default 14)
  -l, --touch uint       first block to visit when scraping (snapped back to most recent snap_to_grid mark)
  -v, --verbose          enable verbose output
  -h, --help             display this help screen

Notes:
  - The --touch option may only be used for blocks after the latest scraped block (if any). It will be snapped back to the latest snap_to block.

Data models produced by this tool:

configuration

Each of the following additional configurable command line options are available.

Configuration file: trueBlocks.toml
Configuration group: [scrape.<chain>]

Item	Type	Default	Description / Default
appsPerChunk	uint64	2000000	the number of appearances to build into a chunk before consolidating it
snapToGrid	uint64	250000	an override to apps_per_chunk to snap-to-grid at every modulo of this value, this allows easier corrections to the index
firstSnap	uint64	2000000	the first block at which snap_to_grid is enabled
unripeDist	uint64	28	the distance (in blocks) from the front of the chain under which (inclusive) a block is considered unripe
channelCount	uint64	20	number of concurrent processing channels
allowMissing	bool	true	do not report errors for blockchains that contain blocks with zero addresses

Note that for Ethereum mainnet, the default values for appsPerChunk and firstSnap are 2,000,000 and 2,300,000 respectively. See the specification for a justification of these values.

These items may be set in three ways, each overriding the preceding method:

-- in the above configuration file under the [scrape.<chain>] group,
-- in the environment by exporting the configuration item as UPPER_CASE (with underbars removed) and prepended with TB_SCRAPE_CHAIN_, or
-- on the command line using the configuration item with leading dashes and in snake case (i.e., --snake_case).

further information

Each time chifra scrape runs, it begins at the last block it completed processing (plus one). With each pass, the scraper descends as deeply as is possible into each block's data. (This is why TrueBlocks requires a --tracing node.) As the scraper encounters appearances of address in the block's data, it adds those appearances to a growing index. Periodically (after processing the block that contains the 2,000,000th appearance), the system consolidates an index chunk.

An index chunk is a portion of the index containing approximately 2,000,000 records (although, this number is adjustable for different chains). As part of the consolidation, the scraper creates a Bloom filter representing the set membership in the associated index portion. The Bloom filters are an order of magnitude smaller than the index chunks. The system then pushes both the index chunk and the Bloom filter to IPFS. In this way, TrueBlocks creates an immutable, uncapturable index of appearances that can be used not only by TrueBlocks, but any member of the community who needs it. (Hint: We all need it.)

Users of the TrueBlocks Explorer (or any other software) may subsequently download the Bloom filters, query them to determine which index chunks need to be downloaded, and thereby build a historical list of transactions for a given address. This is accomplished while imposing a minimum amount of resource requirement on the end user's machine.

Recently, we enabled the ability for the end user to pin these downloaded index chunks and blooms on their own machines. The user needs the data for the software to operate--sharing requires minimal effort and makes the data available to other people. Everyone is better off. A naturally-occuring network effect.

prerequisites

chifra scrape works with any EVM-based blockchain, but does not currently work without a "tracing, archive" RPC endpoint. The Erigon blockchain node, given its minimal disc footprint for an archive node and its support of the required trace_ endpoint routines, is highly recommended.

Please see this article for more information about running the scraper and building and sharing the index of appearances.

Other Options

All tools accept the following additional flags, although in some cases, they have no meaning.

  -v, --version         display the current version of the tool
      --output string   write the results to file 'fn' and return the filename
      --append          for --output command only append to instead of replace contents of file
      --file string     specify multiple sets of command line options in a file

Note: For the --file string option, you may place a series of valid command lines in a file using any valid flags. In some cases, this may significantly improve performance. A semi-colon at the start of any line makes it a comment.

Note: If you use --output --append option and at the same time the --file option, you may not switch export formats in the command file. For example, a command file with two different commands, one with --fmt csv and the other with --fmt json will produce both invalid CSV and invalid JSON.

Documentation ¶

Overview ¶

Package scrapePkg handles the chifra scrape command. It The application creates TrueBlocks' chunked index of address appearances -- the fundamental data structure of the entire system. It also, optionally, pins each chunk of the index to IPFS. is a long running process, therefore we advise you run it as a service or in terminal multiplexer such as tmux. You may start and stop as needed, but doing so means the scraper will not be keeping up with the front of the blockchain. The next time it starts, it will have to catch up to the chain, a process that may take several hours depending on how long ago it was last run. See the section below and the "Papers" section of our website for more information on how the scraping process works and prerequisites for its proper operation. You may adjust the speed of the index creation with the --sleep and --block_cnt options. On some machines, or when running against some EVM node software, the scraper may overburden the hardware. Slowing things down will ensure proper operation. Finally, you may optionally --pin each new chunk to IPFS which naturally shards the database among all users. By default, pinning is against a locally running IPFS node, but the --remote option allows pinning to an IPFS pinning service such as Pinata.

Index ¶

Variables
func Notify[T notify.NotificationPayload](notification notify.Notification[T]) error
func NotifyChunkWritten(chunk index.Chunk, chunkPath string) (err error)
func NotifyConfigured() (bool, string)
func ResetOptions(testMode bool)
func RunScrape(cmd *cobra.Command, args []string) error
func ServeScrape(w http.ResponseWriter, r *http.Request) error
type BlazeManager
type ScrapeOptions
- func GetOptions() *ScrapeOptions
- func GetScrapeOptions(args []string, g *globals.GlobalOptions) *ScrapeOptions

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrConfiguredButNotRunning = fmt.Errorf("listener is configured but not running")

Functions ¶

func Notify ¶

func Notify[T notify.NotificationPayload](notification notify.Notification[T]) error

Notify may be used to tell other processes about progress.

func NotifyChunkWritten ¶

func NotifyChunkWritten(chunk index.Chunk, chunkPath string) (err error)

func NotifyConfigured ¶

func NotifyConfigured() (bool, string)

NotifyConfigured returns true if notification feature is configured

func ResetOptions ¶

func ResetOptions(testMode bool)

func RunScrape ¶

func RunScrape(cmd *cobra.Command, args []string) error

RunScrape handles the scrape command for the command line. Returns error only as per cobra.

func ServeScrape ¶

func ServeScrape(w http.ResponseWriter, r *http.Request) error

ServeScrape handles the scrape command for the API. Returns an error.

Types ¶

type BlazeManager ¶

type BlazeManager struct {
	// contains filtered or unexported fields
}

BlazeManager manages the scraper by keeping track of the progress of the scrape and maintaining the timestamp array and processed map. The processed map helps us know if every block was visited or not.

func (*BlazeManager) AllowMissing ¶

func (bm *BlazeManager) AllowMissing() bool

AllowMissing returns true for all chains but mainnet and the value of the config item on mainnet (false by default). The scraper will halt if AllowMissing is false and a block with zero appearances is encountered.

func (*BlazeManager) AsciiFileToAppearanceMap ¶

func (bm *BlazeManager) AsciiFileToAppearanceMap(fn string) (map[string][]index.AppearanceRecord, base.FileRange, int)

AsciiFileToAppearanceMap reads the appearances from the stage file and returns them as a map

func (*BlazeManager) BlockCount ¶

func (bm *BlazeManager) BlockCount() base.Blknum

BlockCount returns the number of blocks to process for this pass of the scraper.

func (*BlazeManager) Consolidate ¶

func (bm *BlazeManager) Consolidate(blocks []base.Blknum) (error, bool)

Consolidate calls into the block scraper to (a) call Blaze and (b) consolidate if applicable

func (*BlazeManager) EndBlock ¶

func (bm *BlazeManager) EndBlock() base.Blknum

EndBlock returns the last block to process for this pass of the scraper.

func (*BlazeManager) FirstSnap ¶

func (bm *BlazeManager) FirstSnap() base.Blknum

FirstSnap returns the first block to process.

func (*BlazeManager) HandleBlaze ¶

func (bm *BlazeManager) HandleBlaze(blocks []base.Blknum) (err error, ok bool)

HandleBlaze does the actual scraping, walking through block_cnt blocks and querying traces and logs and then extracting addresses and timestamps from those data structures.

func (*BlazeManager) IsSnap ¶

func (bm *BlazeManager) IsSnap(block base.Blknum) bool

IsSnap returns true if the block is a snap point.

func (*BlazeManager) IsTestMode ¶

func (bm *BlazeManager) IsTestMode() bool

IsTestMode returns true if the scraper is running in test mode.

func (*BlazeManager) PerChunk ¶

func (bm *BlazeManager) PerChunk() base.Blknum

PerChunk returns the number of blocks to process per chunk.

func (*BlazeManager) ProcessAppearances ¶

func (bm *BlazeManager) ProcessAppearances(appearanceChannel chan scrapedData, appWg *sync.WaitGroup, tsChannel chan tslib.TimestampRecord) (err error)

ProcessAppearances processes scrapedData objects shoved down the appearanceChannel

func (*BlazeManager) ProcessBlocks ¶

func (bm *BlazeManager) ProcessBlocks(blockChannel chan base.Blknum, blockWg *sync.WaitGroup, appearanceChannel chan scrapedData) (err error)

ProcessBlocks processes the block channel and for each block query the node for both traces and logs. Send results down appearanceChannel.

func (*BlazeManager) ProcessTimestamps ¶

func (bm *BlazeManager) ProcessTimestamps(tsChannel chan tslib.TimestampRecord, tsWg *sync.WaitGroup) (err error)

ProcessTimestamps processes timestamp data (currently by printing to a temporary file)

func (*BlazeManager) RipeFolder ¶

func (bm *BlazeManager) RipeFolder() string

RipeFolder returns the folder where the stage file is stored.

func (*BlazeManager) ScrapeBatch ¶

func (bm *BlazeManager) ScrapeBatch(blocks []base.Blknum) (error, bool)

ScrapeBatch is called each time around the forever loop. It calls into HandleBlaze and writes the timestamps if there's no error.

func (*BlazeManager) SnapTo ¶

func (bm *BlazeManager) SnapTo() base.Blknum

SnapTo returns the number of blocks to process per chunk.

func (*BlazeManager) StageFolder ¶

func (bm *BlazeManager) StageFolder() string

StageFolder returns the folder where the stage file is stored.

func (*BlazeManager) StartBlock ¶

func (bm *BlazeManager) StartBlock() base.Blknum

StartBlock returns the start block for the current pass of the scraper.

func (*BlazeManager) UnripeFolder ¶

func (bm *BlazeManager) UnripeFolder() string

UnripeFolder returns the folder where the stage file is stored.

func (*BlazeManager) WriteAppearances ¶

func (bm *BlazeManager) WriteAppearances(bn base.Blknum, addrMap uniq.AddressBooleanMap) (err error)

WriteAppearances writes the appearance for a chunk to a file

func (*BlazeManager) WriteTimestamps ¶

func (bm *BlazeManager) WriteTimestamps(blocks []base.Blknum) error

type ScrapeOptions ¶

type ScrapeOptions struct {
	BlockCnt  uint64                `json:"blockCnt,omitempty"`  // Maximum number of blocks to process per pass
	Sleep     float64               `json:"sleep,omitempty"`     // Seconds to sleep between scraper passes
	Touch     uint64                `json:"touch,omitempty"`     // First block to visit when scraping (snapped back to most recent snap_to_grid mark)
	RunCount  uint64                `json:"runCount,omitempty"`  // Run the scraper this many times, then quit
	Publisher string                `json:"publisher,omitempty"` // For some query options, the publisher of the index
	DryRun    bool                  `json:"dryRun,omitempty"`    // Show the configuration that would be applied if run,no changes are made
	Settings  config.ScrapeSettings `json:"settings,omitempty"`  // Configuration items for the scrape
	Globals   globals.GlobalOptions `json:"globals,omitempty"`   // The global options
	Conn      *rpc.Connection       `json:"conn,omitempty"`      // The connection to the RPC server
	BadFlag   error                 `json:"badFlag,omitempty"`   // An error flag if needed
	// EXISTING_CODE
	PublisherAddr base.Address `json:"-"`
}

ScrapeOptions provides all command options for the chifra scrape command.

func GetOptions ¶

func GetOptions() *ScrapeOptions

func GetScrapeOptions ¶

func GetScrapeOptions(args []string, g *globals.GlobalOptions) *ScrapeOptions

GetScrapeOptions returns the options for this tool so other tools may use it.

func (*ScrapeOptions) HandleScrape ¶

func (opts *ScrapeOptions) HandleScrape() error

HandleScrape enters a forever loop and continually scrapes --block_cnt blocks (or less if close to the head). The forever loop pauses each round for --sleep seconds (or, if not close to the head, for .25 seconds).

func (*ScrapeOptions) HandleTouch ¶

func (opts *ScrapeOptions) HandleTouch() error

func (*ScrapeOptions) Prepare ¶

func (opts *ScrapeOptions) Prepare() (ok bool, err error)

Prepare performs actions that need to be done prior to entering the forever loop. Returns true if processing should continue, false otherwise. The routine cleans the temporary folders (if any) and then makes sure the zero block (reads the allocation file, if present) is processed.

func (*ScrapeOptions) ScrapeInternal ¶

func (opts *ScrapeOptions) ScrapeInternal() error

ScrapeInternal handles the internal workings of the scrape command. Returns an error.

func (*ScrapeOptions) String ¶

func (opts *ScrapeOptions) String() string

String implements the Stringer interface

Source Files ¶

View all Source files

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL