ogbnmag

package

v0.9.1 Latest Latest Go to latest Published: Apr 20, 2024 License: Apache-2.0 Imports: 25 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/gomlx/gomlx

Links

Open Source Insights

Documentation ¶

Overview ¶

Package ogbnmag provides `Download` method for the corresponding dataset, and some dataset tools

See https://ogb.stanford.edu/ for all Open Graph Benchmark (OGB) datasets. See https://ogb.stanford.edu/docs/nodeprop/#ogbn-mag for the `ogbn-mag` dataset description.

The task is to predict the venue of publication of a paper, given its relations.

Index ¶

Variables
func Download(baseDir string) error
func Eval(ctx *context.Context, baseDir string, datasets ...train.Dataset) error
func FeaturePreprocessing(ctx *context.Context, strategy *sampler.Strategy, inputs []*Node) (graphInputs map[string]*sampler.ValueMask[*Node])
func MagModelGraph(ctx *context.Context, spec any, inputs []*Node) []*Node
func MagStrategy(magSampler *sampler.Sampler, batchSize int, seedIdsCandidates tensor.Tensor) *sampler.Strategy
func MakeDatasets(dataDir string) (trainDS, trainEvalDS, validEvalDS, testEvalDS train.Dataset, err error)
func NewSampler(baseDir string) (*sampler.Sampler, error)
func PapersSeedDatasets(manager *Manager) (trainDS, validDS, testDS *mldata.InMemoryDataset, err error)
func Train(ctx *context.Context, baseDir string) error
func UploadOgbnMagVariables(ctx *context.Context) *context.Context

Constants ¶

This section is empty.

Variables ¶

View Source

var (
	ZipURL         = "http://snap.stanford.edu/ogb/data/nodeproppred/mag.zip"
	ZipFile        = "mag.zip"
	ZipChecksum    = "2afe62ead87f2c301a7398796991d347db85b2d01c5442c95169372bf5a9fca4"
	DownloadSubdir = "downloads"
)

View Source

var (
	NumPapers        = 736389
	NumAuthors       = 1134649
	NumInstitutions  = 8740
	NumFieldsOfStudy = 59965

	// NumLabels is the number of labels for the papers. These correspond to publication venues.
	NumLabels = 349

	// PaperEmbeddingsSize is the size of the node features given.
	PaperEmbeddingsSize = 128

	// PapersEmbeddings contains the embeddings, shaped `(Float32)[NumPapers, PaperEmbeddingsSize]`
	PapersEmbeddings tensor.Tensor

	// PapersYears for each paper, where year starts in 2000 (so 10 corresponds to 2010). Shaped `(Int16)[NumPapers, 1]`.
	PapersYears tensor.Tensor

	// PapersLabels for each paper, values from 0 to 348 (so 349 in total). Shaped `(Int16)[NumPapers, 1]`.
	PapersLabels tensor.Tensor

	// TrainSplit, ValidSplit, TestSplit  splits of the data.
	// These are indices to papers, values from `[0, NumPapers-1]`. Shaped `(Int32)[n, 1]
	// They have 629571, 41939 and 64879 elements each.
	TrainSplit, ValidSplit, TestSplit tensor.Tensor

	// EdgesAffiliatedWith `(Int32)[1043998, 2]`, pairs with (author_id, institution_id).
	//
	// Thousands of institutions with only one affiliated author, and an exponential decreasing amount
	// of institutions with more affiliated authors, all the way to one institution that has 27K authors.
	//
	// Most authors are affiliated to 1 institution only, and an exponentially decreasing number affiliations up
	// to one author with 47 affiliations. ~300K authors with no affiliation.
	EdgesAffiliatedWith tensor.Tensor

	// EdgesWrites `(Int32)[7145660, 2]`, pairs with (author_id, paper_id).
	//
	// Every author writes at least one paper, and every paper has at least one author.
	//
	// Most authors (~600K) wrote one paper, with a substantial tail with thousands of authors having written hundreds of
	// papers, and in the extreme one author wrote 1046 papers.
	//
	// Papers are written on average by 3 authors (140k papers), with a bell-curve distribution with a long
	// tail, with a dozen of papers written by thousands of authors (5050 authors in one case).
	EdgesWrites tensor.Tensor

	// EdgesCites `(Int32)[5416271, 2]`, pairs with (paper_id, paper_id).
	//
	// ~120K papers don't cite anyone, 95K papers cite only one paper, and a long exponential decreasing tail,
	// in the extreme a paper cites 609 other papers.
	//
	// ~100K papers are never cited, 155K are cited once, and again a long exponential decreasing tail, in the extreme
	// one paper is cited by 4744 other papers.
	EdgesCites tensor.Tensor

	// EdgesHasTopic `(Int32)[7505078, 2]`, pairs with (paper_id, topic_id).
	//
	// All papers have at least one "field of study" topic. Most (550K) papers have 12 or 13 topics. At most a paper has
	// 14 topics.
	//
	// All "fields of study" are associated to at least one topic. ~17K (out of ~60K) have only one paper associated.
	// ~50%+ topics have < 10 papers associated. Some ~30% have < 1000 papers associated. A handful have 10s of
	// thousands papers associated, and there is one topic that is associated to everyone.
	EdgesHasTopic tensor.Tensor

	// Counts to the various edge types.
	// These are call shaped `(Int32)[NumElements, 1]` for each of their entities.
	CountAuthorsAffiliations, CountInstitutionsAffiliations tensor.Tensor
	CountPapersCites, CountPapersIsCited                    tensor.Tensor
	CountPapersFieldsOfStudy, CountFieldsOfStudyPapers      tensor.Tensor
	CountAuthorsPapers, CountPapersAuthors                  tensor.Tensor
)

View Source

var (
	//
	// OgbnMagVariables maps variable names to a reference to their values.
	// We keep a reference to the values because the actual values change during the call to `Download()`
	//
	// They will be stored under the "/ogbnmag" scope.
	OgbnMagVariables = map[string]*tensor.Tensor{
		"PapersEmbeddings":              &PapersEmbeddings,
		"PapersLabels":                  &PapersLabels,
		"EdgesAffiliatedWith":           &EdgesAffiliatedWith,
		"EdgesWrites":                   &EdgesWrites,
		"EdgesCites":                    &EdgesCites,
		"EdgesHasTopic":                 &EdgesHasTopic,
		"CountAuthorsAffiliations":      &CountAuthorsAffiliations,
		"CountInstitutionsAffiliations": &CountInstitutionsAffiliations,
		"CountPapersCites":              &CountPapersCites,
		"CountPapersIsCited":            &CountPapersIsCited,
		"CountPapersFieldsOfStudy":      &CountPapersFieldsOfStudy,
		"CountFieldsOfStudyPapers":      &CountFieldsOfStudyPapers,
		"CountAuthorsPapers":            &CountAuthorsPapers,
		"CountPapersAuthors":            &CountPapersAuthors,
	}

	// OgbnMagVariablesScope is the absolute scope where the dataset variables are stored.
	OgbnMagVariablesScope = "/ogbnmag"
)

View Source

var (
	// ParamEmbedDropoutRate adds an extra dropout to learning embeddings.
	// This may be important because many embeddings are seen only once, so likely in testing many will have never
	//  been seen, and we want the model learn how to handle lack of embeddings (zero initialized) well.
	ParamEmbedDropoutRate = "mag_embed_dropout_rate"

	// ParamSplitEmbedTablesSize will make embed tables share entries across these many entries.
	// Default is 1, which means no splitting.
	ParamSplitEmbedTablesSize = "mag_split_embed_tables"
)

View Source

var (
	// BatchSize used for the sampler: the value was taken from the TF-GNN OGBN-MAG demo colab, and it was the
	// best found with some hyperparameter tuning. It does lead to using almost 7Gb of the GPU ram ...
	// but it works fine in an Nvidia RTX 2080 Ti (with 11Gb memory).
	BatchSize = 128

	// ReuseShareableKernels will share the kernels across similar messages in the strategy tree.
	// So the authors to papers messages will be the same if it comes from authors of the seed papers,
	// or of the coauthored-papers.
	// Default is true.
	ReuseShareableKernels = true

	// KeepDegrees will also make sampler keep the degrees of the edges as separate tensors.
	// These can be used by the GNN pooling functions to multiply the sum to the actual degree.
	KeepDegrees = true

	// IdentitySubSeeds controls whether to use an IdentitySubSeed, to allow more sharing of the kernel.
	IdentitySubSeeds = true
)

View Source

var (
	ParamCheckpointPath = "checkpoint"

	// ParamNumCheckpoints is the number of past checkpoints to keep.
	// The default is 10.
	ParamNumCheckpoints = "num_checkpoints"

	// ParamReuseKernels context parameter configs whether the kernels for similar sampling rules will be reused.
	ParamReuseKernels = "mag_reuse_kernels"

	// ParamIdentitySubSeeds controls whether to use an IdentitySubSeed, to allow more sharing of the kernel.
	ParamIdentitySubSeeds = "mag_identity_sub_seeds"
)

View Source

var WithReplacement = false

WithReplacement indicates whether the training dataset is created with replacement.

Functions ¶

func Download ¶

func Download(baseDir string) error

Download and prepares the tensors with the data into the `baseDir`.

If files are already there, it's assumed they were correctly generated and nothing is done.

The data files occupy ~415Mb, but to keep a copy of raw tensors (for faster start up), you'll need ~1Gb free disk.

func Eval ¶

func Eval(ctx *context.Context, baseDir string, datasets ...train.Dataset) error

Eval GNN model based on configuration in `ctx`.

func FeaturePreprocessing ¶

func FeaturePreprocessing(ctx *context.Context, strategy *sampler.Strategy, inputs []*Node) (graphInputs map[string]*sampler.ValueMask[*Node])

FeaturePreprocessing converts the `spec` and `inputs` given by the dataset into a map of node type name to its initial embeddings.

author/paper, so it is reasonable to expect that during validation/testing it will see many embeddings
zero initialized.

func MagModelGraph ¶

func MagModelGraph(ctx *context.Context, spec any, inputs []*Node) []*Node

MagModelGraph builds a OGBN-MAG GNN model that sends [ParamNumGraphUpdates] along its sampling strategy, and then adding a final layer on top of the seeds.

It returns 3 tensors: * Predictions for all seeds shaped `Float32[BatchSize, mag.NumLabels]`. * Mask of the seeds, provided by the sampler, shaped `Bool[BatchSize]`.

func MagStrategy ¶

func MagStrategy(magSampler *sampler.Sampler, batchSize int, seedIdsCandidates tensor.Tensor) *sampler.Strategy

MagStrategy takes a sampler created by ogbnmag.NewSampler, a desired batch size, and the set of seed ids to sample from (ogbnmag.TrainSplit, ogbnmag.ValidSplit or ogbnmag.TestSplit) and returns a sampling strategy, that can be used to create datasets.

func MakeDatasets ¶

func MakeDatasets(dataDir string) (trainDS, trainEvalDS, validEvalDS, testEvalDS train.Dataset, err error)

MakeDatasets takes a directory where to store the downloaded data and return 4 datasets: "train", "trainEval", "validEval", "testEval".

It uses the package `ogbnmag` to download the data.

func NewSampler ¶

func NewSampler(baseDir string) (*sampler.Sampler, error)

NewSampler will create a sampler.Sampler and configure it with the OGBN-MAG graph definition.

func PapersSeedDatasets ¶

func PapersSeedDatasets(manager *Manager) (trainDS, validDS, testDS *mldata.InMemoryDataset, err error)

PapersSeedDatasets returns the train, validation and test datasets (`data.InMemoryDataset`) with only the papers seed nodes, to be used with FNN (Feedforward Neural Networks). See [MakeDataset] to make a dataset with sampled sub-graphs for GNNs.

The datasets can be shuffled and batched as desired.

The yielded values are papers indices, and the corresponding labels.

func Train ¶

func Train(ctx *context.Context, baseDir string) error

Train GNN model based on configuration in `ctx`.

func UploadOgbnMagVariables ¶

func UploadOgbnMagVariables(ctx *context.Context) *context.Context

UploadOgbnMagVariables creates frozen variables with the various static tables of the OGBN-MAG dataset, so it can be used by models.

They will be stored under the "ogbnmag" scope.

Types ¶

This section is empty.

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
demo
fnn Package fnn implements a feed-forward neural network for the OGBN-MAG problem.	Package fnn implements a feed-forward neural network for the OGBN-MAG problem.
gnn Package gnn implements a generic GNN modeling based on [TF-GNN MtAlbis].	Package gnn implements a generic GNN modeling based on [TF-GNN MtAlbis].
sampler

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL