ogbnmag

package
v0.9.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 20, 2024 License: Apache-2.0 Imports: 25 Imported by: 0

Documentation

Overview

Package ogbnmag provides `Download` method for the corresponding dataset, and some dataset tools

See https://ogb.stanford.edu/ for all Open Graph Benchmark (OGB) datasets. See https://ogb.stanford.edu/docs/nodeprop/#ogbn-mag for the `ogbn-mag` dataset description.

The task is to predict the venue of publication of a paper, given its relations.

Index

Constants

This section is empty.

Variables

View Source
var (
	ZipURL         = "http://snap.stanford.edu/ogb/data/nodeproppred/mag.zip"
	ZipFile        = "mag.zip"
	ZipChecksum    = "2afe62ead87f2c301a7398796991d347db85b2d01c5442c95169372bf5a9fca4"
	DownloadSubdir = "downloads"
)
View Source
var (
	NumPapers        = 736389
	NumAuthors       = 1134649
	NumInstitutions  = 8740
	NumFieldsOfStudy = 59965

	// NumLabels is the number of labels for the papers. These correspond to publication venues.
	NumLabels = 349

	// PaperEmbeddingsSize is the size of the node features given.
	PaperEmbeddingsSize = 128

	// PapersEmbeddings contains the embeddings, shaped `(Float32)[NumPapers, PaperEmbeddingsSize]`
	PapersEmbeddings tensor.Tensor

	// PapersYears for each paper, where year starts in 2000 (so 10 corresponds to 2010). Shaped `(Int16)[NumPapers, 1]`.
	PapersYears tensor.Tensor

	// PapersLabels for each paper, values from 0 to 348 (so 349 in total). Shaped `(Int16)[NumPapers, 1]`.
	PapersLabels tensor.Tensor

	// TrainSplit, ValidSplit, TestSplit  splits of the data.
	// These are indices to papers, values from `[0, NumPapers-1]`. Shaped `(Int32)[n, 1]
	// They have 629571, 41939 and 64879 elements each.
	TrainSplit, ValidSplit, TestSplit tensor.Tensor

	// EdgesAffiliatedWith `(Int32)[1043998, 2]`, pairs with (author_id, institution_id).
	//
	// Thousands of institutions with only one affiliated author, and an exponential decreasing amount
	// of institutions with more affiliated authors, all the way to one institution that has 27K authors.
	//
	// Most authors are affiliated to 1 institution only, and an exponentially decreasing number affiliations up
	// to one author with 47 affiliations. ~300K authors with no affiliation.
	EdgesAffiliatedWith tensor.Tensor

	// EdgesWrites `(Int32)[7145660, 2]`, pairs with (author_id, paper_id).
	//
	// Every author writes at least one paper, and every paper has at least one author.
	//
	// Most authors (~600K) wrote one paper, with a substantial tail with thousands of authors having written hundreds of
	// papers, and in the extreme one author wrote 1046 papers.
	//
	// Papers are written on average by 3 authors (140k papers), with a bell-curve distribution with a long
	// tail, with a dozen of papers written by thousands of authors (5050 authors in one case).
	EdgesWrites tensor.Tensor

	// EdgesCites `(Int32)[5416271, 2]`, pairs with (paper_id, paper_id).
	//
	// ~120K papers don't cite anyone, 95K papers cite only one paper, and a long exponential decreasing tail,
	// in the extreme a paper cites 609 other papers.
	//
	// ~100K papers are never cited, 155K are cited once, and again a long exponential decreasing tail, in the extreme
	// one paper is cited by 4744 other papers.
	EdgesCites tensor.Tensor

	// EdgesHasTopic `(Int32)[7505078, 2]`, pairs with (paper_id, topic_id).
	//
	// All papers have at least one "field of study" topic. Most (550K) papers have 12 or 13 topics. At most a paper has
	// 14 topics.
	//
	// All "fields of study" are associated to at least one topic. ~17K (out of ~60K) have only one paper associated.
	// ~50%+ topics have < 10 papers associated. Some ~30% have < 1000 papers associated. A handful have 10s of
	// thousands papers associated, and there is one topic that is associated to everyone.
	EdgesHasTopic tensor.Tensor

	// Counts to the various edge types.
	// These are call shaped `(Int32)[NumElements, 1]` for each of their entities.
	CountAuthorsAffiliations, CountInstitutionsAffiliations tensor.Tensor
	CountPapersCites, CountPapersIsCited                    tensor.Tensor
	CountPapersFieldsOfStudy, CountFieldsOfStudyPapers      tensor.Tensor
	CountAuthorsPapers, CountPapersAuthors                  tensor.Tensor
)
View Source
var (
	//
	// OgbnMagVariables maps variable names to a reference to their values.
	// We keep a reference to the values because the actual values change during the call to `Download()`
	//
	// They will be stored under the "/ogbnmag" scope.
	OgbnMagVariables = map[string]*tensor.Tensor{
		"PapersEmbeddings":              &PapersEmbeddings,
		"PapersLabels":                  &PapersLabels,
		"EdgesAffiliatedWith":           &EdgesAffiliatedWith,
		"EdgesWrites":                   &EdgesWrites,
		"EdgesCites":                    &EdgesCites,
		"EdgesHasTopic":                 &EdgesHasTopic,
		"CountAuthorsAffiliations":      &CountAuthorsAffiliations,
		"CountInstitutionsAffiliations": &CountInstitutionsAffiliations,
		"CountPapersCites":              &CountPapersCites,
		"CountPapersIsCited":            &CountPapersIsCited,
		"CountPapersFieldsOfStudy":      &CountPapersFieldsOfStudy,
		"CountFieldsOfStudyPapers":      &CountFieldsOfStudyPapers,
		"CountAuthorsPapers":            &CountAuthorsPapers,
		"CountPapersAuthors":            &CountPapersAuthors,
	}

	// OgbnMagVariablesScope is the absolute scope where the dataset variables are stored.
	OgbnMagVariablesScope = "/ogbnmag"
)
View Source
var (
	// ParamEmbedDropoutRate adds an extra dropout to learning embeddings.
	// This may be important because many embeddings are seen only once, so likely in testing many will have never
	//  been seen, and we want the model learn how to handle lack of embeddings (zero initialized) well.
	ParamEmbedDropoutRate = "mag_embed_dropout_rate"

	// ParamSplitEmbedTablesSize will make embed tables share entries across these many entries.
	// Default is 1, which means no splitting.
	ParamSplitEmbedTablesSize = "mag_split_embed_tables"
)
View Source
var (
	// BatchSize used for the sampler: the value was taken from the TF-GNN OGBN-MAG demo colab, and it was the
	// best found with some hyperparameter tuning. It does lead to using almost 7Gb of the GPU ram ...
	// but it works fine in an Nvidia RTX 2080 Ti (with 11Gb memory).
	BatchSize = 128

	// ReuseShareableKernels will share the kernels across similar messages in the strategy tree.
	// So the authors to papers messages will be the same if it comes from authors of the seed papers,
	// or of the coauthored-papers.
	// Default is true.
	ReuseShareableKernels = true

	// KeepDegrees will also make sampler keep the degrees of the edges as separate tensors.
	// These can be used by the GNN pooling functions to multiply the sum to the actual degree.
	KeepDegrees = true

	// IdentitySubSeeds controls whether to use an IdentitySubSeed, to allow more sharing of the kernel.
	IdentitySubSeeds = true
)
View Source
var (
	ParamCheckpointPath = "checkpoint"

	// ParamNumCheckpoints is the number of past checkpoints to keep.
	// The default is 10.
	ParamNumCheckpoints = "num_checkpoints"

	// ParamReuseKernels context parameter configs whether the kernels for similar sampling rules will be reused.
	ParamReuseKernels = "mag_reuse_kernels"

	// ParamIdentitySubSeeds controls whether to use an IdentitySubSeed, to allow more sharing of the kernel.
	ParamIdentitySubSeeds = "mag_identity_sub_seeds"
)
View Source
var WithReplacement = false

WithReplacement indicates whether the training dataset is created with replacement.

Functions

func Download

func Download(baseDir string) error

Download and prepares the tensors with the data into the `baseDir`.

If files are already there, it's assumed they were correctly generated and nothing is done.

The data files occupy ~415Mb, but to keep a copy of raw tensors (for faster start up), you'll need ~1Gb free disk.

func Eval

func Eval(ctx *context.Context, baseDir string, datasets ...train.Dataset) error

Eval GNN model based on configuration in `ctx`.

func FeaturePreprocessing

func FeaturePreprocessing(ctx *context.Context, strategy *sampler.Strategy, inputs []*Node) (graphInputs map[string]*sampler.ValueMask[*Node])

FeaturePreprocessing converts the `spec` and `inputs` given by the dataset into a map of node type name to its initial embeddings.

author/paper, so it is reasonable to expect that during validation/testing it will see many embeddings
zero initialized.

func MagModelGraph

func MagModelGraph(ctx *context.Context, spec any, inputs []*Node) []*Node

MagModelGraph builds a OGBN-MAG GNN model that sends [ParamNumGraphUpdates] along its sampling strategy, and then adding a final layer on top of the seeds.

It returns 3 tensors: * Predictions for all seeds shaped `Float32[BatchSize, mag.NumLabels]`. * Mask of the seeds, provided by the sampler, shaped `Bool[BatchSize]`.

func MagStrategy

func MagStrategy(magSampler *sampler.Sampler, batchSize int, seedIdsCandidates tensor.Tensor) *sampler.Strategy

MagStrategy takes a sampler created by ogbnmag.NewSampler, a desired batch size, and the set of seed ids to sample from (ogbnmag.TrainSplit, ogbnmag.ValidSplit or ogbnmag.TestSplit) and returns a sampling strategy, that can be used to create datasets.

func MakeDatasets

func MakeDatasets(dataDir string) (trainDS, trainEvalDS, validEvalDS, testEvalDS train.Dataset, err error)

MakeDatasets takes a directory where to store the downloaded data and return 4 datasets: "train", "trainEval", "validEval", "testEval".

It uses the package `ogbnmag` to download the data.

func NewSampler

func NewSampler(baseDir string) (*sampler.Sampler, error)

NewSampler will create a sampler.Sampler and configure it with the OGBN-MAG graph definition.

func PapersSeedDatasets

func PapersSeedDatasets(manager *Manager) (trainDS, validDS, testDS *mldata.InMemoryDataset, err error)

PapersSeedDatasets returns the train, validation and test datasets (`data.InMemoryDataset`) with only the papers seed nodes, to be used with FNN (Feedforward Neural Networks). See [MakeDataset] to make a dataset with sampled sub-graphs for GNNs.

The datasets can be shuffled and batched as desired.

The yielded values are papers indices, and the corresponding labels.

func Train

func Train(ctx *context.Context, baseDir string) error

Train GNN model based on configuration in `ctx`.

func UploadOgbnMagVariables

func UploadOgbnMagVariables(ctx *context.Context) *context.Context

UploadOgbnMagVariables creates frozen variables with the various static tables of the OGBN-MAG dataset, so it can be used by models.

They will be stored under the "ogbnmag" scope.

Types

This section is empty.

Directories

Path Synopsis
Package fnn implements a feed-forward neural network for the OGBN-MAG problem.
Package fnn implements a feed-forward neural network for the OGBN-MAG problem.
Package gnn implements a generic GNN modeling based on [TF-GNN MtAlbis].
Package gnn implements a generic GNN modeling based on [TF-GNN MtAlbis].

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL