sampler

package
v0.9.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 20, 2024 License: Apache-2.0 Imports: 16 Imported by: 0

Documentation

Index

Constants

View Source
const PaddingIndex = 0

PaddingIndex is used for all sampling not fulfilled. Notice 0 is also valid node index. One should always use the mask returned by the Sampler to check whether a value is padding or not.

Variables

This section is empty.

Functions

func MapInputs

func MapInputs[T any](strategy *Strategy, inputs []T) map[string]*ValueMask[T]

MapInputs convert inputs yielded by a sampler.Dataset to map of the Rules Name to the Value/Mask tensors with the samples for this example.

Example 1: if using directly the outputs of a a sampler.Dataset created by this Strategy:

spec, inputs, _, err := ds.Yield()
strategy := spec.(*Sampler.Strategy)
graphSample := strategy.MapInputs(inputs)
Seeds, mask := graphSample["Seeds"].Value, graphSample["Seeds"].Mask
...

Example 2: usage in a model that is fed the output of a sampler.Dataset:

func MyModelGraph(ctx *context.Context, spec any, inputs []*Node) []*Node {
	strategy := spec.(*Sampler.Strategy)
	graphSample := strategy.MapInputs(inputs)
	Seeds, mask := graphSample["Seeds"].Value, graphSample["Seeds"].Mask
	...
}

func NameForNodeDependentDegree

func NameForNodeDependentDegree(ruleName, dependentName string) string

NameForNodeDependentDegree returns the name of the input field that contains the degree of the given rule node, with respect to the dependent rule node.

Types

type Dataset

type Dataset struct {
	// contains filtered or unexported fields
}

Dataset is created by a configured Strategy. Before using it -- by calling Dataset.Yield -- it can be configured to shuffle and number of epochs, or to loop indefinitely. But batch size is not configurable in the Dataset, it is defined as part of the Strategy Rules configuration (see Strategy.Nodes to define the Seeds).

The Dataset is created to be re-entrant, so it can be used with [data.Parallel].

func (*Dataset) Epochs

func (ds *Dataset) Epochs(n int) *Dataset

Epochs configures the dataset to yield those many epochs. Default is 1.

Notice if there are more than one seed node type, an epoch is considered finished whenever the first of the seed types is exhausted.

It returns itself to allow cascading configuration calls.

func (*Dataset) Infinite

func (ds *Dataset) Infinite() *Dataset

Infinite configures the dataset to yield looping over epochs indefinitely. Default is 1 epoch.

func (*Dataset) Name

func (ds *Dataset) Name() string

Name implements train.Dataset.

func (*Dataset) Reset

func (ds *Dataset) Reset()

Reset implements train.Dataset: it restarts a Dataset after it has been exhausted.

func (*Dataset) Shuffle

func (ds *Dataset) Shuffle() *Dataset

Shuffle configures the dataset to shuffle the seed nodes before sampling it. It is reshuffled at every new epoch, resulting and random samples without replacement.

func (*Dataset) WithReplacement

func (ds *Dataset) WithReplacement() *Dataset

WithReplacement configures the dataset to yield with replacement. This automatically implies `Shuffle` and `Infinite`.

func (*Dataset) Yield

func (ds *Dataset) Yield() (spec any, inputs, labels []tensor.Tensor, err error)

Yield implements train.Dataset. The returned spec is a pointer to the Strategy, and can be used to build a map of the names to the sampled tensors.

type EdgeType

type EdgeType struct {
	// SourceNodeType, TargetNodeType of the edges.
	Name, SourceNodeType, TargetNodeType string

	// Starts has one entry for each source node (shifted by 1): it points to the start of the list of
	// target nodes (edges) that this source node is connected.
	//
	// So for source node `i`, the list of edges start at `Starts[i-1]` and ends at `Starts[i]`,
	// except if `i == 0` in which case the start is at 0.
	// It's normal to be 0 if the source node has no target nodes.
	//
	// The number of sources is given by `len(Starts)`.
	Starts []int32

	// List of target nodes ordered by source nodes.
	// The source node for each edge is given by `Starts` above.
	EdgeTargets []int32
	// contains filtered or unexported fields
}

EdgeType information used by the Sampler.

func (*EdgeType) EdgeTargetsForSourceIdx

func (et *EdgeType) EdgeTargetsForSourceIdx(srcIdx int32) []int32

EdgeTargetsForSourceIdx returns a slice with the target nodes for the given source nodes. Don't modify the returned slice, it's in use by the Sampler -- make a copy if you need to modify.

func (*EdgeType) NumEdges

func (et *EdgeType) NumEdges() int

NumEdges for this type.

func (*EdgeType) NumSourceNodes

func (et *EdgeType) NumSourceNodes() int

NumSourceNodes for the source node type -- total number of nodes, even if they are not used by the edges.

func (*EdgeType) NumTargetNodes

func (et *EdgeType) NumTargetNodes() int

NumTargetNodes for the source node type -- total number of nodes, even if they are not used by the edges.

type Rule

type Rule struct {
	Sampler  *Sampler
	Strategy *Strategy

	// Name of the [Rule].
	Name string

	// ConvKernelScopeName doesn't affect sampling, but can be used to uniquely identify
	// the scope used for the kernels in a GNN to do convolutions on this rule.
	// If two rules have the same ConvKernelScopeName, they will share weights.
	ConvKernelScopeName string

	// UpdateKernelScopeName doesn't affect sampling, but can be used to uniquely identify
	// the scope used for the kernels in a GNN to do convolutions on this rule.
	// If two rules have the same UpdateKernelScopeName, they will share weights.
	UpdateKernelScopeName string

	// NodeTypeName of the nodes sampled by this rule.
	NodeTypeName string

	// NumNodes for NodeTypeName. Only used if NodeSet is not provided.
	NumNodes int32

	// SourceRule is the Name of the [Rule] this rule uses as source, or empty if
	// this is a "Node" sampling rule (a root/seed sampling)
	SourceRule *Rule

	// Dependents is the list of Rules that depend on this one.
	// That is other rules that have this Rule as [SourceRule].
	// This is to keep track of the graph, and are not involved on the sampling of this rule.
	Dependents []*Rule

	// EdgeType that connects the [SourceRule] node type, to the node type ([NodeTypeName]) of this Rule.
	// This is only set if this is an edge sampling rule. A node sampling rule (for seeds) have this set to nil.
	EdgeType *EdgeType

	// Count is the number of samples to create. It will define the last dimension of the tensor sampled.
	Count int

	// Shape of the sample for this rule.
	Shape shapes.Shape

	// NodeSet is a set of indices that a "Node" rule is allowed to sample from.
	// E.g.: have separate NodeSet for train, test and validation datasets.
	NodeSet []int32
}

Rule defines one rule of the sampling strategy. It's created by Strategy.Nodes, Strategy.NodesFromSet and Rule.FromEdges. Don't modify it directly.

func (*Rule) FromEdges

func (r *Rule) FromEdges(name, edgeTypeName string, count int) *Rule

FromEdges returns a Rule that samples nodes from the edges connecting the results of the current Rule `r`.

func (*Rule) IdentitySubRule

func (r *Rule) IdentitySubRule(name string) *Rule

IdentitySubRule creates a sub-rule that copies over the current rule, adding one rank (but same size). This is useful when trying to split updates into different parts, with the "IdentitySubRule" taking a subset of the dependents.

func (*Rule) IsIdentitySubRule

func (r *Rule) IsIdentitySubRule() bool

IsIdentitySubRule returns whether this is an identity sub-rule with a 1-to-1 mapping.

func (*Rule) IsNode

func (r *Rule) IsNode() bool

IsNode returns whether this is a "Node" rule, it can also be seen as a root rule.

func (*Rule) String

func (r *Rule) String() string

String returns an informative description of the rule.

func (*Rule) WithKernelScopeName

func (r *Rule) WithKernelScopeName(name string) *Rule

WithKernelScopeName will set both ConvKernelScopeName and UpdateKernelScopeName to `name`.

type Sampler

type Sampler struct {
	EdgeTypes        map[string]*EdgeType
	NodeTypesToCount map[string]int32
	Frozen           bool // When true, it can no longer be changed.
}

Sampler can be used to dynamically sample a Graph to be used in GNNs. It implements the train.Dataset interface.

It always samples nodes with the same size, padding whenever there is not enough elements to sample from. This way the resulting tensors will always be the same Shape -- required by XLA.

There are 3 phases when using the Sampler:

(1) Specify the full graph data: define node type and edge types, for example for the OGBN-MAG dataset:

Sampler := Sampler.New()
Sampler.AddNodeType("papers", mag.NumPapers)
Sampler.AddNodeType("authors", mag.NumAuthors)
Sampler.AddEdgeType("writes", "authors", "papers", mag.EdgesWrites, /* reverse= */ false)
Sampler.AddEdgeType("writtenBy", "authors", "papers", mag.EdgesWrites, /* reverse= */ true)
Sampler.AddEdgeType("cites", "papers", "papers", mag.EdgesCites, /*reverse=*/ false)
Sampler.AddEdgeType("citedBy", "papers", "papers", mag.EdgesCites, /*reverse=*/ true)

(2) Create and specify sampling strategy: sampling generates always a tree of elements, with fixed shaped tensors. It uses padding if sampling something that doesn't have enough examples to sample. Example:

trainStrategy := Sampler.NewStrategy()
Seeds := trainStrategy.NodesFromSet("Seeds", "papers", batchSize, /* subset= */TrainSplits)
citedBy := Seeds.FromEdges(/* Name= */ "citedBy", /* EdgeType= */ "citedBy", 5)
authors := Seeds.SampleFromEdgesRandomWithoutReplacement(/* Name= */ "authors", /* edgeSet= */ "writtenBy", 5)
coauthoredPapers := authors.SampleFromEdgesRandomWithoutReplacement(/* Name= */ "coauthoredPapers", /* edgeSet= */ "writes", 5)
citingAuthors := citedBy.SampleFromEdgesRandomWithoutReplacement(/* Name= */ "citingAuthors", /* edgeSet= */ "writtenBy", 5)

(3) Create a dataset and use it. The `spec` returned by `Yield` is a pointer to the Strategy object, and can be used to create a [GraphSample] by providing it the inputs and labels lists. Example:

  trainDataset := trainStrategy.Dataset()
  for {
  	spec, inputs, labels, err = trainDataset.Yield()
  	samplerStrategy := spec.(*mag.Strategy)
	  	sample := samplerStrategy.Parse(inputs, labels)
  }

Each registration of an edge type creates a corresponding structure to store the edges, that will be used for sampling.

All the information kept by Sampler is available for reading, but avoid changing it directly, and instead use the provided methods.

Example usage:

func Load

func Load(filePath string) (s *Sampler, err error)

Load previously saved Sampler. If filePath doesn't exist, it returns an error that can be checked with os.IsNotExist

func New

func New() *Sampler

New creates a new empty Sampler.

After creating it, use AddNodeType and AddEdgeType to define where to sample from.

func (*Sampler) AddEdgeType

func (s *Sampler) AddEdgeType(name, sourceNodeType, targetNodeType string, edges tensor.Tensor, reverse bool)

AddEdgeType adds the edge type to the list of known edges. It takes the node types names (must have been added with AddNodeType), and the `edges` given as pairs (source node, target node).

If `reverse` is true, it reverts the direction of the sampling. Note that `sourceNodeType` and `targetNodeType` are given before reversing the direction of the edges. So if `reverse` is true, the source is interpreted as the target and vice-versa. Same as the values of `edges`.

The `edges` tensor must have Shape `(Int32)[N, 2]`. It's contents are changed in place -- they are sorted by the source node type (or target if reversed). But the edges information themselves are not lost.

func (*Sampler) AddNodeType

func (s *Sampler) AddNodeType(name string, count int)

AddNodeType adds the node with the given Name and Count to the collection of known nodes. This assumes this is a dense representation of the node type -- all indices are valid from `0` to `Count-1`

A sparse node type (e.g.: indices are random numbers from 0 to MAXINT-1 or strings) is not supported.

func (*Sampler) NewStrategy

func (s *Sampler) NewStrategy() *Strategy

NewStrategy yields a new Strategy object, based on the graph data definitions of the Sampler object.

Once a strategy is created, the Sampler can no longer be changed -- but multiple strategies can be created based on the same Sampler.

func (*Sampler) Save

func (s *Sampler) Save(filePath string) (err error)

Save Sampler: it will include the edges indices, so it can be reloaded and ready to go.

func (*Sampler) String

func (s *Sampler) String() string

String returns a multi-line informative description of the Sampler data specification.

type Strategy

type Strategy struct {
	Sampler *Sampler

	// KeepDegrees means the sampler should add a tensor for all edges with the degrees of source sampling nodes.
	KeepDegrees bool

	// Rules lists all the rules of a strategy.
	// It can be used for reading, but don't change it.
	Rules map[string]*Rule

	// Seeds lists all the rules that are seeds.
	// It can be used for reading, but don't change it.
	Seeds []*Rule
	// contains filtered or unexported fields
}

Strategy is created by Sampler. A Sampler can create multiple [Strategy]s, a typical example is creating one for training, one for validation and one for testing.

After creation (see Sampler.NewStrategy), one defines what and how to sample a subgraph, by creating "Rules" (Rule) that will translate to sampled nodes.

Once the strategy is defined, it can be used to create one or more datasets -- and after datasets are created, the strategy can no longer be changed.

func (*Strategy) NewDataset

func (strategy *Strategy) NewDataset(name string) *Dataset

NewDataset creates a new Dataset from the configured Strategy. One can create multiple datasets from the same Strategy, but once a Dataset is created, the Strategy is considered frozen and can no longer be modified.

func (*Strategy) Nodes

func (strategy *Strategy) Nodes(name, nodeTypeName string, count int) *Rule

Nodes creates a rule (named `Name`) to sample nodes randomly without replacement from the node type given by `NodeTypeName`.

Nodes will be indices from 0 to the number of elements of the given node type.

Node sampling (as opposed to Edges sampling) are typically the "root nodes" or "seed nodes" of a tree being sampled, that represent the sampled sub-graph.

If this is used to sample the seed nodes, `Count` in this case will be typically the batch size.

func (*Strategy) NodesFromSet

func (strategy *Strategy) NodesFromSet(name, nodeTypeName string, count int, nodeSet []int32) *Rule

NodesFromSet creates a rule (named `Name`) to sample nodes randomly without replacement from the node type given by `NodeTypeName`, but selecting only from the given NodeSet.

`NodeSet` is a list of valid node indices for the given node type from which to sample.

Node sampling (as opposed to Edges sampling) are typically the "root nodes" or "seed nodes" of a tree being sampled, that represent the sampled sub-graph.

If this is used to sample the seed nodes, `Count` in this case will be typically the batch size.

func (*Strategy) String

func (strategy *Strategy) String() string

String returns a multi-line informative description of the strategy.

type ValueMask

type ValueMask[T any] struct {
	Value, Mask T
}

ValueMask contains a pair of tensor.Tensor or [*graph.Node] (Value, Mask).

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL