grammar

package module
v0.0.0-...-fff725e Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Oct 30, 2021 License: Apache-2.0 Imports: 8 Imported by: 1

README

grammar

A parser generator where rules defined as go structs and code generation is optional. The concepts are introduced in the simple example below.

Rules

Rules can be written as Go structs. Here is a definition for s-exprs:


// SExpr ::= Number | String | Atom | List
type SExpr struct {
    grammar.OneOf                // This rule is a "one-of": one of the rules below has to match
    Number *Token `tok:"number"` // A pointer field means an optional match
    String *Token `tok:"string"` // This "tok" tag means only tokens of type "string" will match
    Atom   *Token `tok:"atom"`
    *List                        // All field in a one-of rule must be optional
}

// List ::= "(" [Item] ")"
type List struct {
    grammar.Seq             // This is a Sequence rule, all fields must match in a sequence
    OpenBkt  Token `tok:"bkt,("`  // This "tok" tag means only "bkt" tokens with value "(" will match
    Items []SExpr
    CloseBkt Token `tok:"bkt,)`
}

Tokens

This is not quite complete as you need a Token type. You can create your own or if your needs are simple enough use grammar.SimpleToken:

type Token = grammar.SimpleToken

In order to parse your s-exprs into the data structures aboves you also need a tokeniser. You can make your own tokeniser or you can build one simply with the grammar.SimpleTokeniser function:

var tokenise = grammar.SimpleTokeniser([]grammar.TokenDef{
    {
        // If Name is empty, the token is skipped in the token stream
        Ptn: `\s+` // This is a regular expression (must not contain groups)
    },
    {
        Name: "bkt", // This is the type of the token seen in the "tok" struct tag
        Ptn: `[()]`,
    },
    {
        Name: "string",
        Ptn: `"[^"]*"`,
    },
    {
        Name: "number",
        Ptn: `-?[0-9]+(?:\.[0-9]+)?`,
    },
    {
        Name: "atom",
        Ptn: `[a-zA-Z_][a-zA-Z0-9_-]*`,
    },
})

Parsing

Now putting all this together you can parse an s-expr of your choice:


tokenStream, _ := tokenise(`(cons a (list 123 "c")))`)
var sexpr SExpr
err := grammar.Parse(&sexpr, tokenStream)

Now sexpr's fields have been filled and you explore the syntax tree by traversing the fields in sexpr, e.g. sexpr.List.Items is now a slice of SExprs, sexpr.List.Items[0].Atom is a Token with Value "cons" (and type atom).

There is a convenient function to output a rule struct:

grammar.PrettyWrite(sexpr, os.Stdout)

This will output a pretty representation of sexpr:

SExpr {
  List: List {
    OpenBkt: {bkt (}
    Items: [
      SExpr {
        Atom: {atom cons}
      }
      SExpr {
        Atom: {atom a}
      }
      SExpr {
        List: List {
          OpenBkt: {bkt (}
          Items: [
            SExpr {
              Atom: {atom list}
            }
            SExpr {
              Number: {number 123}
            }
            SExpr {
              String: {string "c"}
            }
          ]
          CloseBkt: {bkt )}
        }
      }
    ]
    CloseBkt: {bkt )}
  }
}

Generating a parser

The above works using reflection, which is fine but can be a little slow if you are parsing very big files. It is also possible to compile a parser.

First you need the command that generates the parser.

go install github.com/arnodel/grammar/cmd/genparse

Then the simplest is to add this line to the file containing the grammar you want to compile:

//go:generate genparse

You can now generate the compiled parser by running go generate in your package. If your file is called say grammar.go, it will generated a file in the same package called grammar.compiled.go. You can still parse files the same way as before using grammar.Parse() but this will no longer use reflection! Note that you can always force reflection to be used by compiling your program with the nocompiledgrammar go compiler tag.

Documentation

Overview

Example
package main

import (
	"os"

	"github.com/arnodel/grammar"
)

type Token = grammar.SimpleToken

type SExpr struct {
	grammar.OneOf
	Number *Token `tok:"number"`
	String *Token `tok:"string"`
	Atom   *Token `tok:"atom"`
	*List
}

type List struct {
	grammar.Seq
	OpenBkt  Token `tok:"bkt,("`
	Items    []SExpr
	CloseBkt Token `tok:"bkt,)"`
}

var tokenise = grammar.SimpleTokeniser([]grammar.TokenDef{
	{
		// If Name is empty, the token is skipped in the token stream
		Ptn: `\s+`,
	},
	{
		Name: "bkt",
		Ptn:  `[()]`,
	},
	{
		Name: "string",
		Ptn:  `"[^"]*"`,
	},
	{
		Name: "number",
		Ptn:  `-?[0-9]+(?:\.[0-9]+)?`,
	},
	{
		Name: "atom",
		Ptn:  `[a-zA-Z_][a-zA-Z0-9_-]*`,
	},
})

func main() {
	tokenStream, _ := tokenise(`(cons a (list 123 "c")))`)
	var sexpr SExpr
	grammar.Parse(&sexpr, tokenStream)
	grammar.PrettyWrite(os.Stdout, sexpr)

}
Output:

SExpr {
  List: List {
    OpenBkt: {bkt (}
    Items: [
      SExpr {
        Atom: {atom cons}
      }
      SExpr {
        Atom: {atom a}
      }
      SExpr {
        List: List {
          OpenBkt: {bkt (}
          Items: [
            SExpr {
              Atom: {atom list}
            }
            SExpr {
              Number: {number 123}
            }
            SExpr {
              String: {string "c"}
            }
          ]
          CloseBkt: {bkt )}
        }
      }
    ]
    CloseBkt: {bkt )}
  }
}

Index

Examples

Constants

This section is empty.

Variables

View Source
var EOF = SimpleToken{
	TokType:  "EOF",
	TokValue: "EOF",
}

EOF is the token that is returned when trying to consument an exhausted token stream.

View Source
var WithDefaultLogger = WithLogger(log.Default())

Functions

func PrettyWrite

func PrettyWrite(out io.Writer, r interface{}) error

PrettyWrite outputs a pretty representation of the rule r on the writer. For a struct like this:

type RuleType struct {
    Field1 Rule1
    Field2 []Rule2
}

It looks like this:

RuleType {
  Field1: <pretty representation of the value of Field1>
  Field2: [
     <pretty representation of the first item in Field2>
     <pretty representation of the second item>
     <...>
  ]
}

Empty optional fields and empty repeated fields are omitted altogether.

func SimpleTokeniser

func SimpleTokeniser(tokenDefs []TokenDef) func(string) (*SimpleTokenStream, error)

SimpleTokeniser takes a list of TokenDefs and returns a function that can tokenise a string. Designed for simple use-cases.

Types

type Empty

type Empty struct{}

func (Empty) Parse

func (Empty) Parse(r interface{}, s *ParserState, opts ParseOptions) *ParseError

type FieldType

type FieldType struct {
	BaseType reflect.Type
	Pointer  bool
	Array    bool
}

type Mode

type Mode struct {
}

type OneOf

type OneOf struct{}

OneOf should be used as the first field of a Rule struct to signify that it should match exactly one of the fields

func (OneOf) Parse

func (OneOf) Parse(r interface{}, s *ParserState, opts ParseOptions) *ParseError

type ParseError

type ParseError struct {
	Err error
	Token
	TokenParseOptions []TokenParseOptions
	Pos               int
}

func Parse

func Parse(dest interface{}, s TokenStream, opts ...ParseOption) *ParseError

Parse tries to interpret dest as a grammar rule and use it to parse the given token stream. Parse can panic if dest is not a valid grammar rule. It returns a non-nil *ParseError if the token stream does not match the rule.

func ParseWithOptions

func ParseWithOptions(dest interface{}, s *ParserState, opts ParseOptions) *ParseError

ParseWithOptions is the same as Parse but the ParseOptions are explicitely given (this is mostly used by the parser generator).

func (*ParseError) Error

func (e *ParseError) Error() string

func (*ParseError) Merge

func (e *ParseError) Merge(e2 *ParseError) *ParseError

type ParseOption

type ParseOption func(s *ParserState)

func WithLogger

func WithLogger(l *log.Logger) ParseOption

type ParseOptions

type ParseOptions struct {
	TokenParseOptions []TokenParseOptions
}

func (ParseOptions) MatchToken

func (o ParseOptions) MatchToken(tok Token) (bool, bool)

type Parser

type Parser interface {
	Parse(r interface{}, s *ParserState, opts ParseOptions) *ParseError
}

A Parser can parse a token stream to populate itself. It must return an error if it fails to do it.

type ParserState

type ParserState struct {
	TokenStream
	// contains filtered or unexported fields
}

func (*ParserState) Debug

func (s *ParserState) Debug() bool

func (*ParserState) Logf

func (s *ParserState) Logf(fstr string, args ...interface{})

func (*ParserState) MergeError

func (s *ParserState) MergeError(err *ParseError) *ParseError

type RuleDef

type RuleDef struct {
	Name      string
	OneOf     bool
	Separator *RuleField
	Fields    []RuleField
}

type RuleField

type RuleField struct {
	FieldType
	ParseOptions
	SizeOptions
	Name  string
	Index int
}

type Seq

type Seq struct{}

Seq should be used as the first field of a Rule struct to signify that it should match all the fields in sequence.

func (Seq) Parse

func (Seq) Parse(r interface{}, s *ParserState, opts ParseOptions) *ParseError

type SimpleToken

type SimpleToken struct {
	TokType  string
	TokValue string
}

SimpleToken is a simple implementation of the both the Token and the Parser interfaces. So it can be used in rules to match a token field. It is also the concrete type of Tokens returned by SimpleTokenStream.

func (*SimpleToken) Parse

func (t *SimpleToken) Parse(_ interface{}, s *ParserState, opts ParseOptions) *ParseError

Parse tries to match the next token in the given TokenStream with the ParseOptions, using opts.MatchToken. If they match, the receiver is loaded with the token data, if not a non-nil *ParseError is returned. In any event the next token in the token stream has been consumed.

func (SimpleToken) Type

func (t SimpleToken) Type() string

Type returns the token type.

func (SimpleToken) Value

func (t SimpleToken) Value() string

Value returns the value of the token.

type SimpleTokenStream

type SimpleTokenStream struct {
	// contains filtered or unexported fields
}

SimpleTokenStream is a very simple implementation of the TokenStream interface which the Parse function requires.

func NewSimpleTokenStream

func NewSimpleTokenStream(toks []Token) *SimpleTokenStream

func (*SimpleTokenStream) Dump

func (s *SimpleTokenStream) Dump(w io.Writer)

func (*SimpleTokenStream) Next

func (s *SimpleTokenStream) Next() Token

Next consumes the next token in the token stream and returns it. If the stream is exhausted, EOF is returned.

func (*SimpleTokenStream) Restore

func (s *SimpleTokenStream) Restore(pos int)

Restore rewinds the token stream to the given position (which should have been obtained by s.Save()). In general this may panic - in this implementation it's always possible to rewind.

func (*SimpleTokenStream) Save

func (s *SimpleTokenStream) Save() int

Save returns the current position in the token stream.

type SizeOptions

type SizeOptions struct {
	Min, Max int
}

type Token

type Token interface {
	Type() string  // The type of the token (e.g. operator, number, string, identifier)
	Value() string // The value of the token(e.g. `+`, `123`, `"abc"`, `if`)
}

A Token has a Type() and Value() method. A TokenStream's Next() method must return a Token.

type TokenDef

type TokenDef struct {
	Ptn      string              // The regular expression the token should match
	Name     string              // The name given to this token type
	Special  func(string) string // If defined, it takes over the tokenising for this pattern
	Mode     string
	PushMode string
	PopMode  bool
}

A TokenDef defines a type of token and the pattern that matches it. Used by SimpleTokeniser to create tokenisers simply.

type TokenParseOptions

type TokenParseOptions struct {
	TokenType    string
	TokenValue   string
	DoNotConsume bool
}

func (TokenParseOptions) String

func (o TokenParseOptions) String() string

type TokenStream

type TokenStream interface {
	Next() Token // Advance the stream by one token and return it.
	Save() int   // Return the current position in the token stream.
	Restore(int) // Return the stream to the given position.
}

A TokenStream represents a sequence of tokens that will be consumed by the Parse() function. The SimpleTokenStream implementation in this package can be used, or users can provide their own implementation (e.g. with a richer data type for tokens, or "on demand" tokenisation).

The Restore() method may panic if is not possible to return to the given position.

Directories

Path Synopsis
cmd

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL