lexer

package
v0.0.0-...-c3230fc Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Aug 7, 2023 License: MIT Imports: 3 Imported by: 0

Documentation

Index

Constants

View Source
const (
	RUNE_OPEN_PAREN    = '('
	RUNE_CLOSE_PAREN   = ')'
	RUNE_QUESTION_MARK = '?'
	RUNE_COMMENT_SEMI  = ';'

	RUNE_BEGIN_ARROW_LD = '<'
	IMAGE_ARROW_LD      = "<="
)
View Source
const CURSOR_TAB_STOP = 4

Arbitrary size for \t alignment.

Variables

View Source
var EOF = Token{kTOKENPOS_ZERO, &eofToken{}}

EOF token indicates the end of the token stream. As EOF is not in the document, its TokenPos is always zero.

Functions

func IsComment

func IsComment(pos TokenPos) bool

func IsMetaBlock

func IsMetaBlock(pos TokenPos) bool

func IsSentence

func IsSentence(pos TokenPos) bool

func ReadAll

func ReadAll(reader TokenReader) error

Repeatedly calls `NextToken()` until either the enf of file (EOF) is reached or until an error is returned when attempting to read the next token. Unlike NextToken(), it does not forward the io.EOF error - if `EOF` is reached and no other errors are encountered, this method returns `nil`.

Types

type Cursor

type Cursor interface {
	// NextRune is called to extend the cursor by reading the next rune from
	// input.  Also updates the pending string except when skipping spaces, and
	// returns the updated cursor and the rune that was read.
	NextRune(input io.RuneReader) (Cursor, rune)
	// Similar to NextRune() but will read from the pending buffer if nonempty,
	// and read from input, populating pending, if pending was empty.  Implicitly
	// ignores leading spaces if needing to read from input.
	FirstRune(input io.RuneReader) (Cursor, rune)

	// Consumes all characters in the pending rune list, updating pos to match.
	ConsumeAll() (Cursor, string)
	// Same as ConsumeAll() except the last rune is left in the pending buffer.
	ConsumeExceptFinal() (Cursor, string)

	// Resets the TokenPos for this cursor to (0, 0, UNKNOWN).
	ResetPos() Cursor

	// The current position of the next Token that would be produced by consuming
	// the contents of this Cursor, whether or not anything is in pending buffer.
	Pos() TokenPos

	// Returns true if there is nothing pending in the cursor.
	IsEmpty() bool

	// Returns true if the last ReadRune call returned an error.
	HasError() bool
	// Returns `true` if the embedded error is io.EOF.
	IsEOF() bool
	// Returns the error (or nil) from the most recent read of input.  If an error
	// is encountered, it will persist through update methods and prohibit reads.
	//
	// Intentionally not extending `error` interface by naming this ErrorValue.
	ErrorValue() error
}

The Cursor represents a few properties of the lexer's state that are invariably coupled to each other -- token position, the runes ready to be integrated into the next token, and whether there is a pending rune waiting to be processed. The token's next position should always be the current position plus the size of the pending rune, if there is a pending rune, but that relies on whether scanning can be done in LL(1) or (in some cases) LL(0) as with `(` and `)`. It also smelled bad to be updating only part of the lexer state and another part that depended on it, while not doing so atomically.

This, and its backing struct, are a solution to the above problems while also aiding the readability of the token-specific lexer code. The coupled updates are done within the Advance and Consume methods, there is no redundant next pos or ambiguity about the contents of the pending image. In addition to that, the cursor is copy-on-write, all updates are conveyed by the return value of the updating method, and the implementing methods use by-value receivers so downcast-and-update has limited adverse effect.

However, it assumes that it is the only reader on the provided input, and that its scan position is consistent between calls to Advance. If there is need of multiple concurrent cursors on the same reader source, use new readers for each cursor or tee the source RunReader. Rather than further complicate this code with management of byte offsets and seeks at each read, especially while this task of tokenizing byte streams is inherently single- threaded. Calling code is expected to manage it, typically via lexerState.

func NewCursor

func NewCursor() Cursor

type ExprEndToken

type ExprEndToken struct{ SymbolToken }

Token EXPR_END = ")"

var EXPR_END ExprEndToken

func (ExprEndToken) Image

func (tok ExprEndToken) Image() string

func (ExprEndToken) TypeString

func (tok ExprEndToken) TypeString() string

type ExprStartToken

type ExprStartToken struct{ SymbolToken }

Token EXPR_START = "("

var EXPR_START ExprStartToken

func (ExprStartToken) Image

func (tok ExprStartToken) Image() string

func (ExprStartToken) TypeString

func (tok ExprStartToken) TypeString() string

type KeywordToken

type KeywordToken struct {
	// contains filtered or unexported fields
}

All keywords are given the KEYWORD token type.

func (KeywordToken) At

func (tok KeywordToken) At(pos TokenPos) Token

Constructs a Token instance pointing to the singular KeywordToken instance for the specific keyword.

func (KeywordToken) Image

func (tok KeywordToken) Image() string

Satisfies the requirement for TokenType interface.

func (KeywordToken) TypeString

func (tok KeywordToken) TypeString() string

Satisfies the requirement for TokenType interface.

type LDArrowToken

type LDArrowToken struct{ SymbolToken }

Token ARROW_LD = "<="

var ARROW_LD LDArrowToken

func (LDArrowToken) Image

func (tok LDArrowToken) Image() string

func (LDArrowToken) TypeString

func (tok LDArrowToken) TypeString() string

type QMarkToken

type QMarkToken struct{ SymbolToken }

Token QUE_MARK = "?"

var QUE_MARK QMarkToken

func (QMarkToken) Image

func (tok QMarkToken) Image() string

func (QMarkToken) TypeString

func (tok QMarkToken) TypeString() string

type SymbolToken

type SymbolToken struct{}

Symbol tokens always have the same image, they can share a common instance.

type Token

type Token struct {
	TokenPos
	TokenType
}

Represents a Token instance by its position in the source and its type. The TokenType is an embedded interface (see above) and may be initialized with state/context or reuse a shared instance for the many tokens that are universally identical within their type (e.g. keywords, operator symbols). TokenPos is a 32-bit uint composite value defined in [token_pos.go].

func ExpressionEnd

func ExpressionEnd(pos TokenPos) Token

Indicates the end of expressions and sub-expressions within a sentence.

func ExpressionStart

func ExpressionStart(pos TokenPos) Token

Begins all expressions, the main structural denotation in GDL syntax.

func Identifier

func Identifier(name string, pos TokenPos) Token

Identifier is a catch-all token for alpha-num strings that are not keywords.

func Integer

func Integer(image string, pos TokenPos) Token

More complex numeric types can be constructed from sequences of unsigned integers and punctuation. This also keeps the tokenizer state management simpler by defining negatives, floats, etc. in terms of production rule semantics. GDL and GDL-II both only assume integer constants in [0-100].

func KeywordAt

func KeywordAt(image string, pos TokenPos) Token

func LeftDoubleArrow

func LeftDoubleArrow(pos TokenPos) Token

Used in constructing relations.

func LineComment

func LineComment(image string, pos TokenPos) Token

Line comments are any sequence of characters beginning with a semicolon and extending until the next newline rune '\n'.

func QuestionMark

func QuestionMark(pos TokenPos) Token

Used in the production rule for Variable terms.

func UnexpectedToken

func UnexpectedToken(image string, pos TokenPos) Token

An unexpected token is used when a parse error is encountered, despite there being no read errors encountered (those are returned with the NextToken call). An example would be incomplete Unicode bytes or a string without end quotes. Illegal tokens retain the image of the scan up to and including the bad char.

func (Token) String

func (data Token) String() string

General implementation of the string conversion (i.e. for fmt interpolation). More specific Token types may override this String() function but the only operation that should make use of it is logging, for debugging & testing.

type TokenPos

type TokenPos uint32
TokenPos encoded as 32-bit uint:

.LLLLLLLLLLLLLLLLLLLLCCCCCCCCCCFF. :[++++++++++++++++++] : : 20 bits LINE : : [++++++++] : : 10 bits COLUMN : : []: : 2 bits FLAGS: : : `10987654321098765432109876543210'

Use Line(), Column() and Next*() methods to read and update values.

func NewTokenPos

func NewTokenPos(line, col uint) TokenPos

func (TokenPos) Column

func (pos TokenPos) Column() uint

Returns the 1-indexed column number of the position, zero means unknwon. Token embeds this from TokenPos interface to adopt its Column() method.

func (TokenPos) InComment

func (pos TokenPos) InComment() TokenPos

Produces the same Token position, ensuring its flag is set to COMMENT mode.

func (TokenPos) InMetaBlock

func (pos TokenPos) InMetaBlock() TokenPos

Produces the same Token position, ensuring its flag is set to COMMENT mode.

func (TokenPos) InSentence

func (pos TokenPos) InSentence() TokenPos

Produces the same Token position, ensuring its flag is set to SENTENCE mode.

func (TokenPos) Line

func (pos TokenPos) Line() uint

Returns the 1-indexed line number of the position, zero means unknwon. Token embeds this from TokenPos interface to adopt ts Line() method.

func (TokenPos) NextAt

func (pos TokenPos) NextAt(lines, cols uint) TokenPos

Increments by number of lines then by number of columns.

func (TokenPos) NextCol

func (pos TokenPos) NextCol() TokenPos

Increments the column, keeping the current flag.

func (TokenPos) NextLine

func (pos TokenPos) NextLine() TokenPos

Increments the position to its next line, resetting the column as well. Flag's current value is reset from comment mode, retained otherwise.

func (TokenPos) ResetFlag

func (pos TokenPos) ResetFlag() TokenPos

Resets the flag value to unknown.

func (TokenPos) String

func (data TokenPos) String() string

String conversion for the TokenPos value. As a uint there was alerady a conversion available but the integer value obscures the actual position data.

type TokenReader

type TokenReader interface {
	// Reads the next token, sending it to output, returning error or nil.  If an
	// io.EOF error was encountered it is returned here as well.
	NextToken() error

	// Read/Receive-only channel for Token values sent as being read from the input.
	// Calling NextToken() or ReadAll() will produce tokens on this channel and one
	// of those methods will close the channel when it encounters EOF. An EOF token
	// is also produced as the last token on the channel, so consumers can listen
	// for it specifically or listen until channel close using `for ... := range`.
	TokenReceiver() <-chan Token
}

Public interface for reading a stream of tokens, sending them to a channel. See also ReadAll(reader) which provides a simpler interface for full reads.

func NewTokenReader

func NewTokenReader(input io.RuneReader, output chan Token) TokenReader

Constructor function for a lexer-based token reader.

type TokenType

type TokenType interface {
	// Returns a string representation of the type of this token.
	TypeString() string
	// Returns a string representation of this token, its syntactic image.
	Image() string
}

TokenType intrinsically defines the subtype of a Token and provides identifying methods.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL