ragel

package module
v2.2.4 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Apr 18, 2020 License: MIT Imports: 3 Imported by: 3

README

ragel-go

godocb goreportb

This package provides a driver for ragel based scanners in Go for streamed input. It takes care of all the boilerplate code, letting the user focus on the ragel state machine definition.

It is only intended for use with ragel machines defined as scanners. See section 6.3 of the ragel guide.

Rationale

Streaming input data to ragel -- i.e. reading from an io.Reader as opposed to loading the whole input into memory, requires a driver, a piece of code that buffers input and manipulates pointers for the ragel state machine.

Here is the Go version of the main loop from the C-like scanner in the ragel examples:

func scanner(r io.Reader) {
    var (
        buf = make([]byte, BUFSIZE)
        cs, act, ts, te int
        have, curLine = 0, 1
        done bool
    )

    // ragel init code
    %% write init;

    for !done {
        var (
            err error
            p = have
            ln, pe int
            eof = -1
        )

        if ( p == len(buf) ) {
            fmt.Fprintf(os.Stderr, "out of buffer space\n" )
            os.Exit(1)
        }

        for {
            ln, err = r.Read(buf[p:])
            if ln != 0 || err != nil {
                break
            }
        }
        pe = p + ln;

        // trigger end of file on any error
        if err != nil {
            eof = pe
            done = true
        }

        // Run the ragel state machine
        %% write exec;

        if cs == clang_error {
            fmt.Fprintf(os.Stderr, "parse error (invalid char)\n" )
            break
        }

        if ts == 0 {
            have = 0
        } else {
            // There is a prefix to preserve, shift it over.
            have = pe - ts;
            copy(buf, buf[ts:pe])
            te -= ts
            ts = 0
        }
    }
}

In this example, the state machine actions just write tokens to stdout. This is a lot of scaffolding and yet a few important things are still missing, like a convenient way to retrieve the tokens and some mechanism to keep track of line and column position for these tokens.

This package provides all the boilerplate code needed to read data from an io.Reader, buffering, token position tracking (line and column) and error handling and lets you focus on your ragel state machine definition.

Quick start

In essence, creating a scanner is as simple as creating a ragel machine definition for your language of choice, along with a tiny Go interface implementation.

Ragel state machine definition

Here is an abbreviated version for the C-like language mentioned above (with support for comments, strings and char literals removed for brevity):

// scanner/myc.rl

// Package scanner provides a scanner for my-C language.
//
package scanner // import "github.com/me/myc/scanner"

import "github.com/db47h/ragel/v2"

// Tokens for my-C
//
const (
    Ident ragel.Token = iota
    Float
    Symbol
)

// ragel state machine definition.
%%{
    machine lang;

    # utf-8 support
    include WChar "utf8.rl";

    newline = '\n' @{ s.Newline(p) };

    main := |*
        alpha_u = uletter | '_';
        alnum_u = alpha_u | digit;

        # punctuation
        punct {
            s.Emit(ts, Symbol, string(data[ts:te]));
        };

        # identifiers
        alpha_u alnum_u* {
            s.Emit(ts, Ident, string(data[ts:te]))
        };

        # ignore white space and control codes
        newline | 0x00..0x20 | 0x7f;

        # integers
        digit+ ('.' digit+)* {
            s.Emit(ts, Float, string(data[ts:te]))
        };
    *|;
}%%

%%write data nofinal;

// MyC implements ragel.Interface. This is the canonical implementation
// available at https://godoc.org/github.com/db47h/ragel#Interface
// with the type name changed to MyC.
//
type MyC struct {}

func (MyC) Init(s *ragel.State) (int, int) {
    var cs, ts, te, act int
    %%write init;
    s.SaveVars(cs, ts, te, act)
    return %%{ write start; }%%, %%{ write error; }%%
}

func (MyC) Run(s *ragel.State, p, pe, eof int) (int, int) {
    cs, ts, te, act, data := s.GetVars()
    %%write exec;
    s.SaveVars(cs, ts, te, act)
    return p, pe
}

Note that this example imports a file named "utf8.rl" for UTF-8 support. This should be generated first and only once:

go run $GOPATH/github.com/db47h/ragel/cmd/unicode2ragel/main.go -o utf8.rl

Now supposing that the above example is saved in a file named myc.rl, run it through ragel with the following command:

    ragel -Z -G2 myc.rl -o myc.go

This will generate a myc.go file. I tend to use the -G2 because it yields the fastest running code in most of my use cases. Your mileage may vary.

You might also want add go:generate directives in some other .go file in the same package:

//go:generate ragel -Z -G2 myc.rl -o myc.go
//go:generate sed -i "1s|.*|// Code generated by ragel. DO NOT EDIT.|" myc.go

The second directive changes the first line of the generated code so that the generated file gets ignored by the Go tooling.

The MyC struct defined in the ragel state machine definition file is the ragel.Interface implementation that is used in the call to ragel.New. Besides changing the struct name to suit your needs (e.g. make it private), this bit of code must be copy-pasted as-is from the reference implementation. If you feel that you need to change it for some reason, please file an issue.

Scanning

Call ragel.New with an io.Reader and a MyC struct instance, then iterate over tokens with Scanner.Next:

// main.go

package main // import "github.com/me/myc"

import (
    "os"
    "fmt"

    "github.com/db47h/ragel/v2"
    "github.com/me/myc/scanner"
)

func main() {
    s := ragel.New("stdin", os.Stdin, scanner.MyC{})
    for {
        pos, tok, literal := s.Next()
        switch tok {
        case ragel.EOF:
            return
        case ragel.Error:
            fmt.Fprintf(os.Stderr, "scan error: %v: %v\n", s.Pos(pos), literal)
        default:
            fmt.Printf("%v: %d %s\n", s.Pos(), tok, literal)
        }
    }
}

That's it.

Supported Go and Ragel versions

Go

This package uses the Go modules convention introduced in Go 1.11. The current major version is v2, so it should be imported like this:

    import "github.com/db47h/ragel/v2"

This is supported by Go 1.11, 1.10.4 and 1.9.7.

It might be possible to get it to work with earlier versions using third-party package management tools, like dep, but this has not been tested.

Ragel

Ragel provides good Go support from version 6.9 onwards.

The development releases (versions 7.x) should be avoided for the time being. They have issues with that bit of code in the Go generator:

    return %%{ write start; }%%, %%{ write error; }%%

If you need to use a 7.x version, just change the write statements like so:

    return MachineName_start, MachineName_error

where MachineName is the FSM name as used in the ragel machine statement.

For all ragel versions, the alphabet data type needs to left to the default setting of byte, even for UTF-8 input. i.e. if you have any alphtype … ; statement, remove it.

Handling UTF-8 input

UTF-8 input is supported by including UTF-8 state machines automatically generated by the companion tool unicode2ragel.

Implementation details

The scanner reads input data in chunks of 32KiB (the default setting), then tokenizes the whole buffer and pushes the tokens found in a FIFO queue.

scanner.Next pops tokens from the queue and populates it as needed.

An alternative to using a queue would be to run the scanning loop in a goroutine and use a channel to emit tokens -- like text/template/parse in Go's standard library. This approach is however 2 to 5 times slower than using a queue (channels are great, just not here). See https://github.com/db47h/ragel/issues/4.

Error handling is done by returning a ragel.Error token. This can happen in 4 cases:

  • an IO error while reading from the input reader
  • the input buffer gets full
  • an illegal symbol is encountered (i.e. unhandled by the state machine definition)
  • a ragel action calls State.Errorf

The first two cases are non-recoverable errors; any subsequent call to scanner.Next returns ragel.EOF. For illegal symbols, scanning resumes at the following byte. Actions calling State.Errorf must update ragel's p variable (or issue an appropriate fgoto) for non-fatal errors.

The standard ragel state variables data, p, pe, eof, cs, ts, te and act are directly available in state machine definitions, i.e. data[p] is the current character.

The ragel.Interface

Ragel state machine definitions are templates where Go code is intermingled with ragel directives (like the %% write exec; in the above example) and ragel generates compilable Go code out of these templates.

The ragel directives %% write exec; and %% write init; generate code that expects a few variables to be accessible in the program scope where these directives appear. As a result, and although the variables names can be customized (they could even be struct fields), the ragel state machine definition is tightly coupled with the Go code that drives it. In order to have a package that is go-gettable and updatable independently of users' own state machine definitions, I needed to find a way to minimize the amount of Go code present in a state machine definition file.

Since there is no way around some copy-paste, I came up with the ragel.Interface interface and a canonical implementation that must be placed in state machine definition files. This implementation uses as little code as possible in order to limit potential bugs and avoid painful updates.

API

The ragel.Interface implementation discussed above brings a *State instance s into the scope of actions in ragel state machine definitions. Actions can use its Emit, Errorf and Newline methods. The first two are mostly self-explanatory. Users who need accurate line and column tracking must take care of calling Newline at the right places (see how this is done in the above example).

The State.SaveVars and State.GetVars methods manage the standard ragel state variables and are only meant to be used by the ragel.Interface implementation. As such, they must not be used by client code and should be considered private for all intents and purposes.

The Scanner type wraps things together and provides the scanning client API.

Behind the scenes, State and Scanner are thin wrappers around state, their sole purpose being a clean separation of concerns between the API used by state machine definitions from the API used by parsers.

LICENSE

Copyright 2018 Denis Bernard [email protected]

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Documentation

Overview

Package ragel provides a driver for ragel based scanners.

It provides all the functionality needed to read data from an io.Reader, buffering, token position tracking (line and column) and error handling.

The scanner reads input data in chunks of 32KiB (the default setting), then tokenizes the whole buffer and pushes the tokens found in a FIFO queue.

scanner.Next pops tokens from the queue and populates it as needed.

Error handling is done by returning a ragel.Error token. This can happen in 4 cases:

  • an IO error while reading from the input reader
  • the input buffer gets full
  • an illegal symbol is encountered (i.e. unhandled by the state machine definition)
  • a ragel action calls State.Errorf

The first two cases are non-recoverable errors; any subsequent call to scanner.Next returns ragel.EOF. For illegal symbols, scanning resumes at the following byte. Actions calling State.Errorf must update ragel's p variable (or issue an appropriate fgoto) for non-fatal errors.

UTF-8 input is supported by including UTF-8 state machines automatically generated by the companion tool unicode2ragel (see the subfolder cmd/unicode2ragel).

Index

Constants

View Source
const DefaultBufferSize = 32768

DefaultBufferSize is the default buffer size for a new Scanner. This can be changed at creation time with the BufferSize option.

Variables

This section is empty.

Functions

This section is empty.

Types

type Interface

type Interface interface {
	// Init initializes the FSM. Returns the start and error state indices.
	Init(s *State) (start, err int)
	// Run runs the scanner on the current buffer contents.
	Run(s *State, p, pe, eof int) (np, npe int)
}

Interface provides an interface to generated code for a given ragel state machine.

The following canonical implementation must be present it the ragel state machine definition file:

type iface struct {}

func (iface) Init(s *ragel.State) (int, int) {
	var cs, ts, te, act int
	%%write init;
	s.SaveVars(cs, ts, te, act)
	return %%{ write start; }%%, %%{ write error; }%%
}

func (iface) Run(s *ragel.State, p, pe, eof int) (int, int) {
	cs, ts, te, act, data := s.GetVars()
	%%write exec;
	s.SaveVars(cs, ts, te, act)
	return p, pe
}

The actual type name and whether it is public or private do not matter since it is only used by client code in the call to ragel.New:

scanner := ragel.New("stdin", os.Stdin, iface{})

type Option added in v2.1.0

type Option interface {
	// contains filtered or unexported methods
}

Option is a scanner option for New.

func BufferSize added in v2.1.0

func BufferSize(size int) Option

BufferSize returns an Option that sets the input buffer size. The buffer size conditions the size of the longest possible token.

type Position

type Position struct {
	FileName string
	Line     int // 1-based line number
	Column   int // 1-based column number in bytes
}

Position represents the position of a token as its 1-based line and column numbers.

func (Position) String

func (p Position) String() string

type Scanner

type Scanner state

Scanner provides the API for scanner clients.

func New

func New(fileName string, r io.Reader, iface Interface, opts ...Option) *Scanner

New returns a new scanner for the given io.Reader and Interface.

The Interface implementation must be provided by the ragel generated code.

func (*Scanner) Next

func (s *Scanner) Next() (offset int, token Token, literal string)

Next returns the next token in the input stream.

func (*Scanner) Pos

func (s *Scanner) Pos(offset int) Position

Pos returns the line/column Position for the given offset.

func (*Scanner) Reset

func (s *Scanner) Reset(opts ...Option)

Reset resets the scanner to the start of the input stream. It will panic if the source reader is not an io.Seeker.

type State

type State state

State holds a scanner's internal state while processing its input.

func (*State) Emit

func (s *State) Emit(ts int, typ Token, value string)

Emit adds the given token to the output queue. The ts argument is usually the ragel variable ts.

func (*State) Errorf

func (s *State) Errorf(p int, fatal bool, format string, args ...interface{})

Errorf emits an Error token at position p. If the fatal argument is true, any subsequent call to Scanner.Next will return an EOF token, otherwise scanning will resume from position p (which the caller must update accordingly). The format and args arguments are passed through fmt.Sprintf to form the tokens's literal value.

func (*State) GetVars

func (s *State) GetVars() (cs, ts, te, act int, data []byte)

GetVars gets the ragel state variables and input buffer. This function must be called at the beginning of Interface.Run in order to initialize these variables properly.

func (*State) Newline

func (s *State) Newline(p int)

Newline increments the line count. The p argument should be the ragel variable p.

func (*State) SaveVars

func (s *State) SaveVars(cs, ts, te, act int)

SaveVars saves the ragel state variables. These variables must be saved between calls to Interface.Run, so SaveVars must be called before returning from Interface.Run and Interface.Init.

type Token

type Token int

Token represents a lexical token.

const (
	EOF Token = -1 - iota
	Error
)

Token types.

Directories

Path Synopsis
cmd
unicode2ragel
unicode2ragel generates ragel machines for UTF-8 input based on the unicode ranges defined in Go's unicode package.
unicode2ragel generates ragel machines for UTF-8 input based on the unicode ranges defined in Go's unicode package.
examples
csv
This example shows how to build a parser on top of a ragel scanner to parse CSV files.
This example shows how to build a parser on top of a ragel scanner to parse CSV files.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL