regexl

package module
v0.1.0 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Mar 7, 2024 License: MIT Imports: 7 Imported by: 0

README

Regexl

Regexl is a high level language for regex that can be used in any project as a simple library.

You can read about the reasoning for creating Regexl here.

Table of contents:

Playground

There is a (WASM based) playground where you can play with Regexl here.

Regexl Query Examples

  • /friend/i is equivalent to the regexl:
select 'friend'
  • /^friend/i is equivalent to the regexl:
// This is a regexl comment.
// This set_options configuration is equivalent to: '/i'
set_options({
    case_sensitive: false,
})

select starts_with('friend')
  • /Hello*/g is equivalent to the regexl:
set_options({
    find_all_matches: true,
})

//-- This '--' is to help the syntax highlighter :)
//-- The '+' performs a simple concatenation, as all functions return strings
select 'Hell' + zero_plus_of('o')
  • /^Golang$/ is equivalent to the regexl:
set_options({
    case_sensitive: false,
})
//-- Functions can be nested, as outputs are strings.
//-- Alternative regexl: select starts_and_ends_with('Golang')
select ends_with(starts_with('Golang'))
  • /[abcd]/ig (match any of these 4 letters) is equivalent to the regexl:
set_options({
    find_all_matches: true,
    case_sensitive: false,
})
//-- Can also be: select any_chars_of('abcd')
select any_chars_of('abc', 'd')
  • /[A-Z0-9]/ig (match letters and numbers only) is equivalent to the regexl:
set_options({
    find_all_matches: true,
    case_sensitive: false,
})
//-- Can also be: select any_chars_of('abcd')
select any_chars_of(from_to('A', 'Z'), from_to(0, 9))
  • /[A-Z0-9._%+-]+@[A-Z0-9.-]+\.[A-Z]{2,10}/i (a 'simple' email regex) is equivalent to the regexl:
set_options({
    case_sensitive: false,
})
select
    //-- Converts to: [A-Z0-9._%+-]+
    one_plus_of(
        any_chars_of(from_to('A', 'Z'), from_to(0, 9), '._%+-')
    ) +
    //-- Converts to: @
    '@' +
    //-- Converts to: [A-Z0-9.-]+
    one_plus_of(
        any_chars_of(from_to('A', 'Z'), from_to(0, 9), '.-')
    ) +
    //-- Converts to: \.
    '.' +
    //-- Converts to: [A-Z]{2,10}
    count_between(
        any_chars_of(from_to('A', 'Z')),
        2,
        10
    )

Usage in Go

package main

import (
	"fmt"

	"github.com/bloeys/regexl"
)

func main() {

	regexlQuery := `
		set_options({
			find_all_matches: true,
			case_sensitive: false,
		})

		select starts_with('Hello there, ') + one_plus_of(any_chars_of(from_to('A', 'Z'), '.!-'))
	`

	rl := regexl.NewRegexl(regexlQuery)
	hasMatch := rl.MustCompile().CompiledRegexp.MatchString("Hello there, friend!")

	fmt.Printf("Produced regex: %s\nHas match: %v\n", rl.CompiledRegexp.String(), hasMatch)
}

Technical Details

The Regexl code is that of a very simple compiler, where the general steps involved are:

  1. Input query text is tokenized (implemented by parser.go)
  2. Tokens are used to create an Abstract Syntax Tree (AST) (implemented by ast.go)
  3. The AST is fed into a 'backend' that outputs a specific regex string (e.g. Go regex) (implemented by regex_go_backend.go)

To explain the above, lets look at how the following query is compiled:

select starts_with('hello')

By tokenization we mean turning the input string into higher level segments, where each segment is split by some separator like a space, a bracket, and so on. In the above query you will get the following tokens:

  • Token value: select; Type: keyword
  • Token value: starts_with; Type: function name
  • Token value: (; Type: open bracket
  • Token value: hello; Type: string
  • Token value: ); Type: close bracket

With this list of tokens, an AST is created. An Abstract Syntax Tree represents the structure of a program as a tree, where the parent nodes have a dependency on the children nodes. For example, if function A calls B, then this function call node becomes a child of A, and the arguments of this call are children of the function call node.

In our query, the linear tokens list produces this AST tree:

|-- select
|   |-- starts_with
|   |   |-- hello

With the AST in place, we can traverse the tree and generate some output. In normal programming languages (e.g. C, Go, Python, etc...) the final output would be machine code, assembly, or perhaps byte code to be interpreted.

In Regexl, the output is some specific regex like Go-compatible regex, python-compatible regex, and so on (regex syntax and features differ between implementations).

The Go regex produced for our example Regexl query is:

(?i)^hello

Equivalent to the more common regex expression:

/^hello/i

The nice thing about this setup is that to support a new regex implementation all one has to do is implement a new backend (step 3), while tokenization and AST generation are reused as-is.

Todo

  • Become feature complete with Go regex
  • Better error messages
  • More test cases

Documentation

Index

Constants

View Source
const (
	AST_INVALID_INDEX = -1
)

Variables

View Source
var (
	PrintTokens  bool
	PrintAstJson bool
	PrintAstTree bool
)

@TODO: remove or make something nicer Debug options

Functions

This section is empty.

Types

type Ast

type Ast struct {
	Tokens []Token
	Nodes  []Node
}

func NewAst

func NewAst(tokens []Token) *Ast

func (*Ast) Gen

func (a *Ast) Gen() error

func (*Ast) GetToken

func (a *Ast) GetToken(index int) *Token

func (*Ast) PrintTree

func (a *Ast) PrintTree()

type AstError

type AstError struct {
	Err error
	Pos TokenPos
}

func (*AstError) Error

func (te *AstError) Error() string

type BinaryExpr

type BinaryExpr struct {
	Pos  TokenPos
	Type TokenType
	Lhs  Expr
	Rhs  Expr
}

func (*BinaryExpr) EndPos

func (e *BinaryExpr) EndPos() TokenPos

func (*BinaryExpr) StartPos

func (e *BinaryExpr) StartPos() TokenPos

type Expr

type Expr interface {
	Node
	// contains filtered or unexported methods
}

type FuncExpr

type FuncExpr struct {
	Pos             TokenPos
	Ident           IdentExpr
	Args            []Expr
	OpenBracketPos  TokenPos
	CloseBracketPos TokenPos
}

func (*FuncExpr) EndPos

func (e *FuncExpr) EndPos() TokenPos

func (*FuncExpr) StartPos

func (e *FuncExpr) StartPos() TokenPos

type GoBackend

type GoBackend struct {
	Opts RegexOptions
}

GoBackend produces valid Go regex strings, based on the rules here: https://pkg.golang.ir/regexp/syntax

func (*GoBackend) ApplyOptionsToRegexString

func (gb *GoBackend) ApplyOptionsToRegexString(regexString string) string

func (*GoBackend) AstToGoRegex

func (gb *GoBackend) AstToGoRegex(ast *Ast) (*regexp.Regexp, string, error)

type IdentExpr

type IdentExpr struct {
	Name string
	Pos  TokenPos
}

func (*IdentExpr) EndPos

func (e *IdentExpr) EndPos() TokenPos

func (*IdentExpr) StartPos

func (e *IdentExpr) StartPos() TokenPos

type KeyValExpr

type KeyValExpr struct {
	Key      IdentExpr
	Val      Expr
	ColonPos TokenPos
}

func (*KeyValExpr) EndPos

func (e *KeyValExpr) EndPos() TokenPos

func (*KeyValExpr) StartPos

func (e *KeyValExpr) StartPos() TokenPos

type LiteralExpr

type LiteralExpr struct {
	Pos  TokenPos
	Type TokenType
	// Value depends on the type, so it can contain a numeric, string etc
	Value string
}

func (*LiteralExpr) EndPos

func (e *LiteralExpr) EndPos() TokenPos

func (*LiteralExpr) StartPos

func (e *LiteralExpr) StartPos() TokenPos

type Node

type Node interface {
	// StartPos is the position of the first byte of the first character making up this node
	StartPos() TokenPos
	// EndPos is the position of the first byte of the first character that doesn't belong to this node.
	// This means EndPos is +1 of the last character, so it acts in the same way len() does
	EndPos() TokenPos
}

type ObjectLiteralExpr

type ObjectLiteralExpr struct {
	OpenCurly  TokenPos
	CloseCurly TokenPos
	KeyVals    []KeyValExpr
}

func (*ObjectLiteralExpr) EndPos

func (e *ObjectLiteralExpr) EndPos() TokenPos

func (*ObjectLiteralExpr) StartPos

func (e *ObjectLiteralExpr) StartPos() TokenPos

type Parser

type Parser struct {
	Query string
}

func NewParser

func NewParser(query string) *Parser

func (*Parser) GetNextRuneByByteIndex

func (p *Parser) GetNextRuneByByteIndex(index int) (rune, error)

func (*Parser) GetRuneByByteIndex

func (p *Parser) GetRuneByByteIndex(index int) (rune, error)

func (*Parser) Tokenize

func (p *Parser) Tokenize() (tokens []Token, err error)

func (*Parser) ValidateTokens

func (p *Parser) ValidateTokens(tokens []Token) error

type ParserError

type ParserError struct {
	Err error
	Pos TokenPos
}

func (*ParserError) Error

func (te *ParserError) Error() string

type RegexOptions

type RegexOptions struct {
	CaseSensitive  bool
	FindAllMatches bool
}

type Regexl

type Regexl struct {
	Query          string
	CompiledRegexp *regexp.Regexp
}

func NewRegexl

func NewRegexl(query string) *Regexl

func (*Regexl) Compile

func (rl *Regexl) Compile() error

Compile tries to compile the query within this Regexl object and then sets Regexl.CompiledRegexp. Regexl.CompiledRegexp is only set if no error is found, otherwise the error is returned and Regexl.CompiledRegexp is unchanged.

func (*Regexl) MustCompile

func (rl *Regexl) MustCompile() *Regexl

MustCompile compiles the query within this regexl object by calling Regexl.Compile and panics if an error is thrown

type SelectStmt

type SelectStmt struct {
	Pos  TokenPos
	Type TokenType
	Es   []Expr
}

func (*SelectStmt) EndPos

func (s *SelectStmt) EndPos() TokenPos

func (*SelectStmt) StartPos

func (s *SelectStmt) StartPos() TokenPos

type Stmt

type Stmt interface {
	Node
	// contains filtered or unexported methods
}

type Token

type Token struct {
	Val  string
	Type TokenType
	Pos  TokenPos
}

func (*Token) HasLoc

func (t *Token) HasLoc() bool

func (*Token) IsEmpty

func (t *Token) IsEmpty() bool

func (*Token) MakeEmpty

func (t *Token) MakeEmpty()

type TokenPos

type TokenPos int

type TokenType

type TokenType int
const (
	TokenType_Unknown TokenType = iota
	TokenType_Space
	TokenType_String
	// TokenType_Single_Quote
	TokenType_Int
	TokenType_Float
	TokenType_Operator
	TokenType_OpenBracket
	TokenType_CloseBracket
	TokenType_OpenCurlyBracket
	TokenType_CloseCurlyBracket
	TokenType_Colon
	TokenType_Comma
	TokenType_Bool
	TokenType_Plus
	TokenType_Comment
	TokenType_Object_Param
	TokenType_Function_Name
	TokenType_Keyword
)

func (TokenType) MarshalText

func (tt TokenType) MarshalText() (text []byte, err error)

func (TokenType) String

func (i TokenType) String() string

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL