ahocorasick

package module
v0.0.0-...-87defef Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jul 14, 2019 License: MIT Imports: 6 Imported by: 0

README

Aho-Corasick

Build Status

Implementation of the Aho-Corasick string-search algorithm in Go.

Licensed under MIT License.

Details

This implementation does not use a Double-Array Trie as in my implementation from a couple of years back.

This reduces the build time drastically, but at the cost of higher memory consumption.

The search time is still fast, and comparable to other Go implementations I have found on github that claims to be fast (see performance).

Documentation

Can be found at godoc.org.

Example Usage

Use a TrieBuilder to build a Trie:

trie := NewTrieBuilder().
    AddStrings([]string{"or", "amet"}).
    Build()

Then go and match something interesting:

matches := trie.MatchString("Lorem ipsum dolor sit amet, consectetur adipiscing elit.")
fmt.Printf("Got %d matches.\n", len(matches))

// => Got 3 matches.

What did we match?

for _, match := range matches {
    fmt.Printf("Matched %q at position %d.\n", match.Match(), match.Pos())
}

// => Matched "or" at position 1.
// => Matched "or" at position 15.
// => Matched "amet" at position 22.

Building

You can easily load patterns from file:

builder := NewTrieBuilder()
builder.LoadPatterns("patterns.txt")
builder.LoadStrings("strings.txt")

Both functions expects a text file with one pattern per line. LoadPatterns expects the pattern to be in hexadecimal form.

Performance

Some simple benchmarking on my machine (Intel(R) Core(TM) i7-8750H CPU @ 2.20GHz, 32 GiB RAM).

Build and search time grows quite linearly with regards to number of patterns and input text length.

Building
BenchmarkTrieBuild/100-12                    10000              0.1460 ms/op
BenchmarkTrieBuild/1000-12                    1000              2.1643 ms/op
BenchmarkTrieBuild/10000-12                    100             14.3305 ms/op
BenchmarkTrieBuild/100000-12                    10            131.2442 ms/op
Searching
BenchmarkMatchIbsen/100-12                 2000000              0.0006 ms/op
BenchmarkMatchIbsen/1000-12                 300000              0.0042 ms/op
BenchmarkMatchIbsen/10000-12                 30000              0.0436 ms/op
BenchmarkMatchIbsen/100000-12                 3000              0.4310 ms/op
Compared to Others

Inspired by anknown I also wanted to check how my implementation compared to other Aho-Corasick implementations in Go.

I created a simple benchmark and ran it on my laptop. With 512,000 patterns, my implementation has comparable build time and faster search time than the other implementations:

anknown         512000     932.57ms    10.81ms      94000
bobusumisu      512000     631.77ms     7.20ms      94000
cloudflare      512000    4879.41ms     2.77ms       4490
iohub           512000     393.96ms    14.19ms      91986

cloudflare is implemented a bit differently though. It doesn't output position of matches but returns indices into the original patterns array.

benchmark plot

Memory Usage

As mentioned, the memory consumption will be quite high compared to a double-array trie implementation. Especially during the build phase (which currently contains a lot of object allocations).

Documentation

Index

Constants

This section is empty.

Variables

This section is empty.

Functions

func MatchEqual

func MatchEqual(a, b *Match) bool

MatchEqual check whether two matches are equal (i.e. at same position and same pattern).

Types

type Match

type Match struct {
	// contains filtered or unexported fields
}

Match represents a matched pattern in the input.

func (*Match) Match

func (m *Match) Match() []byte

Match returns the pattern matched.

func (*Match) MatchString

func (m *Match) MatchString() string

MatchString returns the pattern matched as a string.

func (*Match) Pos

func (m *Match) Pos() int

Pos returns the byte position of the match.

func (*Match) String

func (m *Match) String() string

type Trie

type Trie struct {
	// contains filtered or unexported fields
}

Trie represents a trie of patterns with extra links as per the Aho-Corasick algorithm.

func (*Trie) Match

func (tr *Trie) Match(input []byte) []*Match

Match runs the Aho-Corasick string-search algorithm on a byte input.

func (*Trie) MatchFirst

func (tr *Trie) MatchFirst(input []byte) *Match

MatchFirst is the same as Match, but returns after first successful match.

func (*Trie) MatchFirstString

func (tr *Trie) MatchFirstString(input string) *Match

MatchFirstString is the same as MatchString, but returns after first successful match.

func (*Trie) MatchString

func (tr *Trie) MatchString(input string) []*Match

MatchString runs the Aho-Corasick string-search algorithm on a string input.

type TrieBuilder

type TrieBuilder struct {
	// contains filtered or unexported fields
}

TrieBuilder is used to build Tries.

func NewTrieBuilder

func NewTrieBuilder() *TrieBuilder

NewTrieBuilder creates and initializes a new TrieBuilder.

func (*TrieBuilder) AddPattern

func (tb *TrieBuilder) AddPattern(pattern []byte) *TrieBuilder

AddPattern adds a byte pattern to the Trie under construction.

func (*TrieBuilder) AddPatterns

func (tb *TrieBuilder) AddPatterns(patterns [][]byte) *TrieBuilder

AddPatterns adds multiple byte patterns to the Trie.

func (*TrieBuilder) AddString

func (tb *TrieBuilder) AddString(pattern string) *TrieBuilder

AddString adds a string pattern to the Trie under construction.

func (*TrieBuilder) AddStrings

func (tb *TrieBuilder) AddStrings(patterns []string) *TrieBuilder

AddStrings add multiple strings to the Trie.

func (*TrieBuilder) Build

func (tb *TrieBuilder) Build() *Trie

Build constructs the Trie.

func (*TrieBuilder) LoadPatterns

func (tb *TrieBuilder) LoadPatterns(path string) error

LoadPatterns loads byte patterns from a file. Expects one pattern per line in hexadecimal form.

func (*TrieBuilder) LoadStrings

func (tb *TrieBuilder) LoadStrings(path string) error

LoadStrings loads string patterns from a file. Expects one pattern per line.

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL