dupi

package module

v0.0.11 Latest Latest Go to latest Published: Sep 21, 2021 License: Apache-2.0 Imports: 20 Imported by: 0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/go-air/dupi

Links

Open Source Insights

README ¶

⊧ dupi

Dupi is an engine for identifying and exploring duplicative text in sets of documents.

Status

Dupi is in alpha/early beta development stage. Please feel free to give it a try (and file issues). We have run it on several document sets successfully, but it definitely needs more testing.

Input

Throw hundreds of thousands of textual documents at it. Or extract text from other documents and send that to dupi.

Output

Find and query for repeated chunks of text.

Tutorial

Design

Design Document

Library Reference

Documentation ¶

Overview ¶

Package dupi provides a library for exploring duplicate data in large sets of documents.

Index ¶

Variables
func RemoveIndex(idx *Index) error
func RemoveIndexer(idx *Indexer) error
type Blot
- func (b *Blot) Cap() int
- func (b *Blot) Doc(i int) *Doc
- func (b *Blot) Len() int
- func (b *Blot) Next(lim bool) *Doc
type Config
- func DefaultConfig(root string) (*Config, error)
- func NewConfig(root string, nbuckets, seqLen int) (*Config, error)
- func ReadConfig(root string) (*Config, error)
- func (cfg *Config) DmdPath() string
- func (cfg *Config) FnamesPath() string
- func (cfg *Config) IixPath(i int) string
- func (cfg *Config) LockPath() string
- func (cfg *Config) Path() string
- func (cfg *Config) PostPath(i int) string
- func (cfg *Config) Write() error
type Doc
- func NewDoc(path, body string) *Doc
- func (doc *Doc) Load() error
type Index
- func OpenIndex(root string) (*Index, error)
- func (x *Index) BlotDoc(dst []uint32, doc *Doc) []uint32
- func (x *Index) Blotter() blotter.T
- func (x *Index) Close() error
- func (x *Index) FindBlot(theBlot uint32, doc *Doc) (start, end uint32, err error)
- func (x *Index) FindBlots(m map[uint32][]byte, doc *Doc) (map[uint32][]byte, error)
- func (x *Index) JoinBlot(shard uint32, sblot uint16) uint32
- func (x *Index) NumShards() int
- func (x *Index) NumShatters() int
- func (x *Index) Root() string
- func (x *Index) SeqLen() int
- func (x *Index) SplitBlot(b uint32) (shard uint32, sblot uint16)
- func (x *Index) StartQuery(s QueryStrategy) *Query
- func (x *Index) Stats() (*Stats, error)
- func (x *Index) TokenFunc() token.TokenizerFunc
type Indexer
- func CreateIndexer(root string, nbuckets, seqLen int) (*Indexer, error)
- func IndexerFromConfig(cfg *Config) (*Indexer, error)
- func OpenIndexer(root string) (*Indexer, error)
- func (x *Indexer) Add(doc *Doc) error
- func (x *Indexer) Close() error
- func (x *Indexer) Root() string
type Query
- func (q *Query) Get(blot *Blot) error
- func (q *Query) Next(dst []Blot) (n int, err error)
type QueryStrategy
type Stats
- func (st *Stats) String() string

Constants ¶

This section is empty.

Variables ¶

View Source

var ErrInvalidQueryState = errors.New("query state invalid")

Functions ¶

func RemoveIndex ¶

func RemoveIndex(idx *Index) error

func RemoveIndexer ¶

func RemoveIndexer(idx *Indexer) error

Types ¶

type Blot ¶

type Blot struct {
	Blot uint32
	Docs []Doc
}

Blot represents a piece of a query or extraction. The field Blot gives the blot which was witnessed in the docs specified in the field Docs.

The caller of Query.Next supplies a slice of Blots, indicating to the index/query implementation for how many blots we would like results.

For each sub Blot, the field docs can either be nil, indicating to show all documents, or non-nil, in which case up to len(Docs) - cap(Docs) doc records are returned, each associated with Blot.

func (*Blot) Cap ¶

func (b *Blot) Cap() int

func (*Blot) Doc ¶

func (b *Blot) Doc(i int) *Doc

func (*Blot) Len ¶

func (b *Blot) Len() int

func (*Blot) Next ¶

func (b *Blot) Next(lim bool) *Doc

type Config ¶

type Config struct {
	IndexRoot   string
	SeqLen      int
	NumShards   int
	NumShatters int

	// How frequently buckets write document
	// data to disk.  Higher= less memory,
	// more frequent i/o.
	// Frequency in terms of number of documents.
	DocFlushRate int

	TokenConfig token.Config
	BlotConfig  blotter.Config
}

func DefaultConfig ¶

func DefaultConfig(root string) (*Config, error)

func NewConfig ¶

func NewConfig(root string, nbuckets, seqLen int) (*Config, error)

func ReadConfig ¶

func ReadConfig(root string) (*Config, error)

func (*Config) DmdPath ¶

func (cfg *Config) DmdPath() string

func (*Config) FnamesPath ¶

func (cfg *Config) FnamesPath() string

func (*Config) IixPath ¶

func (cfg *Config) IixPath(i int) string

func (*Config) LockPath ¶

func (cfg *Config) LockPath() string

func (*Config) Path ¶

func (cfg *Config) Path() string

func (*Config) PostPath ¶

func (cfg *Config) PostPath(i int) string

func (*Config) Write ¶

func (cfg *Config) Write() error

type Doc ¶

type Doc struct {
	Path  string
	Start uint32
	End   uint32
	Dat   []byte `json:"-"`
}

func NewDoc ¶

func NewDoc(path, body string) *Doc

func (*Doc) Load ¶ added in v0.0.4

func (doc *Doc) Load() error

type Index ¶

type Index struct {
	// contains filtered or unexported fields
}

func OpenIndex ¶

func OpenIndex(root string) (*Index, error)

func (*Index) BlotDoc ¶ added in v0.0.4

func (x *Index) BlotDoc(dst []uint32, doc *Doc) []uint32

func (*Index) Blotter ¶ added in v0.0.3

func (x *Index) Blotter() blotter.T

func (*Index) Close ¶

func (x *Index) Close() error

func (*Index) FindBlot ¶ added in v0.0.4

func (x *Index) FindBlot(theBlot uint32, doc *Doc) (start, end uint32, err error)

func (*Index) FindBlots ¶ added in v0.0.10

func (x *Index) FindBlots(m map[uint32][]byte, doc *Doc) (map[uint32][]byte, error)

func (*Index) JoinBlot ¶ added in v0.0.4

func (x *Index) JoinBlot(shard uint32, sblot uint16) uint32

func (*Index) NumShards ¶ added in v0.0.3

func (x *Index) NumShards() int

func (*Index) NumShatters ¶ added in v0.0.3

func (x *Index) NumShatters() int

func (*Index) Root ¶

func (x *Index) Root() string

func (*Index) SeqLen ¶ added in v0.0.3

func (x *Index) SeqLen() int

func (*Index) SplitBlot ¶ added in v0.0.4

func (x *Index) SplitBlot(b uint32) (shard uint32, sblot uint16)

func (*Index) StartQuery ¶

func (x *Index) StartQuery(s QueryStrategy) *Query

func (*Index) Stats ¶ added in v0.0.5

func (x *Index) Stats() (*Stats, error)

func (*Index) TokenFunc ¶ added in v0.0.3

func (x *Index) TokenFunc() token.TokenizerFunc

type Indexer ¶

type Indexer struct {
	// contains filtered or unexported fields
}

Indexer is a struct for duplicate indexing.

func CreateIndexer ¶

func CreateIndexer(root string, nbuckets, seqLen int) (*Indexer, error)

CreateIndexer attempts to creat a new dupy index. root is the directory root of the dupy index nbuckets states how many buckets docCap should be a conservative estimate of number of documents toksPerDoc should indicate about how many tokens are expected per document.

func IndexerFromConfig ¶

func IndexerFromConfig(cfg *Config) (*Indexer, error)

func OpenIndexer ¶

func OpenIndexer(root string) (*Indexer, error)

func (*Indexer) Add ¶

func (x *Indexer) Add(doc *Doc) error

Add adds 'doc' to the index.

func (*Indexer) Close ¶

func (x *Indexer) Close() error

Close attempts to flush all data associated with the index to disk.

func (*Indexer) Root ¶

func (x *Indexer) Root() string

Root returns the path to the root of the index 'x'. the returned root is an absolute path.

type Query ¶

type Query struct {
	// contains filtered or unexported fields
}

func (*Query) Get ¶ added in v0.0.3

func (q *Query) Get(blot *Blot) error

func (*Query) Next ¶

func (q *Query) Next(dst []Blot) (n int, err error)

type QueryStrategy ¶

type QueryStrategy int

const (
	QueryMaxBlot QueryStrategy = iota
	QueryMaxDoc
	QueryRandom
)

type Stats ¶ added in v0.0.5

type Stats struct {
	Root      string
	NumDocs   uint64
	NumPaths  uint64
	NumPosts  uint64
	NumBlots  uint64
	BlotMean  float64
	BlotSigma float64
}

func (*Stats) String ¶ added in v0.0.5

func (st *Stats) String() string

Source Files ¶

View all Source files

Directories ¶

Path	Synopsis
attic Package attic contains interesting dead ends.	Package attic contains interesting dead ends.
ibloom Package ibloom implements a bloom filter on integer (uint32) sets.	Package ibloom implements a bloom filter on integer (uint32) sets.
trigram Package trigram supports a trigram alphabet for dupy.	Package trigram supports a trigram alphabet for dupy.
blotter package blotter provides fingerprinting for dupi docs.	package blotter provides fingerprinting for dupi docs.
cmd
dupenron
dupi Command dupi is the dupi command line.	Command dupi is the dupi command line.
dmd Package dmd maps document, offset pairs to internal document ids.	Package dmd maps document, offset pairs to internal document ids.
internal
shard Package shard implements sharded posting indices.	Package shard implements sharded posting indices.
lock Package lock provides file based cooperative locking.	Package lock provides file based cooperative locking.
post Package post provides a data structure coupling dupi blots with dupi internal document ids.	Package post provides a data structure coupling dupi blots with dupi internal document ids.
token Package token tokenizes data for dupi.	Package token tokenizes data for dupi.

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL