tile-library-architecture/

directory
v0.0.0-...-7a0a068 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Jun 7, 2023 License: AGPL-3.0

README

Golang Tile Library Architecture


In this directory is a series of packages useful for creating and managing tile libraries and genomes in Go.

The following Go packages are included:

  • structures, a package with the basic TileVariant structure important to tile libraries, along with an equality method, and the number of paths in the genome, for convenience.
  • tilelibrary, a package containing ways to create, add to, compare, export, and import tile libraries. Requires the structures package.
  • genome, a package containing ways to create, export, and import genomes relative to a tile library. Requires the tilelibrary and the structures packages.

In addition, a couple programs are provided to create and merge libraries directly from the command line.

Installation


Once this project is on GitHub, use the go get command like with any other Go package on GitHub. Make sure to get every package, since tilelibrary and genome rely on structures, and genome relies on tilelibrary.

CreateLibrary Program Usage


The createlibrary program has the functionality to combine a series of directories of genome files containing FastJs and write the resulting library to disk.

Within the createlibrary folder, the command to run the program is ./createlibrary. Outside of this directory, the command will be the filepath to the createlibrary program in the createlibrary folder. All non-flag arguments given to the command should be after all flags, and should be directories of FastJ files to add to the library.

The following flags apply:

  • -version: specifies the version of the SGLF to be outputted. A value 0 outputs regular SGLF files, and 1 outputs SGLFv2 files. Behavior is not defined when the -version flag is given any other number. Default behavior is 0 (regular SGLF files).
  • -textfile: specifies the location and name of the text file where the command is allowed to write intermediate data. Default behavior is to write the intermediate data to a text file called test.txt in the current directory. Files are created if they do not exist.
  • -dir: specifies the directory to write the output files to. If the directory does not exist, it will be created. Default behavior is to write the output files to the current directory.

A successful run will print out the directory that was written to as an absolute path, if other programs need to access the created files.

MergeLibraries Program Usage


The mergelibraries program has the functionality to merge libraries written to disk the in the form of SGLFv2 files, and output either SGLF or SGLFv2 files from the merging.

Within the mergelibraries folder, the command to run the program is ./mergelibraries. Outside of this directory, the command will be the filepath to the mergelibraries program in the mergelibraries folder. All non-flag arguments given to the command should be after all flags, and should be directories of SGLFv2 files to add to the library.

The following flags apply:

  • -version: specifies the version of the SGLF to be outputted. A value 0 outputs regular SGLF files, and 1 outputs SGLFv2 files. Behavior is not defined when the -version flag is given any other number. Default behavior is 0 (regular SGLF files).
  • -dir: specifies the directory to write the output files to. If the directory does not exist, it will be created. Default behavior is to write the output files to the current directory.

A successful run will create files silently. Any errors that are encountered will exit the program and be printed out.

LiftoverGenome Program Usage


The liftovergenome program has the functionality to liftover a genome from a source library to a destination library and write out a text file or numpy files to disk.

Within the liftovergenome folder, the command to run the program is ./liftovergenome. Outside of this directory, the command will be the filepath to the liftovergenome program in the liftovergenome folder.

The following flags apply:

  • -genome: specifies the location of the text file of the genome.
  • -source: specifies the directory of the SGLFv2 files for the source library. Uses the current directory by default.
  • -destination: specifies the directory of the SGLFv2 files for the destination library. Uses the current directory by default.
  • -path: specifies the location where the file should be outputted. If npy is true, this must be a directory. Writes a textfile called newgenome.txt in the current directory by default.
  • -npy: specifies what type of files to output. True outputs a numpy array for each path--false will output one text file for the genome. Default is false.

A successful run will print out the path that was written to as an absolute path, if other programs need to access the created files.

GenomesToNumpy Program Usage


The genomestonumpy program will take a series of genomes, a source library, a directory, and a path number, and map those genomes to the library, and write the numpy array for that path for all genomes (if the path number is given), or writes out all numpy arrays for all paths, if no specified path number is given.

Within the genomestonumpy folder, the command to run the program is ./genomestonumpy. Outside of this directory, the command will be the filepath to the genomestonumpy program in the genomestonumpy folder. All non-flag arguments given to the command should be after all flags, and should be directories of FastJ files for each genome that should be included.

The following flags apply:

  • -dir: specifies the directory to output files to. Uses the current directory by default.
  • -source: specifies the directory of the SGLFv2 files for the source library. Uses the current directory by default.
  • -path: specifies the path number of the numpy array that should be outputted. Outputs all numpy arrays instead by default.

A successful run will print out the path that was written to as an absolute path, if other programs need to access the created files.

Go package descriptions


For package structures, the major structure here is the TileVariant, which contains information regarding the variant and the library it belongs to. Fields of the TileVariant can be called and modified directly if needed. An equality method based on the hash is provided.

Package tilelibrary contains methods for working with, modifying, and exporting and importing libraries. This allows for writing to SGLF and SGLFv2 files, along with creating libraries from SGLFv2 files. Tiles can be added and annotated, along with finding frequencies and existence of specific tiles. Libraries can be merged, and liftover mappings from one library to another can be created if genomes attached to a specific library need to reference another library. Tile libraries are given IDs based on the MD5 hash algorithm, to make them easy to refer to. Libraries are safe for concurrent use.

Package genome contains methods for creating, exporting, and importing genomes relative to a specific tile library. This includes writing to and reading from numpy arrays and text files, and also allows for lifting over genomes from one tile library to another. Genomes refer to a specific tile by using its tile variant number in the reference tile library.

SGLFv2 Specification

This package creates a new type of file to keep track of libraries, being the SGLFv2 file. The format of the SGLFv2 file is as follows:

First, each name must be a 4 digit hexadecimal number between 0 and the number of paths in the genome representing the path, followed by the suffix .sglfv2. Every variant in that path for this library must be in that file.

The first line of each file follows the following format:

ID:LibraryID,Components:ComponentID1,ComponentID2,ComponentID3...

where the current library ID is the first ID, and the IDs of any components are separated by commas.

The following lines are of the following format, where each line contains exactly one tile variant's information:

PATH.01.STEP.VARIANT+COUNT+LENGTH,HASH,BASES

where PATH and STEP are 4 digit hexadecimal representations of the tile's path and step, VARIANT is the variant number (ordered from most common to least common, sorted by increasing hash in ties) as a 3 digit hexadecimal number, LENGTH is the tile span in hexadecimal, COUNT is the frequency of the tile, as an 8 digit hexadecimal number, HASH is the hash of the bases of the tile, and BASES is the string of bases of the tile (may include nocalls).

The appearance of lines is done first increasing by step, and then increasing in variant number within each step.

The current hash algorithm for determining IDs and for hashing tile variants is MD5.

Documentation


Once this project is on GitHub, documentation of the packages should appear on Godoc (godoc.org). For now, documentation for Go packages can be found within the .go files, and documentation for programs can be found here or by using the help flag (-h).

Tests


Tests are provided for the genome and the tilelibrary packages, under the files genome_test.go and tile-library_test.go. In both files, there are variable fields with empty strings for where various directories or file names would go. When running these tests, make sure to replace these empty strings with whatever directories and file names to use. Both relative and absolute file paths work for these tests.

Notes


  • Make sure to initialize libraries and genomes for use--using the provided New functions will automatically set up the necessary structures.
  • Check to make sure libraries are initialized with the correct reference paths--importing a library from SGLFv2 files is only allowed if the reference path is the directory of those SGLFv2 files.
  • Leftover files from libraries that need to be removed should be removed by the caller. The tilelibrary package does not delete intermediate files.
  • TileVariants are compared by hash only. Even if two TileVariants might have different fields elsewhere (for example, different annotations), equality is determined only by the hash of both variants.
  • Adding tiles directly to a library created from SGLFv2 files is not valid, since adding to an SGLFv2 file directly would cause all lookup reference numbers for tiles to be shifted over. One workaround is by merging this library with an empty library, as adding tiles to merged libraries is allowed.
  • In genomes, the number -1 represents a skipped step location because of a spanning tile. The number -2, which appears in text file and numpy array representations of genomes, represents an incomplete tile (that is, it contains a nocall).

Future goals/features


  • Parallelization of merging. This will speed up the merge process considerably, especially when merging more than two libraries together.
  • Parallelization by path in library creation.

Directories

Path Synopsis
Createlibrary is a command line function that parses directories of FastJ files into a tile library and writes files to a specified directory.
Createlibrary is a command line function that parses directories of FastJ files into a tile library and writes files to a specified directory.
Package genome is a package for representing the genome, relative to a tile library, with Go data structures.
Package genome is a package for representing the genome, relative to a tile library, with Go data structures.
Program genomestonumpy takes a directory to write to, a directory for a source library, a path number, and any number of directories for genomes.
Program genomestonumpy takes a directory to write to, a directory for a source library, a path number, and any number of directories for genomes.
Program liftovergenome takes a genome, a source library, and a destination library, along with a destination filepath and a boolean.
Program liftovergenome takes a genome, a source library, and a destination library, along with a destination filepath and a boolean.
Program mergelibraries merges a set of given directories of SGLFv2 files together into one library, and then writes SGLF or SGLFv2 files for the new library to disk.
Program mergelibraries merges a set of given directories of SGLFv2 files together into one library, and then writes SGLF or SGLFv2 files for the new library to disk.
Package structures is a basic package to hold basic structures, methods, and functions for tile libraries and genomes.
Package structures is a basic package to hold basic structures, methods, and functions for tile libraries and genomes.
Package tilelibrary is a package for implementing tile libraries in Go.
Package tilelibrary is a package for implementing tile libraries in Go.

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL