README ¶
post
post
is a program for processing structured data files in bulk.
It was originally intended as an automation tool for generating LaTeX
graphs from functionObject
data generated by OpenFOAM® simulations,
but has since evolved such that it can be used as a general structured data
processor with optional graph generation support.
It's primary use is processing and formatting data spread over multiple files and/or archives. The main benefit being that the entire process is defined through one or more YAML formatted run files, hence, automating data processing pipelines is fairly simple, while no programming is necessary.
CLI usage
Usage:
post [run file] [flags]
post [command]
Available Commands:
completion Generate the autocompletion script for the specified shell
graphfile Generate graph file stub(s)
help Help about any command
runfile Generate a run file stub
Flags:
--dry-run check runfile syntax and exit
-h, --help help for post
--no-graph don't write or generate graphs
--no-graph-generate don't generate graphs
--no-graph-write don't write graph files
--no-output don't output data
--no-process don't process data
--only-graphs only write and generate graphs, skip input, processing and output
--skip strings a list of pipeline IDs to be skipped during processing
-v, --verbose verbose log output
Run file structure
post
is controlled by a run file in YAML format file supplied as a CLI parameter.
The run file consists of a list of pipelines, each defining 4 sections:
input
, process
, output
and graph
. The input
section defines input
files and formats from which data is read; the process
section defines
operations which are applied to the data; the output
section defines how
the processed data will be output/stored; and the graph
section defines
how the data will be graphed.
Even though all sections are technically optional, certain sections depend on
others, specifically, the process
and output
sections require an input
section to be defined in order to work, since 'some data' is necessary for
processing/output. The graph
section is entirely optional and can both be
omitted, defined by itself, or as part of a pipeline.
A single pipeline has the following fields:
- id:
input:
type:
fields:
type_spec:
process:
- type:
type_spec:
output:
- type:
type_spec:
graph:
type:
graphs:
id
: the pipeline tag, used to reference the pipeline on the CLI; optionalinput
: the input sectiontype
: input type; see Input for type descriptionsfields
: field (column) names of the input data; optionaltype_spec
: input type specific configuration
process
: the process sectiontype
: process type; see Processing for type descriptionstype_spec
: process type specific configuration
output
: the output sectiontype
: output type; see Output for type descriptionstype_spec
: output type specific configuration
graph
: the graph sectiontype
: graph type; see Graphing for type descriptionsgraphs
: a list of graph type specific graph configurations
A simple run file example is shown below.
- input:
type: dat
fields: [x, y]
type_spec:
file: 'xy.dat'
process:
- type: expression
type_spec:
expression: '100*y'
result: 'result'
output:
- type: csv
type_spec:
file: 'output/data.csv'
graph:
type: tex
graphs:
- name: xy.tex
directory: output
table_file: 'output/data.csv'
axes:
- x:
min: 0
max: 1
label: '$x$'
y:
min: 0
max: 100
label: '$100 y$'
tables:
- x_field: x
y_field: result
legend_entry: 'result'
The example run file instructs post
to do the following:
- read data from a
DAT
formatted filexy.dat
and rename the fields (columns) tox
andy
- evaluate the expression
100*y
and store the result to a field namedresult
- output the data, now containing the fields
x
,y
andresult
to aCSV
formatted fileoutput/data.csv
, if the directoryoutput
does not exist, it will be created - generate a graph using TeX in the
output
directory, usingoutput/data.csv
as the table (data) file, withx
as the abscissa andresult
as the ordinate
For more examples see the test
directory.
Input
The following is a list of available input types and their descriptions along with their run file configuration stubs.
-
archive
reads input from an archive. The archive format is inferred from the file name extension. The following archive formats are supported:TAR
,TAR-GZ
,TAR-BZIP2
,TAR-XZ
,ZIP
. Note thatarchive
input wraps one or more input types, i.e., thearchive
configuration only specifies how to read 'some data' from an archive, the wrapped input type reads the actual data. Another important note is that the contents of the archive are stored into memory the first time it is read, so if the same archive is used multiple times as an input source, it will be read from disk only once, each subsequent read will read directly from RAM. Hence it is beneficial to use thearchive
input type when the data consists of a large amount of input files, e.g., a largetime-series
.type: archive type_spec: file: # file path of the archive format_spec: # input type configuration, e.g., a CSV input type
-
csv
reads from aCSV
formatted file. If the file contains a header line theheader
field should be set totrue
and the header column names will be used as the field names for the data. If no header line is present theheader
field must be set tofalse
.type: csv type_spec: file: # file path of the CSV file header: # determines if the CSV file has a header; default 'true' comment: # character to denote comments; default '#' delimiter: # character to use as the field delimiter; default ','
-
dat
reads from a white-space-separated-value file. The type and amount of white space between columns is irrelevant, as are leading and trailing white spaces, as long as the number of columns (non-white space fields) is consistent in each row.type: dat type_spec: file: # file path of the DAT file
-
multiple
is a wrapper for multiple input types. Data is read from each input type specified and once all inputs have been read, the data from each input is merged into a single data instance containing all fields (columns) from all inputs. The number and type of input types specified is arbitrary, but each input must yield data with the same number of rows.type: multiple type_spec: format_specs: # a list of input type configurations
-
ram
reads data from an in-memory store. For the data to be read it must have been stored previously, e.g., a previousoutput
section defines aram
output.type: ram type_spec: name: # key under which the data is stored
-
time-series
reads data from a time-series of structured data files in the following format:. ├── 0.0 │ ├── data_0.csv │ ├── data_1.dat │ └── ... ├── 0.1 │ ├── data_0.csv │ ├── data_1.dat │ └── ... └── ...
where each
data_*.*
file contains the data in some format at the moment in time specified by the directory name. Each series dataset must be output into a different file, i.e., thedata_0.csv
files contain one dataset,data_1.dat
another one, and so on.type: time-series type_spec: file: # file name (base only) of the time-series data files directory: # path to the root directory of the time-series time_name: # the time field name; default is 'time' format_spec: # input type configuration, e.g., a CSV input type
Processing
The following is a list of available processor types and their descriptions along with their run file configuration stubs.
-
average-cycle
mutates the data by computing the enesemble average of a cycle for all numeric fields. The ensemble average is computed as:Φ(ωt) = 1/N Σ ϕ[ω(t+j)T], j = 0...N-1
where
ϕ
is the slice of values to be averaged,ω
the angular velocity,t
the time andT
the period.The resulting data will contain the cycle average of all numeric fields and a time field (named
time
), containing times for each row of cycle average data, in the range (0, T]. The time field will be the last field (column), while the order of the other fields is preserved.Time matching can be optionally specified, as well as the match precision, by setting
time_field
andtime_precision
respectively in the configuration. This checks whether the time (step) is uniform and whether there is a mismatch between the expected time of the averaged value, as per the number of cycles defined in the configuration and the supplied data, and the read time. The read time is the one read from the field namedtime_field
. Note that in this case the output time field will be named aftertime_field
, i.e., the time field name will remain unchanged.type: average-cycle type_spec: n_cycles: # number of cycles to average over time_field: # time field name; optional time_precision: # time-matching precision; optional
-
expression
evaluates an arithmetic expression and appends the resulting field (column) to the data. The expression operands can be scalar values or fields (columns) present in the data, which are referenced by their names. Note that at least one of the operands must be a field present in the data.Each operation involving a field is applied element-wise. The following arithmetic operations are supported:
+
-
*
/
**
type: expression type_spec: expression: # an arithmetic expression result: # field name of the resulting field
-
filter
mutates the data by applying a set of row filters as defined in the configuration. The filter behaviour is described by providing the field namefield
to which the filter is applied, the comparison operatorop
and a comparison valuevalue
. Rows satisfying the comparison are kept, while others are discarded. The following comparison operators are supported:==
!=
>
>=
<
<=
All defined filters are applied at the same time. The way in which they are aggregated is controlled by setting the
aggregation
field in the configuration,and
andor
aggregation modes are available. Theor
mode is the default if theaggregation
field is unset.type: filter type_spec: aggregation: # aggregration mode; defaults to 'or' filters: - field: # field name to which the filter is applied op: # filtering operation value: # comparison value
-
resample
mutates the data by linearly interpolating all numeric fields, such that the resulting fields haven_points
values, at uniformly distributed values of the fieldx_field
. Ifx_field
is not set, a uniform resampling is performed, i.e., as if the values of each field were given at a uniformly distributed x, where x ∈ [0,1]. The first and last values of a field are preserved in the resampled field.type: resample type_spec: n_points: # number of resampling points x_field: # field name of the independent variable; optional
-
select
mutates the data by selecting fields (extracting columns) specified byfields
which is a list of field names.type: select type_spec: fields: # a list of field names
Output
The following is a list of available output types and their descriptions along with their run file configuration stubs.
-
csv
writesCSV
formatted data to a file. Ifheader
is set totrue
the file will contain a header line with the field names as the column names. Note that, if necessary, directories will be created so as to ensure thatfile
specifies a valid path.type: csv type_spec: file: # file path of the CSV file header: # determines if the CSV file has a header; default 'true' comment: # character to denote comments; default '#' delimiter: # character to use as the field delimiter; default ','
-
ram
stores data in an in-memory store. Once data is stored, any subsequentram
input type can access the data.type: ram type_spec: name: # key under which the data is stored
Graphing
Only TeX graphing, via tikz
and pgfplots
, is supported currently. Hence
for the graph generation to work, TeX needs to be installed along with any
dependent packages.
Graphing consists of two steps: generating TeX graph files from templates, and generating the graphs from TeX files. To see the default template files run:
$ post graphfile --outdir=templates
The templates can be user supplied by setting template_directory
and
template_main
(if necessary) in the run file configuration. The templates
use go
template syntax, see the package documentation
for more information.
A tex
graph configuration stub is given below, note several fields expect
raw TeX as input.
type: tex
graphs:
- name: # used as a basename for all graph related files
directory: # optional; output directory name, created if not present
table_file: # optional; needed if 'tables.table_file' is undefined
template_directory: # optional; template directory
template_main: # optional; root template file name
template_delims: # optional; go template delimiters; ['__{','}__'] by default
tex_command: # optional; 'pdflatex' by default
axes:
- x:
min:
max:
label: # raw TeX
y:
min:
max:
label: # raw TeX
width: # optional; raw TeX, axis width option
height: # optional; raw TeX, axis height option
legend_style: # optional; raw TeX, axis legend style option
raw_options: # optional; raw TeX, if defined all other options are ignored
tables:
- x_field:
y_field:
legend_entry: # raw TeX
col_sep: # optional; 'comma' by default
table_file: # optional; needed if 'graphs.table_file' is undefined