sdt

module
v1.0.1 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Dec 1, 2022 License: MIT

README

Semantic Diff Tool (sdt)

The command-line tool sdt compares source files to identify which changes create semantic differences in the program operation, and specifically to exclude many changes which cannot be functionally important to the operation of a program or library.

Use of sdt will allow code reviewers or submitters to assure that modifications made to improve stylistic formatting of the code—whether made by hand or using code-formatting tools—does not modify the underlying meaning of the code.

As designed, the tool is much more likely to produce false positives for the presence of semantic changes than false negatives. That is to say, sdt might indicate that a certain segment of the diff between versions is likely to contain a semantic difference in code, but upon human examination, a developer might decide that no actual behavior will change (or, of course, she might decide that the change in code behavior is a desired change).

It is unlikely that sdt will identify an overall file change, or any change to a particular segment of a diff as semantically irrelevant where that change actually does change behavior. However, this tool provides NO WARRANTY, and it remains up to your human developers and your CI/CD process to make final decisions on whether to accept code changes.

Installation

If the Go language is installed on your system, you may install sdt by cloning this repository, and installing the tool using go install ./.... For example:

% git clone https://github.com/atlantistechnology/sdt.git
% cd sdt
% go install ./...

These commands will install both sdt itself and also the small support tools jsonformat, gotree, and treesit that may be used to evaluate changed JSON, Golang, and tree-sitter grammar files, respectively. Additional similar tools are likely to be added to this repository as new languages are supported.

However, you may also simply download pre-built binaries for all available Go cross-compilation targets directly to your system path. For example, on a Linux operating system and an AMD64 architecture, you might download with:

% sudo curl https://sdt.dev/linux/amd64/sdt > /usr/local/bin/sdt
% sudo curl https://sdt.dev/linux/amd64/jsonformat > /usr/local/bin/jsonformat
% sudo curl https://sdt.dev/linux/amd64/gotree > /usr/local/bin/gotree
% sudo curl https://sdt.dev/linux/amd64/treesit > /usr/local/bin/treesit
% sudo chmod a+x /usr/local/bin/sdt \
  /usr/local/bin/jsonformat /usr/local/bin/gotree /usr/local/bin/treesit

The extra tool jsonformat is not needed for users who prefer to use the much more powerful jq in their .sdt.toml configuration. The tool gotree produces a somewhat customized parse tree, so a different tool is unlikely to be compatible with sdt.

The separate step of setting the "executable bit" is probably not needed, but does no harm. If you are installing to a PATH that only needs user permission, the sudo is not necessary.

Integrations

Subcommand of git

You may wish to use sdt as a git subcommand. To do so, you can create an alias under either locally within <repo>/.git/config or globally within $HOME/.gitconfig, depending on which better suits your workflow.

A straightforward and useful alias can provide something akin to an enhanced git diff. This alias will either compare current files to the HEAD of the working branch, compare current files to another branch, or compare to branches/revisions to each other. For example:

% git sdt
INFO: Comparing HEAD to current changes on-disk
Changes not staged for commit:
    modified:   .gitignore
| No available semantic analyzer for this format
    modified:   README.md
| No available semantic analyzer for this format
% git sdt f7f6b934 ruby-samples
INFO: Comparing branches/revisions f7f6b934: to ruby-samples:
Changes between branches/revisions:
    samples/simple-filter.rb
| Segments with likely semantic changes
| @@ -3,6 +3,4 @@
|    items.to_a.select {|item| item % 5 == 0}
|  end
|
| -puts mod5? 1..50
| -puts mod5? 1..101
| -
| +puts mod5? 1..100

The alias for this behavior is:

[alias]
    sdt = "!f() { \
        if [ -z \"$1\" ]; then A=''; else A=\"--src $1:\"; fi; \
        if [ -z \"$2\" ]; then B=''; else B=\"--dst $2:\"; fi; \
        sdt semantic $A $B; }; f"
Checking within GitHub Action

You may wish to have a semantic analysis of changes performed along with every PR. This can be accomplished using a GitHub Action, or in an analogous way on other repository management services such as GitLab or BitBucket. A GitHub Action could look like the below (and such is used in the repository for SDT itself; note you only need the middle steps of building the internal tools if your project includes JSON or Go). The "Analyze semantic changes" step will always be needed. If you want to use tree-sitter grammars and/or custom versions of SDT-supported programming languages, you will need to install those within the workflow.

Since this is a GH Action for the sdt repo itself, we build the underlying tool(s) in the workflow. For an unrelated repository, you should simply follow the installation instructions, as you would for installation to a local development workstation, and include those steps in the YAML file.

name: Semantic Diff Tool analysis in PR comment

on:
  pull_request:

jobs:
  pr:
    runs-on: ubuntu-latest
    permissions:
      pull-requests: write
    steps:
      - uses: actions/checkout@v3
        with:
          fetch-depth: 2

      - name: Set up Go
        uses: actions/setup-go@v3
        with:
          go-version: 1.19

      - name: Build jsonformat tool
        run: |
          go build -o "$GITHUB_WORKSPACE/jsonformat" cmd/jsonformat/main.go &&
          echo "$GITHUB_WORKSPACE/" >> $GITHUB_PATH

      - name: Build gotree tool
        run: |
          go build -o "$GITHUB_WORKSPACE/gotree" cmd/gotree/main.go &&
          echo "$GITHUB_WORKSPACE/" >> $GITHUB_PATH

      - name: Analyze semantic changes
        id: pr
        run: |
          new=$(git log | grep 'Merge.*into.*' | head -1 | sed 's/ into .*$//;s/^ *Merge //')
          old=$(git log | grep 'Merge.*into.*' | head -1 | sed 's/^.* into //')
          status=$(go run cmd/sdt/main.go semantic -A "${old}:" -B "${new}:" -m -d)
          echo "Comparing revision $old to $new" >> SDT.analysis
          echo "<pre>" >> SDT.analysis
          echo "$status" >> SDT.analysis
          echo "</pre>" >> SDT.analysis
          echo "$status" # Workflow sees the report also
          
        env:
          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

      - name: Add a comment to the PR
        uses: mshick/add-pr-comment@v2
        with:
          message-path: SDT.analysis
Difftastic

Difftastic (difft) serves a largely overlapping purpose to sdt. Difftastic builds on the substantial and longstanding work in tree-sitter, which is a parser generator tool for which many grammars and interfaces have been created.

Semantic Diff Tool is much newer and less developed; but sdt also takes a much more heuristic approach. That is, difft will answer the question "exactly what has changed between these two versions of a file at an AST level?" In contrast, sdt aims to answer the somewhat narrower question "which segments of the diff are likely to be semantically important?" Specifically, sdt is meant to aid code reviewers in excluding merely stylistic changes and focus on (likely) functional changes.

This difference is reflected in the different approaches to comparing files the two tools take. Instead of relying on fixed and compiled-in versions of parsers as difft does, sdt dynamically calls external tools (which can be whichever specific version of those is used by a project).

As a consequence, using difft with a project that, e.g., utilizes Python 3.8 is difficult to impossible. At the least, it would require recompiling the Rust tool with dependencies on whichever older version of tree-sitter-python has specifically 3.8-level syntax. In contrast, sdt can be easily and dynamically configured to utilize an appropriate Python executable, such as /usr/local/bin/python3.8.

As an illustration, in the following screenshots, git diff shows three segments where surface changes were made to a small test file. The first and third—but not the second—segment is semantically meaningful to what the program does. Both difft and sdt correctly identify this fact; but sdt presents a display closely modeled on git diff itself while difft highlights exactly those characters that represent a change (using a somewhat idiosyncratic, but clear, format). These examples are included as screenshots rather than as marked text to preserve the use of color highlighting by all three tool.

% git diff
git diff
% GIT_EXTERNAL_DIFF='difft --display=inline' git diff
difft as GIT_EXTERNAL_DIFF
% sdt semantic
sdt semantic
AST Explorer

AST Explorer is somewhat similar in concept to Difftastic. It is written in JavaScript, and uses language-native parsers to support numerous programming languages. It does not attempt to provide diff'ing, but adding that would be relatively straightforward. That was, in fact, the initial but discarded approach taken by the creator of sdt.

However, after some initial development, I realized that the quality of the third-party JS parsers that AST Explorer utilizes are often poor, and fail to recognize many commonplace constructs of the languages they respectively process. AST Explorer itself exposes a very nice web-based front end, but beyond the sample files of many source languages it provides, the tool often fails to parse valid source code.

Supported languages

Much of the work that Semantic Diff Tool accomplishes is done by means of utilizing other tools. You will need to install those other tools in your development environment separately. However, this requirement is generally fairly trivial, since the tools used are often the underlying runtime engines or compilers for the very same programming languages of those files whose changes are analyzed (in other words, the programming languages your project uses).

The configuration file $HOME/.sdt.toml allows you to choose specific versions of tools and specific switches to use. This is useful especially if a particular project utilizes a different version of a programming language than the one installed to the default path of a development environment. Absent an overriding configuration, each tool is assumed to reside on your $PATH, and a default collection of flags and switches are used.

For example, for Ruby files, the default command ruby --dump=parsetree is used to create an AST of the file being analyzed. Similarly, for Python files, python -m ast -a is used for the same purpose. Other tools produce canonical representations rather than ASTs, depending on what best serves the needs of a particular language (and depending on what tools are available and their quality). While overriding the configuration between different version of a programming language or tool will probably not break the code that performs the semantic comparison, not all languages have been tested in all versions (especially for versions that will be created in the future and do not yet exist).

Ruby

Initial support created. Supports both semantic and parsetree subcommands.

Only a ruby interpreter is required (by default)

Python

Initial support created. Supports both semantic and parsetree subcommands.

Only a python interpreter is required (by default)

SQL

Initial support created. Uses canonicalization rather than parsing, so only the semantic subcommand is supported.

Requires the tool sqlformat (by default). See:

JavaScript

Initial support created. Supports both semantic and parsetree subcommands.

Requires the node interpreter and the library acorn (by default). See:

Use of the error-tolerant variant acorn-loose was contemplated and rejected (as a default). We believe this tool would be more likely to produce spurious difference in parse trees. See:

Note that Node 18.10 was used during development. However, any node version supported by Acorn should work identically. However, if you wish to treat the files being analyzed as a specific ECMAScript version, see the option ecmaVersion that can be configured in .sdt.toml and is discussed in the sample version of that file.

JSON

JSON is supported by the bundled tool jsonformat which uses the internal Go library json to canonicalized documents. The much more sophisticated, and commonly installed third-party tool jq may also be used, and is illustrated within the sample .sdt.toml.

Golang

Go is supported by the bundled tool gotree which uses the internal Go libraries go/ast, go/parser, and go/token. The parse tree format produced by this tool is designed to make the work SDT does easy, but another library is very unlikely to produce a format compatible with the assumptions made in evaluating this parse tree.

Tree-Sitter languages

The widely used parser generator Tree-sitter has had a great many grammars designed for it. This tool is used by the Atom editor and by GitHub in its code highlighting and analysis facilities. As noted relative to Difftastic, these grammars are all at "some version of the language" rather than allowing an arbitrary version to be configued; however, by utilizing this tool, SDT gets support for 63 languages (at time of this writing; some grammars more complete than others).

Installing the command-line tool tree-sitter requires building a project using Rust's cargo or Node.js' npm, and generating each grammar requires that Node.js is installed. A C compiler is also needed to build the core parser.

There are a number of moving pieces needed to install each supported language, but the instructions in the tree-sitter project documentation are fairly straightforward. If no more specialized language parser is defined in SDT specific code, a final effort is made to use the bundled tool treesit to produce a parse tree for a modified file.

The tool treesit wraps a call to tree-sitter parse, and merely massages the output slightly to produce a format compatible with that produced by gotree. This allows SDT to produce semantic analysis for all languages having installed tree-sitter grammars. Available grammars include:

* Agda
* Bash
* C
* C#
* C++
* Common Lisp
* CSS
* CUDA
* D
* Dockerfile
* DOT
* Elixir
* Elm
* Emacs Lisp
* Eno
* ERB / EJS
* Erlang
* Fennel
* GLSL (OpenGL Shading Language)
* Go
* Go mod
* Hack
* Haskell
* HCL
* HTML
* Java
* JavaScript
* JSON
* Julia
* Kotlin
* Lua
* Make
* Markdown
* Nix
* Objective-C
* OCaml
* Org
* Perl
* PHP
* Protocol Buffers
* Python
* R
* Racket
* Ruby
* Rust
* Scala
* S-expressions
* Sourcepawn
* SPARQL
* SQL
* Svelte
* Swift
* SystemRDL
* TOML
* Turtle
* Twig
* TypeScript
* Verilog
* VHDL
* Vue
* WASM
* WGSL WebGPU Shading Language
* YAML

Others

What do you want most?

Directories

Path Synopsis
cmd
gotree
The purpose of this small program is to generate an AST for a Golang * package.
The purpose of this small program is to generate an AST for a Golang * package.
jsonformat
The purpose of this small program is to canonicalize existing JSON * files.
The purpose of this small program is to canonicalize existing JSON * files.
sdt
pkg
sql

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL