revdotcom/fstalignPublic

NotificationsYou must be signed in to change notification settings
Fork10
Star167

An efficient OpenFST-based tool for calculating WER and aligning two transcript sequences.

License

Apache-2.0 license

167 stars 10 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github/workflows		.github/workflows
docs		docs
sample_data		sample_data
src		src
test		test
third-party		third-party
tools		tools
.clang-format		.clang-format
.gitignore		.gitignore
.gitmodules		.gitmodules
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md

Repository files navigation

fstalign

Overview

fstalign is a tool for creating alignment between two sequences of tokens (here out referred to as “reference” and “hypothesis”). It has two key functions: computing word error rate (WER) and aligningNLP-formatted references with CTM hypotheses.

Due to its use of OpenFST and lazy algorithms for text-based alignment,fstalign is efficient for calculating WER while also providing significant flexibility for different measurement features and error analysis.

What's new in 2.0

Version 2.0 introduces two major changes:

A new method to traverse the composition graph, which dramatically improves the overall speed, especially when the sequences are long contain many errors.We have files that took 25 minutes to align before that can now take about 7 seconds. This is especially noticeable with the adapted composition (the default).
Some smarts were introduced when --use-case and --use-punctuation are enabled.Now, by default, punctuation symbols can only be substituted by other punctuation symbols (or deleted/inserted).Also, words that differ only by the first letter case will be preffered for substitution.

Here's an example of the 1.x behavior and the 2.0 version

==> v1.x sbs.txt <==           ref_tokenhyp_token           IsErrClassWer_Tag_Entities             WelcomeWelcome             ###322_###|                backback                                  toto                               anotheranother                          episodeepisode             ###323_###|                  ofof                              PodcastsPodcast             ERR###324_###|                  inand                 ERR               ColorColor               ###167_###|###325_###|                   :of                  ERR                 Thethe                 ERR             PodcastPodcast             ###168_###|###326_###|                   ..                                      II                   ==> v2.0 sbs.txt <==           ref_tokenhyp_token           IsErrClassWer_Tag_Entities             WelcomeWelcome             ###322_###|                backback                                  toto                               anotheranother                          episodeepisode             ###323_###|                  ofof                              PodcastsPodcast             ERR###324_###|                  inand                 ERR               ColorColor               ###167_###|###325_###|               <ins>of                  ERR                   :<del>               ERR                 Thethe                 ERR             PodcastPodcast             ###168_###|###326_###|

The confusion between: andof is not longer allowed.

Also, here's how favoring or not the substitution based on case-insensitive comparison, while still counting it as an error, looks like:

==> v1.x sbs.txt <==           ref_tokenhyp_token           IsErrClassWer_Tag_Entities             shorten    shorten                         ###801_###|                It's    it's                    ERR                    Berry    Barry                   ERR     ###785_###|###788_###|###802_###|                   .    .                        Just    Just                   Yeah    like                    ERR     ###805_###|                                                                                                                                .    <del>                   ERR                Like    <del>                   ERR                        ,    <del>                   ERR                        I    I                               ###809_###|                have    have                      a    a                     nickname    nickname ==> v2.0 sbs.txt <==           ref_tokenhyp_token           IsErrClassWer_Tag_Entities                It's    it's                    ERR                    Berry    Barry                   ERR     ###785_###|###788_###|###802_###|                   .    .                     Just    Just                     Yeah    <del>                   ERR     ###805_###|                   .    <del>                   ERR                     Like    like                    ERR                        ,    <del>                   ERR                        I    I                               ###809_###|                have    have                        a    a                    nickname    nickname

Here,Like <-> like substitution is favored. While this generally won't change the WER value itself (although it can), it will improve the timing alignments.

These behavior, as well as the beam size (that has a default value of 50.0) can be controlled with the following new parameters:

  --disable-strict-punctuation                              Disable strict punctuation alignment (which prevents punctuation aligning with words).  --disable-favored-subs      Disable favored substitutions (which makes alignment favor substitutions between words which differ only by case).  --favored-sub-cost FLOAT    Cost for favored substitutions (e.g., case diff). Default: 0.1

Installation

Dependencies

We use git submodules to manage third-party dependencies. Initialize and update submodules before proceeding to the main build steps.

git submodule update --init --recursive

This will pull the current dependencies:

catch2 - for unit testing
spdlog - for logging
CLI11 - for CLI construction
csv - for CTM and NLP input parsing
jsoncpp - for JSON output construction
strtk - for various string utilities

Additionally, we have dependencies outside of the third-party submodules:

OpenFST - currently provided to the build system by settings the $OPENFST_ROOT environment variable or during the CMake command via-DOPENFST_ROOT.

Build

The current build framework is CMake. Install CMake following the instructions here (https://cmake.org/install/).

To build fstalign, run:

    mkdir build && cd build    cmake .. -DOPENFST_ROOT="<path to OpenFST>" -DDYNAMIC_OPENFST=ON    make

Note:-DDYNAMIC_OPENFST=ON is needed if OpenFST atOPENFST_ROOT is compiled as shared libraries. Otherwise static libraries are assumed.

Finally, tests can be run using:

make test

Docker

The fstalign docker image is hosted on Docker Hub and can be easily pulled and run:

docker pull revdotcom/fstaligndocker run --rm -it revdotcom/fstalign

Seehttps://hub.docker.com/r/revdotcom/fstalign/tags for the available versions/tags to pull. If you desire to run the tool on local files you can mount local directories with the-v flag of thedocker run command.

From inside the container:

/fstalign/build/fstalign --help

For development you can also build the docker image locally using:

docker build . -t fstalign-dev

Documentation

For more information on how to usefstalign see ourdocumentation for more details.

About

An efficient OpenFST-based tool for calculating WER and aligning two transcript sequences.

Releases18

New major release : v 2.0 ! Latest

May 1, 2025

+ 17 releases

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

fstalign

Overview

What's new in 2.0

Installation

Dependencies

Build

Docker

Documentation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases18

Packages

Uh oh!

Contributors14

Uh oh!

Languages

Movatterモバイル変換

License

revdotcom/fstalign

Folders and files

Latest commit

History

Repository files navigation

fstalign

Overview

What's new in 2.0

Installation

Dependencies

Build

Docker

Documentation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases18

Packages0

Uh oh!

Contributors14

Uh oh!

Languages

Packages