19 KiB
+++ title = "Addon for audio file support" [extra] author = "eikek" +++
Audio file support
Since version 0.36.0 Docspell can be extended by addons - external programs that are executed at some defined point in Docspell. This is a walk through the first addon that was created, mainly as an example: providing support for audio files.
I think it is interesting to provide support for audio files for a DMS, although admittedly I don't have much of a use :). But this is the kind of use-case that addons are for.
The idea
The idea is very simple: the real work is done by external programs, most notably coqui's stt a deep learning toolkit originally created at Mozilla. It provides a command line tool that accepts a WAV file and spits out text. Perfect!
With this text, a PDF file can be created and a preview image which is already enough for basic support. You can see the pdf in the web-ui and search for the text via SOLR or PostgreSQL.
Because a WAV file is not the most popular format today, ffmpeg
can
be used to transform any other audio to WAV.
The only thing now is to create a program that checks the uploaded files, filters out all audio files and runs them through the mentioned programs. So let's do this.
Preparation
Addons are external programs and can be written in whatever language…. For me this is a good opportunity to refresh my rusty scheme know-how a bit. So this addon is written in Scheme, in particular guile. Programming in scheme is fun and guile provides good integration into the (posix) OS and also has a nice JSON module. I had the reference docs open all the time - look at them for further details on the used functions.
It's usually good to play around with the tools at first. For stt, we first need to download a model. This will be used to "detect" the text in the audio data. They have a page where we can download model files for any supported language. For the addon, we will implement English and German.
When creating a PDF with wkhtmltopdf, we prettify it a little by embedding the plain text into some html template. This will also take care to specifiy UTF-8 as default encoding directly in the HTML template.
FFMpeg just works as usual. It figures out the input format automatically and knows from the extension of the output file what to do.
You can find the full code here. The following shows excerpts from it with some explanation.
The script
Helpers
After the preamble, there are two helper functions.
(define* (errln formatstr . args)
(apply format (current-error-port) formatstr args)
(newline))
;; Macro for executing system commands and making this program exit in
;; case of failure.
(define-syntax sysexec
(syntax-rules ()
((sysexec exp ...)
(let ((rc (apply system* (list exp ...))))
(unless (eqv? rc EXIT_SUCCESS)
(format (current-error-port) "> '~a …' failed with: ~#*~:*~d~%" exp ... rc)
(exit 1))
#t))))
As this addon wants to pass data back to Docspell via stdout, we use
the stderr for logging and printing general information. The function
errln
(short for "error line" :)) allows to conveniently print to
stderr and the second wraps the system*
procedure such that the
script fails whenever the external program fails. It is somewhat
similar to set -e
in bash.
Dependencies
Next is the declaration of external dependencies. At first all external programs are listed. This is important for later, when the script is packaged via nix. Nix will substitute these commands with absolute paths. Then it's good to not have them scattered around.
It also reads in the expected environment variables (only those we need) that are provided by Docspell. Since this addon only makes sense to work on an item, it quits early should some env vars are missing.
(define *curl* "curl")
(define *ffmpeg* "ffmpeg")
(define *stt* "stt")
(define *wkhtmltopdf* "wkhtmltopdf")
;; Getting some environment variables
(define *output-dir* (getenv "OUTPUT_DIR"))
(define *tmp-dir* (getenv "TMP_DIR"))
(define *cache-dir* (getenv "CACHE_DIR"))
(define *item-data-json* (getenv "ITEM_DATA_JSON"))
(define *original-files-json* (getenv "ITEM_ORIGINAL_JSON"))
(define *original-files-dir* (getenv "ITEM_ORIGINAL_DIR"))
;; fail early if not in the right context
(when (not *item-data-json*)
(errln "No item data json file found.")
(exit 1))
Input/Output
The input and output schemas can be defined now. This uses the guile-json module. It provides very convenient features for reading and writing json.
It is possible to define a record via define-json-type
that
generates readers and writers to/from JSON. For example, the record
<itemdata>
is defined to be an object with only one field id
. The
function json->scm
reads in json into scheme datastructures and then
the generated function scm->itemdata
creates the record from it. For
every record, accessor functions exists. For example: (itemdata-id data)
would lookup the field id
in the given itemdata record
data
.
Here we need it to get the item-id and the list of file properties belonging to the original uploaded files.
Another interesting definition is the <output>
record. This captures
(a subset of) the schema of what Docspell receives from this addon as
a result. A full example of this data is
here. We don't need commands
or
newItems
, so this schema only cares about the files
attribute.
(define-json-type <itemdata>
(id))
;; The array of original files
(define-json-type <original-file>
(id)
(name)
(position)
(language)
(mimetype)
(length)
(checksum))
;; The output record, what is returned to docspell
(define-json-type <itemfiles>
(itemId)
(textFiles)
(pdfFiles))
(define-json-type <output>
(files "files" #(<itemfiles>)))
;; Parses the JSON containing the item information
(define *itemdata-json*
(scm->itemdata (call-with-input-file *item-data-json* json->scm)))
;; The JSON file containing meta data for all source files as vector.
(define *original-meta-json*
(let ((props (vector->list (call-with-input-file *original-files-json* json->scm))))
(map scm->original-file props)))
Finding the audio file
The previously parsed json array *original-meta-json*
can now be
used to find any audio files within the original uploaded files, as
done in find-audio-files
. It simply goes through the list and keeps
those files whose mimetype starts with audio/
. The mimetype is
provided by Docspell in the file properties in ITEM_ORIGINAL_JSON
.
Before converting to wav with ffmpeg, it is quickly checked if it's not a wav already.
(define (is-wav? mime)
"Test whether the mimetype MIME is denoting a wav file."
(or (string-suffix? "/wav" mime)
(string-suffix? "/x-wav" mime)
(string-suffix? "/vnd.wav" mime)))
(define (find-audio-files)
"Find all source files that are audio files."
(filter! (lambda (el)
(string-prefix?
"audio/"
(original-file-mimetype el)))
*original-meta-json*))
(define (convert-wav id mime)
"Run ffmpeg to convert to wav."
(let ((src-file (format #f "~a/~a" *original-files-dir* id))
(out-file (format #f "~a/in.wav" *tmp-dir*)))
(if (is-wav? mime)
src-file
(begin
(errln "Running ffmpeg to convert wav file...")
(sysexec *ffmpeg* "-loglevel" "error" "-y" "-i" src-file out-file)
out-file))))
Speech to text
Once we have a wav file, we can run speech-to-text recognition on it. As said above, we need to download a model first, which is depending on a language. Luckily, Docspell provides the language of the file. This is the lanugage either given directly by the user when uploading or it's the collective's default language.
In the following snippet, we get the language as arguments. We will get it later from the file properties.
As seen below, the model file is stored to the CACHE_DIR
. This is
provided by Docspell and will survive the execution of this script.
All other directories involved will be deleted eventually. The
CACHE_DIR
is the place to store intermediate results you don't want
to loose between addon runs. But as any cache, it may not exist the
next time the addon is run. Docspell doesn't clear it automatically,
though.
The last function simply executes the stt
external command and puts
stdout into a file.
(define (get-model language)
(let* ((lang (or language "eng"))
(file (format #f "~a/model_~a.pbmm" *cache-dir* lang)))
(unless (file-exists? file)
(download-model lang file))
file))
(define (download-model lang file)
"Download model files per language. Nix has currently stt 0.9.3 packaged."
(let ((url (cond
((string= lang "eng") "https://coqui.gateway.scarf.sh/english/coqui/v0.9.3/model.pbmm")
((string= lang "deu") "https://coqui.gateway.scarf.sh/german/AASHISHAG/v0.9.0/model.pbmm")
(else (error "Unsupported language: " lang)))))
(errln "Downloading model file for language: ~a" lang)
(sysexec *curl* "-SsL" "-o" file url)
file))
(define (extract-text model input out)
"Runs stt for speech-to-text and writes the text into the file OUT."
(errln "Extracting text from audio…")
(with-output-to-file out
(lambda ()
(sysexec *stt* "--model" model "--audio" input))))
Create PDF
Creating the PDF is straight forward. The extracted text is embedded
into a HTML file which is then passed to wkhtmltopdf
. Since we don't
need this file for anything else, it is stored to the TMP_DIR
.
(define (create-pdf txt-file out)
(define (line str)
(format #t "~a\n" str))
(errln "Creating pdf file…")
(let ((tmphtml (format #f "~a/text.html" *tmp-dir*)))
(with-output-to-file tmphtml
(lambda ()
(line "<!DOCTYPE html>")
(line "<html>")
(line " <head><meta charset=\"UTF-8\"></head>")
(line " <body style=\"padding: 2em; font-size: large;\">")
(line " <div style=\"padding: 0.5em; font-size:normal; font-weight: bold; border: 1px solid black;\">")
(line " Extracted from audio using stt on ")
(display (strftime "%c" (localtime (current-time))))
(line " </div>")
(line " <p>")
(display (call-with-input-file txt-file read-string))
(line " </p>")
(line "</body></html>")))
(sysexec *wkhtmltopdf* tmphtml out)))
Putting it together
The main function now puts everything together. The process-file
function is called for every file that is returned from
(find-audio-files)
. It will extract the necessary information (like
the language) from the json document via record accessors (e.g.
original-file-lanugage file)
) and then calls the functions defined
above. At last it creates a <itemfile>
record with make-itemfiles
.
An <itemfile>
record contains now the important information for
Docspell. It requires the item-id and a mapping from attachment-ids to
files in OUTPUT_DIR
. For each attachment identified by its ID,
Docspell replaces the extracted text with the contents of the given
file and replaces the converted PDF file, respectively. In the code
below, two lists of such mappings are defined - the first for the text
files, the second for the converted pdf. The files must be specified
relative to OUTPUT_DIR
.
That means process-all
returns a list of <itemfile>
records which
is then used to create the <output>
record. And finally, a
output->json
function will turn the record into proper JSON which is
send to stdout.
(define (process-file itemid file)
"Processing a single audio file."
(let* ((id (original-file-id file))
(mime (original-file-mimetype file))
(lang (original-file-language file))
(txt-file (format #f "~a/~a.txt" *output-dir* id))
(pdf-file (format #f "~a/~a.pdf" *output-dir* id))
(wav (convert-wav id mime))
(model (get-model lang)))
(extract-text model wav txt-file)
(create-pdf txt-file pdf-file)
(make-itemfiles itemid
`((,id . ,(format #f "~a.txt" id)))
`((,id . ,(format #f "~a.pdf" id))))))
(define (process-all)
(let ((item-id (itemdata-id *itemdata-json*)))
(map (lambda (file)
(process-file item-id file))
(find-audio-files))))
(define (main args)
(let ((out (make-output (process-all))))
(format #t "~a" (output->json out))))
Example output:
{
"files": [
{
"itemId":"qZDnyGIAJsXr",
"textFiles": { "HPFvIDib6eA": "HPFvIDib6eA.txt" },
"pdfFiles": { "HPFvIDib6eA": "HPFvIDib6eA.pdf"}
}
]
}
Packaging
Now with that script some additional plumbing is needed to make it an "Addon" for Docspell.
The external tools - stt, ffmpeg, curl and wkhtmltopdf are required as well as guile to compile and interpret the script. Also the guile-json module must be installed.
This can turn into a quite tedious task. Luckily, there is nix that has an answer to this. A user who wants to use this script only needs to install nix. This package manager then takes care of providing the exact dependencies we need (down to the correct version and including guile as the language and runtime).
A flake
Everything is defined in the flake.nix
in the source root. It looks
like this:
{
description = "A docspell addon for basic audio file support";
inputs = {
utils.url = "github:numtide/flake-utils";
# Nixpkgs / NixOS version to use.
nixpkgs.url = "nixpkgs/nixos-21.11";
};
outputs = { self, nixpkgs, utils }:
utils.lib.eachSystem ["x86_64-linux"] (system:
let
pkgs = import nixpkgs {
inherit system;
overlays = [
];
};
name = "audio-files-addon";
in rec {
packages.${name} = pkgs.callPackage ./nix/addon.nix {
inherit name;
};
defaultPackage = packages.${name};
apps.${name} = utils.lib.mkApp {
inherit name;
drv = packages.${name};
};
defaultApp = apps.${name};
## … omitted for brevity
}
);
}
First sad thing is, that only x86_64
systems are supported. This is
due to stt
not being available on other platforms currently (as
provided by nixpkgs).
The rest is a bit magic: A package and "defaultPackage" is defined
with a reference to nix/addon.nix
. The important part is the line
inputs = {
# Nixpkgs / NixOS version to use.
nixpkgs.url = "nixpkgs/nixos-21.11";
};
It says that as input for "building" the script, we take all of nixpkgs which is a package collection defined for (and in) nix - including thousands of software packages. We can pick and choose from these. No surprise, all external tools we need are included!
A flake defines the inputs and outputs of a package. With all of nixpkgs as inputs, we can create a definition to elevate this script into a package.
Package definition
The definition for "building" the script is in nix/addon.nix
:
{ stdenv, bash, cacert, curl, stt, wkhtmltopdf, ffmpeg, guile, guile-json, lib, name }:
stdenv.mkDerivation {
inherit name;
src = lib.sources.cleanSource ../.;
buildInputs = [ guile guile-json ];
patchPhase = ''
TARGET=src/addon.scm
sed -i 's,\*curl\* "curl",\*curl\* "${curl}/bin/curl",g' $TARGET
sed -i 's,\*ffmpeg\* "ffmpeg",\*ffmpeg\* "${ffmpeg}/bin/ffmpeg",g' $TARGET
sed -i 's,\*stt\* "stt",\*stt\* "${stt}/bin/stt",g' $TARGET
sed -i 's,\*wkhtmltopdf\* "wkhtmltopdf",\*wkhtmltopdf\* "${wkhtmltopdf}/bin/wkhtmltopdf",g' $TARGET
'';
buildPhase = ''
guild compile -o ${name}.go src/addon.scm
'';
# module name must be same as <filename>.go
installPhase = ''
mkdir -p $out/{bin,lib}
cp ${name}.go $out/lib/
cat > $out/bin/${name} <<-EOF
#!${bash}/bin/bash
export SSL_CERT_FILE="${cacert}/etc/ssl/certs/ca-bundle.crt"
exec -a "${name}" ${guile}/bin/guile -C ${guile-json}/share/guile/ccache -C $out/lib -e '(${name}) main' -c "" \$@
EOF
chmod +x $out/bin/${name}
'';
}
With a bit of handwaving - this is a bash script that modifies
slightly the scheme script and runs a compile on it. We simply declare
all packages we need in the first line of { … }
- these are
arguments that are automatically filled by nix by searching the
corresponding package in nixpkgs.
First the patchPhase
is executed. It will replace the variables
containing the external tools with an absolute path to the version
that we currently get from nixpkgs. With this step nix takes care that
all these packages are available at runtime when executing the
script. All versions are finally fixed in flake.lock
and can be
upgraded manually.
The buildPhase
runs the guile compiler that produces some
intermediate code that will be loaded instead of compiling the script
on-the-fly.
At last, installPhase
creates a wrapper script that runs guile with
the correct load-path pointing to guile-json
and to our pre-compiled
script. Additionally, trusted root certificates are exported to make
the curl commands work. This script will be created in $out
directory that is provided by nix.
If you now run nix build
in the source root, it will execute all
these phases and produce a symlink pointing to the result. You can
then cat
the resulting file if you are curious.
This way the script is completely isolated from the system it runs on - as long as the nix package manager is available. It includes all the external tools, as well as the underlying runtime (guile)! The result is a tiny wrapper bash script that can be run "everywhere" (modulo all the restrictions, like non-x86_64 platforms, of course :)).
Addon Descriptor
At last, a small yaml file is needed to tell Docspell a little about the addon.
meta:
name: "audio-files-addon"
version: "0.1.0"
description: |
This addon adds support for audio files. Audio files are processed
by a speech-to-text engine and a pdf is generated.
It doesn't expect any user arguments at the moment. It requires
internet access to download model files.
triggers:
- final-process-item
- final-reprocess-item
- existing-item
runner:
nix:
enable: true
docker:
enable: false
trivial:
enable: true
exec: src/addon.scm
options:
networking: true
collectOutput: true
This tells Docspell via triggers
when this addon may be run. This
one only makes sense for an item. Thus it can be hooked up to run with
every file-processing job or a user can manually trigger it on an
item.
It also tells via runner:
that it can be build and run via nix, but
not via docker (I gave up after an hour to create a Dockerfile…). It
could also be run "as-is" but the user then needs to install all these
tools and guile manually.
Done
That's it. You can install this addon in Docspell and create a run configuration to let it execute when you want.