+++ title = "Addon for audio file support" [extra] author = "eikek" +++ # Audio file support Since version 0.36.0 Docspell can be extended by [addons](@/docs/addons/basics.md) - external programs that are executed at some defined point in Docspell. This is a walk through the first addon that was created, mainly as an example: providing support for audio files. I think it is interesting to provide support for audio files for a DMS, although admittedly I don't have much of a use :). But this is the kind of use-case that addons are for. # The idea The idea is very simple: the real work is done by external programs, most notably [coqui's stt](https://github.com/coqui-ai/STT) a deep learning toolkit originally created at Mozilla. It provides a command line tool that accepts a WAV file and spits out text. Perfect! With this text, a PDF file can be created and a preview image which is already enough for basic support. You can see the pdf in the web-ui and search for the text via SOLR or PostgreSQL. Because a WAV file is not the most popular format today, `ffmpeg` can be used to transform any other audio to WAV. The only thing now is to create a program that checks the uploaded files, filters out all audio files and runs them through the mentioned programs. So let's do this. # Preparation Addons are external programs and can be written in whatever language…. For me this is a good opportunity to refresh my rusty scheme know-how a bit. So this addon is written in Scheme, in particular [guile](https://www.gnu.org/software/guile/). Programming in scheme is fun and guile provides good integration into the (posix) OS and also has a nice JSON module. I had the [reference docs](https://www.gnu.org/software/guile/docs/docs-2.2/guile-ref/index.html) open all the time - look at them for further details on the used functions. It's usually good to play around with the tools at first. For stt, we first need to download a *model*. This will be used to "detect" the text in the audio data. They have a [page](https://github.com/coqui-ai/STT-models/releases) where we can download model files for any supported language. For the addon, we will implement English and German. When creating a PDF with wkhtmltopdf, we prettify it a little by embedding the plain text into some html template. This will also take care to specifiy UTF-8 as default encoding directly in the HTML template. FFMpeg just works as usual. It figures out the input format automatically and knows from the extension of the output file what to do. You can find the full code [here](https://github.com/docspell/audio-files-addon/blob/master/src/addon.scm). The following shows excerpts from it with some explanation. # The script ## Helpers After the preamble, there are two helper functions. ```lisp (define* (errln formatstr . args) (apply format (current-error-port) formatstr args) (newline)) ;; Macro for executing system commands and making this program exit in ;; case of failure. (define-syntax sysexec (syntax-rules () ((sysexec exp ...) (let ((rc (apply system* (list exp ...)))) (unless (eqv? rc EXIT_SUCCESS) (format (current-error-port) "> '~a …' failed with: ~#*~:*~d~%" exp ... rc) (exit 1)) #t)))) ``` As this addon wants to pass data back to Docspell via stdout, we use the stderr for logging and printing general information. The function `errln` (short for "error line" :)) allows to conveniently print to stderr and the second wraps the `system*` procedure such that the script fails whenever the external program fails. It is somewhat similar to `set -e` in bash. ## Dependencies Next is the declaration of external dependencies. At first all external programs are listed. This is important for later, when the script is packaged via nix. Nix will substitute these commands with absolute paths. Then it's good to not have them scattered around. It also reads in the expected environment variables (only those we need) that are provided by Docspell. Since this addon only makes sense to work on an item, it quits early should some env vars are missing. ```lisp (define *curl* "curl") (define *ffmpeg* "ffmpeg") (define *stt* "stt") (define *wkhtmltopdf* "wkhtmltopdf") ;; Getting some environment variables (define *output-dir* (getenv "OUTPUT_DIR")) (define *tmp-dir* (getenv "TMP_DIR")) (define *cache-dir* (getenv "CACHE_DIR")) (define *item-data-json* (getenv "ITEM_DATA_JSON")) (define *original-files-json* (getenv "ITEM_ORIGINAL_JSON")) (define *original-files-dir* (getenv "ITEM_ORIGINAL_DIR")) ;; fail early if not in the right context (when (not *item-data-json*) (errln "No item data json file found.") (exit 1)) ``` ## Input/Output The input and output schemas can be defined now. This uses the [guile-json](https://github.com/aconchillo/guile-json) module. It provides very convenient features for reading and writing json. It is possible to define a record via `define-json-type` that generates readers and writers to/from JSON. For example, the record `` is defined to be an object with only one field `id`. The function `json->scm` reads in json into scheme datastructures and then the generated function `scm->itemdata` creates the record from it. For every record, accessor functions exists. For example: `(itemdata-id data)` would lookup the field `id` in the given itemdata record `data`. Here we need it to get the item-id and the list of file properties belonging to the original uploaded files. Another interesting definition is the `` record. This captures (a subset of) the schema of what Docspell receives from this addon as a result. A full example of this data is [here](@/docs/addons/writing.md#output). We don't need `commands` or `newItems`, so this schema only cares about the `files` attribute. ```lisp (define-json-type (id)) ;; The array of original files (define-json-type (id) (name) (position) (language) (mimetype) (length) (checksum)) ;; The output record, what is returned to docspell (define-json-type (itemId) (textFiles) (pdfFiles)) (define-json-type (files "files" #())) ;; Parses the JSON containing the item information (define *itemdata-json* (scm->itemdata (call-with-input-file *item-data-json* json->scm))) ;; The JSON file containing meta data for all source files as vector. (define *original-meta-json* (let ((props (vector->list (call-with-input-file *original-files-json* json->scm)))) (map scm->original-file props))) ``` ## Finding the audio file The previously parsed json array `*original-meta-json*` can now be used to find any audio files within the original uploaded files, as done in `find-audio-files`. It simply goes through the list and keeps those files whose mimetype starts with `audio/`. The mimetype is provided by Docspell in the file properties in `ITEM_ORIGINAL_JSON`. Before converting to wav with ffmpeg, it is quickly checked if it's not a wav already. ```lisp (define (is-wav? mime) "Test whether the mimetype MIME is denoting a wav file." (or (string-suffix? "/wav" mime) (string-suffix? "/x-wav" mime) (string-suffix? "/vnd.wav" mime))) (define (find-audio-files) "Find all source files that are audio files." (filter! (lambda (el) (string-prefix? "audio/" (original-file-mimetype el))) *original-meta-json*)) (define (convert-wav id mime) "Run ffmpeg to convert to wav." (let ((src-file (format #f "~a/~a" *original-files-dir* id)) (out-file (format #f "~a/in.wav" *tmp-dir*))) (if (is-wav? mime) src-file (begin (errln "Running ffmpeg to convert wav file...") (sysexec *ffmpeg* "-loglevel" "error" "-y" "-i" src-file out-file) out-file)))) ``` ## Speech to text Once we have a wav file, we can run speech-to-text recognition on it. As said above, we need to download a model first, which is depending on a language. Luckily, Docspell provides the language of the file. This is the lanugage either given directly by the user when uploading or it's the collective's default language. In the following snippet, we get the language as arguments. We will get it later from the file properties. As seen below, the model file is stored to the `CACHE_DIR`. This is provided by Docspell and will survive the execution of this script. All other directories involved will be deleted eventually. The `CACHE_DIR` is the place to store intermediate results you don't want to loose between addon runs. But as any cache, it may not exist the next time the addon is run. Docspell doesn't clear it automatically, though. The last function simply executes the `stt` external command and puts stdout into a file. ```lisp (define (get-model language) (let* ((lang (or language "eng")) (file (format #f "~a/model_~a.pbmm" *cache-dir* lang))) (unless (file-exists? file) (download-model lang file)) file)) (define (download-model lang file) "Download model files per language. Nix has currently stt 0.9.3 packaged." (let ((url (cond ((string= lang "eng") "https://coqui.gateway.scarf.sh/english/coqui/v0.9.3/model.pbmm") ((string= lang "deu") "https://coqui.gateway.scarf.sh/german/AASHISHAG/v0.9.0/model.pbmm") (else (error "Unsupported language: " lang))))) (errln "Downloading model file for language: ~a" lang) (sysexec *curl* "-SsL" "-o" file url) file)) (define (extract-text model input out) "Runs stt for speech-to-text and writes the text into the file OUT." (errln "Extracting text from audio…") (with-output-to-file out (lambda () (sysexec *stt* "--model" model "--audio" input)))) ``` ## Create PDF Creating the PDF is straight forward. The extracted text is embedded into a HTML file which is then passed to `wkhtmltopdf`. Since we don't need this file for anything else, it is stored to the `TMP_DIR`. ```lisp (define (create-pdf txt-file out) (define (line str) (format #t "~a\n" str)) (errln "Creating pdf file…") (let ((tmphtml (format #f "~a/text.html" *tmp-dir*))) (with-output-to-file tmphtml (lambda () (line "") (line "") (line " ") (line " ") (line "
") (line " Extracted from audio using stt on ") (display (strftime "%c" (localtime (current-time)))) (line "
") (line "

") (display (call-with-input-file txt-file read-string)) (line "

") (line ""))) (sysexec *wkhtmltopdf* tmphtml out))) ``` ## Putting it together The main function now puts everything together. The `process-file` function is called for every file that is returned from `(find-audio-files)`. It will extract the necessary information (like the language) from the json document via record accessors (e.g. `original-file-lanugage file)`) and then calls the functions defined above. At last it creates a `` record with `make-itemfiles`. An `` record contains now the important information for Docspell. It requires the item-id and a mapping from attachment-ids to files in `OUTPUT_DIR`. For each attachment identified by its ID, Docspell replaces the extracted text with the contents of the given file and replaces the converted PDF file, respectively. In the code below, two lists of such mappings are defined - the first for the text files, the second for the converted pdf. The files must be specified relative to `OUTPUT_DIR`. That means `process-all` returns a list of `` records which is then used to create the `` record. And finally, a `output->json` function will turn the record into proper JSON which is send to stdout. ```lisp (define (process-file itemid file) "Processing a single audio file." (let* ((id (original-file-id file)) (mime (original-file-mimetype file)) (lang (original-file-language file)) (txt-file (format #f "~a/~a.txt" *output-dir* id)) (pdf-file (format #f "~a/~a.pdf" *output-dir* id)) (wav (convert-wav id mime)) (model (get-model lang))) (extract-text model wav txt-file) (create-pdf txt-file pdf-file) (make-itemfiles itemid `((,id . ,(format #f "~a.txt" id))) `((,id . ,(format #f "~a.pdf" id)))))) (define (process-all) (let ((item-id (itemdata-id *itemdata-json*))) (map (lambda (file) (process-file item-id file)) (find-audio-files)))) (define (main args) (let ((out (make-output (process-all)))) (format #t "~a" (output->json out)))) ``` Example output: ```json { "files": [ { "itemId":"qZDnyGIAJsXr", "textFiles": { "HPFvIDib6eA": "HPFvIDib6eA.txt" }, "pdfFiles": { "HPFvIDib6eA": "HPFvIDib6eA.pdf"} } ] } ``` # Packaging Now with that script some additional plumbing is needed to make it an "Addon" for Docspell. The external tools - stt, ffmpeg, curl and wkhtmltopdf are required as well as guile to compile and interpret the script. Also the guile-json module must be installed. This can turn into a quite tedious task. Luckily, there is [nix](https://nixos.org) that has an answer to this. A user who wants to use this script only needs to install nix. This package manager then takes care of providing the exact dependencies we need (down to the correct version and including guile as the language and runtime). ## A flake Everything is defined in the `flake.nix` in the source root. It looks like this: ```nix { description = "A docspell addon for basic audio file support"; inputs = { utils.url = "github:numtide/flake-utils"; # Nixpkgs / NixOS version to use. nixpkgs.url = "nixpkgs/nixos-21.11"; }; outputs = { self, nixpkgs, utils }: utils.lib.eachSystem ["x86_64-linux"] (system: let pkgs = import nixpkgs { inherit system; overlays = [ ]; }; name = "audio-files-addon"; in rec { packages.${name} = pkgs.callPackage ./nix/addon.nix { inherit name; }; defaultPackage = packages.${name}; apps.${name} = utils.lib.mkApp { inherit name; drv = packages.${name}; }; defaultApp = apps.${name}; ## … omitted for brevity } ); } ``` First sad thing is, that only `x86_64` systems are supported. This is due to `stt` not being available on other platforms currently (as provided by nixpkgs). The rest is a bit magic: A package and "defaultPackage" is defined with a reference to `nix/addon.nix`. The important part is the line ```nix inputs = { # Nixpkgs / NixOS version to use. nixpkgs.url = "nixpkgs/nixos-21.11"; }; ``` It says that as input for "building" the script, we take all of [nixpkgs](https://github.com/NixOS/nixpkgs) which is a package collection defined for (and in) nix - including thousands of software packages. We can pick and choose from these. No surprise, all external tools we need are included! A flake defines the inputs and outputs of a package. With all of nixpkgs as inputs, we can create a definition to elevate this script into a *package*. ## Package definition The definition for "building" the script is in `nix/addon.nix`: ```nix { stdenv, bash, cacert, curl, stt, wkhtmltopdf, ffmpeg, guile, guile-json, lib, name }: stdenv.mkDerivation { inherit name; src = lib.sources.cleanSource ../.; buildInputs = [ guile guile-json ]; patchPhase = '' TARGET=src/addon.scm sed -i 's,\*curl\* "curl",\*curl\* "${curl}/bin/curl",g' $TARGET sed -i 's,\*ffmpeg\* "ffmpeg",\*ffmpeg\* "${ffmpeg}/bin/ffmpeg",g' $TARGET sed -i 's,\*stt\* "stt",\*stt\* "${stt}/bin/stt",g' $TARGET sed -i 's,\*wkhtmltopdf\* "wkhtmltopdf",\*wkhtmltopdf\* "${wkhtmltopdf}/bin/wkhtmltopdf",g' $TARGET ''; buildPhase = '' guild compile -o ${name}.go src/addon.scm ''; # module name must be same as .go installPhase = '' mkdir -p $out/{bin,lib} cp ${name}.go $out/lib/ cat > $out/bin/${name} <<-EOF #!${bash}/bin/bash export SSL_CERT_FILE="${cacert}/etc/ssl/certs/ca-bundle.crt" exec -a "${name}" ${guile}/bin/guile -C ${guile-json}/share/guile/ccache -C $out/lib -e '(${name}) main' -c "" \$@ EOF chmod +x $out/bin/${name} ''; } ``` With a bit of handwaving - this is a bash script that modifies slightly the scheme script and runs a compile on it. We simply declare all packages we need in the first line of `{ … }` - these are arguments that are automatically filled by nix by searching the corresponding package in nixpkgs. First the `patchPhase` is executed. It will replace the variables containing the external tools with an absolute path to the version that we currently get from nixpkgs. With this step nix takes care that all these packages are available *at runtime* when executing the script. All versions are finally fixed in `flake.lock` and can be upgraded manually. The `buildPhase` runs the guile compiler that produces some intermediate code that will be loaded instead of compiling the script on-the-fly. At last, `installPhase` creates a wrapper script that runs guile with the correct load-path pointing to `guile-json` and to our pre-compiled script. Additionally, trusted root certificates are exported to make the curl commands work. This script will be created in `$out` directory that is provided by nix. If you now run `nix build` in the source root, it will execute all these phases and produce a symlink pointing to the result. You can then `cat` the resulting file if you are curious. This way the script is completely isolated from the system it runs on - as long as the nix package manager is available. It includes all the external tools, as well as the underlying runtime (guile)! The result is a tiny wrapper bash script that can be run "everywhere" (modulo all the restrictions, like non-x86_64 platforms, of course :)). ## Addon Descriptor At last, a small yaml file is needed to tell Docspell a little about the addon. ```yaml meta: name: "audio-files-addon" version: "0.1.0" description: | This addon adds support for audio files. Audio files are processed by a speech-to-text engine and a pdf is generated. It doesn't expect any user arguments at the moment. It requires internet access to download model files. triggers: - final-process-item - final-reprocess-item - existing-item runner: nix: enable: true docker: enable: false trivial: enable: true exec: src/addon.scm options: networking: true collectOutput: true ``` This tells Docspell via `triggers` when this addon may be run. This one only makes sense for an item. Thus it can be hooked up to run with every file-processing job or a user can manually trigger it on an item. It also tells via `runner:` that it can be build and run via nix, but not via docker (I gave up after an hour to create a Dockerfile…). It could also be run "as-is" but the user then needs to install all these tools and guile manually. # Done That's it. You can install this addon in Docspell and create a run configuration to let it execute when you want.