Allows to configure external commands and provide different arguments
based on runtime values, like language. It extends the current config
of a command to allow a `arg-mappings` section. An example for
ocrmypdf:
```conf
ocrmypdf = {
enabled = true
command = {
program = "ocrmypdf"
### new arg-mappings
arg-mappings = {
"mylang" = {
value = "{{lang}}"
mappings = [
{
matches = "deu"
args = [ "-l", "deu", "--pdf-renderer", "sandwich" ]
},
{
matches = ".*"
args = [ "-l", "{{lang}}" ]
}
]
}
}
#### end new arg-mappings
args = [
### will be replaced with corresponding args from "mylang" mapping
"{{mylang}}",
"--skip-text",
"--deskew",
"-j", "1",
"{{infile}}",
"{{outfile}}"
]
timeout = "5 minutes"
}
working-dir = ${java.io.tmpdir}"/docspell-convert"
}
```
The whole section will be first processed to replace all `{{…}}`
patterns with corresponding values. Then `arg-mappings` will be looked
at and the first match (value == matches) in its `mappings` array is
used to replace its name in the arguments to the command.
Addons allow to execute external programs in some context inside
docspell. Currently it is possible to run them after processing files.
Addons are provided by URLs to zip files.
This cuts down considerably when high-dpi images are provided in pdfs.
The test file, scanned with 600dpi resulting in a 5.4M pdf file
contains a 9900x13800 image. This image is loaded into memory in order
to scale it down by PDFBox. This easily results in out of memory
errors (this image requires already ~400M). With subsampling the size
is reduced at most by a factor of 8. Still recommended to avoid large
dpi image-only scans for text based documents or increase the heap
size for joex.
The scaling factor can be given in the config file. When this changes,
images can be regenerated via POSTing to certain endpoints. It is
possible to regenerate just one attachment preview or all within a
collective.
Html and text files are not fixed to be UTF-8. The encoding is now
detected, which may not work for all files. Default/fallback will be
utf-8.
There is still a problem with mails that contain html parts not in
utf8 encoding. The mail text is always returned as a string and the
original encoding is lost. Then the html is stored using utf-8 bytes,
but wkhtmltopdf reads it using latin1. It seems that the `--encoding`
setting doesn't override encoding provided by the document.
The restriction that only pdf files can be uploaded is removed. All
files can now be uploaded. The processing may not process all. It is
still possible to restrict file uploads by types via a configuration.