Allows to configure external commands and provide different arguments
based on runtime values, like language. It extends the current config
of a command to allow a `arg-mappings` section. An example for
ocrmypdf:
```conf
ocrmypdf = {
enabled = true
command = {
program = "ocrmypdf"
### new arg-mappings
arg-mappings = {
"mylang" = {
value = "{{lang}}"
mappings = [
{
matches = "deu"
args = [ "-l", "deu", "--pdf-renderer", "sandwich" ]
},
{
matches = ".*"
args = [ "-l", "{{lang}}" ]
}
]
}
}
#### end new arg-mappings
args = [
### will be replaced with corresponding args from "mylang" mapping
"{{mylang}}",
"--skip-text",
"--deskew",
"-j", "1",
"{{infile}}",
"{{outfile}}"
]
timeout = "5 minutes"
}
working-dir = ${java.io.tmpdir}"/docspell-convert"
}
```
The whole section will be first processed to replace all `{{…}}`
patterns with corresponding values. Then `arg-mappings` will be looked
at and the first match (value == matches) in its `mappings` array is
used to replace its name in the arguments to the command.
Addons allow to execute external programs in some context inside
docspell. Currently it is possible to run them after processing files.
Addons are provided by URLs to zip files.
- Use another external tool to convert pdf to pdf which also adds the
extracted text as another layer into the pdf
- Although not used, the external conversion routine will now check
for an existing text file that is named as the pdf file with extension
`.txt`. If present it is included in the conversion result and will be
used as the extracted text.
- text extraction for pdf files happens now on the converted file,
because it may already contain the text from the conversion step and
thus avoids running OCR twice.
- All errors during conversion are not fatal; processing continues
without a converted file.
- When converting from html->pdf, the wkhtmltopdf program exits with
errors if the document contains invalid links. The content is now
cleaned before handed to wkhtmltopdf.
- Update emil library which fixes a bug when reading mails without
explicit transfer encoding (8bit)
- Add a info header to converted mails
Html and text files are not fixed to be UTF-8. The encoding is now
detected, which may not work for all files. Default/fallback will be
utf-8.
There is still a problem with mails that contain html parts not in
utf8 encoding. The mail text is always returned as a string and the
original encoding is lost. Then the html is stored using utf-8 bytes,
but wkhtmltopdf reads it using latin1. It seems that the `--encoding`
setting doesn't override encoding provided by the document.
The restriction that only pdf files can be uploaded is removed. All
files can now be uploaded. The processing may not process all. It is
still possible to restrict file uploads by types via a configuration.