This cuts down considerably when high-dpi images are provided in pdfs.
The test file, scanned with 600dpi resulting in a 5.4M pdf file
contains a 9900x13800 image. This image is loaded into memory in order
to scale it down by PDFBox. This easily results in out of memory
errors (this image requires already ~400M). With subsampling the size
is reduced at most by a factor of 8. Still recommended to avoid large
dpi image-only scans for text based documents or increase the heap
size for joex.
The scaling factor can be given in the config file. When this changes,
images can be regenerated via POSTing to certain endpoints. It is
possible to regenerate just one attachment preview or all within a
collective.
Html and text files are not fixed to be UTF-8. The encoding is now
detected, which may not work for all files. Default/fallback will be
utf-8.
There is still a problem with mails that contain html parts not in
utf8 encoding. The mail text is always returned as a string and the
original encoding is lost. Then the html is stored using utf-8 bytes,
but wkhtmltopdf reads it using latin1. It seems that the `--encoding`
setting doesn't override encoding provided by the document.
The restriction that only pdf files can be uploaded is removed. All
files can now be uploaded. The processing may not process all. It is
still possible to restrict file uploads by types via a configuration.