Improve handling encodings

Html and text files are not fixed to be UTF-8. The encoding is now
detected, which may not work for all files. Default/fallback will be
utf-8.

There is still a problem with mails that contain html parts not in
utf8 encoding. The mail text is always returned as a string and the
original encoding is lost. Then the html is stored using utf-8 bytes,
but wkhtmltopdf reads it using latin1. It seems that the `--encoding`
setting doesn't override encoding provided by the document.
This commit is contained in:
Eike Kettner
2020-03-23 22:43:15 +01:00
parent b265421a46
commit cf7ccd572c
23 changed files with 383 additions and 92 deletions

13
NOTICE.txt Normal file
View File

@ -0,0 +1,13 @@
Docspell
Copyright 2019-2020
Licensed under the GPLv3
This software contains portions of code from tika-parser
https://tika.apache.org
Copyright (C) Apache Software Foundation (ASF) <https://www.apache.org>
Licensed under Apache License 2.0
This software contains portions of code from http4s
https://http4s.org
Copyright 2013-2018 http4s.org
Licensed under Apache License 2.0