Commit Graph

23 Commits

Author SHA1 Message Date
0599176ae8 Update scala to 2.13.3 2020-08-01 01:03:43 +02:00
3d49ceaab5 Use ocrmypdf tool to create pdf/a during conversion
- Use another external tool to convert pdf to pdf which also adds the
  extracted text as another layer into the pdf

- Although not used, the external conversion routine will now check
  for an existing text file that is named as the pdf file with extension
  `.txt`. If present it is included in the conversion result and will be
  used as the extracted text.

- text extraction for pdf files happens now on the converted file,
  because it may already contain the text from the conversion step and
  thus avoids running OCR twice.

- All errors during conversion are not fatal; processing continues
  without a converted file.
2020-07-18 17:19:29 +02:00
347a029af8 Scalafix organize-imports 2020-06-28 21:20:47 +02:00
56624515a5 ScalafmtAll 2020-05-25 13:56:06 +02:00
ee394eae86 Try streamline the different impls for MimeType 2020-05-25 09:24:24 +02:00
c41cdeefec Update scalafmt to 2.5.1 + scalafmtAll 2020-05-04 23:53:57 +02:00
b2ca314da9 Check code formatting with travis ci 2020-04-23 20:25:21 +02:00
362e1a5e14 Fix compile errors in test code 2020-04-07 23:00:25 +02:00
1206105f0b Fix several bugs with handling e-mail files
- When converting from html->pdf, the wkhtmltopdf program exits with
  errors if the document contains invalid links. The content is now
  cleaned before handed to wkhtmltopdf.
- Update emil library which fixes a bug when reading mails without
  explicit transfer encoding (8bit)
- Add a info header to converted mails
2020-04-07 22:38:25 +02:00
aed5dfaff6 Fix mimetype extractors 2020-03-27 21:49:55 +01:00
9656ba62f4 scalafmtAll 2020-03-26 18:26:00 +01:00
cf7ccd572c Improve handling encodings
Html and text files are not fixed to be UTF-8. The encoding is now
detected, which may not work for all files. Default/fallback will be
utf-8.

There is still a problem with mails that contain html parts not in
utf8 encoding. The mail text is always returned as a string and the
original encoding is lost. Then the html is stored using utf-8 bytes,
but wkhtmltopdf reads it using latin1. It seems that the `--encoding`
setting doesn't override encoding provided by the document.
2020-03-23 22:51:28 +01:00
3703dce9a6 Update fs2 to 2.3.0 2020-03-20 22:47:09 +01:00
2f87065b2e sbt scalafmtAll 2020-02-25 20:55:00 +01:00
ec419c7bfd Adopt nix modules to new config 2020-02-22 12:40:56 +01:00
97305d27ff Integrate support for more files into processing and upload
The restriction that only pdf files can be uploaded is removed. All
files can now be uploaded. The processing may not process all. It is
still possible to restrict file uploads by types via a configuration.
2020-02-19 23:27:00 +01:00
9b1349734e Convert some files to pdf 2020-02-19 02:03:10 +01:00
5869e2ee6e Streamline extern-conv stdin/infile 2020-02-18 12:43:47 +01:00
0dcc00836b Make logger configurable in system commands 2020-02-18 12:02:43 +01:00
bd605b8c94 Add first drafts for converting 2020-02-18 01:31:22 +01:00
c665c212a0 Early draft for running wkhtmltopdf 2020-02-17 14:02:23 +01:00
8143a4edcc Adding extraction primitives 2020-02-16 21:37:26 +01:00
ce22b727b1 Add new convert module and sketch its integration 2020-02-11 00:33:52 +01:00