Commit Graph

47 Commits

Author SHA1 Message Date
eikek
8269a73a83
Extend config for external commands (#2536)
Allows to configure external commands and provide different arguments
based on runtime values, like language. It extends the current config
of a command to allow a `arg-mappings` section. An example for
ocrmypdf:

```conf
ocrmypdf = {
  enabled = true
  command = {
    program = "ocrmypdf"
### new arg-mappings
    arg-mappings = {
      "mylang" = {
        value = "{{lang}}"
        mappings = [
          {
            matches = "deu"
            args = [ "-l", "deu", "--pdf-renderer", "sandwich" ]
          },
          {
            matches = ".*"
            args = [ "-l", "{{lang}}" ]
          }
        ]
      }
    }
#### end new arg-mappings
    args = [
      ### will be replaced with corresponding args from "mylang" mapping
      "{{mylang}}", 
      "--skip-text",
      "--deskew",
      "-j", "1",
      "{{infile}}",
      "{{outfile}}"
    ]
    timeout = "5 minutes"
  }
  working-dir = ${java.io.tmpdir}"/docspell-convert"
}
```

The whole section will be first processed to replace all `{{…}}`
patterns with corresponding values. Then `arg-mappings` will be looked
at and the first match (value == matches) in its `mappings` array is
used to replace its name in the arguments to the command.
2024-03-08 21:34:42 +01:00
eikek
924aaf720e Fix compile warnings after scala update 2024-03-03 18:43:54 +01:00
eikek
dd763e7796 Fix potential infinite loop
The code removed here was copied from another project some years back.
Now there is an improved version in fs2 that can be used.

Fixes: #2376
2023-11-12 13:04:03 +01:00
eikek
fe4a300b0e Update pdfbox to 3.0.0 2023-11-06 00:06:49 +01:00
Rehan Mahmood
2a39b2f6a6 Updated following dependencies as they need changes to the code to work properly:
- Scala
- fs2
- http4s
2023-10-31 14:24:00 -04:00
eikek
85094cc1f6 Fix html conversion for text files
It must honor the configuration when doing html->pdf.
2023-01-09 18:17:23 +01:00
eikek
df75fbddcd Allow to convert html->pdf via weasyprint 2022-11-07 10:31:25 +01:00
eikek
7fdd78ad06 Experiment with addons
Addons allow to execute external programs in some context inside
docspell. Currently it is possible to run them after processing files.
Addons are provided by URLs to zip files.
2022-05-15 23:46:43 +02:00
eikek
9eb9497675 Fix logging in tests 2022-02-19 23:33:01 +01:00
eikek
e483a97de7 Adopt to new loggin api 2022-02-19 21:41:38 +01:00
eikek
aa8f3b82fc Use passwords when reading PDFs 2021-09-30 11:48:59 +02:00
eikek
3c93b63c8a Add option to decrypt PDFs during conversion
Refs: #1074
2021-09-29 23:04:26 +02:00
eikek
9013f2de5b Update scalafmt settings 2021-09-22 17:23:24 +02:00
eikek
9785db0683 Change license header of all files 2021-09-21 22:35:38 +02:00
Scala Steward
e4fecefaea
Reformat with scalafmt 3.0.0 2021-08-19 08:50:30 +02:00
eikek
1901fe1a8c Adopt deprecated APIs from fs2; use fs2.Path 2021-08-07 17:51:56 +02:00
eikek
8e5c88fd32 Add copyright header to source files 2021-07-04 10:57:53 +02:00
eikek
bd791b4593 Upgrade code base to CE3 2021-06-22 22:53:34 +02:00
Eike Kettner
e1bbc2edf5 Apply autoformat 2021-04-10 16:31:58 +02:00
Eike Kettner
6a63694a3e Convert unit tests to munit 2021-03-10 19:48:56 +01:00
Eike Kettner
3fabe0a582 Update to Scala 2.13.4 2020-11-27 20:26:24 +01:00
Eike Kettner
6db5c39d78 Fix converted filename
Mark it by default with a string from the config file.

Issue: 397
2020-11-08 09:45:03 +01:00
Eike Kettner
dd89e05cc2 Convert exceptions when converting to pdf into an error result
The file processing tries pdf conversion once and keeps going if it
fails. Some errors (e.g. timeouts) are raised via an exception.

Issue: #387
2020-10-26 19:51:02 +01:00
Eike Kettner
c658677032 Autoformat 2020-09-09 00:29:32 +02:00
Eike Kettner
0599176ae8 Update scala to 2.13.3 2020-08-01 01:03:43 +02:00
Eike Kettner
3d49ceaab5 Use ocrmypdf tool to create pdf/a during conversion
- Use another external tool to convert pdf to pdf which also adds the
  extracted text as another layer into the pdf

- Although not used, the external conversion routine will now check
  for an existing text file that is named as the pdf file with extension
  `.txt`. If present it is included in the conversion result and will be
  used as the extracted text.

- text extraction for pdf files happens now on the converted file,
  because it may already contain the text from the conversion step and
  thus avoids running OCR twice.

- All errors during conversion are not fatal; processing continues
  without a converted file.
2020-07-18 17:19:29 +02:00
Eike Kettner
347a029af8 Scalafix organize-imports 2020-06-28 21:20:47 +02:00
Eike Kettner
56624515a5 ScalafmtAll 2020-05-25 13:56:06 +02:00
Eike Kettner
ee394eae86 Try streamline the different impls for MimeType 2020-05-25 09:24:24 +02:00
Eike Kettner
c41cdeefec Update scalafmt to 2.5.1 + scalafmtAll 2020-05-04 23:53:57 +02:00
Eike Kettner
b2ca314da9 Check code formatting with travis ci 2020-04-23 20:25:21 +02:00
Eike Kettner
362e1a5e14 Fix compile errors in test code 2020-04-07 23:00:25 +02:00
Eike Kettner
1206105f0b Fix several bugs with handling e-mail files
- When converting from html->pdf, the wkhtmltopdf program exits with
  errors if the document contains invalid links. The content is now
  cleaned before handed to wkhtmltopdf.
- Update emil library which fixes a bug when reading mails without
  explicit transfer encoding (8bit)
- Add a info header to converted mails
2020-04-07 22:38:25 +02:00
Eike Kettner
aed5dfaff6 Fix mimetype extractors 2020-03-27 21:49:55 +01:00
Eike Kettner
9656ba62f4 scalafmtAll 2020-03-26 18:26:00 +01:00
Eike Kettner
cf7ccd572c Improve handling encodings
Html and text files are not fixed to be UTF-8. The encoding is now
detected, which may not work for all files. Default/fallback will be
utf-8.

There is still a problem with mails that contain html parts not in
utf8 encoding. The mail text is always returned as a string and the
original encoding is lost. Then the html is stored using utf-8 bytes,
but wkhtmltopdf reads it using latin1. It seems that the `--encoding`
setting doesn't override encoding provided by the document.
2020-03-23 22:51:28 +01:00
Eike Kettner
3703dce9a6 Update fs2 to 2.3.0 2020-03-20 22:47:09 +01:00
Eike Kettner
2f87065b2e sbt scalafmtAll 2020-02-25 20:55:00 +01:00
Eike Kettner
ec419c7bfd Adopt nix modules to new config 2020-02-22 12:40:56 +01:00
Eike Kettner
97305d27ff Integrate support for more files into processing and upload
The restriction that only pdf files can be uploaded is removed. All
files can now be uploaded. The processing may not process all. It is
still possible to restrict file uploads by types via a configuration.
2020-02-19 23:27:00 +01:00
Eike Kettner
9b1349734e Convert some files to pdf 2020-02-19 02:03:10 +01:00
Eike Kettner
5869e2ee6e Streamline extern-conv stdin/infile 2020-02-18 12:43:47 +01:00
Eike Kettner
0dcc00836b Make logger configurable in system commands 2020-02-18 12:02:43 +01:00
Eike Kettner
bd605b8c94 Add first drafts for converting 2020-02-18 01:31:22 +01:00
Eike Kettner
c665c212a0 Early draft for running wkhtmltopdf 2020-02-17 14:02:23 +01:00
Eike Kettner
8143a4edcc Adding extraction primitives 2020-02-16 21:37:26 +01:00
Eike Kettner
ce22b727b1 Add new convert module and sketch its integration 2020-02-11 00:33:52 +01:00