Commit Graph

18 Commits

Author SHA1 Message Date
cec4948710 Add pdf meta data to extracted text to add it to full-text index 2020-07-19 01:07:49 +02:00
209c068436 Use keywords in pdfs to search for existing tags
During processing, keywords stored in PDF metadata are used to look
them up in the tag database and associate any existing tags to the
item.

See #175
2020-07-19 00:28:04 +02:00
da68405f9b Extract meta data from pdfs using pdfbox 2020-07-18 23:04:46 +02:00
347a029af8 Scalafix organize-imports 2020-06-28 21:20:47 +02:00
2e88207ff1 Post process all extracted text
Removes 0 bytes and leading/trailing whitespace
2020-05-25 13:56:06 +02:00
ee394eae86 Try streamline the different impls for MimeType 2020-05-25 09:24:24 +02:00
c41cdeefec Update scalafmt to 2.5.1 + scalafmtAll 2020-05-04 23:53:57 +02:00
9656ba62f4 scalafmtAll 2020-03-26 18:26:00 +01:00
cf7ccd572c Improve handling encodings
Html and text files are not fixed to be UTF-8. The encoding is now
detected, which may not work for all files. Default/fallback will be
utf-8.

There is still a problem with mails that contain html parts not in
utf8 encoding. The mail text is always returned as a string and the
original encoding is lost. Then the html is stored using utf-8 bytes,
but wkhtmltopdf reads it using latin1. It seems that the `--encoding`
setting doesn't override encoding provided by the document.
2020-03-23 22:51:28 +01:00
2f87065b2e sbt scalafmtAll 2020-02-25 20:55:00 +01:00
97305d27ff Integrate support for more files into processing and upload
The restriction that only pdf files can be uploaded is removed. All
files can now be uploaded. The processing may not process all. It is
still possible to restrict file uploads by types via a configuration.
2020-02-19 23:27:00 +01:00
9b1349734e Convert some files to pdf 2020-02-19 02:03:10 +01:00
5869e2ee6e Streamline extern-conv stdin/infile 2020-02-18 12:43:47 +01:00
0dcc00836b Make logger configurable in system commands 2020-02-18 12:02:43 +01:00
e0682464b5 Configure pdf extraction; move Logger and DataType to common 2020-02-17 14:01:36 +01:00
3d615181e0 Early draft for text extraction 2020-02-17 01:57:22 +01:00
8143a4edcc Adding extraction primitives 2020-02-16 21:37:26 +01:00
851ee7ef0f Reorganize processing code
Use separate modules for

- text extraction
- conversion to pdf
- text analysis
2020-02-15 21:25:25 +01:00