Commit Graph

23 Commits

Author SHA1 Message Date
eikek
3c93b63c8a Add option to decrypt PDFs during conversion
Refs: #1074
2021-09-29 23:04:26 +02:00
eikek
1761526e20 Simplify MimeType class and parse mimetypes in a more lenient way 2021-09-23 14:10:24 +02:00
eikek
9013f2de5b Update scalafmt settings 2021-09-22 17:23:24 +02:00
eikek
9785db0683 Change license header of all files 2021-09-21 22:35:38 +02:00
eikek
8e5c88fd32 Add copyright header to source files 2021-07-04 10:57:53 +02:00
eikek
bd791b4593 Upgrade code base to CE3 2021-06-22 22:53:34 +02:00
Eike Kettner
e1bbc2edf5 Apply autoformat 2021-04-10 16:31:58 +02:00
Eike Kettner
6a63694a3e Convert unit tests to munit 2021-03-10 19:48:56 +01:00
Eike Kettner
f01646aeb5 Reorganize nlp pipeline and add nlp-unsupported language italian
Improves and reorganizes how nlp pipelines are setup. Now users can
choose from many options, depending on their hardware and usage
scenario.

This is the base to use more languages without depending on what
stanford-nlp supports. Support then is involves to text extraction and
simple regex-ner processing.
2021-01-18 17:41:40 +01:00
Eike Kettner
4fd6e02ec0 Improve glob and filter archive entries 2020-11-11 21:01:23 +01:00
Eike Kettner
e26d7129e7 Add fix for mariadb text columns
The `text` data type can only store up to 64kb data. The `mediumtext`
up to 16M and `longtext` up to 4G.

Issue: #297
2020-10-02 16:50:51 +02:00
Eike Kettner
c658677032 Autoformat 2020-09-09 00:29:32 +02:00
Eike Kettner
0599176ae8 Update scala to 2.13.3 2020-08-01 01:03:43 +02:00
Eike Kettner
da68405f9b Extract meta data from pdfs using pdfbox 2020-07-18 23:04:46 +02:00
Eike Kettner
c41cdeefec Update scalafmt to 2.5.1 + scalafmtAll 2020-05-04 23:53:57 +02:00
Eike Kettner
9656ba62f4 scalafmtAll 2020-03-26 18:26:00 +01:00
Eike Kettner
4ed7a137f7 Add support for archive files
Each attachment is now first extracted into potentially multiple ones,
if it is recognized as an archive. This is the first step in
processing. The original archive file is also stored and the resulting
attachments are associated to their original archive.

First support is implemented for zip files.
2020-03-19 22:42:27 +01:00
Eike Kettner
2f87065b2e sbt scalafmtAll 2020-02-25 20:55:00 +01:00
Eike Kettner
bd605b8c94 Add first drafts for converting 2020-02-18 01:31:22 +01:00
Eike Kettner
3d615181e0 Early draft for text extraction 2020-02-17 01:57:22 +01:00
Eike Kettner
8143a4edcc Adding extraction primitives 2020-02-16 21:37:26 +01:00
Eike Kettner
3deba44282 Rename example files 2020-02-15 12:52:24 +01:00
Eike Kettner
5c3d2b2e28 Rename example-files to files 2020-02-14 11:14:09 +01:00