8b42708db2
Remove old log stuff
2022-02-19 22:01:49 +01:00
0b606e6b05
Use logfmt for log lines and remove ansi color codes
2021-12-19 22:29:56 +01:00
c21b2cdd29
Update scalafmt to 3.0.8
2021-12-11 22:46:55 +01:00
3c93b63c8a
Add option to decrypt PDFs during conversion
...
Refs: #1074
2021-09-29 23:04:26 +02:00
1761526e20
Simplify MimeType class and parse mimetypes in a more lenient way
2021-09-23 14:10:24 +02:00
9013f2de5b
Update scalafmt settings
2021-09-22 17:23:24 +02:00
9785db0683
Change license header of all files
2021-09-21 22:35:38 +02:00
8e5c88fd32
Add copyright header to source files
2021-07-04 10:57:53 +02:00
bd791b4593
Upgrade code base to CE3
2021-06-22 22:53:34 +02:00
e1bbc2edf5
Apply autoformat
2021-04-10 16:31:58 +02:00
6a63694a3e
Convert unit tests to munit
2021-03-10 19:48:56 +01:00
f01646aeb5
Reorganize nlp pipeline and add nlp-unsupported language italian
...
Improves and reorganizes how nlp pipelines are setup. Now users can
choose from many options, depending on their hardware and usage
scenario.
This is the base to use more languages without depending on what
stanford-nlp supports. Support then is involves to text extraction and
simple regex-ner processing.
2021-01-18 17:41:40 +01:00
4fd6e02ec0
Improve glob and filter archive entries
2020-11-11 21:01:23 +01:00
e26d7129e7
Add fix for mariadb text columns
...
The `text` data type can only store up to 64kb data. The `mediumtext`
up to 16M and `longtext` up to 4G.
Issue: #297
2020-10-02 16:50:51 +02:00
c658677032
Autoformat
2020-09-09 00:29:32 +02:00
0599176ae8
Update scala to 2.13.3
2020-08-01 01:03:43 +02:00
da68405f9b
Extract meta data from pdfs using pdfbox
2020-07-18 23:04:46 +02:00
c41cdeefec
Update scalafmt to 2.5.1 + scalafmtAll
2020-05-04 23:53:57 +02:00
9656ba62f4
scalafmtAll
2020-03-26 18:26:00 +01:00
4ed7a137f7
Add support for archive files
...
Each attachment is now first extracted into potentially multiple ones,
if it is recognized as an archive. This is the first step in
processing. The original archive file is also stored and the resulting
attachments are associated to their original archive.
First support is implemented for zip files.
2020-03-19 22:42:27 +01:00
2f87065b2e
sbt scalafmtAll
2020-02-25 20:55:00 +01:00
bd605b8c94
Add first drafts for converting
2020-02-18 01:31:22 +01:00
3d615181e0
Early draft for text extraction
2020-02-17 01:57:22 +01:00
8143a4edcc
Adding extraction primitives
2020-02-16 21:37:26 +01:00
3deba44282
Rename example files
2020-02-15 12:52:24 +01:00
5c3d2b2e28
Rename example-files to files
2020-02-14 11:14:09 +01:00