Commit Graph

47 Commits

Author SHA1 Message Date
eikek
3c93b63c8a Add option to decrypt PDFs during conversion
Refs: #1074
2021-09-29 23:04:26 +02:00
eikek
20a829cf7a Refactoring for migrating to binny library 2021-09-22 14:18:43 +02:00
eikek
5d33b3841a Add a task to check for updates periodically
It must be enabled and configured by the admin.

Refs: #990
2021-08-20 00:25:27 +02:00
eikek
bdc7822f50 Add documentation about docker setup 2021-05-31 22:19:49 +02:00
Eike Kettner
d7bc963450 Cleanup nodes that are not reachable anymore 2021-02-18 00:37:18 +01:00
Eike Kettner
c7e850116f Make the text length limit optional 2021-01-22 23:06:50 +01:00
Eike Kettner
9957c3267e Add constraints from config to classifier training
For large and/or many documents, training the classifier can lead to
OOM errors. Some limits have been set by default.
2021-01-21 17:46:39 +01:00
Eike Kettner
a6c31be22f Update documentation 2021-01-20 22:47:15 +01:00
Eike Kettner
85ddc61d9d Move date proposal setting to nlp config 2021-01-20 19:17:29 +01:00
Eike Kettner
f01646aeb5 Reorganize nlp pipeline and add nlp-unsupported language italian
Improves and reorganizes how nlp pipelines are setup. Now users can
choose from many options, depending on their hardware and usage
scenario.

This is the base to use more languages without depending on what
stanford-nlp supports. Support then is involves to text extraction and
simple regex-ner processing.
2021-01-18 17:41:40 +01:00
Eike Kettner
aa937797be Choose nlp mode in config file 2021-01-17 22:56:33 +01:00
Eike Kettner
d77b5855e4 Set default pool-size to 1 2021-01-11 22:30:59 +01:00
Eike Kettner
a670bbb6c2 Make idle interval when clearing nlp cache configurable 2021-01-06 23:03:00 +01:00
Eike Kettner
f5ae389eea Cleanup remember-me tokens periodically 2020-12-04 17:59:25 +01:00
Eike Kettner
f4e50c5229 Provide endpoints to submit tasks to re-generate previews
The scaling factor can be given in the config file. When this changes,
images can be regenerated via POSTing to certain endpoints. It is
possible to regenerate just one attachment preview or all within a
collective.
2020-11-09 09:00:02 +01:00
Eike Kettner
6db5c39d78 Fix converted filename
Mark it by default with a string from the config file.

Issue: 397
2020-11-08 09:45:03 +01:00
Eike Kettner
4309bd8dfd Some cleanup 2020-09-02 21:22:30 +02:00
Eike Kettner
0c97b4ef76 Initial impl of a text classifier based on stanford-nlp 2020-09-02 18:28:14 +02:00
Eike Kettner
8c4f2e702b Add classifier settings 2020-09-02 18:28:14 +02:00
Eike Kettner
3473cbb773 Use collective data with NER annotation 2020-08-25 20:40:44 +02:00
Eike Kettner
1fc57fc2b2 Set default value for min-text-len to 500
This value is used to decide whether to try OCR or not. If text is
below this value, OCR is run and both results are compared. It was set
to 10, which is just one or two words. Since the context for docspell
are documents, this value is too low.
2020-08-01 15:46:00 +02:00
Eike Kettner
3d49ceaab5 Use ocrmypdf tool to create pdf/a during conversion
- Use another external tool to convert pdf to pdf which also adds the
  extracted text as another layer into the pdf

- Although not used, the external conversion routine will now check
  for an existing text file that is named as the pdf file with extension
  `.txt`. If present it is included in the conversion result and will be
  used as the extracted text.

- text extraction for pdf files happens now on the converted file,
  because it may already contain the text from the conversion step and
  thus avoids running OCR twice.

- All errors during conversion are not fatal; processing continues
  without a converted file.
2020-07-18 17:19:29 +02:00
Eike Kettner
d79ae6233a Restrict proposals for due date
Avoid dates too far in the future.
2020-06-26 16:58:17 +02:00
Eike Kettner
91da3b149e Reducing default retries to 2
Many errors cannot be recovered from by retrying. There is currently
no way to distinguish these states so it is now set to a lower value
to have not long wait times until an item arrives.
2020-06-25 23:57:01 +02:00
Eike Kettner
14213c4c27 Allow some solr query options in the config file 2020-06-24 23:37:20 +02:00
Eike Kettner
532caed84c Consistent logging of request/responses to solr
Using a middleware. Also add missing changesets for mariadb.
2020-06-24 21:25:46 +02:00
Eike Kettner
47697a8056 Set some logs to trace 2020-06-24 01:16:13 +02:00
Eike Kettner
ffbb16db45 Transport highlighting information to the client 2020-06-23 00:17:29 +02:00
Eike Kettner
2a0bf24088 Setup solr schema and index all data using a system task
The task runs on application start. It sets the schema using solr's
schema api and then indexes all data in the database. Each step is
memorized so that it is not executed again on subsequent starts.
2020-06-19 21:37:22 +02:00
Eike Kettner
60c079f664 Add task to index current database state 2020-06-18 22:38:45 +02:00
Eike Kettner
522daaf57e Introducing fts client into codebase 2020-06-17 23:20:46 +02:00
Eike Kettner
d9782582d8 Use max-mails setting with higher priority
The `mail-chunk-size` is set to its configured value or `max-mails`
whichever is lower.
2020-05-20 22:44:29 +02:00
Eike Kettner
c0259dba7e Allow to enable debug flag for javamail 2020-05-20 22:15:25 +02:00
Eike Kettner
f2d67dc816 Initial impl of import from mailbox user task 2020-05-20 17:52:38 +02:00
Eike Kettner
852455c610 Add upload operation to task arguments 2020-05-20 17:52:38 +02:00
Eike Kettner
0a1b3fcf95 Set list-id header for notification mails 2020-04-30 21:23:56 +02:00
Eike Kettner
6a1297fc95 Add a limit for text analysis 2020-03-27 22:54:49 +01:00
Eike Kettner
cf7ccd572c Improve handling encodings
Html and text files are not fixed to be UTF-8. The encoding is now
detected, which may not work for all files. Default/fallback will be
utf-8.

There is still a problem with mails that contain html parts not in
utf8 encoding. The mail text is always returned as a string and the
original encoding is lost. Then the html is stored using utf-8 bytes,
but wkhtmltopdf reads it using latin1. It seems that the `--encoding`
setting doesn't override encoding provided by the document.
2020-03-23 22:51:28 +01:00
Eike Kettner
718e44a21c Add cleanup jobs task 2020-03-09 20:24:00 +01:00
Eike Kettner
854a596da3 Integrate periodic tasks
The first use case for periodic task is the cleanup of expired
invitation keys. This is part of a house-keeping periodic task.
2020-03-08 22:49:49 +01:00
Eike Kettner
1e598bd902 Sketch a scheduler for running periodic tasks
Periodic tasks are special in that they are usually kept around and
started based on a schedule. A new component checks periodic tasks and
submits them in the queue once they are due.

In order to avoid duplicate periodic jobs, the tracker of a job is
used to store the periodic job id. Each time a periodic task is due,
it is first checked if there is a job running (or queued) for this
task.
2020-03-08 12:55:03 +01:00
Eike Kettner
ec419c7bfd Adopt nix modules to new config 2020-02-22 12:40:56 +01:00
Eike Kettner
3f316ab4d0 Update config file doc 2020-02-20 21:10:00 +01:00
Eike Kettner
97305d27ff Integrate support for more files into processing and upload
The restriction that only pdf files can be uploaded is removed. All
files can now be uploaded. The processing may not process all. It is
still possible to restrict file uploads by types via a configuration.
2020-02-19 23:27:00 +01:00
Eike Kettner
3be90d64d5 Move SystemCommand to common module 2020-02-10 22:23:06 +01:00
Eike Kettner
831cd8b655 Initial version.
Features:

- Upload PDF files let them analyze

- Manage meta data and items

- See processing in webapp
2019-09-21 22:02:36 +02:00
Eike Kettner
6154e6a387 Initial application stub 2019-09-21 14:54:03 +02:00