Commit Graph

65 Commits

Author SHA1 Message Date
eikek
81d8b6c9c1 Allow to configure a region for s3 backend
Closes: #2386
2023-11-17 21:27:13 +01:00
eikek
df75fbddcd Allow to convert html->pdf via weasyprint 2022-11-07 10:31:25 +01:00
eikek
d413b16b03 Allow to always use OCR extracted text
Fixes: #1628
2022-07-07 17:58:03 +02:00
eikek
3764f9265b Configure run/repair db migrations
Refs: #1517
2022-05-22 00:07:36 +02:00
eikek
47bd6cd0ba Fail fast when multiple addons are run 2022-05-21 00:40:26 +02:00
eikek
7fdd78ad06 Experiment with addons
Addons allow to execute external programs in some context inside
docspell. Currently it is possible to run them after processing files.
Addons are provided by URLs to zip files.
2022-05-15 23:46:43 +02:00
eikek
5bdf728eb3 Improve logging configuration
- Log levels of specific loggers can be defined in the config
  file (doesn't work with env variables)

- Log events of background tasks carry now additional data
2022-04-30 18:26:19 +02:00
eikek
9851b71c45 Fix documentation about fulltext search 2022-04-24 18:34:22 +02:00
eikek
4488291319 Download multiple files as zip 2022-04-09 15:28:51 +02:00
eikek
21e13341e3 Configure postgres fts backend 2022-03-21 11:05:03 +01:00
eikek
cd3db6ea08 Run file integrity check in house keeping tasks 2022-03-13 15:20:33 +01:00
eikek
e82b00c582 Use different file stores based on config 2022-03-12 12:19:00 +01:00
eikek
9545431d59 Allow the user to set time zone
Fix timezone handling for periodic tasks
2022-03-01 23:15:59 +01:00
eikek
8103e25e32 Set default log format to fancy 2022-02-23 23:26:22 +01:00
eikek
8b42708db2 Remove old log stuff 2022-02-19 22:01:49 +01:00
eikek
e483a97de7 Adopt to new loggin api 2022-02-19 21:41:38 +01:00
eikek
118d23c3a2 Add list of env variables to documentation
Issue: #1121
2021-10-25 00:23:20 +02:00
eikek
aa8f3b82fc Use passwords when reading PDFs 2021-09-30 11:48:59 +02:00
eikek
3c93b63c8a Add option to decrypt PDFs during conversion
Refs: #1074
2021-09-29 23:04:26 +02:00
eikek
20a829cf7a Refactoring for migrating to binny library 2021-09-22 14:18:43 +02:00
eikek
5d33b3841a Add a task to check for updates periodically
It must be enabled and configured by the admin.

Refs: #990
2021-08-20 00:25:27 +02:00
eikek
bdc7822f50 Add documentation about docker setup 2021-05-31 22:19:49 +02:00
Eike Kettner
d7bc963450 Cleanup nodes that are not reachable anymore 2021-02-18 00:37:18 +01:00
Eike Kettner
c7e850116f Make the text length limit optional 2021-01-22 23:06:50 +01:00
Eike Kettner
9957c3267e Add constraints from config to classifier training
For large and/or many documents, training the classifier can lead to
OOM errors. Some limits have been set by default.
2021-01-21 17:46:39 +01:00
Eike Kettner
a6c31be22f Update documentation 2021-01-20 22:47:15 +01:00
Eike Kettner
85ddc61d9d Move date proposal setting to nlp config 2021-01-20 19:17:29 +01:00
Eike Kettner
f01646aeb5 Reorganize nlp pipeline and add nlp-unsupported language italian
Improves and reorganizes how nlp pipelines are setup. Now users can
choose from many options, depending on their hardware and usage
scenario.

This is the base to use more languages without depending on what
stanford-nlp supports. Support then is involves to text extraction and
simple regex-ner processing.
2021-01-18 17:41:40 +01:00
Eike Kettner
aa937797be Choose nlp mode in config file 2021-01-17 22:56:33 +01:00
Eike Kettner
d77b5855e4 Set default pool-size to 1 2021-01-11 22:30:59 +01:00
Eike Kettner
a670bbb6c2 Make idle interval when clearing nlp cache configurable 2021-01-06 23:03:00 +01:00
Eike Kettner
f5ae389eea Cleanup remember-me tokens periodically 2020-12-04 17:59:25 +01:00
Eike Kettner
f4e50c5229 Provide endpoints to submit tasks to re-generate previews
The scaling factor can be given in the config file. When this changes,
images can be regenerated via POSTing to certain endpoints. It is
possible to regenerate just one attachment preview or all within a
collective.
2020-11-09 09:00:02 +01:00
Eike Kettner
6db5c39d78 Fix converted filename
Mark it by default with a string from the config file.

Issue: 397
2020-11-08 09:45:03 +01:00
Eike Kettner
4309bd8dfd Some cleanup 2020-09-02 21:22:30 +02:00
Eike Kettner
0c97b4ef76 Initial impl of a text classifier based on stanford-nlp 2020-09-02 18:28:14 +02:00
Eike Kettner
8c4f2e702b Add classifier settings 2020-09-02 18:28:14 +02:00
Eike Kettner
3473cbb773 Use collective data with NER annotation 2020-08-25 20:40:44 +02:00
Eike Kettner
1fc57fc2b2 Set default value for min-text-len to 500
This value is used to decide whether to try OCR or not. If text is
below this value, OCR is run and both results are compared. It was set
to 10, which is just one or two words. Since the context for docspell
are documents, this value is too low.
2020-08-01 15:46:00 +02:00
Eike Kettner
3d49ceaab5 Use ocrmypdf tool to create pdf/a during conversion
- Use another external tool to convert pdf to pdf which also adds the
  extracted text as another layer into the pdf

- Although not used, the external conversion routine will now check
  for an existing text file that is named as the pdf file with extension
  `.txt`. If present it is included in the conversion result and will be
  used as the extracted text.

- text extraction for pdf files happens now on the converted file,
  because it may already contain the text from the conversion step and
  thus avoids running OCR twice.

- All errors during conversion are not fatal; processing continues
  without a converted file.
2020-07-18 17:19:29 +02:00
Eike Kettner
d79ae6233a Restrict proposals for due date
Avoid dates too far in the future.
2020-06-26 16:58:17 +02:00
Eike Kettner
91da3b149e Reducing default retries to 2
Many errors cannot be recovered from by retrying. There is currently
no way to distinguish these states so it is now set to a lower value
to have not long wait times until an item arrives.
2020-06-25 23:57:01 +02:00
Eike Kettner
14213c4c27 Allow some solr query options in the config file 2020-06-24 23:37:20 +02:00
Eike Kettner
532caed84c Consistent logging of request/responses to solr
Using a middleware. Also add missing changesets for mariadb.
2020-06-24 21:25:46 +02:00
Eike Kettner
47697a8056 Set some logs to trace 2020-06-24 01:16:13 +02:00
Eike Kettner
ffbb16db45 Transport highlighting information to the client 2020-06-23 00:17:29 +02:00
Eike Kettner
2a0bf24088 Setup solr schema and index all data using a system task
The task runs on application start. It sets the schema using solr's
schema api and then indexes all data in the database. Each step is
memorized so that it is not executed again on subsequent starts.
2020-06-19 21:37:22 +02:00
Eike Kettner
60c079f664 Add task to index current database state 2020-06-18 22:38:45 +02:00
Eike Kettner
522daaf57e Introducing fts client into codebase 2020-06-17 23:20:46 +02:00
Eike Kettner
d9782582d8 Use max-mails setting with higher priority
The `mail-chunk-size` is set to its configured value or `max-mails`
whichever is lower.
2020-05-20 22:44:29 +02:00