Commit Graph

67 Commits

Author SHA1 Message Date
172513ce38 Move arg-mappings underneath command section
The argument mappings are part of the command configuration
2024-05-27 17:53:13 +02:00
e731d822dc Add Japanese Vertical Support Branch for Tesseract and Ocrmypdf OCR (#2505)
* Add Japanese Vertical Support 
* Adds Japanese Vertical mappings to default configuration.
2024-04-16 20:24:57 +02:00
81d8b6c9c1 Allow to configure a region for s3 backend
Closes: #2386
2023-11-17 21:27:13 +01:00
df75fbddcd Allow to convert html->pdf via weasyprint 2022-11-07 10:31:25 +01:00
d413b16b03 Allow to always use OCR extracted text
Fixes: #1628
2022-07-07 17:58:03 +02:00
3764f9265b Configure run/repair db migrations
Refs: #1517
2022-05-22 00:07:36 +02:00
47bd6cd0ba Fail fast when multiple addons are run 2022-05-21 00:40:26 +02:00
7fdd78ad06 Experiment with addons
Addons allow to execute external programs in some context inside
docspell. Currently it is possible to run them after processing files.
Addons are provided by URLs to zip files.
2022-05-15 23:46:43 +02:00
5bdf728eb3 Improve logging configuration
- Log levels of specific loggers can be defined in the config
  file (doesn't work with env variables)

- Log events of background tasks carry now additional data
2022-04-30 18:26:19 +02:00
9851b71c45 Fix documentation about fulltext search 2022-04-24 18:34:22 +02:00
4488291319 Download multiple files as zip 2022-04-09 15:28:51 +02:00
21e13341e3 Configure postgres fts backend 2022-03-21 11:05:03 +01:00
cd3db6ea08 Run file integrity check in house keeping tasks 2022-03-13 15:20:33 +01:00
e82b00c582 Use different file stores based on config 2022-03-12 12:19:00 +01:00
9545431d59 Allow the user to set time zone
Fix timezone handling for periodic tasks
2022-03-01 23:15:59 +01:00
8103e25e32 Set default log format to fancy 2022-02-23 23:26:22 +01:00
8b42708db2 Remove old log stuff 2022-02-19 22:01:49 +01:00
e483a97de7 Adopt to new loggin api 2022-02-19 21:41:38 +01:00
118d23c3a2 Add list of env variables to documentation
Issue: #1121
2021-10-25 00:23:20 +02:00
aa8f3b82fc Use passwords when reading PDFs 2021-09-30 11:48:59 +02:00
3c93b63c8a Add option to decrypt PDFs during conversion
Refs: #1074
2021-09-29 23:04:26 +02:00
20a829cf7a Refactoring for migrating to binny library 2021-09-22 14:18:43 +02:00
5d33b3841a Add a task to check for updates periodically
It must be enabled and configured by the admin.

Refs: #990
2021-08-20 00:25:27 +02:00
bdc7822f50 Add documentation about docker setup 2021-05-31 22:19:49 +02:00
d7bc963450 Cleanup nodes that are not reachable anymore 2021-02-18 00:37:18 +01:00
c7e850116f Make the text length limit optional 2021-01-22 23:06:50 +01:00
9957c3267e Add constraints from config to classifier training
For large and/or many documents, training the classifier can lead to
OOM errors. Some limits have been set by default.
2021-01-21 17:46:39 +01:00
a6c31be22f Update documentation 2021-01-20 22:47:15 +01:00
85ddc61d9d Move date proposal setting to nlp config 2021-01-20 19:17:29 +01:00
f01646aeb5 Reorganize nlp pipeline and add nlp-unsupported language italian
Improves and reorganizes how nlp pipelines are setup. Now users can
choose from many options, depending on their hardware and usage
scenario.

This is the base to use more languages without depending on what
stanford-nlp supports. Support then is involves to text extraction and
simple regex-ner processing.
2021-01-18 17:41:40 +01:00
aa937797be Choose nlp mode in config file 2021-01-17 22:56:33 +01:00
d77b5855e4 Set default pool-size to 1 2021-01-11 22:30:59 +01:00
a670bbb6c2 Make idle interval when clearing nlp cache configurable 2021-01-06 23:03:00 +01:00
f5ae389eea Cleanup remember-me tokens periodically 2020-12-04 17:59:25 +01:00
f4e50c5229 Provide endpoints to submit tasks to re-generate previews
The scaling factor can be given in the config file. When this changes,
images can be regenerated via POSTing to certain endpoints. It is
possible to regenerate just one attachment preview or all within a
collective.
2020-11-09 09:00:02 +01:00
6db5c39d78 Fix converted filename
Mark it by default with a string from the config file.

Issue: 397
2020-11-08 09:45:03 +01:00
4309bd8dfd Some cleanup 2020-09-02 21:22:30 +02:00
0c97b4ef76 Initial impl of a text classifier based on stanford-nlp 2020-09-02 18:28:14 +02:00
8c4f2e702b Add classifier settings 2020-09-02 18:28:14 +02:00
3473cbb773 Use collective data with NER annotation 2020-08-25 20:40:44 +02:00
1fc57fc2b2 Set default value for min-text-len to 500
This value is used to decide whether to try OCR or not. If text is
below this value, OCR is run and both results are compared. It was set
to 10, which is just one or two words. Since the context for docspell
are documents, this value is too low.
2020-08-01 15:46:00 +02:00
3d49ceaab5 Use ocrmypdf tool to create pdf/a during conversion
- Use another external tool to convert pdf to pdf which also adds the
  extracted text as another layer into the pdf

- Although not used, the external conversion routine will now check
  for an existing text file that is named as the pdf file with extension
  `.txt`. If present it is included in the conversion result and will be
  used as the extracted text.

- text extraction for pdf files happens now on the converted file,
  because it may already contain the text from the conversion step and
  thus avoids running OCR twice.

- All errors during conversion are not fatal; processing continues
  without a converted file.
2020-07-18 17:19:29 +02:00
d79ae6233a Restrict proposals for due date
Avoid dates too far in the future.
2020-06-26 16:58:17 +02:00
91da3b149e Reducing default retries to 2
Many errors cannot be recovered from by retrying. There is currently
no way to distinguish these states so it is now set to a lower value
to have not long wait times until an item arrives.
2020-06-25 23:57:01 +02:00
14213c4c27 Allow some solr query options in the config file 2020-06-24 23:37:20 +02:00
532caed84c Consistent logging of request/responses to solr
Using a middleware. Also add missing changesets for mariadb.
2020-06-24 21:25:46 +02:00
47697a8056 Set some logs to trace 2020-06-24 01:16:13 +02:00
ffbb16db45 Transport highlighting information to the client 2020-06-23 00:17:29 +02:00
2a0bf24088 Setup solr schema and index all data using a system task
The task runs on application start. It sets the schema using solr's
schema api and then indexes all data in the database. Each step is
memorized so that it is not executed again on subsequent starts.
2020-06-19 21:37:22 +02:00
60c079f664 Add task to index current database state 2020-06-18 22:38:45 +02:00