docspell

mirror of https://github.com/TheAnachronism/docspell.git synced 2025-11-03 02:30:11 +00:00

Author	SHA1	Message	Date
eikek	3c93b63c8a	Add option to decrypt PDFs during conversion Refs: #1074	2021-09-29 23:04:26 +02:00
eikek	20a829cf7a	Refactoring for migrating to binny library	2021-09-22 14:18:43 +02:00
eikek	5d33b3841a	Add a task to check for updates periodically It must be enabled and configured by the admin. Refs: #990	2021-08-20 00:25:27 +02:00
eikek	bdc7822f50	Add documentation about docker setup	2021-05-31 22:19:49 +02:00
Eike Kettner	d7bc963450	Cleanup nodes that are not reachable anymore	2021-02-18 00:37:18 +01:00
Eike Kettner	c7e850116f	Make the text length limit optional	2021-01-22 23:06:50 +01:00
Eike Kettner	9957c3267e	Add constraints from config to classifier training For large and/or many documents, training the classifier can lead to OOM errors. Some limits have been set by default.	2021-01-21 17:46:39 +01:00
Eike Kettner	a6c31be22f	Update documentation	2021-01-20 22:47:15 +01:00
Eike Kettner	85ddc61d9d	Move date proposal setting to nlp config	2021-01-20 19:17:29 +01:00
Eike Kettner	f01646aeb5	Reorganize nlp pipeline and add nlp-unsupported language italian Improves and reorganizes how nlp pipelines are setup. Now users can choose from many options, depending on their hardware and usage scenario. This is the base to use more languages without depending on what stanford-nlp supports. Support then is involves to text extraction and simple regex-ner processing.	2021-01-18 17:41:40 +01:00
Eike Kettner	aa937797be	Choose nlp mode in config file	2021-01-17 22:56:33 +01:00
Eike Kettner	d77b5855e4	Set default pool-size to 1	2021-01-11 22:30:59 +01:00
Eike Kettner	a670bbb6c2	Make idle interval when clearing nlp cache configurable	2021-01-06 23:03:00 +01:00
Eike Kettner	f5ae389eea	Cleanup remember-me tokens periodically	2020-12-04 17:59:25 +01:00
Eike Kettner	f4e50c5229	Provide endpoints to submit tasks to re-generate previews The scaling factor can be given in the config file. When this changes, images can be regenerated via POSTing to certain endpoints. It is possible to regenerate just one attachment preview or all within a collective.	2020-11-09 09:00:02 +01:00
Eike Kettner	6db5c39d78	Fix converted filename Mark it by default with a string from the config file. Issue: 397	2020-11-08 09:45:03 +01:00
Eike Kettner	4309bd8dfd	Some cleanup	2020-09-02 21:22:30 +02:00
Eike Kettner	0c97b4ef76	Initial impl of a text classifier based on stanford-nlp	2020-09-02 18:28:14 +02:00
Eike Kettner	8c4f2e702b	Add classifier settings	2020-09-02 18:28:14 +02:00
Eike Kettner	3473cbb773	Use collective data with NER annotation	2020-08-25 20:40:44 +02:00
Eike Kettner	1fc57fc2b2	Set default value for min-text-len to 500 This value is used to decide whether to try OCR or not. If text is below this value, OCR is run and both results are compared. It was set to 10, which is just one or two words. Since the context for docspell are documents, this value is too low.	2020-08-01 15:46:00 +02:00
Eike Kettner	3d49ceaab5	Use ocrmypdf tool to create pdf/a during conversion - Use another external tool to convert pdf to pdf which also adds the extracted text as another layer into the pdf - Although not used, the external conversion routine will now check for an existing text file that is named as the pdf file with extension `.txt`. If present it is included in the conversion result and will be used as the extracted text. - text extraction for pdf files happens now on the converted file, because it may already contain the text from the conversion step and thus avoids running OCR twice. - All errors during conversion are not fatal; processing continues without a converted file.	2020-07-18 17:19:29 +02:00
Eike Kettner	d79ae6233a	Restrict proposals for due date Avoid dates too far in the future.	2020-06-26 16:58:17 +02:00
Eike Kettner	91da3b149e	Reducing default retries to 2 Many errors cannot be recovered from by retrying. There is currently no way to distinguish these states so it is now set to a lower value to have not long wait times until an item arrives.	2020-06-25 23:57:01 +02:00
Eike Kettner	14213c4c27	Allow some solr query options in the config file	2020-06-24 23:37:20 +02:00
Eike Kettner	532caed84c	Consistent logging of request/responses to solr Using a middleware. Also add missing changesets for mariadb.	2020-06-24 21:25:46 +02:00
Eike Kettner	47697a8056	Set some logs to trace	2020-06-24 01:16:13 +02:00
Eike Kettner	ffbb16db45	Transport highlighting information to the client	2020-06-23 00:17:29 +02:00
Eike Kettner	2a0bf24088	Setup solr schema and index all data using a system task The task runs on application start. It sets the schema using solr's schema api and then indexes all data in the database. Each step is memorized so that it is not executed again on subsequent starts.	2020-06-19 21:37:22 +02:00
Eike Kettner	60c079f664	Add task to index current database state	2020-06-18 22:38:45 +02:00
Eike Kettner	522daaf57e	Introducing fts client into codebase	2020-06-17 23:20:46 +02:00
Eike Kettner	d9782582d8	Use `max-mails` setting with higher priority The `mail-chunk-size` is set to its configured value or `max-mails` whichever is lower.	2020-05-20 22:44:29 +02:00
Eike Kettner	c0259dba7e	Allow to enable debug flag for javamail	2020-05-20 22:15:25 +02:00
Eike Kettner	f2d67dc816	Initial impl of import from mailbox user task	2020-05-20 17:52:38 +02:00
Eike Kettner	852455c610	Add upload operation to task arguments	2020-05-20 17:52:38 +02:00
Eike Kettner	0a1b3fcf95	Set list-id header for notification mails	2020-04-30 21:23:56 +02:00
Eike Kettner	6a1297fc95	Add a limit for text analysis	2020-03-27 22:54:49 +01:00
Eike Kettner	cf7ccd572c	Improve handling encodings Html and text files are not fixed to be UTF-8. The encoding is now detected, which may not work for all files. Default/fallback will be utf-8. There is still a problem with mails that contain html parts not in utf8 encoding. The mail text is always returned as a string and the original encoding is lost. Then the html is stored using utf-8 bytes, but wkhtmltopdf reads it using latin1. It seems that the `--encoding` setting doesn't override encoding provided by the document.	2020-03-23 22:51:28 +01:00
Eike Kettner	718e44a21c	Add cleanup jobs task	2020-03-09 20:24:00 +01:00
Eike Kettner	854a596da3	Integrate periodic tasks The first use case for periodic task is the cleanup of expired invitation keys. This is part of a house-keeping periodic task.	2020-03-08 22:49:49 +01:00
Eike Kettner	1e598bd902	Sketch a scheduler for running periodic tasks Periodic tasks are special in that they are usually kept around and started based on a schedule. A new component checks periodic tasks and submits them in the queue once they are due. In order to avoid duplicate periodic jobs, the tracker of a job is used to store the periodic job id. Each time a periodic task is due, it is first checked if there is a job running (or queued) for this task.	2020-03-08 12:55:03 +01:00
Eike Kettner	ec419c7bfd	Adopt nix modules to new config	2020-02-22 12:40:56 +01:00
Eike Kettner	3f316ab4d0	Update config file doc	2020-02-20 21:10:00 +01:00
Eike Kettner	97305d27ff	Integrate support for more files into processing and upload The restriction that only pdf files can be uploaded is removed. All files can now be uploaded. The processing may not process all. It is still possible to restrict file uploads by types via a configuration.	2020-02-19 23:27:00 +01:00
Eike Kettner	3be90d64d5	Move `SystemCommand` to common module	2020-02-10 22:23:06 +01:00
Eike Kettner	831cd8b655	Initial version. Features: - Upload PDF files let them analyze - Manage meta data and items - See processing in webapp	2019-09-21 22:02:36 +02:00
Eike Kettner	6154e6a387	Initial application stub	2019-09-21 14:54:03 +02:00

47 Commits