docspell

mirror of https://github.com/TheAnachronism/docspell.git synced 2025-06-22 02:18:26 +00:00

Author	SHA1	Message	Date
eikek	172513ce38	Move arg-mappings underneath `command` section The argument mappings are part of the command configuration	2024-05-27 17:53:13 +02:00
tenpai	e731d822dc	Add Japanese Vertical Support Branch for Tesseract and Ocrmypdf OCR (#2505 ) * Add Japanese Vertical Support * Adds Japanese Vertical mappings to default configuration.	2024-04-16 20:24:57 +02:00
eikek	81d8b6c9c1	Allow to configure a region for s3 backend Closes: #2386	2023-11-17 21:27:13 +01:00
eikek	df75fbddcd	Allow to convert html->pdf via weasyprint	2022-11-07 10:31:25 +01:00
eikek	d413b16b03	Allow to always use OCR extracted text Fixes: #1628	2022-07-07 17:58:03 +02:00
eikek	3764f9265b	Configure run/repair db migrations Refs: #1517	2022-05-22 00:07:36 +02:00
eikek	47bd6cd0ba	Fail fast when multiple addons are run	2022-05-21 00:40:26 +02:00
eikek	7fdd78ad06	Experiment with addons Addons allow to execute external programs in some context inside docspell. Currently it is possible to run them after processing files. Addons are provided by URLs to zip files.	2022-05-15 23:46:43 +02:00
eikek	5bdf728eb3	Improve logging configuration - Log levels of specific loggers can be defined in the config file (doesn't work with env variables) - Log events of background tasks carry now additional data	2022-04-30 18:26:19 +02:00
eikek	9851b71c45	Fix documentation about fulltext search	2022-04-24 18:34:22 +02:00
eikek	4488291319	Download multiple files as zip	2022-04-09 15:28:51 +02:00
eikek	21e13341e3	Configure postgres fts backend	2022-03-21 11:05:03 +01:00
eikek	cd3db6ea08	Run file integrity check in house keeping tasks	2022-03-13 15:20:33 +01:00
eikek	e82b00c582	Use different file stores based on config	2022-03-12 12:19:00 +01:00
eikek	9545431d59	Allow the user to set time zone Fix timezone handling for periodic tasks	2022-03-01 23:15:59 +01:00
eikek	8103e25e32	Set default log format to fancy	2022-02-23 23:26:22 +01:00
eikek	8b42708db2	Remove old log stuff	2022-02-19 22:01:49 +01:00
eikek	e483a97de7	Adopt to new loggin api	2022-02-19 21:41:38 +01:00
eikek	118d23c3a2	Add list of env variables to documentation Issue: #1121	2021-10-25 00:23:20 +02:00
eikek	aa8f3b82fc	Use passwords when reading PDFs	2021-09-30 11:48:59 +02:00
eikek	3c93b63c8a	Add option to decrypt PDFs during conversion Refs: #1074	2021-09-29 23:04:26 +02:00
eikek	20a829cf7a	Refactoring for migrating to binny library	2021-09-22 14:18:43 +02:00
eikek	5d33b3841a	Add a task to check for updates periodically It must be enabled and configured by the admin. Refs: #990	2021-08-20 00:25:27 +02:00
eikek	bdc7822f50	Add documentation about docker setup	2021-05-31 22:19:49 +02:00
Eike Kettner	d7bc963450	Cleanup nodes that are not reachable anymore	2021-02-18 00:37:18 +01:00
Eike Kettner	c7e850116f	Make the text length limit optional	2021-01-22 23:06:50 +01:00
Eike Kettner	9957c3267e	Add constraints from config to classifier training For large and/or many documents, training the classifier can lead to OOM errors. Some limits have been set by default.	2021-01-21 17:46:39 +01:00
Eike Kettner	a6c31be22f	Update documentation	2021-01-20 22:47:15 +01:00
Eike Kettner	85ddc61d9d	Move date proposal setting to nlp config	2021-01-20 19:17:29 +01:00
Eike Kettner	f01646aeb5	Reorganize nlp pipeline and add nlp-unsupported language italian Improves and reorganizes how nlp pipelines are setup. Now users can choose from many options, depending on their hardware and usage scenario. This is the base to use more languages without depending on what stanford-nlp supports. Support then is involves to text extraction and simple regex-ner processing.	2021-01-18 17:41:40 +01:00
Eike Kettner	aa937797be	Choose nlp mode in config file	2021-01-17 22:56:33 +01:00
Eike Kettner	d77b5855e4	Set default pool-size to 1	2021-01-11 22:30:59 +01:00
Eike Kettner	a670bbb6c2	Make idle interval when clearing nlp cache configurable	2021-01-06 23:03:00 +01:00
Eike Kettner	f5ae389eea	Cleanup remember-me tokens periodically	2020-12-04 17:59:25 +01:00
Eike Kettner	f4e50c5229	Provide endpoints to submit tasks to re-generate previews The scaling factor can be given in the config file. When this changes, images can be regenerated via POSTing to certain endpoints. It is possible to regenerate just one attachment preview or all within a collective.	2020-11-09 09:00:02 +01:00
Eike Kettner	6db5c39d78	Fix converted filename Mark it by default with a string from the config file. Issue: 397	2020-11-08 09:45:03 +01:00
Eike Kettner	4309bd8dfd	Some cleanup	2020-09-02 21:22:30 +02:00
Eike Kettner	0c97b4ef76	Initial impl of a text classifier based on stanford-nlp	2020-09-02 18:28:14 +02:00
Eike Kettner	8c4f2e702b	Add classifier settings	2020-09-02 18:28:14 +02:00
Eike Kettner	3473cbb773	Use collective data with NER annotation	2020-08-25 20:40:44 +02:00
Eike Kettner	1fc57fc2b2	Set default value for min-text-len to 500 This value is used to decide whether to try OCR or not. If text is below this value, OCR is run and both results are compared. It was set to 10, which is just one or two words. Since the context for docspell are documents, this value is too low.	2020-08-01 15:46:00 +02:00
Eike Kettner	3d49ceaab5	Use ocrmypdf tool to create pdf/a during conversion - Use another external tool to convert pdf to pdf which also adds the extracted text as another layer into the pdf - Although not used, the external conversion routine will now check for an existing text file that is named as the pdf file with extension `.txt`. If present it is included in the conversion result and will be used as the extracted text. - text extraction for pdf files happens now on the converted file, because it may already contain the text from the conversion step and thus avoids running OCR twice. - All errors during conversion are not fatal; processing continues without a converted file.	2020-07-18 17:19:29 +02:00
Eike Kettner	d79ae6233a	Restrict proposals for due date Avoid dates too far in the future.	2020-06-26 16:58:17 +02:00
Eike Kettner	91da3b149e	Reducing default retries to 2 Many errors cannot be recovered from by retrying. There is currently no way to distinguish these states so it is now set to a lower value to have not long wait times until an item arrives.	2020-06-25 23:57:01 +02:00
Eike Kettner	14213c4c27	Allow some solr query options in the config file	2020-06-24 23:37:20 +02:00
Eike Kettner	532caed84c	Consistent logging of request/responses to solr Using a middleware. Also add missing changesets for mariadb.	2020-06-24 21:25:46 +02:00
Eike Kettner	47697a8056	Set some logs to trace	2020-06-24 01:16:13 +02:00
Eike Kettner	ffbb16db45	Transport highlighting information to the client	2020-06-23 00:17:29 +02:00
Eike Kettner	2a0bf24088	Setup solr schema and index all data using a system task The task runs on application start. It sets the schema using solr's schema api and then indexes all data in the database. Each step is memorized so that it is not executed again on subsequent starts.	2020-06-19 21:37:22 +02:00
Eike Kettner	60c079f664	Add task to index current database state	2020-06-18 22:38:45 +02:00

1 2

67 Commits