docspell

mirror of https://github.com/TheAnachronism/docspell.git synced 2025-02-22 14:03:26 +00:00

Author	SHA1	Message	Date
Eike Kettner	5fe532001b	Allow to specify document lanugage with the request	2020-11-23 20:49:01 +01:00
Eike Kettner	5034e12bec	Add a subject filter to scan-mailbox args	2020-11-13 23:15:20 +01:00
mergify[bot]	e5ce1fd45f	Merge pull request #437 from eikek/upload-improvements Upload improvements	2020-11-12 22:58:08 +00:00
Eike Kettner	4fd6e02ec0	Improve glob and filter archive entries	2020-11-11 21:01:23 +01:00
Eike Kettner	27eb5d70de	Apply given tags in processing step Issue: #346	2020-11-11 21:01:23 +01:00
Eike Kettner	55a6f7aaf6	Add more properties to upload meta data	2020-11-11 21:01:23 +01:00
Eike Kettner	746e04c624	Improve logging when creating preview images	2020-11-10 22:25:46 +01:00
Eike Kettner	10305bc82d	Minor improvements	2020-11-09 21:16:53 +01:00
Eike Kettner	29455d638c	Add startup task to find page counts of existing files	2020-11-09 20:35:35 +01:00
Eike Kettner	a77f34b7ba	Add a processing step to retrieve page counts	2020-11-09 11:08:24 +01:00
Eike Kettner	f4e50c5229	Provide endpoints to submit tasks to re-generate previews The scaling factor can be given in the config file. When this changes, images can be regenerated via POSTing to certain endpoints. It is possible to regenerate just one attachment preview or all within a collective.	2020-11-09 09:00:02 +01:00
Eike Kettner	6037b54959	Don't fail processing if generating preview fails	2020-11-09 00:05:11 +01:00
Eike Kettner	709848244c	Create tasks to generate all previews There is a task to generate preview images per attachment. It can either add them (if not present yet) or overwrite them (e.g. some config has changed). There is a task that selects all attachments without previews and submits a task to create it. This is submitted on start automatically to generate previews for all existing attachments.	2020-11-08 23:46:02 +01:00
Eike Kettner	7ba6baf6f0	Make preview image smaller	2020-11-08 15:12:56 +01:00
Eike Kettner	6db5c39d78	Fix converted filename Mark it by default with a string from the config file. Issue: 397	2020-11-08 09:45:03 +01:00
Eike Kettner	ef7cb4e779	Create a preview image of all files during processing	2020-11-08 01:25:59 +01:00
Eike Kettner	ab1139523a	Let the convert-all task retry when pdf conversion fails	2020-10-26 23:39:26 +01:00
Eike Kettner	b59696a9d3	Make sure to only remove/retry items in premature states	2020-10-26 23:39:26 +01:00
Eike Kettner	26e89bf84e	Edit org/person/equipment of multiple items	2020-10-26 13:35:47 +01:00
Eike Kettner	2e6026b817	Edit dates of multiple items	2020-10-26 13:16:03 +01:00
Eike Kettner	3c0b86cb19	Fix regex patterns used for NER Patterns are split on whitespace by the nlp library and then compiled, so each "word" must be a valid regex. Fixes: #356	2020-10-21 00:55:14 +02:00
Eike Kettner	3f697f51aa	Autoformat	2020-10-06 23:31:09 +02:00
Eike Kettner	d4354b8b49	Skip pdf conversion if a converted file exists For images the conversion also returns the extracted text. If this would have failed to be saved, it is extracted in the following text-extraction step.	2020-10-02 17:39:39 +02:00
Eike Kettner	b6f23b038a	Fix finding attachments for retries The attachments to process again must be searched in sources and archives, too.	2020-10-02 17:39:34 +02:00
Eike Kettner	5e21552358	Don't do duplicate check on retries	2020-10-02 16:50:52 +02:00
Eike Kettner	f6f63000be	Prepend a duplicate check when uploading files	2020-09-23 23:37:00 +02:00
Eike Kettner	c658677032	Autoformat	2020-09-09 00:29:32 +02:00
Eike Kettner	76ccfb8a81	Only learn from confirmed items Text classification should only learn from confirmed items. Log if classification is disabled when processing an item.	2020-09-07 13:04:40 +02:00
Eike Kettner	4309bd8dfd	Some cleanup	2020-09-02 21:22:30 +02:00
Eike Kettner	237b960625	Guess a tag on item processing using a trained model if available	2020-09-02 18:28:14 +02:00
Eike Kettner	316b490008	Implement learning a text classifier from collective data	2020-09-02 18:28:14 +02:00
Eike Kettner	68bb65572b	Integrate learn-classifier task into the app	2020-09-02 18:28:14 +02:00
Eike Kettner	0c97b4ef76	Initial impl of a text classifier based on stanford-nlp	2020-09-02 18:28:14 +02:00
Eike Kettner	8c4f2e702b	Add classifier settings	2020-09-02 18:28:14 +02:00
Eike Kettner	3473cbb773	Use collective data with NER annotation	2020-08-25 20:40:44 +02:00
Eike Kettner	96d2f948f2	Use collective's addressbook to configure regexner	2020-08-24 14:40:52 +02:00
Eike Kettner	8628a0a8b3	Allow configuring stanford-ner and cache based on collective	2020-08-24 10:55:59 +02:00
Eike Kettner	3986487f11	Add api docs and cleanup	2020-08-13 21:22:54 +02:00
Eike Kettner	41ea071555	Add a task to convert all pdfs that have not been converted	2020-08-13 01:06:13 +02:00
Eike Kettner	07e9a9767e	Add a task to re-process files of an item	2020-08-12 22:29:56 +02:00
Eike Kettner	09d74b7e80	Return item notes with search results In order to not make the response very large, a admin can define a limit on how much to return.	2020-08-05 00:09:37 +02:00
Eike Kettner	45b0deeced	Print solr url on start This is useful info to see which url has been selected, same as db connection.	2020-08-01 15:59:14 +02:00
Eike Kettner	1fc57fc2b2	Set default value for min-text-len to 500 This value is used to decide whether to try OCR or not. If text is below this value, OCR is run and both results are compared. It was set to 10, which is just one or two words. Since the context for docspell are documents, this value is too low.	2020-08-01 15:46:00 +02:00
Eike Kettner	cec4948710	Add pdf meta data to extracted text to add it to full-text index	2020-07-19 01:07:49 +02:00
Eike Kettner	209c068436	Use keywords in pdfs to search for existing tags During processing, keywords stored in PDF metadata are used to look them up in the tag database and associate any existing tags to the item. See #175	2020-07-19 00:28:04 +02:00
Eike Kettner	bd20165d1a	Use given folder-id when adding initial fts docs	2020-07-18 23:04:01 +02:00
Eike Kettner	3d49ceaab5	Use ocrmypdf tool to create pdf/a during conversion - Use another external tool to convert pdf to pdf which also adds the extracted text as another layer into the pdf - Although not used, the external conversion routine will now check for an existing text file that is named as the pdf file with extension `.txt`. If present it is included in the conversion result and will be used as the extracted text. - text extraction for pdf files happens now on the converted file, because it may already contain the text from the conversion step and thus avoids running OCR twice. - All errors during conversion are not fatal; processing continues without a converted file.	2020-07-18 17:19:29 +02:00
Eike Kettner	5b01c93711	Add a folder-id to item processing This allows to define a folder when uploading files. All generated items are associated to this folder on creation.	2020-07-14 23:18:39 +02:00
Eike Kettner	259526a088	Organize imports	2020-07-12 13:51:52 +02:00
Eike Kettner	22fa1dba13	Apply folder restriction to fulltext only search And update index when folder changes.	2020-07-12 13:50:45 +02:00

1 2 3

140 Commits