27eb5d70de
Apply given tags in processing step
...
Issue: #346
2020-11-11 21:01:23 +01:00
55a6f7aaf6
Add more properties to upload meta data
2020-11-11 21:01:23 +01:00
746e04c624
Improve logging when creating preview images
2020-11-10 22:25:46 +01:00
10305bc82d
Minor improvements
2020-11-09 21:16:53 +01:00
29455d638c
Add startup task to find page counts of existing files
2020-11-09 20:35:35 +01:00
a77f34b7ba
Add a processing step to retrieve page counts
2020-11-09 11:08:24 +01:00
f4e50c5229
Provide endpoints to submit tasks to re-generate previews
...
The scaling factor can be given in the config file. When this changes,
images can be regenerated via POSTing to certain endpoints. It is
possible to regenerate just one attachment preview or all within a
collective.
2020-11-09 09:00:02 +01:00
6037b54959
Don't fail processing if generating preview fails
2020-11-09 00:05:11 +01:00
709848244c
Create tasks to generate all previews
...
There is a task to generate preview images per attachment. It can
either add them (if not present yet) or overwrite them (e.g. some
config has changed).
There is a task that selects all attachments without previews and
submits a task to create it. This is submitted on start automatically
to generate previews for all existing attachments.
2020-11-08 23:46:02 +01:00
7ba6baf6f0
Make preview image smaller
2020-11-08 15:12:56 +01:00
6db5c39d78
Fix converted filename
...
Mark it by default with a string from the config file.
Issue: 397
2020-11-08 09:45:03 +01:00
ef7cb4e779
Create a preview image of all files during processing
2020-11-08 01:25:59 +01:00
ab1139523a
Let the convert-all task retry when pdf conversion fails
2020-10-26 23:39:26 +01:00
b59696a9d3
Make sure to only remove/retry items in premature states
2020-10-26 23:39:26 +01:00
26e89bf84e
Edit org/person/equipment of multiple items
2020-10-26 13:35:47 +01:00
2e6026b817
Edit dates of multiple items
2020-10-26 13:16:03 +01:00
3c0b86cb19
Fix regex patterns used for NER
...
Patterns are split on whitespace by the nlp library and then compiled,
so each "word" must be a valid regex.
Fixes : #356
2020-10-21 00:55:14 +02:00
3f697f51aa
Autoformat
2020-10-06 23:31:09 +02:00
d4354b8b49
Skip pdf conversion if a converted file exists
...
For images the conversion also returns the extracted text. If this
would have failed to be saved, it is extracted in the following
text-extraction step.
2020-10-02 17:39:39 +02:00
b6f23b038a
Fix finding attachments for retries
...
The attachments to process again must be searched in sources and
archives, too.
2020-10-02 17:39:34 +02:00
5e21552358
Don't do duplicate check on retries
2020-10-02 16:50:52 +02:00
f6f63000be
Prepend a duplicate check when uploading files
2020-09-23 23:37:00 +02:00
c658677032
Autoformat
2020-09-09 00:29:32 +02:00
76ccfb8a81
Only learn from confirmed items
...
Text classification should only learn from confirmed items. Log if
classification is disabled when processing an item.
2020-09-07 13:04:40 +02:00
4309bd8dfd
Some cleanup
2020-09-02 21:22:30 +02:00
237b960625
Guess a tag on item processing using a trained model if available
2020-09-02 18:28:14 +02:00
316b490008
Implement learning a text classifier from collective data
2020-09-02 18:28:14 +02:00
68bb65572b
Integrate learn-classifier task into the app
2020-09-02 18:28:14 +02:00
0c97b4ef76
Initial impl of a text classifier based on stanford-nlp
2020-09-02 18:28:14 +02:00
8c4f2e702b
Add classifier settings
2020-09-02 18:28:14 +02:00
3473cbb773
Use collective data with NER annotation
2020-08-25 20:40:44 +02:00
96d2f948f2
Use collective's addressbook to configure regexner
2020-08-24 14:40:52 +02:00
8628a0a8b3
Allow configuring stanford-ner and cache based on collective
2020-08-24 10:55:59 +02:00
3986487f11
Add api docs and cleanup
2020-08-13 21:22:54 +02:00
41ea071555
Add a task to convert all pdfs that have not been converted
2020-08-13 01:06:13 +02:00
07e9a9767e
Add a task to re-process files of an item
2020-08-12 22:29:56 +02:00
09d74b7e80
Return item notes with search results
...
In order to not make the response very large, a admin can define a
limit on how much to return.
2020-08-05 00:09:37 +02:00
45b0deeced
Print solr url on start
...
This is useful info to see which url has been selected, same as db
connection.
2020-08-01 15:59:14 +02:00
1fc57fc2b2
Set default value for min-text-len to 500
...
This value is used to decide whether to try OCR or not. If text is
below this value, OCR is run and both results are compared. It was set
to 10, which is just one or two words. Since the context for docspell
are documents, this value is too low.
2020-08-01 15:46:00 +02:00
cec4948710
Add pdf meta data to extracted text to add it to full-text index
2020-07-19 01:07:49 +02:00
209c068436
Use keywords in pdfs to search for existing tags
...
During processing, keywords stored in PDF metadata are used to look
them up in the tag database and associate any existing tags to the
item.
See #175
2020-07-19 00:28:04 +02:00
bd20165d1a
Use given folder-id when adding initial fts docs
2020-07-18 23:04:01 +02:00
3d49ceaab5
Use ocrmypdf tool to create pdf/a during conversion
...
- Use another external tool to convert pdf to pdf which also adds the
extracted text as another layer into the pdf
- Although not used, the external conversion routine will now check
for an existing text file that is named as the pdf file with extension
`.txt`. If present it is included in the conversion result and will be
used as the extracted text.
- text extraction for pdf files happens now on the converted file,
because it may already contain the text from the conversion step and
thus avoids running OCR twice.
- All errors during conversion are not fatal; processing continues
without a converted file.
2020-07-18 17:19:29 +02:00
5b01c93711
Add a folder-id to item processing
...
This allows to define a folder when uploading files. All generated
items are associated to this folder on creation.
2020-07-14 23:18:39 +02:00
259526a088
Organize imports
2020-07-12 13:51:52 +02:00
22fa1dba13
Apply folder restriction to fulltext only search
...
And update index when folder changes.
2020-07-12 13:50:45 +02:00
aeba4ba913
Refactor full-text migrations and add folder to solr schema
2020-07-12 13:50:14 +02:00
e387b5513f
Remove items in non-member folders from sql search results
2020-07-11 22:25:56 +02:00
752a94a9e2
Implement space operations
2020-07-11 01:30:28 +02:00
347a029af8
Scalafix organize-imports
2020-06-28 21:20:47 +02:00