ab1139523a
Let the convert-all task retry when pdf conversion fails
2020-10-26 23:39:26 +01:00
b59696a9d3
Make sure to only remove/retry items in premature states
2020-10-26 23:39:26 +01:00
26e89bf84e
Edit org/person/equipment of multiple items
2020-10-26 13:35:47 +01:00
2e6026b817
Edit dates of multiple items
2020-10-26 13:16:03 +01:00
3c0b86cb19
Fix regex patterns used for NER
...
Patterns are split on whitespace by the nlp library and then compiled,
so each "word" must be a valid regex.
Fixes : #356
2020-10-21 00:55:14 +02:00
3f697f51aa
Autoformat
2020-10-06 23:31:09 +02:00
d4354b8b49
Skip pdf conversion if a converted file exists
...
For images the conversion also returns the extracted text. If this
would have failed to be saved, it is extracted in the following
text-extraction step.
2020-10-02 17:39:39 +02:00
b6f23b038a
Fix finding attachments for retries
...
The attachments to process again must be searched in sources and
archives, too.
2020-10-02 17:39:34 +02:00
5e21552358
Don't do duplicate check on retries
2020-10-02 16:50:52 +02:00
f6f63000be
Prepend a duplicate check when uploading files
2020-09-23 23:37:00 +02:00
c658677032
Autoformat
2020-09-09 00:29:32 +02:00
76ccfb8a81
Only learn from confirmed items
...
Text classification should only learn from confirmed items. Log if
classification is disabled when processing an item.
2020-09-07 13:04:40 +02:00
4309bd8dfd
Some cleanup
2020-09-02 21:22:30 +02:00
237b960625
Guess a tag on item processing using a trained model if available
2020-09-02 18:28:14 +02:00
316b490008
Implement learning a text classifier from collective data
2020-09-02 18:28:14 +02:00
68bb65572b
Integrate learn-classifier task into the app
2020-09-02 18:28:14 +02:00
0c97b4ef76
Initial impl of a text classifier based on stanford-nlp
2020-09-02 18:28:14 +02:00
8c4f2e702b
Add classifier settings
2020-09-02 18:28:14 +02:00
3473cbb773
Use collective data with NER annotation
2020-08-25 20:40:44 +02:00
96d2f948f2
Use collective's addressbook to configure regexner
2020-08-24 14:40:52 +02:00
8628a0a8b3
Allow configuring stanford-ner and cache based on collective
2020-08-24 10:55:59 +02:00
3986487f11
Add api docs and cleanup
2020-08-13 21:22:54 +02:00
41ea071555
Add a task to convert all pdfs that have not been converted
2020-08-13 01:06:13 +02:00
07e9a9767e
Add a task to re-process files of an item
2020-08-12 22:29:56 +02:00
09d74b7e80
Return item notes with search results
...
In order to not make the response very large, a admin can define a
limit on how much to return.
2020-08-05 00:09:37 +02:00
45b0deeced
Print solr url on start
...
This is useful info to see which url has been selected, same as db
connection.
2020-08-01 15:59:14 +02:00
1fc57fc2b2
Set default value for min-text-len to 500
...
This value is used to decide whether to try OCR or not. If text is
below this value, OCR is run and both results are compared. It was set
to 10, which is just one or two words. Since the context for docspell
are documents, this value is too low.
2020-08-01 15:46:00 +02:00
cec4948710
Add pdf meta data to extracted text to add it to full-text index
2020-07-19 01:07:49 +02:00
209c068436
Use keywords in pdfs to search for existing tags
...
During processing, keywords stored in PDF metadata are used to look
them up in the tag database and associate any existing tags to the
item.
See #175
2020-07-19 00:28:04 +02:00
bd20165d1a
Use given folder-id when adding initial fts docs
2020-07-18 23:04:01 +02:00
3d49ceaab5
Use ocrmypdf tool to create pdf/a during conversion
...
- Use another external tool to convert pdf to pdf which also adds the
extracted text as another layer into the pdf
- Although not used, the external conversion routine will now check
for an existing text file that is named as the pdf file with extension
`.txt`. If present it is included in the conversion result and will be
used as the extracted text.
- text extraction for pdf files happens now on the converted file,
because it may already contain the text from the conversion step and
thus avoids running OCR twice.
- All errors during conversion are not fatal; processing continues
without a converted file.
2020-07-18 17:19:29 +02:00
5b01c93711
Add a folder-id to item processing
...
This allows to define a folder when uploading files. All generated
items are associated to this folder on creation.
2020-07-14 23:18:39 +02:00
259526a088
Organize imports
2020-07-12 13:51:52 +02:00
22fa1dba13
Apply folder restriction to fulltext only search
...
And update index when folder changes.
2020-07-12 13:50:45 +02:00
aeba4ba913
Refactor full-text migrations and add folder to solr schema
2020-07-12 13:50:14 +02:00
e387b5513f
Remove items in non-member folders from sql search results
2020-07-11 22:25:56 +02:00
752a94a9e2
Implement space operations
2020-07-11 01:30:28 +02:00
347a029af8
Scalafix organize-imports
2020-06-28 21:20:47 +02:00
41c0f70d3b
Fix cancelling jobs
...
A request to cancel a job was not processed correctly. The cancelling
routine of a task must run, regardless of the (non-final) state. Now
it works like this: if a job is currently running, it is interrupted
and its cancel routine is invoked. It then enters "cancelled" state.
If it is stuck, it is loaded and only its cancel routine is run. If it
is in a final state or waiting, it is removed from the queue.
2020-06-26 23:08:27 +02:00
d79ae6233a
Restrict proposals for due date
...
Avoid dates too far in the future.
2020-06-26 16:58:17 +02:00
91da3b149e
Reducing default retries to 2
...
Many errors cannot be recovered from by retrying. There is currently
no way to distinguish these states so it is now set to a lower value
to have not long wait times until an item arrives.
2020-06-25 23:57:01 +02:00
dc8f1a0387
Fix global re-index task to re-create the schema
...
Otherwise new instances could not be re-indexed.
2020-06-25 23:02:06 +02:00
14213c4c27
Allow some solr query options in the config file
2020-06-24 23:37:20 +02:00
532caed84c
Consistent logging of request/responses to solr
...
Using a middleware. Also add missing changesets for mariadb.
2020-06-24 21:25:46 +02:00
47697a8056
Set some logs to trace
2020-06-24 01:16:13 +02:00
e06a3f8fdd
ScalafmtAll
2020-06-23 00:18:59 +02:00
ffbb16db45
Transport highlighting information to the client
2020-06-23 00:17:29 +02:00
cfe5aa8894
Use no-op fts-client if disabled + push this flag to the webui
2020-06-21 21:06:08 +02:00
0d8b03fc61
Add backend operations for re-creating the full-text index
2020-06-21 15:46:51 +02:00
14ea4091c4
Renaming things
2020-06-21 13:15:02 +02:00