Commit Graph

98 Commits

Author SHA1 Message Date
Eike Kettner
1fc57fc2b2 Set default value for min-text-len to 500
This value is used to decide whether to try OCR or not. If text is
below this value, OCR is run and both results are compared. It was set
to 10, which is just one or two words. Since the context for docspell
are documents, this value is too low.
2020-08-01 15:46:00 +02:00
Eike Kettner
cec4948710 Add pdf meta data to extracted text to add it to full-text index 2020-07-19 01:07:49 +02:00
Eike Kettner
209c068436 Use keywords in pdfs to search for existing tags
During processing, keywords stored in PDF metadata are used to look
them up in the tag database and associate any existing tags to the
item.

See #175
2020-07-19 00:28:04 +02:00
Eike Kettner
bd20165d1a Use given folder-id when adding initial fts docs 2020-07-18 23:04:01 +02:00
Eike Kettner
3d49ceaab5 Use ocrmypdf tool to create pdf/a during conversion
- Use another external tool to convert pdf to pdf which also adds the
  extracted text as another layer into the pdf

- Although not used, the external conversion routine will now check
  for an existing text file that is named as the pdf file with extension
  `.txt`. If present it is included in the conversion result and will be
  used as the extracted text.

- text extraction for pdf files happens now on the converted file,
  because it may already contain the text from the conversion step and
  thus avoids running OCR twice.

- All errors during conversion are not fatal; processing continues
  without a converted file.
2020-07-18 17:19:29 +02:00
Eike Kettner
5b01c93711 Add a folder-id to item processing
This allows to define a folder when uploading files. All generated
items are associated to this folder on creation.
2020-07-14 23:18:39 +02:00
Eike Kettner
259526a088 Organize imports 2020-07-12 13:51:52 +02:00
Eike Kettner
22fa1dba13 Apply folder restriction to fulltext only search
And update index when folder changes.
2020-07-12 13:50:45 +02:00
Eike Kettner
aeba4ba913 Refactor full-text migrations and add folder to solr schema 2020-07-12 13:50:14 +02:00
Eike Kettner
e387b5513f Remove items in non-member folders from sql search results 2020-07-11 22:25:56 +02:00
Eike Kettner
752a94a9e2 Implement space operations 2020-07-11 01:30:28 +02:00
Eike Kettner
347a029af8 Scalafix organize-imports 2020-06-28 21:20:47 +02:00
Eike Kettner
41c0f70d3b Fix cancelling jobs
A request to cancel a job was not processed correctly. The cancelling
routine of a task must run, regardless of the (non-final) state. Now
it works like this: if a job is currently running, it is interrupted
and its cancel routine is invoked. It then enters "cancelled" state.
If it is stuck, it is loaded and only its cancel routine is run. If it
is in a final state or waiting, it is removed from the queue.
2020-06-26 23:08:27 +02:00
Eike Kettner
d79ae6233a Restrict proposals for due date
Avoid dates too far in the future.
2020-06-26 16:58:17 +02:00
Eike Kettner
91da3b149e Reducing default retries to 2
Many errors cannot be recovered from by retrying. There is currently
no way to distinguish these states so it is now set to a lower value
to have not long wait times until an item arrives.
2020-06-25 23:57:01 +02:00
Eike Kettner
dc8f1a0387 Fix global re-index task to re-create the schema
Otherwise new instances could not be re-indexed.
2020-06-25 23:02:06 +02:00
Eike Kettner
14213c4c27 Allow some solr query options in the config file 2020-06-24 23:37:20 +02:00
Eike Kettner
532caed84c Consistent logging of request/responses to solr
Using a middleware. Also add missing changesets for mariadb.
2020-06-24 21:25:46 +02:00
Eike Kettner
47697a8056 Set some logs to trace 2020-06-24 01:16:13 +02:00
Eike Kettner
e06a3f8fdd ScalafmtAll 2020-06-23 00:18:59 +02:00
Eike Kettner
ffbb16db45 Transport highlighting information to the client 2020-06-23 00:17:29 +02:00
Eike Kettner
cfe5aa8894 Use no-op fts-client if disabled + push this flag to the webui 2020-06-21 21:06:08 +02:00
Eike Kettner
0d8b03fc61 Add backend operations for re-creating the full-text index 2020-06-21 15:46:51 +02:00
Eike Kettner
14ea4091c4 Renaming things 2020-06-21 13:15:02 +02:00
Eike Kettner
2f6e531c45 Refactoring index migration task 2020-06-21 01:37:23 +02:00
Eike Kettner
1f4ff0d4c4 Add language to schema, extend fts-client 2020-06-20 22:44:47 +02:00
Eike Kettner
2a0bf24088 Setup solr schema and index all data using a system task
The task runs on application start. It sets the schema using solr's
schema api and then indexes all data in the database. Each step is
memorized so that it is not executed again on subsequent starts.
2020-06-19 21:37:22 +02:00
Eike Kettner
60c079f664 Add task to index current database state 2020-06-18 22:38:45 +02:00
Eike Kettner
146d1b0562 Make data to index more flexible and extensible 2020-06-17 23:20:46 +02:00
Eike Kettner
522daaf57e Introducing fts client into codebase 2020-06-17 23:20:46 +02:00
Eike Kettner
897d91475e Update scalafmt-core to 2.6.0 2020-06-17 19:53:56 +02:00
Eike Kettner
7a3d2e4dc6 Extract OItemSearch from OItem 2020-06-15 23:13:48 +02:00
Eike Kettner
e5b90eff34 Allow client to load items in batches 2020-06-06 11:05:15 +02:00
Eike Kettner
4b0eb650f2 Rename package to avoid name clashes 2020-05-25 16:22:09 +02:00
Eike Kettner
56624515a5 ScalafmtAll 2020-05-25 13:56:06 +02:00
Eike Kettner
ee394eae86 Try streamline the different impls for MimeType 2020-05-25 09:24:24 +02:00
Eike Kettner
4694433e38 Fix attachment positions
It worked for new items, because the implicit offset was 0. when
adding archives to existing items, there are already attachments and
the new attachments are added to the end. This won't work if files are
added concurrently, because there is no quick and reliable way to
determine the offset then.
2020-05-24 15:13:30 +02:00
Eike Kettner
1dde43e092 Only process attachments in task arguments
When files are added to an item, the attachments already present must
not be "re-processed".
2020-05-24 13:29:38 +02:00
Eike Kettner
4e49c78e72 Change some log levels of item processing task 2020-05-24 12:54:35 +02:00
Eike Kettner
f4949446e3 Allow to specify an item id to amend files to existing items 2020-05-23 20:15:55 +02:00
Eike Kettner
25d089da6c Update state and proposals only on invalid items
Invalid items are those that are not ready, and not shown to the user.
When changing metadata, it should only be changed, if the item was not
already shown to the user.
2020-05-23 15:46:24 +02:00
Eike Kettner
855d4eefa8 Set progress in a linear way between each step 2020-05-23 15:33:58 +02:00
Eike Kettner
d9782582d8 Use max-mails setting with higher priority
The `mail-chunk-size` is set to its configured value or `max-mails`
whichever is lower.
2020-05-20 22:44:29 +02:00
Eike Kettner
c0259dba7e Allow to enable debug flag for javamail 2020-05-20 22:15:25 +02:00
Eike Kettner
2858d6b853 Notify job executors at the end of the task 2020-05-20 19:44:45 +02:00
Eike Kettner
31a1abf395 Add server limits to importing mails task 2020-05-20 17:52:38 +02:00
Eike Kettner
f2d67dc816 Initial impl of import from mailbox user task 2020-05-20 17:52:38 +02:00
Eike Kettner
852455c610 Add upload operation to task arguments 2020-05-20 17:52:38 +02:00
Eike Kettner
a4be63fd77 Add stub for scan-mailbox task 2020-05-20 17:52:38 +02:00
Eike Kettner
d65c1e0d36 Use date from e-mails to set item date 2020-05-17 11:58:51 +02:00