Eike Kettner
a670bbb6c2
Make idle interval when clearing nlp cache configurable
2021-01-06 23:03:00 +01:00
Eike Kettner
b08e88cd69
Add (inofficial) routes to get system information
2021-01-05 20:54:53 +01:00
Eike Kettner
611e480eb4
Use more prominent log line to indicate start of processing
...
Issue: #530
2021-01-02 21:47:54 +01:00
Eike Kettner
97dfcece97
Fix duplicate check on restarts
...
Issue: #530
2021-01-02 21:18:05 +01:00
Eike Kettner
2dff686fa0
Introduce unit condition
2020-12-15 21:03:47 +01:00
Eike Kettner
80406cabc2
Refactoring some code into separate files
2020-12-15 21:03:47 +01:00
Eike Kettner
5e2c5d2a50
Extends query builder
2020-12-15 21:03:46 +01:00
Eike Kettner
35c62049f5
Start converting QItem
2020-12-15 21:03:46 +01:00
Eike Kettner
613696539f
Minor refactorings
2020-12-15 21:03:46 +01:00
Eike Kettner
e3f6892abd
Convert job record
2020-12-15 21:03:46 +01:00
Eike Kettner
3cef932ccd
Convert more records
2020-12-15 21:03:46 +01:00
Eike Kettner
10b49fccf8
Converting user and userimap records
2020-12-15 21:03:46 +01:00
Eike Kettner
f5ae389eea
Cleanup remember-me tokens periodically
2020-12-04 17:59:25 +01:00
Eike Kettner
290989f67f
Reorder correspondent person suggestion based on org relationship
2020-12-01 23:39:45 +01:00
Eike Kettner
3fabe0a582
Update to Scala 2.13.4
2020-11-27 20:26:24 +01:00
Eike Kettner
5fe532001b
Allow to specify document lanugage with the request
2020-11-23 20:49:01 +01:00
Eike Kettner
5034e12bec
Add a subject filter to scan-mailbox args
2020-11-13 23:15:20 +01:00
mergify[bot]
e5ce1fd45f
Merge pull request #437 from eikek/upload-improvements
...
Upload improvements
2020-11-12 22:58:08 +00:00
Eike Kettner
4fd6e02ec0
Improve glob and filter archive entries
2020-11-11 21:01:23 +01:00
Eike Kettner
27eb5d70de
Apply given tags in processing step
...
Issue: #346
2020-11-11 21:01:23 +01:00
Eike Kettner
55a6f7aaf6
Add more properties to upload meta data
2020-11-11 21:01:23 +01:00
Eike Kettner
746e04c624
Improve logging when creating preview images
2020-11-10 22:25:46 +01:00
Eike Kettner
10305bc82d
Minor improvements
2020-11-09 21:16:53 +01:00
Eike Kettner
29455d638c
Add startup task to find page counts of existing files
2020-11-09 20:35:35 +01:00
Eike Kettner
a77f34b7ba
Add a processing step to retrieve page counts
2020-11-09 11:08:24 +01:00
Eike Kettner
f4e50c5229
Provide endpoints to submit tasks to re-generate previews
...
The scaling factor can be given in the config file. When this changes,
images can be regenerated via POSTing to certain endpoints. It is
possible to regenerate just one attachment preview or all within a
collective.
2020-11-09 09:00:02 +01:00
Eike Kettner
6037b54959
Don't fail processing if generating preview fails
2020-11-09 00:05:11 +01:00
Eike Kettner
709848244c
Create tasks to generate all previews
...
There is a task to generate preview images per attachment. It can
either add them (if not present yet) or overwrite them (e.g. some
config has changed).
There is a task that selects all attachments without previews and
submits a task to create it. This is submitted on start automatically
to generate previews for all existing attachments.
2020-11-08 23:46:02 +01:00
Eike Kettner
7ba6baf6f0
Make preview image smaller
2020-11-08 15:12:56 +01:00
Eike Kettner
6db5c39d78
Fix converted filename
...
Mark it by default with a string from the config file.
Issue: 397
2020-11-08 09:45:03 +01:00
Eike Kettner
ef7cb4e779
Create a preview image of all files during processing
2020-11-08 01:25:59 +01:00
Eike Kettner
ab1139523a
Let the convert-all task retry when pdf conversion fails
2020-10-26 23:39:26 +01:00
Eike Kettner
b59696a9d3
Make sure to only remove/retry items in premature states
2020-10-26 23:39:26 +01:00
Eike Kettner
26e89bf84e
Edit org/person/equipment of multiple items
2020-10-26 13:35:47 +01:00
Eike Kettner
2e6026b817
Edit dates of multiple items
2020-10-26 13:16:03 +01:00
Eike Kettner
3c0b86cb19
Fix regex patterns used for NER
...
Patterns are split on whitespace by the nlp library and then compiled,
so each "word" must be a valid regex.
Fixes : #356
2020-10-21 00:55:14 +02:00
Eike Kettner
3f697f51aa
Autoformat
2020-10-06 23:31:09 +02:00
Eike Kettner
d4354b8b49
Skip pdf conversion if a converted file exists
...
For images the conversion also returns the extracted text. If this
would have failed to be saved, it is extracted in the following
text-extraction step.
2020-10-02 17:39:39 +02:00
Eike Kettner
b6f23b038a
Fix finding attachments for retries
...
The attachments to process again must be searched in sources and
archives, too.
2020-10-02 17:39:34 +02:00
Eike Kettner
5e21552358
Don't do duplicate check on retries
2020-10-02 16:50:52 +02:00
Eike Kettner
f6f63000be
Prepend a duplicate check when uploading files
2020-09-23 23:37:00 +02:00
Eike Kettner
c658677032
Autoformat
2020-09-09 00:29:32 +02:00
Eike Kettner
76ccfb8a81
Only learn from confirmed items
...
Text classification should only learn from confirmed items. Log if
classification is disabled when processing an item.
2020-09-07 13:04:40 +02:00
Eike Kettner
4309bd8dfd
Some cleanup
2020-09-02 21:22:30 +02:00
Eike Kettner
237b960625
Guess a tag on item processing using a trained model if available
2020-09-02 18:28:14 +02:00
Eike Kettner
316b490008
Implement learning a text classifier from collective data
2020-09-02 18:28:14 +02:00
Eike Kettner
68bb65572b
Integrate learn-classifier task into the app
2020-09-02 18:28:14 +02:00
Eike Kettner
0c97b4ef76
Initial impl of a text classifier based on stanford-nlp
2020-09-02 18:28:14 +02:00
Eike Kettner
8c4f2e702b
Add classifier settings
2020-09-02 18:28:14 +02:00
Eike Kettner
3473cbb773
Use collective data with NER annotation
2020-08-25 20:40:44 +02:00
Eike Kettner
96d2f948f2
Use collective's addressbook to configure regexner
2020-08-24 14:40:52 +02:00
Eike Kettner
8628a0a8b3
Allow configuring stanford-ner and cache based on collective
2020-08-24 10:55:59 +02:00
Eike Kettner
3986487f11
Add api docs and cleanup
2020-08-13 21:22:54 +02:00
Eike Kettner
41ea071555
Add a task to convert all pdfs that have not been converted
2020-08-13 01:06:13 +02:00
Eike Kettner
07e9a9767e
Add a task to re-process files of an item
2020-08-12 22:29:56 +02:00
Eike Kettner
09d74b7e80
Return item notes with search results
...
In order to not make the response very large, a admin can define a
limit on how much to return.
2020-08-05 00:09:37 +02:00
Eike Kettner
45b0deeced
Print solr url on start
...
This is useful info to see which url has been selected, same as db
connection.
2020-08-01 15:59:14 +02:00
Eike Kettner
1fc57fc2b2
Set default value for min-text-len to 500
...
This value is used to decide whether to try OCR or not. If text is
below this value, OCR is run and both results are compared. It was set
to 10, which is just one or two words. Since the context for docspell
are documents, this value is too low.
2020-08-01 15:46:00 +02:00
Eike Kettner
cec4948710
Add pdf meta data to extracted text to add it to full-text index
2020-07-19 01:07:49 +02:00
Eike Kettner
209c068436
Use keywords in pdfs to search for existing tags
...
During processing, keywords stored in PDF metadata are used to look
them up in the tag database and associate any existing tags to the
item.
See #175
2020-07-19 00:28:04 +02:00
Eike Kettner
bd20165d1a
Use given folder-id when adding initial fts docs
2020-07-18 23:04:01 +02:00
Eike Kettner
3d49ceaab5
Use ocrmypdf tool to create pdf/a during conversion
...
- Use another external tool to convert pdf to pdf which also adds the
extracted text as another layer into the pdf
- Although not used, the external conversion routine will now check
for an existing text file that is named as the pdf file with extension
`.txt`. If present it is included in the conversion result and will be
used as the extracted text.
- text extraction for pdf files happens now on the converted file,
because it may already contain the text from the conversion step and
thus avoids running OCR twice.
- All errors during conversion are not fatal; processing continues
without a converted file.
2020-07-18 17:19:29 +02:00
Eike Kettner
5b01c93711
Add a folder-id to item processing
...
This allows to define a folder when uploading files. All generated
items are associated to this folder on creation.
2020-07-14 23:18:39 +02:00
Eike Kettner
259526a088
Organize imports
2020-07-12 13:51:52 +02:00
Eike Kettner
22fa1dba13
Apply folder restriction to fulltext only search
...
And update index when folder changes.
2020-07-12 13:50:45 +02:00
Eike Kettner
aeba4ba913
Refactor full-text migrations and add folder to solr schema
2020-07-12 13:50:14 +02:00
Eike Kettner
e387b5513f
Remove items in non-member folders from sql search results
2020-07-11 22:25:56 +02:00
Eike Kettner
752a94a9e2
Implement space operations
2020-07-11 01:30:28 +02:00
Eike Kettner
347a029af8
Scalafix organize-imports
2020-06-28 21:20:47 +02:00
Eike Kettner
41c0f70d3b
Fix cancelling jobs
...
A request to cancel a job was not processed correctly. The cancelling
routine of a task must run, regardless of the (non-final) state. Now
it works like this: if a job is currently running, it is interrupted
and its cancel routine is invoked. It then enters "cancelled" state.
If it is stuck, it is loaded and only its cancel routine is run. If it
is in a final state or waiting, it is removed from the queue.
2020-06-26 23:08:27 +02:00
Eike Kettner
d79ae6233a
Restrict proposals for due date
...
Avoid dates too far in the future.
2020-06-26 16:58:17 +02:00
Eike Kettner
91da3b149e
Reducing default retries to 2
...
Many errors cannot be recovered from by retrying. There is currently
no way to distinguish these states so it is now set to a lower value
to have not long wait times until an item arrives.
2020-06-25 23:57:01 +02:00
Eike Kettner
dc8f1a0387
Fix global re-index task to re-create the schema
...
Otherwise new instances could not be re-indexed.
2020-06-25 23:02:06 +02:00
Eike Kettner
14213c4c27
Allow some solr query options in the config file
2020-06-24 23:37:20 +02:00
Eike Kettner
532caed84c
Consistent logging of request/responses to solr
...
Using a middleware. Also add missing changesets for mariadb.
2020-06-24 21:25:46 +02:00
Eike Kettner
47697a8056
Set some logs to trace
2020-06-24 01:16:13 +02:00
Eike Kettner
e06a3f8fdd
ScalafmtAll
2020-06-23 00:18:59 +02:00
Eike Kettner
ffbb16db45
Transport highlighting information to the client
2020-06-23 00:17:29 +02:00
Eike Kettner
cfe5aa8894
Use no-op fts-client if disabled + push this flag to the webui
2020-06-21 21:06:08 +02:00
Eike Kettner
0d8b03fc61
Add backend operations for re-creating the full-text index
2020-06-21 15:46:51 +02:00
Eike Kettner
14ea4091c4
Renaming things
2020-06-21 13:15:02 +02:00
Eike Kettner
2f6e531c45
Refactoring index migration task
2020-06-21 01:37:23 +02:00
Eike Kettner
1f4ff0d4c4
Add language to schema, extend fts-client
2020-06-20 22:44:47 +02:00
Eike Kettner
2a0bf24088
Setup solr schema and index all data using a system task
...
The task runs on application start. It sets the schema using solr's
schema api and then indexes all data in the database. Each step is
memorized so that it is not executed again on subsequent starts.
2020-06-19 21:37:22 +02:00
Eike Kettner
60c079f664
Add task to index current database state
2020-06-18 22:38:45 +02:00
Eike Kettner
146d1b0562
Make data to index more flexible and extensible
2020-06-17 23:20:46 +02:00
Eike Kettner
522daaf57e
Introducing fts client into codebase
2020-06-17 23:20:46 +02:00
Eike Kettner
897d91475e
Update scalafmt-core to 2.6.0
2020-06-17 19:53:56 +02:00
Eike Kettner
7a3d2e4dc6
Extract OItemSearch
from OItem
2020-06-15 23:13:48 +02:00
Eike Kettner
e5b90eff34
Allow client to load items in batches
2020-06-06 11:05:15 +02:00
Eike Kettner
4b0eb650f2
Rename package to avoid name clashes
2020-05-25 16:22:09 +02:00
Eike Kettner
56624515a5
ScalafmtAll
2020-05-25 13:56:06 +02:00
Eike Kettner
ee394eae86
Try streamline the different impls for MimeType
2020-05-25 09:24:24 +02:00
Eike Kettner
4694433e38
Fix attachment positions
...
It worked for new items, because the implicit offset was 0. when
adding archives to existing items, there are already attachments and
the new attachments are added to the end. This won't work if files are
added concurrently, because there is no quick and reliable way to
determine the offset then.
2020-05-24 15:13:30 +02:00
Eike Kettner
1dde43e092
Only process attachments in task arguments
...
When files are added to an item, the attachments already present must
not be "re-processed".
2020-05-24 13:29:38 +02:00
Eike Kettner
4e49c78e72
Change some log levels of item processing task
2020-05-24 12:54:35 +02:00
Eike Kettner
f4949446e3
Allow to specify an item id to amend files to existing items
2020-05-23 20:15:55 +02:00
Eike Kettner
25d089da6c
Update state and proposals only on invalid items
...
Invalid items are those that are not ready, and not shown to the user.
When changing metadata, it should only be changed, if the item was not
already shown to the user.
2020-05-23 15:46:24 +02:00
Eike Kettner
855d4eefa8
Set progress in a linear way between each step
2020-05-23 15:33:58 +02:00
Eike Kettner
d9782582d8
Use max-mails
setting with higher priority
...
The `mail-chunk-size` is set to its configured value or `max-mails`
whichever is lower.
2020-05-20 22:44:29 +02:00
Eike Kettner
c0259dba7e
Allow to enable debug flag for javamail
2020-05-20 22:15:25 +02:00
Eike Kettner
2858d6b853
Notify job executors at the end of the task
2020-05-20 19:44:45 +02:00
Eike Kettner
31a1abf395
Add server limits to importing mails task
2020-05-20 17:52:38 +02:00
Eike Kettner
f2d67dc816
Initial impl of import from mailbox user task
2020-05-20 17:52:38 +02:00
Eike Kettner
852455c610
Add upload operation to task arguments
2020-05-20 17:52:38 +02:00
Eike Kettner
a4be63fd77
Add stub for scan-mailbox task
2020-05-20 17:52:38 +02:00
Eike Kettner
d65c1e0d36
Use date from e-mails to set item date
2020-05-17 11:58:51 +02:00
Eike Kettner
3e10e2175a
Sort by weights better and save them
2020-05-17 11:58:51 +02:00
Scala Steward
5d6658770e
Update emil-common, emil-doobie, ... to 0.6.0
2020-05-17 11:55:53 +02:00
Eike Kettner
6747a86fea
Simplify jsoup sanitizer to reuse from emil
2020-05-14 23:56:08 +02:00
Eike Kettner
9c882e1be9
Fix package name
2020-05-10 21:03:12 +02:00
Eike Kettner
bd5066740d
Joex depends on backend module
...
The job executor depends on backend module, since it may control the
application via user tasks. The `ONode` can now be moved from the
store module into the backend module.
2020-05-10 21:03:12 +02:00
Eike Kettner
c41cdeefec
Update scalafmt to 2.5.1 + scalafmtAll
2020-05-04 23:53:57 +02:00
Eike Kettner
0a1b3fcf95
Set list-id header for notification mails
2020-04-30 21:23:56 +02:00
Eike Kettner
75a66ecb86
Update http4s to 0.21.4
2020-04-29 01:05:13 +02:00
Eike Kettner
fa10fe3fae
Update scala to 2.13.2
2020-04-24 22:24:31 +02:00
Eike Kettner
315ea63f44
Improve notify mail template
2020-04-23 23:17:34 +02:00
Eike Kettner
84e0ebf1a2
Add a flag for restricting overdue items
2020-04-23 21:37:03 +02:00
Eike Kettner
d52efdfcf0
Improve mail template
2020-04-22 23:41:09 +02:00
Eike Kettner
ffc1cdee51
Sort due items by their earliest due date
2020-04-22 22:21:28 +02:00
Eike Kettner
e1f9ae2629
Include links to items into mail template
2020-04-22 21:53:25 +02:00
Eike Kettner
2723d6b43b
Implement notify-due-items task
2020-04-22 21:08:45 +02:00
Eike Kettner
ad772c0c25
Server-side stub impl for notify-due-items
2020-04-22 21:08:45 +02:00
Eike Kettner
1206105f0b
Fix several bugs with handling e-mail files
...
- When converting from html->pdf, the wkhtmltopdf program exits with
errors if the document contains invalid links. The content is now
cleaned before handed to wkhtmltopdf.
- Update emil library which fixes a bug when reading mails without
explicit transfer encoding (8bit)
- Add a info header to converted mails
2020-04-07 22:38:25 +02:00
Eike Kettner
6a1297fc95
Add a limit for text analysis
2020-03-27 22:54:49 +01:00
Eike Kettner
9656ba62f4
scalafmtAll
2020-03-26 18:26:00 +01:00
Eike Kettner
09ea724c13
Store message-id of eml files
2020-03-25 22:00:51 +01:00
Eike Kettner
e305b46708
Extract tnef attachments and fix incomplete html
...
The wkhtmltopdf requires the content encoding set correctly in the
document.
2020-03-24 23:40:29 +01:00
Eike Kettner
0b80572664
Fix encodings for mails with non-utf8 html parts
2020-03-24 23:40:29 +01:00
Eike Kettner
cf7ccd572c
Improve handling encodings
...
Html and text files are not fixed to be UTF-8. The encoding is now
detected, which may not work for all files. Default/fallback will be
utf-8.
There is still a problem with mails that contain html parts not in
utf8 encoding. The mail text is always returned as a string and the
original encoding is lost. Then the html is stored using utf-8 bytes,
but wkhtmltopdf reads it using latin1. It seems that the `--encoding`
setting doesn't override encoding provided by the document.
2020-03-23 22:51:28 +01:00
Eike Kettner
cba466ed47
Set item due date candidate
...
After processing, set the due date of an item to the first candidate.
The earliest due date is considered best match.
2020-03-20 22:39:09 +01:00
Eike Kettner
6b1156182c
Add support for eml (rfc822 email) files
2020-03-19 22:42:40 +01:00
Eike Kettner
4ed7a137f7
Add support for archive files
...
Each attachment is now first extracted into potentially multiple ones,
if it is recognized as an archive. This is the first step in
processing. The original archive file is also stored and the resulting
attachments are associated to their original archive.
First support is implemented for zip files.
2020-03-19 22:42:27 +01:00
Eike Kettner
f0449dd2ce
Properly initialize thread pools
2020-03-17 22:37:12 +01:00
Eike Kettner
00ca6b5697
Improve text analysis
...
- Search for consecutive labels
- Sort list of candidates by a weight
- Search for organizations using person labels
2020-03-17 22:34:50 +01:00
Eike Kettner
718e44a21c
Add cleanup jobs task
2020-03-09 20:24:00 +01:00
Eike Kettner
854a596da3
Integrate periodic tasks
...
The first use case for periodic task is the cleanup of expired
invitation keys. This is part of a house-keeping periodic task.
2020-03-08 22:49:49 +01:00
Eike Kettner
616c333fa5
Implement storage routines for periodic scheduler
2020-03-08 13:56:23 +01:00
Eike Kettner
1e598bd902
Sketch a scheduler for running periodic tasks
...
Periodic tasks are special in that they are usually kept around and
started based on a schedule. A new component checks periodic tasks and
submits them in the queue once they are due.
In order to avoid duplicate periodic jobs, the tracker of a job is
used to store the periodic job id. Each time a periodic task is due,
it is first checked if there is a job running (or queued) for this
task.
2020-03-08 12:55:03 +01:00
Eike Kettner
2f87065b2e
sbt scalafmtAll
2020-02-25 20:55:00 +01:00
Eike Kettner
ec419c7bfd
Adopt nix modules to new config
2020-02-22 12:40:56 +01:00
Eike Kettner
3f316ab4d0
Update config file doc
2020-02-20 21:10:00 +01:00
Eike Kettner
97305d27ff
Integrate support for more files into processing and upload
...
The restriction that only pdf files can be uploaded is removed. All
files can now be uploaded. The processing may not process all. It is
still possible to restrict file uploads by types via a configuration.
2020-02-19 23:27:00 +01:00
Eike Kettner
0dcc00836b
Make logger configurable in system commands
2020-02-18 12:02:43 +01:00
Eike Kettner
bd605b8c94
Add first drafts for converting
2020-02-18 01:31:22 +01:00
Eike Kettner
e0682464b5
Configure pdf extraction; move Logger and DataType to common
2020-02-17 14:01:36 +01:00
Eike Kettner
3d615181e0
Early draft for text extraction
2020-02-17 01:57:22 +01:00
Eike Kettner
851ee7ef0f
Reorganize processing code
...
Use separate modules for
- text extraction
- conversion to pdf
- text analysis
2020-02-15 21:25:25 +01:00
Eike Kettner
ce22b727b1
Add new convert module and sketch its integration
2020-02-11 00:33:52 +01:00
Eike Kettner
3be90d64d5
Move SystemCommand
to common module
2020-02-10 22:23:06 +01:00
Eike Kettner
ba3865ef5e
Starting to support more file types
...
First, files are be converted to PDF for archiving. It is also easier
to create a preview. This is done via the `ConvertPdf` processing
task (which is not yet implemented).
Text extraction then tries first with the original file. If that
fails, OCR is done on the (potentially) converted pdf file.
To not loose information of the original file, it is saved using the
table `attachment_source`. If the original file is already a pdf, or
the conversion did not succeed, the `attachment` and
`attachment_source` record point to the same file.
2020-02-10 12:42:45 +01:00
Eike Kettner
fc3e22e399
Apply scalafmt to all files
2019-12-30 21:44:13 +01:00
Eike Kettner
2ad1586d00
Set stricter compile options and fix cookie data
2019-09-28 22:17:45 +02:00
Eike Kettner
831cd8b655
Initial version.
...
Features:
- Upload PDF files let them analyze
- Manage meta data and items
- See processing in webapp
2019-09-21 22:02:36 +02:00
Eike Kettner
6154e6a387
Initial application stub
2019-09-21 14:54:03 +02:00