Commit Graph

238 Commits

Author SHA1 Message Date
Eike Kettner
e9ed998e3a Basic poc to search via custom query 2021-03-01 00:51:01 +01:00
Eike Kettner
186014a1c6 Refactor search to separate between a base query and user query
The `findBase` is adding only strictly required conditions. Everything
else comes from the user.
2021-03-01 00:51:01 +01:00
Eike Kettner
e6d9ce2c37 Remove obsolete type capabilities
These are now detected by the new scala compiler and lead to compile
errors.
2021-03-01 00:16:30 +01:00
Eike Kettner
d7bc963450 Cleanup nodes that are not reachable anymore 2021-02-18 00:37:18 +01:00
Eike Kettner
48eee00c0b Allow person to be correspondent, concerning or both 2021-02-16 22:49:55 +01:00
Eike Kettner
d99ce76d89 Remove person suggestion if it doesn't match with organization 2021-02-16 00:29:54 +01:00
Eike Kettner
dd935454c9 First version of new ui based on tailwind
This drops fomantic-ui as css toolkit and introduces tailwindcss. With
tailwind there are no predefined components, but it's very easy to
create those. So customizing the look&feel is much simpler, most of
the time no additional css is needed.

This requires a complete rewrite of the markup + styles. Luckily all
logic can be kept as is. The now old ui is not removed, it is still
available by using a request header `Docspell-Ui` with a value of `1`
for the old ui and `2` for the new ui.

Another addition is "dev mode", where docspell serves assets with a
no-cache header, to disable browser caching. This makes developing a
lot easier.
2021-02-14 01:46:13 +01:00
Eike Kettner
96612e0e59 Refactor scan mailbox form and add flag for post-processing
Mails are filtered once by using an imap search and then by some globs
to filter files and subjects. Imap can search by subject via a
string-contains, but not via globs or patterns (afaik). The subject
filter is applied to all downloaded mail headers. Now for post
processing (moving to some target folder or deleting), it can be
chosen to post-process all "seen" mails or only those that matched all
filters.
2021-01-24 01:46:31 +01:00
Eike Kettner
c7e850116f Make the text length limit optional 2021-01-22 23:06:50 +01:00
Eike Kettner
4cba96f390 Always return classifier results as suggestion
The classifier results are spliced into the suggestion list at second
place. When linking they are only used if nlp didn't find anything.
2021-01-21 21:05:28 +01:00
Eike Kettner
9957c3267e Add constraints from config to classifier training
For large and/or many documents, training the classifier can lead to
OOM errors. Some limits have been set by default.
2021-01-21 17:46:39 +01:00
Eike Kettner
a6c31be22f Update documentation 2021-01-20 22:47:15 +01:00
Eike Kettner
85ddc61d9d Move date proposal setting to nlp config 2021-01-20 19:17:29 +01:00
Eike Kettner
b12d965223 Improve logging 2021-01-20 00:40:58 +01:00
Eike Kettner
27c24c128d Store tags guessed with classifier in database 2021-01-20 00:30:40 +01:00
Eike Kettner
9d83cb7fe4 Store item based proposals in separate table
Classifier don't work on each attachment, but on all. So the results
must not be stored at an attachment. This reverts some previous
changes to put the classifier results for item entities into its own
table.
2021-01-19 23:48:09 +01:00
Eike Kettner
75573c905e Use classifier results as fallback when linking proposed metadata 2021-01-19 23:13:34 +01:00
Eike Kettner
8455d1badf Lookup results from classifier
The model may be out of date, data may change. Then it should be
looked up to fetch the id to be compatible with next stages.
2021-01-19 22:56:01 +01:00
Eike Kettner
1cd3441462 Run classifier for item entities (concerned, correspondent)
Store the results separately from nlp results in attachment metadata.
2021-01-19 22:08:29 +01:00
Eike Kettner
5c487ef7a9 Refactor running classifier in text analysis 2021-01-19 21:30:02 +01:00
Eike Kettner
99dcaae66b Learn classifiers for item entities
Learns classifiers for concerned and correspondent entities. This can
be used as an alternative to or after nlp.
2021-01-19 20:54:47 +01:00
Eike Kettner
a6f29153c4 Control what tag categories to use for auto-tagging 2021-01-19 01:20:13 +01:00
Eike Kettner
cce8878898 Exclude tags w/o category from classifying; remove obsolete models 2021-01-18 21:51:49 +01:00
Eike Kettner
249f9e6e2a Extend guessing tags to all tag categories 2021-01-18 21:51:45 +01:00
Eike Kettner
360cad3304 Refactoring solr/fts migration
When re-indexing everything, skip intermediate populating the index
and do this as the very last step.

Parameterize adding new fields by their language.
2021-01-18 17:41:40 +01:00
Eike Kettner
f01646aeb5 Reorganize nlp pipeline and add nlp-unsupported language italian
Improves and reorganizes how nlp pipelines are setup. Now users can
choose from many options, depending on their hardware and usage
scenario.

This is the base to use more languages without depending on what
stanford-nlp supports. Support then is involves to text extraction and
simple regex-ner processing.
2021-01-18 17:41:40 +01:00
Eike Kettner
a70e9ab614 Store used language for processing on attachmentmeta
Issue: #570
2021-01-17 22:56:33 +01:00
Eike Kettner
aa937797be Choose nlp mode in config file 2021-01-17 22:56:33 +01:00
Eike Kettner
a699e87304 Separate ner from classification 2021-01-17 22:56:33 +01:00
Eike Kettner
f02f15e5bd Move blocker into constructor of text analyser 2021-01-17 22:56:33 +01:00
Eike Kettner
d77b5855e4 Set default pool-size to 1 2021-01-11 22:30:59 +01:00
Eike Kettner
bddafa7d28 Fix looping over already seen mails when they are skipped
When skipping mails due to a filter, it must still enter the
post-handling step. Otherwise it will be seen again on next run.

Issue: #551
2021-01-09 15:07:18 +01:00
Eike Kettner
d712f8303d Make glob matching case-insensitive by default 2021-01-09 13:23:15 +01:00
Eike Kettner
a670bbb6c2 Make idle interval when clearing nlp cache configurable 2021-01-06 23:03:00 +01:00
Eike Kettner
b08e88cd69 Add (inofficial) routes to get system information 2021-01-05 20:54:53 +01:00
Eike Kettner
611e480eb4 Use more prominent log line to indicate start of processing
Issue: #530
2021-01-02 21:47:54 +01:00
Eike Kettner
97dfcece97 Fix duplicate check on restarts
Issue: #530
2021-01-02 21:18:05 +01:00
Eike Kettner
2dff686fa0 Introduce unit condition 2020-12-15 21:03:47 +01:00
Eike Kettner
80406cabc2 Refactoring some code into separate files 2020-12-15 21:03:47 +01:00
Eike Kettner
5e2c5d2a50 Extends query builder 2020-12-15 21:03:46 +01:00
Eike Kettner
35c62049f5 Start converting QItem 2020-12-15 21:03:46 +01:00
Eike Kettner
613696539f Minor refactorings 2020-12-15 21:03:46 +01:00
Eike Kettner
e3f6892abd Convert job record 2020-12-15 21:03:46 +01:00
Eike Kettner
3cef932ccd Convert more records 2020-12-15 21:03:46 +01:00
Eike Kettner
10b49fccf8 Converting user and userimap records 2020-12-15 21:03:46 +01:00
Eike Kettner
f5ae389eea Cleanup remember-me tokens periodically 2020-12-04 17:59:25 +01:00
Eike Kettner
290989f67f Reorder correspondent person suggestion based on org relationship 2020-12-01 23:39:45 +01:00
Eike Kettner
3fabe0a582 Update to Scala 2.13.4 2020-11-27 20:26:24 +01:00
Eike Kettner
5fe532001b Allow to specify document lanugage with the request 2020-11-23 20:49:01 +01:00
Eike Kettner
5034e12bec Add a subject filter to scan-mailbox args 2020-11-13 23:15:20 +01:00
mergify[bot]
e5ce1fd45f
Merge pull request #437 from eikek/upload-improvements
Upload improvements
2020-11-12 22:58:08 +00:00
Eike Kettner
4fd6e02ec0 Improve glob and filter archive entries 2020-11-11 21:01:23 +01:00
Eike Kettner
27eb5d70de Apply given tags in processing step
Issue: #346
2020-11-11 21:01:23 +01:00
Eike Kettner
55a6f7aaf6 Add more properties to upload meta data 2020-11-11 21:01:23 +01:00
Eike Kettner
746e04c624 Improve logging when creating preview images 2020-11-10 22:25:46 +01:00
Eike Kettner
10305bc82d Minor improvements 2020-11-09 21:16:53 +01:00
Eike Kettner
29455d638c Add startup task to find page counts of existing files 2020-11-09 20:35:35 +01:00
Eike Kettner
a77f34b7ba Add a processing step to retrieve page counts 2020-11-09 11:08:24 +01:00
Eike Kettner
f4e50c5229 Provide endpoints to submit tasks to re-generate previews
The scaling factor can be given in the config file. When this changes,
images can be regenerated via POSTing to certain endpoints. It is
possible to regenerate just one attachment preview or all within a
collective.
2020-11-09 09:00:02 +01:00
Eike Kettner
6037b54959 Don't fail processing if generating preview fails 2020-11-09 00:05:11 +01:00
Eike Kettner
709848244c Create tasks to generate all previews
There is a task to generate preview images per attachment. It can
either add them (if not present yet) or overwrite them (e.g. some
config has changed).

There is a task that selects all attachments without previews and
submits a task to create it. This is submitted on start automatically
to generate previews for all existing attachments.
2020-11-08 23:46:02 +01:00
Eike Kettner
7ba6baf6f0 Make preview image smaller 2020-11-08 15:12:56 +01:00
Eike Kettner
6db5c39d78 Fix converted filename
Mark it by default with a string from the config file.

Issue: 397
2020-11-08 09:45:03 +01:00
Eike Kettner
ef7cb4e779 Create a preview image of all files during processing 2020-11-08 01:25:59 +01:00
Eike Kettner
ab1139523a Let the convert-all task retry when pdf conversion fails 2020-10-26 23:39:26 +01:00
Eike Kettner
b59696a9d3 Make sure to only remove/retry items in premature states 2020-10-26 23:39:26 +01:00
Eike Kettner
26e89bf84e Edit org/person/equipment of multiple items 2020-10-26 13:35:47 +01:00
Eike Kettner
2e6026b817 Edit dates of multiple items 2020-10-26 13:16:03 +01:00
Eike Kettner
3c0b86cb19 Fix regex patterns used for NER
Patterns are split on whitespace by the nlp library and then compiled,
so each "word" must be a valid regex.

Fixes: #356
2020-10-21 00:55:14 +02:00
Eike Kettner
3f697f51aa Autoformat 2020-10-06 23:31:09 +02:00
Eike Kettner
d4354b8b49 Skip pdf conversion if a converted file exists
For images the conversion also returns the extracted text. If this
would have failed to be saved, it is extracted in the following
text-extraction step.
2020-10-02 17:39:39 +02:00
Eike Kettner
b6f23b038a Fix finding attachments for retries
The attachments to process again must be searched in sources and
archives, too.
2020-10-02 17:39:34 +02:00
Eike Kettner
5e21552358 Don't do duplicate check on retries 2020-10-02 16:50:52 +02:00
Eike Kettner
f6f63000be Prepend a duplicate check when uploading files 2020-09-23 23:37:00 +02:00
Eike Kettner
c658677032 Autoformat 2020-09-09 00:29:32 +02:00
Eike Kettner
76ccfb8a81 Only learn from confirmed items
Text classification should only learn from confirmed items. Log if
classification is disabled when processing an item.
2020-09-07 13:04:40 +02:00
Eike Kettner
4309bd8dfd Some cleanup 2020-09-02 21:22:30 +02:00
Eike Kettner
237b960625 Guess a tag on item processing using a trained model if available 2020-09-02 18:28:14 +02:00
Eike Kettner
316b490008 Implement learning a text classifier from collective data 2020-09-02 18:28:14 +02:00
Eike Kettner
68bb65572b Integrate learn-classifier task into the app 2020-09-02 18:28:14 +02:00
Eike Kettner
0c97b4ef76 Initial impl of a text classifier based on stanford-nlp 2020-09-02 18:28:14 +02:00
Eike Kettner
8c4f2e702b Add classifier settings 2020-09-02 18:28:14 +02:00
Eike Kettner
3473cbb773 Use collective data with NER annotation 2020-08-25 20:40:44 +02:00
Eike Kettner
96d2f948f2 Use collective's addressbook to configure regexner 2020-08-24 14:40:52 +02:00
Eike Kettner
8628a0a8b3 Allow configuring stanford-ner and cache based on collective 2020-08-24 10:55:59 +02:00
Eike Kettner
3986487f11 Add api docs and cleanup 2020-08-13 21:22:54 +02:00
Eike Kettner
41ea071555 Add a task to convert all pdfs that have not been converted 2020-08-13 01:06:13 +02:00
Eike Kettner
07e9a9767e Add a task to re-process files of an item 2020-08-12 22:29:56 +02:00
Eike Kettner
09d74b7e80 Return item notes with search results
In order to not make the response very large, a admin can define a
limit on how much to return.
2020-08-05 00:09:37 +02:00
Eike Kettner
45b0deeced Print solr url on start
This is useful info to see which url has been selected, same as db
connection.
2020-08-01 15:59:14 +02:00
Eike Kettner
1fc57fc2b2 Set default value for min-text-len to 500
This value is used to decide whether to try OCR or not. If text is
below this value, OCR is run and both results are compared. It was set
to 10, which is just one or two words. Since the context for docspell
are documents, this value is too low.
2020-08-01 15:46:00 +02:00
Eike Kettner
cec4948710 Add pdf meta data to extracted text to add it to full-text index 2020-07-19 01:07:49 +02:00
Eike Kettner
209c068436 Use keywords in pdfs to search for existing tags
During processing, keywords stored in PDF metadata are used to look
them up in the tag database and associate any existing tags to the
item.

See #175
2020-07-19 00:28:04 +02:00
Eike Kettner
bd20165d1a Use given folder-id when adding initial fts docs 2020-07-18 23:04:01 +02:00
Eike Kettner
3d49ceaab5 Use ocrmypdf tool to create pdf/a during conversion
- Use another external tool to convert pdf to pdf which also adds the
  extracted text as another layer into the pdf

- Although not used, the external conversion routine will now check
  for an existing text file that is named as the pdf file with extension
  `.txt`. If present it is included in the conversion result and will be
  used as the extracted text.

- text extraction for pdf files happens now on the converted file,
  because it may already contain the text from the conversion step and
  thus avoids running OCR twice.

- All errors during conversion are not fatal; processing continues
  without a converted file.
2020-07-18 17:19:29 +02:00
Eike Kettner
5b01c93711 Add a folder-id to item processing
This allows to define a folder when uploading files. All generated
items are associated to this folder on creation.
2020-07-14 23:18:39 +02:00
Eike Kettner
259526a088 Organize imports 2020-07-12 13:51:52 +02:00
Eike Kettner
22fa1dba13 Apply folder restriction to fulltext only search
And update index when folder changes.
2020-07-12 13:50:45 +02:00
Eike Kettner
aeba4ba913 Refactor full-text migrations and add folder to solr schema 2020-07-12 13:50:14 +02:00
Eike Kettner
e387b5513f Remove items in non-member folders from sql search results 2020-07-11 22:25:56 +02:00