diff --git a/Changelog.md b/Changelog.md
index 76a41d8b..4960b444 100644
--- a/Changelog.md
+++ b/Changelog.md
@@ -6,9 +6,9 @@
This release comes with major improvements to the text analysis
module. It is now much more configurable, has improved results and can
-learn tags from all categories. Additionally, more languages have been
-added (and it's now easier to add more, please open an issue if want
-more languages).
+learn tags from all categories. Additionally, more languages for
+document processing have been added and it's now easier to add more.
+Please open an issue if want more languages to be included.
- text analysis improvements (#263, #570)
- docspell can now learn from all your tag categories
@@ -23,15 +23,15 @@ more languages).
- Adds: Spanish, Italian, Portuguese, Czech, Dutch, Danish, Finnish,
Norwegian, Swedish, Russian, Romanian
- languages have different support for text-analysis, but there is
- some basic support for all, there is extended support for English,
- German and French through [Stanford
- CoreNLP](https://stanfordnlp.github.io/CoreNLP/) nlp models
- - if you want more languages, please open an issue.
+ some basic support for all
+ - there is extended support for English, German and French through
+ [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) nlp
+ models (as before)
- scan mailbox change (#576)
- The change from last version (#551) has been moved behind a flag
in the "scan mailbox settings". Please review your scan mailbox
- settings.
- - The scan mailbox settings form view has been changed to tab-style,
+ tasks in your user settings.
+ - The scan mailbox settings form view has been organized into tabs,
as it grew too large for a single form.
- nix tools package fixed (#584)
- If you are using docspell tools package for nix, it has now been
@@ -50,22 +50,23 @@ more languages).
- This was a bug introduced by the last release. When tag categories
can now be spelled upper- or lower-case. In 0.18.0 you had to
spell them lowercase, otherwise the search doesn't work.
-- adds a workaround for mails that don't specify the used charset (#591)
+- adds a workaround for mails that don't specify their used charset (#591)
### Breaking Changes
- The joex configuration changed around text analysis. If you had some
custom settings there, please review these wrt the new default
config.
-- The tools package renamed the scripts to be better distinguishable,
- since they all end up in `$PATH`. They are now prefixed by `ds-`.
+- When using the nix package manager: the tools package renamed the
+ scripts to be better distinguishable, since they all end up in
+ `$PATH`. They are now prefixed by `ds-`.
- The path of the consumedir script changed in the consumedir docker
image
- The settings of the scan-mailbox task has been extended by another
flag. It controls when to apply the post-processing (moving or
deleting). If you were relying that all mails (even those excluded
- by a subject filter) where moved away, you need to check the
- settings.
+ by a subject filter) where moved away, you need to check your
+ scan-mailbox task settings.
### REST Api Changes
@@ -82,6 +83,9 @@ more languages).
moved inside `text-anlysis`. Please have a look at the new
[default config](https://docspell.org/docs/configure/#joex) if you
changed something there.
+ - The `regex-ner` section has changed: the `enabled` flag has been
+ removed, you can now limit the number of entries using
+ `max-entries` to apply and `0` means to disable it.
## v0.18.0
diff --git a/website/site/content/docs/configure/_index.md b/website/site/content/docs/configure/_index.md
index a7bf0765..3b8ec7fd 100644
--- a/website/site/content/docs/configure/_index.md
+++ b/website/site/content/docs/configure/_index.md
@@ -341,13 +341,59 @@ will degrade to `regexonly` for these.
The mode `disabled` skips NLP processing completely. This has least
impact in memory consumption, obviously, but then only the classifier
-is used to find metadata.
+is used to find metadata (unless it is disabled, too).
You might want to try different modes and see what combination suits
best your usage pattern and machine running joex. If a powerful
-machine is used, simply leave the defaults. When running on an older
+machine is used, simply leave the defaults. When running on an
raspberry pi, for example, you might need to adjust things.
+### Memory Usage
+
+The memory requirements for the joex component depends on the document
+language and the enabled features for text-analysis. The `nlp.mode`
+setting has significant impact, especially when your documents are in
+German. Here are some rough numbers on jvm heap usage (the same file
+was used for all tries):
+
+
+
+ nlp.mode | English | German | French |
+
+
+
+
+ full | 420M | 950M | 490M |
+ basic | 170M | 380M | 390M |
+
+
+
+Note that these are only rough numbers and they show the maximum used
+heap memory while processing a file.
+
+When using `mode=full`, a heap setting of at least `-Xmx1400M` is
+recommended. For `mode=basic` a heap setting of at least `-Xmx500M` is
+recommended.
+
+Other languages can't use these two modes, and so don't require this
+amount of memory (but don't have as good results). Then you can go
+with less heap. For these languages, the nlp mode is the same as
+`regexonly`.
+
+Training the classifier is also memory intensive, which solely depends
+on the size and number of documents that are being trained. However,
+training the classifier is done periodically and can happen maybe
+every two weeks. When classifying new documents, memory requirements
+are lower, since the model already exists.
+
+More details about these modes can be found
+[here](@/docs/joex/file-processing.md#text-analysis).
+
+
+The restserver component is very lightweight, here you can use
+defaults.
+
+
# File Format
The format of the configuration files can be
diff --git a/website/site/content/docs/feed/_index.md b/website/site/content/docs/feed/_index.md
index 109a6db8..f193539a 100644
--- a/website/site/content/docs/feed/_index.md
+++ b/website/site/content/docs/feed/_index.md
@@ -42,6 +42,8 @@ directory and uploads all incoming files to Docspell. The script can
watch directories recursively and can skip files already uploaded, so
you can organize the files as you want in there (rename, move etc).
+This can be used multiple times on different machines, if desired.
+
The scanner should support 300dpi for better results. Docspell
converts the files into PDF adding a text layer of image-only files.
diff --git a/website/site/content/docs/install/quickstart.md b/website/site/content/docs/install/quickstart.md
index dfa6e583..0cdd649a 100644
--- a/website/site/content/docs/install/quickstart.md
+++ b/website/site/content/docs/install/quickstart.md
@@ -25,3 +25,8 @@ To get started, here are some quick links:
user provided [notes and unraid
templates](https://github.com/vakilando/unraid-docker-templates)
which can get you started. Thanks for providing these!
+
+Every [component](@docs/intro/_index.md#components) (restserver, joex,
+consumedir) can run on different machines and multiple times. Most of
+the time running all on one machine is sufficient and also for
+simplicity, the docker-compose setup reflects this variant.
diff --git a/website/site/content/docs/install/rpi.md b/website/site/content/docs/install/rpi.md
index edf35e88..4e268515 100644
--- a/website/site/content/docs/install/rpi.md
+++ b/website/site/content/docs/install/rpi.md
@@ -27,7 +27,7 @@ result in long processing times for OCR and text analysis. The board
should provide 4G of RAM (like the current RPi4), especially if also a
database and solr are running next to it. The memory required by joex
depends on the config and document language. Please pick a value that
-suits your setup from [here](@/docs/install/running.md#memory-usage).
+suits your setup from [here](@/docs/configure/_index.md#memory-usage).
For boards like the RPi, it might be necessary to use
`nlp.mode=basic`, rather than `nlp.mode=full`. You should also set the
joex pool size to 1.
diff --git a/website/site/content/docs/install/running.md b/website/site/content/docs/install/running.md
index b012f5d5..466ef352 100644
--- a/website/site/content/docs/install/running.md
+++ b/website/site/content/docs/install/running.md
@@ -45,42 +45,14 @@ when opened up to the outside, it is recommended to lock this down.
{% end %}
-## Memory Usage
+## Memory
-The memory requirements for the joex component depends on the document
-language and the configuration for [file
-processing](@/docs/configure/_index.md#file-processing). The
-`nlp.mode` setting has significant impact, especially when your
-documents are in German. Here are some rough numbers on jvm heap usage
-(the same small jpeg file was used for all tries):
-
-
-
- nlp.mode | English | German | French |
-
-
-
-
- full | 420M | 950M | 490M |
- basic | 170M | 380M | 390M |
-
-
-
-When using `mode=full`, a heap setting of at least `-Xmx1400M` is
-recommended. For `mode=basic` a heap setting of at least `-Xmx500M` is
-recommended.
-
-Other languages can't use these two modes, and so don't require this
-amount of memory (but don't have as good results). Then you can go
-with less heap.
-
-More details about these modes can be found
-[here](@/docs/joex/file-processing.md#text-analysis).
-
-
-The restserver component is very lightweight, here you can use
-defaults.
+Using the options below you can define how much memory the JVM process
+is able to use. This might be necessary to adopt depending on the
+usage scenario and configured text analysis features.
+Please have a look at the corresponding [configuration
+section](@/docs/configure/_index.md#memory-usage).
## Options
diff --git a/website/site/content/docs/intro/_index.md b/website/site/content/docs/intro/_index.md
index 31831b7c..75e6c7f1 100644
--- a/website/site/content/docs/intro/_index.md
+++ b/website/site/content/docs/intro/_index.md
@@ -24,9 +24,10 @@ candidates for:
- Correspondents
- Concerned person or things
- A date
+- Tags
-It will propose a few candidates and sets the most likely one to your
-item.
+For tags, it sets all that it thinks do apply. For the others, it will
+propose a few candidates and sets the most likely one to your item.
This might be wrong, so it is recommended to curate the results.
However, very often the correct one is either set or within the
diff --git a/website/site/content/docs/joex/file-processing.md b/website/site/content/docs/joex/file-processing.md
index f1343dea..dd5ca0a9 100644
--- a/website/site/content/docs/joex/file-processing.md
+++ b/website/site/content/docs/joex/file-processing.md
@@ -443,9 +443,10 @@ nlp processing as follows:
- mode `regexonly`: this doesn't load any statistical models and is
therefore much lighter on memory (depending on the address book
size, of course). It will use the address book to create regex rules
- and match them against your document.
-- mode = disabled: this disables nlp processing altogether. Then only
- the classifier is run (unless disabled).
+ and match them against your document. Memory usage then doesn't
+ depend on the document language.
+- mode `disabled`: this disables nlp processing. Then only the
+ classifier is run (unless disabled).
Note that mode `full` and `basic` is only relevant for the languages
where models are available. For all other languages, it is effectively
diff --git a/website/site/content/docs/webapp/scanmailbox-detail-01.png b/website/site/content/docs/webapp/scanmailbox-detail-01.png
new file mode 100644
index 00000000..3594198b
Binary files /dev/null and b/website/site/content/docs/webapp/scanmailbox-detail-01.png differ
diff --git a/website/site/content/docs/webapp/scanmailbox-detail-02.png b/website/site/content/docs/webapp/scanmailbox-detail-02.png
new file mode 100644
index 00000000..69176a40
Binary files /dev/null and b/website/site/content/docs/webapp/scanmailbox-detail-02.png differ
diff --git a/website/site/content/docs/webapp/scanmailbox-detail-03.png b/website/site/content/docs/webapp/scanmailbox-detail-03.png
new file mode 100644
index 00000000..48a7f12e
Binary files /dev/null and b/website/site/content/docs/webapp/scanmailbox-detail-03.png differ
diff --git a/website/site/content/docs/webapp/scanmailbox-detail-04.png b/website/site/content/docs/webapp/scanmailbox-detail-04.png
new file mode 100644
index 00000000..14cca316
Binary files /dev/null and b/website/site/content/docs/webapp/scanmailbox-detail-04.png differ
diff --git a/website/site/content/docs/webapp/scanmailbox-detail-05.png b/website/site/content/docs/webapp/scanmailbox-detail-05.png
new file mode 100644
index 00000000..79708b83
Binary files /dev/null and b/website/site/content/docs/webapp/scanmailbox-detail-05.png differ
diff --git a/website/site/content/docs/webapp/scanmailbox-detail-06.png b/website/site/content/docs/webapp/scanmailbox-detail-06.png
new file mode 100644
index 00000000..da793e00
Binary files /dev/null and b/website/site/content/docs/webapp/scanmailbox-detail-06.png differ
diff --git a/website/site/content/docs/webapp/scanmailbox-detail.png b/website/site/content/docs/webapp/scanmailbox-detail.png
deleted file mode 100644
index 4360a772..00000000
Binary files a/website/site/content/docs/webapp/scanmailbox-detail.png and /dev/null differ
diff --git a/website/site/content/docs/webapp/scanmailbox.md b/website/site/content/docs/webapp/scanmailbox.md
index b84fbfbd..cc5867d0 100644
--- a/website/site/content/docs/webapp/scanmailbox.md
+++ b/website/site/content/docs/webapp/scanmailbox.md
@@ -21,19 +21,22 @@ multiple e-mail accounts you want to import periodically.
# Details
-Creating a task requires the following information:
+## General
-{{ figure(file="scanmailbox-detail.png") }}
+{{ figure(file="scanmailbox-detail-01.png") }}
You can enable or disable this task. A disabled task will not run
periodically. You can still choose to run it manually if you click the
`Start Once` button.
-## E-Mail Settings
-
Then you need to specify which [IMAP
connection](@/docs/webapp/emailsettings.md#imap-settings) to use.
+
+## Processing
+
+{{ figure(file="scanmailbox-detail-02.png") }}
+
A list of folders is required. Docspell will only look into these
folders. You can specify multiple folders. The "Inbox" folder is a
special folder, which will usually appear translated in your web-mail
@@ -43,30 +46,20 @@ mails in your inbox. Any other folder is usually case-sensitive
except the INBOX folder). Type in a folder name and click the add
button on the right.
-The next two settings tell docspell what to do once a mail has been
-submitted to docspell. It can be moved into another folder in your
-mail account. This moves it out of the way for the next run. You can
-also choose to delete the mail, but *note that it will really be
-deleted and not moved to your trash folder*. If both options are off,
-nothing happens with that mail, it simply stays (and could be re-read
-on the next run).
-
-Be careful when mails are neither moved nor deleted after processing.
-They could be selected anew in the next run, meaning that the job can
-not progress, because it filters out the same mails all the time. You
-can however, simply schedule the task in an interval >= the `Received
-Since Hours` setting.
-
-
-## Filtering
-
-The following properties allow to filter mails that are imported.
-
Then the field *Received Since Hours* defines how many hours to go
back and look for mails. Usually there are many mails in your inbox
and importing them all at once is not feasible or desirable. It can
work together with the *Schedule* field below. For example, you could
-run this task all 6 hours and read mails from 8 hours back.
+run this task all 6 hours and read mails from 8 hours back. This
+setting is used to query the mail server.
+
+
+## Additional Filter
+
+{{ figure(file="scanmailbox-detail-03.png") }}
+
+The following properties allow to filter those downloaded mails that
+should be imported.
The *File Filter* can be specified as a glob to only import mail
attachments based on their file name. For example, a value of `*.pdf`
@@ -82,10 +75,38 @@ pattern. For example, if your scanner mails to you with a certain
subject like _"Scanned Document 214"_, you could include those via a
`Scanned Document*` pattern.
+## Post Processing
+
+{{ figure(file="scanmailbox-detail-04.png") }}
+
+The next settings tell docspell what to do once a mail has been read
+by docspell. It can be moved into another folder in your mail account.
+This moves it out of the way for the next run. You can also choose to
+delete the mail, but *note that it will really be deleted and not
+moved to your trash folder*. If both options are off, nothing happens
+with that mail, it simply stays (and could be re-read on the next
+run).
+
+Be careful when mails are neither moved nor deleted after processing.
+They could be selected anew in the next run, meaning that the job can
+not progress, because it filters out the same mails all the time. You
+can however, simply schedule the task in an interval >= the `Received
+Since Hours` setting.
+
+By default, post-processing is only applied to mails that have been
+*submitted to docspell*. Some mails may have been skipped due subject
+filtering. If you also want these skipped mails to be affected by
+post-processing, enabled the *Apply post-processing to all fetched
+mails*.
+
+
+
## Metadata
-The last properties allow to specify some metadata that are
-automatically attached to the items being created.
+{{ figure(file="scanmailbox-detail-05.png") }}
+
+These properties allow to specify some metadata that are automatically
+attached to the items being created.
Every item in docspell has a direction value (incoming or outgoing).
If you know that all mails you want to import have a specific
@@ -104,17 +125,26 @@ resulting items.
The *Tags* setting can be used to associate a fixed number of tags to
all items that are imported from this mail task.
-The last field is the *Schedule* which defines when and how often this
-task should run. The syntax is similiar to a date-time string, like
-`2019-09-15 12:32`, where each part is a pattern to also match multple
-values. The ui tries to help a little by displaying the next two
-date-times this task would execute. A more in depth help is available
-[here](https://github.com/eikek/calev#what-are-calendar-events). For
-example, to execute the task every monday at noon, you would write:
-`Mon *-*-* 12:00`. A date-time part can match all values (`*`), a list
-of values (e.g. `1,5,12,19`) or a range (e.g. `1..9`). Long lists may
-be written in a shorter way using a repetition value. It is written
-like this: `1/7` which is the same as a list with `1` and all
+The *Language* setting is applied when processing the mails. If not
+set, the default language of the collective is used.
+
+
+## Schedule
+
+{{ figure(file="scanmailbox-detail-06.png") }}
+
+At last the *Schedule* defines when and how often this task should
+run. The syntax is similiar to a date-time string, like `2019-09-15
+12:32`, where each part is a pattern to also match multple values. The
+ui tries to help a little by displaying the next two date-times this
+task would execute. A more in depth help is available
+[here](https://github.com/eikek/calev#what-are-calendar-events).
+
+For example, to execute the task every monday at noon, you would
+write: `Mon *-*-* 12:00`. A date-time part can match all values (`*`),
+a list of values (e.g. `1,5,12,19`) or a range (e.g. `1..9`). Long
+lists may be written in a shorter way using a repetition value. It is
+written like this: `1/7` which is the same as a list with `1` and all
multiples of `7` added to it. In other words, it matches `1`, `1+7`,
`1+7+7`, `1+7+7+7` and so on.