Update documentation

2025-10-20 12:20:12 +00:00 · 2021-01-25 08:50:46 +01:00
parent e9a4f904c9
commit 946204e809
16 changed files with 154 additions and 93 deletions
--- a/Changelog.md
+++ b/Changelog.md
@@ -6,9 +6,9 @@

 This release comes with major improvements to the text analysis
 module. It is now much more configurable, has improved results and can
-learn tags from all categories. Additionally, more languages have been
-added (and it's now easier to add more, please open an issue if want
-more languages).
+learn tags from all categories. Additionally, more languages for
+document processing have been added and it's now easier to add more.
+Please open an issue if want more languages to be included.

 - text analysis improvements (#263, #570)
  - docspell can now learn from all your tag categories
@@ -23,15 +23,15 @@ more languages).
  - Adds: Spanish, Italian, Portuguese, Czech, Dutch, Danish, Finnish,
    Norwegian, Swedish, Russian, Romanian
  - languages have different support for text-analysis, but there is
-    some basic support for all, there is extended support for English,
-    German and French through [Stanford
-    CoreNLP](https://stanfordnlp.github.io/CoreNLP/) nlp models
-  - if you want more languages, please open an issue.
+    some basic support for all
+  - there is extended support for English, German and French through
+    [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) nlp
+    models (as before)
 - scan mailbox change (#576)
  - The change from last version (#551) has been moved behind a flag
    in the "scan mailbox settings". Please review your scan mailbox
-    settings.
-  - The scan mailbox settings form view has been changed to tab-style,
+    tasks in your user settings.
+  - The scan mailbox settings form view has been organized into tabs,
    as it grew too large for a single form.
 - nix tools package fixed (#584)
  - If you are using docspell tools package for nix, it has now been
@@ -50,22 +50,23 @@ more languages).
  - This was a bug introduced by the last release. When tag categories
    can now be spelled upper- or lower-case. In 0.18.0 you had to
    spell them lowercase, otherwise the search doesn't work.
- adds a workaround for mails that don't specify the used charset (#591)
+- adds a workaround for mails that don't specify their used charset (#591)

 ### Breaking Changes

 - The joex configuration changed around text analysis. If you had some
  custom settings there, please review these wrt the new default
  config.
- The tools package renamed the scripts to be better distinguishable,
-  since they all end up in `$PATH`. They are now prefixed by `ds-`.
+- When using the nix package manager: the tools package renamed the
+  scripts to be better distinguishable, since they all end up in
+  `$PATH`. They are now prefixed by `ds-`.
 - The path of the consumedir script changed in the consumedir docker
  image
 - The settings of the scan-mailbox task has been extended by another
  flag. It controls when to apply the post-processing (moving or
  deleting). If you were relying that all mails (even those excluded
-  by a subject filter) where moved away, you need to check the
-  settings.
+  by a subject filter) where moved away, you need to check your
+  scan-mailbox task settings.

 ### REST Api Changes

@@ -82,6 +83,9 @@ more languages).
    moved inside `text-anlysis`. Please have a look at the new
    [default config](https://docspell.org/docs/configure/#joex) if you
    changed something there.
+  - The `regex-ner` section has changed: the `enabled` flag has been
+    removed, you can now limit the number of entries using
+    `max-entries` to apply and `0` means to disable it.


 ## v0.18.0
--- a/website/site/content/docs/configure/_index.md
+++ b/website/site/content/docs/configure/_index.md
@@ -341,13 +341,59 @@ will degrade to `regexonly` for these.

 The mode `disabled` skips NLP processing completely. This has least
 impact in memory consumption, obviously, but then only the classifier
-is used to find metadata.
+is used to find metadata (unless it is disabled, too).

 You might want to try different modes and see what combination suits
 best your usage pattern and machine running joex. If a powerful
-machine is used, simply leave the defaults. When running on an older
+machine is used, simply leave the defaults. When running on an
 raspberry pi, for example, you might need to adjust things.

+### Memory Usage
+
+The memory requirements for the joex component depends on the document
+language and the enabled features for text-analysis. The `nlp.mode`
+setting has significant impact, especially when your documents are in
+German. Here are some rough numbers on jvm heap usage (the same file
+was used for all tries):
+
+<table class="table is-hoverable is-striped">
+<thead>
+  <tr><th>nlp.mode</th><th>English</th><th>German</th><th>French</th></tr>
+</thead>
+<tfoot>
+</tfoot>
+<tbody>
+  <tr><td>full</td><td>420M</td><td>950M</td><td>490M</td></tr>
+  <tr><td>basic</td><td>170M</td><td>380M</td><td>390M</td></tr>
+</tbody>
+</table>
+
+Note that these are only rough numbers and they show the maximum used
+heap memory while processing a file.
+
+When using `mode=full`, a heap setting of at least `-Xmx1400M` is
+recommended. For `mode=basic` a heap setting of at least `-Xmx500M` is
+recommended.
+
+Other languages can't use these two modes, and so don't require this
+amount of memory (but don't have as good results). Then you can go
+with less heap. For these languages, the nlp mode is the same as
+`regexonly`.
+
+Training the classifier is also memory intensive, which solely depends
+on the size and number of documents that are being trained. However,
+training the classifier is done periodically and can happen maybe
+every two weeks. When classifying new documents, memory requirements
+are lower, since the model already exists.
+
+More details about these modes can be found
+[here](@/docs/joex/file-processing.md#text-analysis).
+
+
+The restserver component is very lightweight, here you can use
+defaults.
+
+
 # File Format

 The format of the configuration files can be
--- a/website/site/content/docs/feed/_index.md
+++ b/website/site/content/docs/feed/_index.md
@@ -42,6 +42,8 @@ directory and uploads all incoming files to Docspell. The script can
 watch directories recursively and can skip files already uploaded, so
 you can organize the files as you want in there (rename, move etc).

+This can be used multiple times on different machines, if desired.
+
 The scanner should support 300dpi for better results. Docspell
 converts the files into PDF adding a text layer of image-only files.

--- a/website/site/content/docs/install/quickstart.md
+++ b/website/site/content/docs/install/quickstart.md
@@ -25,3 +25,8 @@ To get started, here are some quick links:
  user provided [notes and unraid
  templates](https://github.com/vakilando/unraid-docker-templates)
  which can get you started. Thanks for providing these!
+
+Every [component](@docs/intro/_index.md#components) (restserver, joex,
+consumedir) can run on different machines and multiple times. Most of
+the time running all on one machine is sufficient and also for
+simplicity, the docker-compose setup reflects this variant.
--- a/website/site/content/docs/install/rpi.md
+++ b/website/site/content/docs/install/rpi.md
@@ -27,7 +27,7 @@ result in long processing times for OCR and text analysis. The board
 should provide 4G of RAM (like the current RPi4), especially if also a
 database and solr are running next to it. The memory required by joex
 depends on the config and document language. Please pick a value that
-suits your setup from [here](@/docs/install/running.md#memory-usage).
+suits your setup from [here](@/docs/configure/_index.md#memory-usage).
 For boards like the RPi, it might be necessary to use
 `nlp.mode=basic`, rather than `nlp.mode=full`. You should also set the
 joex pool size to 1.
--- a/website/site/content/docs/install/running.md
+++ b/website/site/content/docs/install/running.md
@@ -45,42 +45,14 @@ when opened up to the outside, it is recommended to lock this down.

 {% end %}

-## Memory Usage
+## Memory

-The memory requirements for the joex component depends on the document
-language and the configuration for [file
-processing](@/docs/configure/_index.md#file-processing). The
-`nlp.mode` setting has significant impact, especially when your
-documents are in German. Here are some rough numbers on jvm heap usage
-(the same small jpeg file was used for all tries):
-
-<table class="table is-hoverable is-striped">
-<thead>
-  <tr><th>nlp.mode</th><th>English</th><th>German</th><th>French</th></tr>
-</thead>
-<tfoot>
-</tfoot>
-<tbody>
-  <tr><td>full</td><td>420M</td><td>950M</td><td>490M</td></tr>
-  <tr><td>basic</td><td>170M</td><td>380M</td><td>390M</td></tr>
-</tbody>
-</table>
-
-When using `mode=full`, a heap setting of at least `-Xmx1400M` is
-recommended. For `mode=basic` a heap setting of at least `-Xmx500M` is
-recommended.
-
-Other languages can't use these two modes, and so don't require this
-amount of memory (but don't have as good results). Then you can go
-with less heap.
-
-More details about these modes can be found
-[here](@/docs/joex/file-processing.md#text-analysis).
-
-
-The restserver component is very lightweight, here you can use
-defaults.
+Using the options below you can define how much memory the JVM process
+is able to use. This might be necessary to adopt depending on the
+usage scenario and configured text analysis features.

+Please have a look at the corresponding [configuration
+section](@/docs/configure/_index.md#memory-usage).

 ## Options

--- a/website/site/content/docs/intro/_index.md
+++ b/website/site/content/docs/intro/_index.md
@@ -24,9 +24,10 @@ candidates for:
 - Correspondents
 - Concerned person or things
 - A date
+- Tags

-It will propose a few candidates and sets the most likely one to your
-item.
+For tags, it sets all that it thinks do apply. For the others, it will
+propose a few candidates and sets the most likely one to your item.

 This might be wrong, so it is recommended to curate the results.
 However, very often the correct one is either set or within the
--- a/website/site/content/docs/joex/file-processing.md
+++ b/website/site/content/docs/joex/file-processing.md
@@ -443,9 +443,10 @@ nlp processing as follows:
 - mode `regexonly`: this doesn't load any statistical models and is
  therefore much lighter on memory (depending on the address book
  size, of course). It will use the address book to create regex rules
-  and match them against your document.
- mode = disabled: this disables nlp processing altogether. Then only
-  the classifier is run (unless disabled).
+  and match them against your document. Memory usage then doesn't
+  depend on the document language.
+- mode `disabled`: this disables nlp processing. Then only the
+  classifier is run (unless disabled).

 Note that mode `full` and `basic` is only relevant for the languages
 where models are available. For all other languages, it is effectively
--- a/website/site/content/docs/webapp/scanmailbox-detail-01.png
+++ b/website/site/content/docs/webapp/scanmailbox-detail-01.png
--- a/website/site/content/docs/webapp/scanmailbox-detail-02.png
+++ b/website/site/content/docs/webapp/scanmailbox-detail-02.png
--- a/website/site/content/docs/webapp/scanmailbox-detail-03.png
+++ b/website/site/content/docs/webapp/scanmailbox-detail-03.png
--- a/website/site/content/docs/webapp/scanmailbox-detail-04.png
+++ b/website/site/content/docs/webapp/scanmailbox-detail-04.png
--- a/website/site/content/docs/webapp/scanmailbox-detail-05.png
+++ b/website/site/content/docs/webapp/scanmailbox-detail-05.png
--- a/website/site/content/docs/webapp/scanmailbox-detail-06.png
+++ b/website/site/content/docs/webapp/scanmailbox-detail-06.png
--- a/website/site/content/docs/webapp/scanmailbox-detail.png
+++ b/website/site/content/docs/webapp/scanmailbox-detail.png
--- a/website/site/content/docs/webapp/scanmailbox.md
+++ b/website/site/content/docs/webapp/scanmailbox.md
@@ -21,19 +21,22 @@ multiple e-mail accounts you want to import periodically.

 # Details

-Creating a task requires the following information:
+## General

-{{ figure(file="scanmailbox-detail.png") }}
+{{ figure(file="scanmailbox-detail-01.png") }}

 You can enable or disable this task. A disabled task will not run
 periodically. You can still choose to run it manually if you click the
 `Start Once` button.

-## E-Mail Settings
-
 Then you need to specify which [IMAP
 connection](@/docs/webapp/emailsettings.md#imap-settings) to use.

+
+## Processing
+
+{{ figure(file="scanmailbox-detail-02.png") }}
+
 A list of folders is required. Docspell will only look into these
 folders. You can specify multiple folders. The "Inbox" folder is a
 special folder, which will usually appear translated in your web-mail
@@ -43,30 +46,20 @@ mails in your inbox. Any other folder is usually case-sensitive
 except the INBOX folder). Type in a folder name and click the add
 button on the right.

-The next two settings tell docspell what to do once a mail has been
-submitted to docspell. It can be moved into another folder in your
-mail account. This moves it out of the way for the next run. You can
-also choose to delete the mail, but *note that it will really be
-deleted and not moved to your trash folder*. If both options are off,
-nothing happens with that mail, it simply stays (and could be re-read
-on the next run).
-
-Be careful when mails are neither moved nor deleted after processing.
-They could be selected anew in the next run, meaning that the job can
-not progress, because it filters out the same mails all the time. You
-can however, simply schedule the task in an interval >= the `Received
-Since Hours` setting.
-
-
-## Filtering
-
-The following properties allow to filter mails that are imported.
-
 Then the field *Received Since Hours* defines how many hours to go
 back and look for mails. Usually there are many mails in your inbox
 and importing them all at once is not feasible or desirable. It can
 work together with the *Schedule* field below. For example, you could
-run this task all 6 hours and read mails from 8 hours back.
+run this task all 6 hours and read mails from 8 hours back. This
+setting is used to query the mail server.
+
+
+## Additional Filter
+
+{{ figure(file="scanmailbox-detail-03.png") }}
+
+The following properties allow to filter those downloaded mails that
+should be imported.

 The *File Filter* can be specified as a glob to only import mail
 attachments based on their file name. For example, a value of `*.pdf`
@@ -82,10 +75,38 @@ pattern. For example, if your scanner mails to you with a certain
 subject like _"Scanned Document 214"_, you could include those via a
 `Scanned Document*` pattern.

+## Post Processing
+
+{{ figure(file="scanmailbox-detail-04.png") }}
+
+The next settings tell docspell what to do once a mail has been read
+by docspell. It can be moved into another folder in your mail account.
+This moves it out of the way for the next run. You can also choose to
+delete the mail, but *note that it will really be deleted and not
+moved to your trash folder*. If both options are off, nothing happens
+with that mail, it simply stays (and could be re-read on the next
+run).
+
+Be careful when mails are neither moved nor deleted after processing.
+They could be selected anew in the next run, meaning that the job can
+not progress, because it filters out the same mails all the time. You
+can however, simply schedule the task in an interval >= the `Received
+Since Hours` setting.
+
+By default, post-processing is only applied to mails that have been
+*submitted to docspell*. Some mails may have been skipped due subject
+filtering. If you also want these skipped mails to be affected by
+post-processing, enabled the *Apply post-processing to all fetched
+mails*.
+
+
+
 ## Metadata

-The last properties allow to specify some metadata that are
-automatically attached to the items being created.
+{{ figure(file="scanmailbox-detail-05.png") }}
+
+These properties allow to specify some metadata that are automatically
+attached to the items being created.

 Every item in docspell has a direction value (incoming or outgoing).
 If you know that all mails you want to import have a specific
@@ -104,17 +125,26 @@ resulting items.
 The *Tags* setting can be used to associate a fixed number of tags to
 all items that are imported from this mail task.

-The last field is the *Schedule* which defines when and how often this
-task should run. The syntax is similiar to a date-time string, like
-`2019-09-15 12:32`, where each part is a pattern to also match multple
-values. The ui tries to help a little by displaying the next two
-date-times this task would execute. A more in depth help is available
-[here](https://github.com/eikek/calev#what-are-calendar-events). For
-example, to execute the task every monday at noon, you would write:
-`Mon *-*-* 12:00`. A date-time part can match all values (`*`), a list
-of values (e.g. `1,5,12,19`) or a range (e.g. `1..9`). Long lists may
-be written in a shorter way using a repetition value. It is written
-like this: `1/7` which is the same as a list with `1` and all
+The *Language* setting is applied when processing the mails. If not
+set, the default language of the collective is used.
+
+
+## Schedule
+
+{{ figure(file="scanmailbox-detail-06.png") }}
+
+At last the *Schedule* defines when and how often this task should
+run. The syntax is similiar to a date-time string, like `2019-09-15
+12:32`, where each part is a pattern to also match multple values. The
+ui tries to help a little by displaying the next two date-times this
+task would execute. A more in depth help is available
+[here](https://github.com/eikek/calev#what-are-calendar-events).
+
+For example, to execute the task every monday at noon, you would
+write: `Mon *-*-* 12:00`. A date-time part can match all values (`*`),
+a list of values (e.g. `1,5,12,19`) or a range (e.g. `1..9`). Long
+lists may be written in a shorter way using a repetition value. It is
+written like this: `1/7` which is the same as a list with `1` and all
 multiples of `7` added to it. In other words, it matches `1`, `1+7`,
 `1+7+7`, `1+7+7+7` and so on.