Update documentation

This commit is contained in:
Eike Kettner 2021-01-25 08:50:46 +01:00
parent e9a4f904c9
commit 946204e809
16 changed files with 154 additions and 93 deletions

View File

@ -6,9 +6,9 @@
This release comes with major improvements to the text analysis
module. It is now much more configurable, has improved results and can
learn tags from all categories. Additionally, more languages have been
added (and it's now easier to add more, please open an issue if want
more languages).
learn tags from all categories. Additionally, more languages for
document processing have been added and it's now easier to add more.
Please open an issue if want more languages to be included.
- text analysis improvements (#263, #570)
- docspell can now learn from all your tag categories
@ -23,15 +23,15 @@ more languages).
- Adds: Spanish, Italian, Portuguese, Czech, Dutch, Danish, Finnish,
Norwegian, Swedish, Russian, Romanian
- languages have different support for text-analysis, but there is
some basic support for all, there is extended support for English,
German and French through [Stanford
CoreNLP](https://stanfordnlp.github.io/CoreNLP/) nlp models
- if you want more languages, please open an issue.
some basic support for all
- there is extended support for English, German and French through
[Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) nlp
models (as before)
- scan mailbox change (#576)
- The change from last version (#551) has been moved behind a flag
in the "scan mailbox settings". Please review your scan mailbox
settings.
- The scan mailbox settings form view has been changed to tab-style,
tasks in your user settings.
- The scan mailbox settings form view has been organized into tabs,
as it grew too large for a single form.
- nix tools package fixed (#584)
- If you are using docspell tools package for nix, it has now been
@ -50,22 +50,23 @@ more languages).
- This was a bug introduced by the last release. When tag categories
can now be spelled upper- or lower-case. In 0.18.0 you had to
spell them lowercase, otherwise the search doesn't work.
- adds a workaround for mails that don't specify the used charset (#591)
- adds a workaround for mails that don't specify their used charset (#591)
### Breaking Changes
- The joex configuration changed around text analysis. If you had some
custom settings there, please review these wrt the new default
config.
- The tools package renamed the scripts to be better distinguishable,
since they all end up in `$PATH`. They are now prefixed by `ds-`.
- When using the nix package manager: the tools package renamed the
scripts to be better distinguishable, since they all end up in
`$PATH`. They are now prefixed by `ds-`.
- The path of the consumedir script changed in the consumedir docker
image
- The settings of the scan-mailbox task has been extended by another
flag. It controls when to apply the post-processing (moving or
deleting). If you were relying that all mails (even those excluded
by a subject filter) where moved away, you need to check the
settings.
by a subject filter) where moved away, you need to check your
scan-mailbox task settings.
### REST Api Changes
@ -82,6 +83,9 @@ more languages).
moved inside `text-anlysis`. Please have a look at the new
[default config](https://docspell.org/docs/configure/#joex) if you
changed something there.
- The `regex-ner` section has changed: the `enabled` flag has been
removed, you can now limit the number of entries using
`max-entries` to apply and `0` means to disable it.
## v0.18.0

View File

@ -341,13 +341,59 @@ will degrade to `regexonly` for these.
The mode `disabled` skips NLP processing completely. This has least
impact in memory consumption, obviously, but then only the classifier
is used to find metadata.
is used to find metadata (unless it is disabled, too).
You might want to try different modes and see what combination suits
best your usage pattern and machine running joex. If a powerful
machine is used, simply leave the defaults. When running on an older
machine is used, simply leave the defaults. When running on an
raspberry pi, for example, you might need to adjust things.
### Memory Usage
The memory requirements for the joex component depends on the document
language and the enabled features for text-analysis. The `nlp.mode`
setting has significant impact, especially when your documents are in
German. Here are some rough numbers on jvm heap usage (the same file
was used for all tries):
<table class="table is-hoverable is-striped">
<thead>
<tr><th>nlp.mode</th><th>English</th><th>German</th><th>French</th></tr>
</thead>
<tfoot>
</tfoot>
<tbody>
<tr><td>full</td><td>420M</td><td>950M</td><td>490M</td></tr>
<tr><td>basic</td><td>170M</td><td>380M</td><td>390M</td></tr>
</tbody>
</table>
Note that these are only rough numbers and they show the maximum used
heap memory while processing a file.
When using `mode=full`, a heap setting of at least `-Xmx1400M` is
recommended. For `mode=basic` a heap setting of at least `-Xmx500M` is
recommended.
Other languages can't use these two modes, and so don't require this
amount of memory (but don't have as good results). Then you can go
with less heap. For these languages, the nlp mode is the same as
`regexonly`.
Training the classifier is also memory intensive, which solely depends
on the size and number of documents that are being trained. However,
training the classifier is done periodically and can happen maybe
every two weeks. When classifying new documents, memory requirements
are lower, since the model already exists.
More details about these modes can be found
[here](@/docs/joex/file-processing.md#text-analysis).
The restserver component is very lightweight, here you can use
defaults.
# File Format
The format of the configuration files can be

View File

@ -42,6 +42,8 @@ directory and uploads all incoming files to Docspell. The script can
watch directories recursively and can skip files already uploaded, so
you can organize the files as you want in there (rename, move etc).
This can be used multiple times on different machines, if desired.
The scanner should support 300dpi for better results. Docspell
converts the files into PDF adding a text layer of image-only files.

View File

@ -25,3 +25,8 @@ To get started, here are some quick links:
user provided [notes and unraid
templates](https://github.com/vakilando/unraid-docker-templates)
which can get you started. Thanks for providing these!
Every [component](@docs/intro/_index.md#components) (restserver, joex,
consumedir) can run on different machines and multiple times. Most of
the time running all on one machine is sufficient and also for
simplicity, the docker-compose setup reflects this variant.

View File

@ -27,7 +27,7 @@ result in long processing times for OCR and text analysis. The board
should provide 4G of RAM (like the current RPi4), especially if also a
database and solr are running next to it. The memory required by joex
depends on the config and document language. Please pick a value that
suits your setup from [here](@/docs/install/running.md#memory-usage).
suits your setup from [here](@/docs/configure/_index.md#memory-usage).
For boards like the RPi, it might be necessary to use
`nlp.mode=basic`, rather than `nlp.mode=full`. You should also set the
joex pool size to 1.

View File

@ -45,42 +45,14 @@ when opened up to the outside, it is recommended to lock this down.
{% end %}
## Memory Usage
## Memory
The memory requirements for the joex component depends on the document
language and the configuration for [file
processing](@/docs/configure/_index.md#file-processing). The
`nlp.mode` setting has significant impact, especially when your
documents are in German. Here are some rough numbers on jvm heap usage
(the same small jpeg file was used for all tries):
<table class="table is-hoverable is-striped">
<thead>
<tr><th>nlp.mode</th><th>English</th><th>German</th><th>French</th></tr>
</thead>
<tfoot>
</tfoot>
<tbody>
<tr><td>full</td><td>420M</td><td>950M</td><td>490M</td></tr>
<tr><td>basic</td><td>170M</td><td>380M</td><td>390M</td></tr>
</tbody>
</table>
When using `mode=full`, a heap setting of at least `-Xmx1400M` is
recommended. For `mode=basic` a heap setting of at least `-Xmx500M` is
recommended.
Other languages can't use these two modes, and so don't require this
amount of memory (but don't have as good results). Then you can go
with less heap.
More details about these modes can be found
[here](@/docs/joex/file-processing.md#text-analysis).
The restserver component is very lightweight, here you can use
defaults.
Using the options below you can define how much memory the JVM process
is able to use. This might be necessary to adopt depending on the
usage scenario and configured text analysis features.
Please have a look at the corresponding [configuration
section](@/docs/configure/_index.md#memory-usage).
## Options

View File

@ -24,9 +24,10 @@ candidates for:
- Correspondents
- Concerned person or things
- A date
- Tags
It will propose a few candidates and sets the most likely one to your
item.
For tags, it sets all that it thinks do apply. For the others, it will
propose a few candidates and sets the most likely one to your item.
This might be wrong, so it is recommended to curate the results.
However, very often the correct one is either set or within the

View File

@ -443,9 +443,10 @@ nlp processing as follows:
- mode `regexonly`: this doesn't load any statistical models and is
therefore much lighter on memory (depending on the address book
size, of course). It will use the address book to create regex rules
and match them against your document.
- mode = disabled: this disables nlp processing altogether. Then only
the classifier is run (unless disabled).
and match them against your document. Memory usage then doesn't
depend on the document language.
- mode `disabled`: this disables nlp processing. Then only the
classifier is run (unless disabled).
Note that mode `full` and `basic` is only relevant for the languages
where models are available. For all other languages, it is effectively

Binary file not shown.

After

Width:  |  Height:  |  Size: 142 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 152 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 187 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 188 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 196 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 184 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 184 KiB

View File

@ -21,19 +21,22 @@ multiple e-mail accounts you want to import periodically.
# Details
Creating a task requires the following information:
## General
{{ figure(file="scanmailbox-detail.png") }}
{{ figure(file="scanmailbox-detail-01.png") }}
You can enable or disable this task. A disabled task will not run
periodically. You can still choose to run it manually if you click the
`Start Once` button.
## E-Mail Settings
Then you need to specify which [IMAP
connection](@/docs/webapp/emailsettings.md#imap-settings) to use.
## Processing
{{ figure(file="scanmailbox-detail-02.png") }}
A list of folders is required. Docspell will only look into these
folders. You can specify multiple folders. The "Inbox" folder is a
special folder, which will usually appear translated in your web-mail
@ -43,30 +46,20 @@ mails in your inbox. Any other folder is usually case-sensitive
except the INBOX folder). Type in a folder name and click the add
button on the right.
The next two settings tell docspell what to do once a mail has been
submitted to docspell. It can be moved into another folder in your
mail account. This moves it out of the way for the next run. You can
also choose to delete the mail, but *note that it will really be
deleted and not moved to your trash folder*. If both options are off,
nothing happens with that mail, it simply stays (and could be re-read
on the next run).
Be careful when mails are neither moved nor deleted after processing.
They could be selected anew in the next run, meaning that the job can
not progress, because it filters out the same mails all the time. You
can however, simply schedule the task in an interval >= the `Received
Since Hours` setting.
## Filtering
The following properties allow to filter mails that are imported.
Then the field *Received Since Hours* defines how many hours to go
back and look for mails. Usually there are many mails in your inbox
and importing them all at once is not feasible or desirable. It can
work together with the *Schedule* field below. For example, you could
run this task all 6 hours and read mails from 8 hours back.
run this task all 6 hours and read mails from 8 hours back. This
setting is used to query the mail server.
## Additional Filter
{{ figure(file="scanmailbox-detail-03.png") }}
The following properties allow to filter those downloaded mails that
should be imported.
The *File Filter* can be specified as a glob to only import mail
attachments based on their file name. For example, a value of `*.pdf`
@ -82,10 +75,38 @@ pattern. For example, if your scanner mails to you with a certain
subject like _"Scanned Document 214"_, you could include those via a
`Scanned Document*` pattern.
## Post Processing
{{ figure(file="scanmailbox-detail-04.png") }}
The next settings tell docspell what to do once a mail has been read
by docspell. It can be moved into another folder in your mail account.
This moves it out of the way for the next run. You can also choose to
delete the mail, but *note that it will really be deleted and not
moved to your trash folder*. If both options are off, nothing happens
with that mail, it simply stays (and could be re-read on the next
run).
Be careful when mails are neither moved nor deleted after processing.
They could be selected anew in the next run, meaning that the job can
not progress, because it filters out the same mails all the time. You
can however, simply schedule the task in an interval >= the `Received
Since Hours` setting.
By default, post-processing is only applied to mails that have been
*submitted to docspell*. Some mails may have been skipped due subject
filtering. If you also want these skipped mails to be affected by
post-processing, enabled the *Apply post-processing to all fetched
mails*.
## Metadata
The last properties allow to specify some metadata that are
automatically attached to the items being created.
{{ figure(file="scanmailbox-detail-05.png") }}
These properties allow to specify some metadata that are automatically
attached to the items being created.
Every item in docspell has a direction value (incoming or outgoing).
If you know that all mails you want to import have a specific
@ -104,17 +125,26 @@ resulting items.
The *Tags* setting can be used to associate a fixed number of tags to
all items that are imported from this mail task.
The last field is the *Schedule* which defines when and how often this
task should run. The syntax is similiar to a date-time string, like
`2019-09-15 12:32`, where each part is a pattern to also match multple
values. The ui tries to help a little by displaying the next two
date-times this task would execute. A more in depth help is available
[here](https://github.com/eikek/calev#what-are-calendar-events). For
example, to execute the task every monday at noon, you would write:
`Mon *-*-* 12:00`. A date-time part can match all values (`*`), a list
of values (e.g. `1,5,12,19`) or a range (e.g. `1..9`). Long lists may
be written in a shorter way using a repetition value. It is written
like this: `1/7` which is the same as a list with `1` and all
The *Language* setting is applied when processing the mails. If not
set, the default language of the collective is used.
## Schedule
{{ figure(file="scanmailbox-detail-06.png") }}
At last the *Schedule* defines when and how often this task should
run. The syntax is similiar to a date-time string, like `2019-09-15
12:32`, where each part is a pattern to also match multple values. The
ui tries to help a little by displaying the next two date-times this
task would execute. A more in depth help is available
[here](https://github.com/eikek/calev#what-are-calendar-events).
For example, to execute the task every monday at noon, you would
write: `Mon *-*-* 12:00`. A date-time part can match all values (`*`),
a list of values (e.g. `1,5,12,19`) or a range (e.g. `1..9`). Long
lists may be written in a shorter way using a repetition value. It is
written like this: `1/7` which is the same as a list with `1` and all
multiples of `7` added to it. In other words, it matches `1`, `1+7`,
`1+7+7`, `1+7+7+7` and so on.