Update documentation

This commit is contained in:
Eike Kettner 2021-01-25 08:50:46 +01:00
parent e9a4f904c9
commit 946204e809
16 changed files with 154 additions and 93 deletions

View File

@ -6,9 +6,9 @@
This release comes with major improvements to the text analysis This release comes with major improvements to the text analysis
module. It is now much more configurable, has improved results and can module. It is now much more configurable, has improved results and can
learn tags from all categories. Additionally, more languages have been learn tags from all categories. Additionally, more languages for
added (and it's now easier to add more, please open an issue if want document processing have been added and it's now easier to add more.
more languages). Please open an issue if want more languages to be included.
- text analysis improvements (#263, #570) - text analysis improvements (#263, #570)
- docspell can now learn from all your tag categories - docspell can now learn from all your tag categories
@ -23,15 +23,15 @@ more languages).
- Adds: Spanish, Italian, Portuguese, Czech, Dutch, Danish, Finnish, - Adds: Spanish, Italian, Portuguese, Czech, Dutch, Danish, Finnish,
Norwegian, Swedish, Russian, Romanian Norwegian, Swedish, Russian, Romanian
- languages have different support for text-analysis, but there is - languages have different support for text-analysis, but there is
some basic support for all, there is extended support for English, some basic support for all
German and French through [Stanford - there is extended support for English, German and French through
CoreNLP](https://stanfordnlp.github.io/CoreNLP/) nlp models [Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) nlp
- if you want more languages, please open an issue. models (as before)
- scan mailbox change (#576) - scan mailbox change (#576)
- The change from last version (#551) has been moved behind a flag - The change from last version (#551) has been moved behind a flag
in the "scan mailbox settings". Please review your scan mailbox in the "scan mailbox settings". Please review your scan mailbox
settings. tasks in your user settings.
- The scan mailbox settings form view has been changed to tab-style, - The scan mailbox settings form view has been organized into tabs,
as it grew too large for a single form. as it grew too large for a single form.
- nix tools package fixed (#584) - nix tools package fixed (#584)
- If you are using docspell tools package for nix, it has now been - If you are using docspell tools package for nix, it has now been
@ -50,22 +50,23 @@ more languages).
- This was a bug introduced by the last release. When tag categories - This was a bug introduced by the last release. When tag categories
can now be spelled upper- or lower-case. In 0.18.0 you had to can now be spelled upper- or lower-case. In 0.18.0 you had to
spell them lowercase, otherwise the search doesn't work. spell them lowercase, otherwise the search doesn't work.
- adds a workaround for mails that don't specify the used charset (#591) - adds a workaround for mails that don't specify their used charset (#591)
### Breaking Changes ### Breaking Changes
- The joex configuration changed around text analysis. If you had some - The joex configuration changed around text analysis. If you had some
custom settings there, please review these wrt the new default custom settings there, please review these wrt the new default
config. config.
- The tools package renamed the scripts to be better distinguishable, - When using the nix package manager: the tools package renamed the
since they all end up in `$PATH`. They are now prefixed by `ds-`. scripts to be better distinguishable, since they all end up in
`$PATH`. They are now prefixed by `ds-`.
- The path of the consumedir script changed in the consumedir docker - The path of the consumedir script changed in the consumedir docker
image image
- The settings of the scan-mailbox task has been extended by another - The settings of the scan-mailbox task has been extended by another
flag. It controls when to apply the post-processing (moving or flag. It controls when to apply the post-processing (moving or
deleting). If you were relying that all mails (even those excluded deleting). If you were relying that all mails (even those excluded
by a subject filter) where moved away, you need to check the by a subject filter) where moved away, you need to check your
settings. scan-mailbox task settings.
### REST Api Changes ### REST Api Changes
@ -82,6 +83,9 @@ more languages).
moved inside `text-anlysis`. Please have a look at the new moved inside `text-anlysis`. Please have a look at the new
[default config](https://docspell.org/docs/configure/#joex) if you [default config](https://docspell.org/docs/configure/#joex) if you
changed something there. changed something there.
- The `regex-ner` section has changed: the `enabled` flag has been
removed, you can now limit the number of entries using
`max-entries` to apply and `0` means to disable it.
## v0.18.0 ## v0.18.0

View File

@ -341,13 +341,59 @@ will degrade to `regexonly` for these.
The mode `disabled` skips NLP processing completely. This has least The mode `disabled` skips NLP processing completely. This has least
impact in memory consumption, obviously, but then only the classifier impact in memory consumption, obviously, but then only the classifier
is used to find metadata. is used to find metadata (unless it is disabled, too).
You might want to try different modes and see what combination suits You might want to try different modes and see what combination suits
best your usage pattern and machine running joex. If a powerful best your usage pattern and machine running joex. If a powerful
machine is used, simply leave the defaults. When running on an older machine is used, simply leave the defaults. When running on an
raspberry pi, for example, you might need to adjust things. raspberry pi, for example, you might need to adjust things.
### Memory Usage
The memory requirements for the joex component depends on the document
language and the enabled features for text-analysis. The `nlp.mode`
setting has significant impact, especially when your documents are in
German. Here are some rough numbers on jvm heap usage (the same file
was used for all tries):
<table class="table is-hoverable is-striped">
<thead>
<tr><th>nlp.mode</th><th>English</th><th>German</th><th>French</th></tr>
</thead>
<tfoot>
</tfoot>
<tbody>
<tr><td>full</td><td>420M</td><td>950M</td><td>490M</td></tr>
<tr><td>basic</td><td>170M</td><td>380M</td><td>390M</td></tr>
</tbody>
</table>
Note that these are only rough numbers and they show the maximum used
heap memory while processing a file.
When using `mode=full`, a heap setting of at least `-Xmx1400M` is
recommended. For `mode=basic` a heap setting of at least `-Xmx500M` is
recommended.
Other languages can't use these two modes, and so don't require this
amount of memory (but don't have as good results). Then you can go
with less heap. For these languages, the nlp mode is the same as
`regexonly`.
Training the classifier is also memory intensive, which solely depends
on the size and number of documents that are being trained. However,
training the classifier is done periodically and can happen maybe
every two weeks. When classifying new documents, memory requirements
are lower, since the model already exists.
More details about these modes can be found
[here](@/docs/joex/file-processing.md#text-analysis).
The restserver component is very lightweight, here you can use
defaults.
# File Format # File Format
The format of the configuration files can be The format of the configuration files can be

View File

@ -42,6 +42,8 @@ directory and uploads all incoming files to Docspell. The script can
watch directories recursively and can skip files already uploaded, so watch directories recursively and can skip files already uploaded, so
you can organize the files as you want in there (rename, move etc). you can organize the files as you want in there (rename, move etc).
This can be used multiple times on different machines, if desired.
The scanner should support 300dpi for better results. Docspell The scanner should support 300dpi for better results. Docspell
converts the files into PDF adding a text layer of image-only files. converts the files into PDF adding a text layer of image-only files.

View File

@ -25,3 +25,8 @@ To get started, here are some quick links:
user provided [notes and unraid user provided [notes and unraid
templates](https://github.com/vakilando/unraid-docker-templates) templates](https://github.com/vakilando/unraid-docker-templates)
which can get you started. Thanks for providing these! which can get you started. Thanks for providing these!
Every [component](@docs/intro/_index.md#components) (restserver, joex,
consumedir) can run on different machines and multiple times. Most of
the time running all on one machine is sufficient and also for
simplicity, the docker-compose setup reflects this variant.

View File

@ -27,7 +27,7 @@ result in long processing times for OCR and text analysis. The board
should provide 4G of RAM (like the current RPi4), especially if also a should provide 4G of RAM (like the current RPi4), especially if also a
database and solr are running next to it. The memory required by joex database and solr are running next to it. The memory required by joex
depends on the config and document language. Please pick a value that depends on the config and document language. Please pick a value that
suits your setup from [here](@/docs/install/running.md#memory-usage). suits your setup from [here](@/docs/configure/_index.md#memory-usage).
For boards like the RPi, it might be necessary to use For boards like the RPi, it might be necessary to use
`nlp.mode=basic`, rather than `nlp.mode=full`. You should also set the `nlp.mode=basic`, rather than `nlp.mode=full`. You should also set the
joex pool size to 1. joex pool size to 1.

View File

@ -45,42 +45,14 @@ when opened up to the outside, it is recommended to lock this down.
{% end %} {% end %}
## Memory Usage ## Memory
The memory requirements for the joex component depends on the document Using the options below you can define how much memory the JVM process
language and the configuration for [file is able to use. This might be necessary to adopt depending on the
processing](@/docs/configure/_index.md#file-processing). The usage scenario and configured text analysis features.
`nlp.mode` setting has significant impact, especially when your
documents are in German. Here are some rough numbers on jvm heap usage
(the same small jpeg file was used for all tries):
<table class="table is-hoverable is-striped">
<thead>
<tr><th>nlp.mode</th><th>English</th><th>German</th><th>French</th></tr>
</thead>
<tfoot>
</tfoot>
<tbody>
<tr><td>full</td><td>420M</td><td>950M</td><td>490M</td></tr>
<tr><td>basic</td><td>170M</td><td>380M</td><td>390M</td></tr>
</tbody>
</table>
When using `mode=full`, a heap setting of at least `-Xmx1400M` is
recommended. For `mode=basic` a heap setting of at least `-Xmx500M` is
recommended.
Other languages can't use these two modes, and so don't require this
amount of memory (but don't have as good results). Then you can go
with less heap.
More details about these modes can be found
[here](@/docs/joex/file-processing.md#text-analysis).
The restserver component is very lightweight, here you can use
defaults.
Please have a look at the corresponding [configuration
section](@/docs/configure/_index.md#memory-usage).
## Options ## Options

View File

@ -24,9 +24,10 @@ candidates for:
- Correspondents - Correspondents
- Concerned person or things - Concerned person or things
- A date - A date
- Tags
It will propose a few candidates and sets the most likely one to your For tags, it sets all that it thinks do apply. For the others, it will
item. propose a few candidates and sets the most likely one to your item.
This might be wrong, so it is recommended to curate the results. This might be wrong, so it is recommended to curate the results.
However, very often the correct one is either set or within the However, very often the correct one is either set or within the

View File

@ -443,9 +443,10 @@ nlp processing as follows:
- mode `regexonly`: this doesn't load any statistical models and is - mode `regexonly`: this doesn't load any statistical models and is
therefore much lighter on memory (depending on the address book therefore much lighter on memory (depending on the address book
size, of course). It will use the address book to create regex rules size, of course). It will use the address book to create regex rules
and match them against your document. and match them against your document. Memory usage then doesn't
- mode = disabled: this disables nlp processing altogether. Then only depend on the document language.
the classifier is run (unless disabled). - mode `disabled`: this disables nlp processing. Then only the
classifier is run (unless disabled).
Note that mode `full` and `basic` is only relevant for the languages Note that mode `full` and `basic` is only relevant for the languages
where models are available. For all other languages, it is effectively where models are available. For all other languages, it is effectively

Binary file not shown.

After

Width:  |  Height:  |  Size: 142 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 152 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 187 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 188 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 196 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 184 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 184 KiB

View File

@ -21,19 +21,22 @@ multiple e-mail accounts you want to import periodically.
# Details # Details
Creating a task requires the following information: ## General
{{ figure(file="scanmailbox-detail.png") }} {{ figure(file="scanmailbox-detail-01.png") }}
You can enable or disable this task. A disabled task will not run You can enable or disable this task. A disabled task will not run
periodically. You can still choose to run it manually if you click the periodically. You can still choose to run it manually if you click the
`Start Once` button. `Start Once` button.
## E-Mail Settings
Then you need to specify which [IMAP Then you need to specify which [IMAP
connection](@/docs/webapp/emailsettings.md#imap-settings) to use. connection](@/docs/webapp/emailsettings.md#imap-settings) to use.
## Processing
{{ figure(file="scanmailbox-detail-02.png") }}
A list of folders is required. Docspell will only look into these A list of folders is required. Docspell will only look into these
folders. You can specify multiple folders. The "Inbox" folder is a folders. You can specify multiple folders. The "Inbox" folder is a
special folder, which will usually appear translated in your web-mail special folder, which will usually appear translated in your web-mail
@ -43,30 +46,20 @@ mails in your inbox. Any other folder is usually case-sensitive
except the INBOX folder). Type in a folder name and click the add except the INBOX folder). Type in a folder name and click the add
button on the right. button on the right.
The next two settings tell docspell what to do once a mail has been
submitted to docspell. It can be moved into another folder in your
mail account. This moves it out of the way for the next run. You can
also choose to delete the mail, but *note that it will really be
deleted and not moved to your trash folder*. If both options are off,
nothing happens with that mail, it simply stays (and could be re-read
on the next run).
Be careful when mails are neither moved nor deleted after processing.
They could be selected anew in the next run, meaning that the job can
not progress, because it filters out the same mails all the time. You
can however, simply schedule the task in an interval >= the `Received
Since Hours` setting.
## Filtering
The following properties allow to filter mails that are imported.
Then the field *Received Since Hours* defines how many hours to go Then the field *Received Since Hours* defines how many hours to go
back and look for mails. Usually there are many mails in your inbox back and look for mails. Usually there are many mails in your inbox
and importing them all at once is not feasible or desirable. It can and importing them all at once is not feasible or desirable. It can
work together with the *Schedule* field below. For example, you could work together with the *Schedule* field below. For example, you could
run this task all 6 hours and read mails from 8 hours back. run this task all 6 hours and read mails from 8 hours back. This
setting is used to query the mail server.
## Additional Filter
{{ figure(file="scanmailbox-detail-03.png") }}
The following properties allow to filter those downloaded mails that
should be imported.
The *File Filter* can be specified as a glob to only import mail The *File Filter* can be specified as a glob to only import mail
attachments based on their file name. For example, a value of `*.pdf` attachments based on their file name. For example, a value of `*.pdf`
@ -82,10 +75,38 @@ pattern. For example, if your scanner mails to you with a certain
subject like _"Scanned Document 214"_, you could include those via a subject like _"Scanned Document 214"_, you could include those via a
`Scanned Document*` pattern. `Scanned Document*` pattern.
## Post Processing
{{ figure(file="scanmailbox-detail-04.png") }}
The next settings tell docspell what to do once a mail has been read
by docspell. It can be moved into another folder in your mail account.
This moves it out of the way for the next run. You can also choose to
delete the mail, but *note that it will really be deleted and not
moved to your trash folder*. If both options are off, nothing happens
with that mail, it simply stays (and could be re-read on the next
run).
Be careful when mails are neither moved nor deleted after processing.
They could be selected anew in the next run, meaning that the job can
not progress, because it filters out the same mails all the time. You
can however, simply schedule the task in an interval >= the `Received
Since Hours` setting.
By default, post-processing is only applied to mails that have been
*submitted to docspell*. Some mails may have been skipped due subject
filtering. If you also want these skipped mails to be affected by
post-processing, enabled the *Apply post-processing to all fetched
mails*.
## Metadata ## Metadata
The last properties allow to specify some metadata that are {{ figure(file="scanmailbox-detail-05.png") }}
automatically attached to the items being created.
These properties allow to specify some metadata that are automatically
attached to the items being created.
Every item in docspell has a direction value (incoming or outgoing). Every item in docspell has a direction value (incoming or outgoing).
If you know that all mails you want to import have a specific If you know that all mails you want to import have a specific
@ -104,17 +125,26 @@ resulting items.
The *Tags* setting can be used to associate a fixed number of tags to The *Tags* setting can be used to associate a fixed number of tags to
all items that are imported from this mail task. all items that are imported from this mail task.
The last field is the *Schedule* which defines when and how often this The *Language* setting is applied when processing the mails. If not
task should run. The syntax is similiar to a date-time string, like set, the default language of the collective is used.
`2019-09-15 12:32`, where each part is a pattern to also match multple
values. The ui tries to help a little by displaying the next two
date-times this task would execute. A more in depth help is available ## Schedule
[here](https://github.com/eikek/calev#what-are-calendar-events). For
example, to execute the task every monday at noon, you would write: {{ figure(file="scanmailbox-detail-06.png") }}
`Mon *-*-* 12:00`. A date-time part can match all values (`*`), a list
of values (e.g. `1,5,12,19`) or a range (e.g. `1..9`). Long lists may At last the *Schedule* defines when and how often this task should
be written in a shorter way using a repetition value. It is written run. The syntax is similiar to a date-time string, like `2019-09-15
like this: `1/7` which is the same as a list with `1` and all 12:32`, where each part is a pattern to also match multple values. The
ui tries to help a little by displaying the next two date-times this
task would execute. A more in depth help is available
[here](https://github.com/eikek/calev#what-are-calendar-events).
For example, to execute the task every monday at noon, you would
write: `Mon *-*-* 12:00`. A date-time part can match all values (`*`),
a list of values (e.g. `1,5,12,19`) or a range (e.g. `1..9`). Long
lists may be written in a shorter way using a repetition value. It is
written like this: `1/7` which is the same as a list with `1` and all
multiples of `7` added to it. In other words, it matches `1`, `1+7`, multiples of `7` added to it. In other words, it matches `1`, `1+7`,
`1+7+7`, `1+7+7+7` and so on. `1+7+7`, `1+7+7+7` and so on.