Update documentation
32
Changelog.md
@ -6,9 +6,9 @@
|
||||
|
||||
This release comes with major improvements to the text analysis
|
||||
module. It is now much more configurable, has improved results and can
|
||||
learn tags from all categories. Additionally, more languages have been
|
||||
added (and it's now easier to add more, please open an issue if want
|
||||
more languages).
|
||||
learn tags from all categories. Additionally, more languages for
|
||||
document processing have been added and it's now easier to add more.
|
||||
Please open an issue if want more languages to be included.
|
||||
|
||||
- text analysis improvements (#263, #570)
|
||||
- docspell can now learn from all your tag categories
|
||||
@ -23,15 +23,15 @@ more languages).
|
||||
- Adds: Spanish, Italian, Portuguese, Czech, Dutch, Danish, Finnish,
|
||||
Norwegian, Swedish, Russian, Romanian
|
||||
- languages have different support for text-analysis, but there is
|
||||
some basic support for all, there is extended support for English,
|
||||
German and French through [Stanford
|
||||
CoreNLP](https://stanfordnlp.github.io/CoreNLP/) nlp models
|
||||
- if you want more languages, please open an issue.
|
||||
some basic support for all
|
||||
- there is extended support for English, German and French through
|
||||
[Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) nlp
|
||||
models (as before)
|
||||
- scan mailbox change (#576)
|
||||
- The change from last version (#551) has been moved behind a flag
|
||||
in the "scan mailbox settings". Please review your scan mailbox
|
||||
settings.
|
||||
- The scan mailbox settings form view has been changed to tab-style,
|
||||
tasks in your user settings.
|
||||
- The scan mailbox settings form view has been organized into tabs,
|
||||
as it grew too large for a single form.
|
||||
- nix tools package fixed (#584)
|
||||
- If you are using docspell tools package for nix, it has now been
|
||||
@ -50,22 +50,23 @@ more languages).
|
||||
- This was a bug introduced by the last release. When tag categories
|
||||
can now be spelled upper- or lower-case. In 0.18.0 you had to
|
||||
spell them lowercase, otherwise the search doesn't work.
|
||||
- adds a workaround for mails that don't specify the used charset (#591)
|
||||
- adds a workaround for mails that don't specify their used charset (#591)
|
||||
|
||||
### Breaking Changes
|
||||
|
||||
- The joex configuration changed around text analysis. If you had some
|
||||
custom settings there, please review these wrt the new default
|
||||
config.
|
||||
- The tools package renamed the scripts to be better distinguishable,
|
||||
since they all end up in `$PATH`. They are now prefixed by `ds-`.
|
||||
- When using the nix package manager: the tools package renamed the
|
||||
scripts to be better distinguishable, since they all end up in
|
||||
`$PATH`. They are now prefixed by `ds-`.
|
||||
- The path of the consumedir script changed in the consumedir docker
|
||||
image
|
||||
- The settings of the scan-mailbox task has been extended by another
|
||||
flag. It controls when to apply the post-processing (moving or
|
||||
deleting). If you were relying that all mails (even those excluded
|
||||
by a subject filter) where moved away, you need to check the
|
||||
settings.
|
||||
by a subject filter) where moved away, you need to check your
|
||||
scan-mailbox task settings.
|
||||
|
||||
### REST Api Changes
|
||||
|
||||
@ -82,6 +83,9 @@ more languages).
|
||||
moved inside `text-anlysis`. Please have a look at the new
|
||||
[default config](https://docspell.org/docs/configure/#joex) if you
|
||||
changed something there.
|
||||
- The `regex-ner` section has changed: the `enabled` flag has been
|
||||
removed, you can now limit the number of entries using
|
||||
`max-entries` to apply and `0` means to disable it.
|
||||
|
||||
|
||||
## v0.18.0
|
||||
|
@ -341,13 +341,59 @@ will degrade to `regexonly` for these.
|
||||
|
||||
The mode `disabled` skips NLP processing completely. This has least
|
||||
impact in memory consumption, obviously, but then only the classifier
|
||||
is used to find metadata.
|
||||
is used to find metadata (unless it is disabled, too).
|
||||
|
||||
You might want to try different modes and see what combination suits
|
||||
best your usage pattern and machine running joex. If a powerful
|
||||
machine is used, simply leave the defaults. When running on an older
|
||||
machine is used, simply leave the defaults. When running on an
|
||||
raspberry pi, for example, you might need to adjust things.
|
||||
|
||||
### Memory Usage
|
||||
|
||||
The memory requirements for the joex component depends on the document
|
||||
language and the enabled features for text-analysis. The `nlp.mode`
|
||||
setting has significant impact, especially when your documents are in
|
||||
German. Here are some rough numbers on jvm heap usage (the same file
|
||||
was used for all tries):
|
||||
|
||||
<table class="table is-hoverable is-striped">
|
||||
<thead>
|
||||
<tr><th>nlp.mode</th><th>English</th><th>German</th><th>French</th></tr>
|
||||
</thead>
|
||||
<tfoot>
|
||||
</tfoot>
|
||||
<tbody>
|
||||
<tr><td>full</td><td>420M</td><td>950M</td><td>490M</td></tr>
|
||||
<tr><td>basic</td><td>170M</td><td>380M</td><td>390M</td></tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
Note that these are only rough numbers and they show the maximum used
|
||||
heap memory while processing a file.
|
||||
|
||||
When using `mode=full`, a heap setting of at least `-Xmx1400M` is
|
||||
recommended. For `mode=basic` a heap setting of at least `-Xmx500M` is
|
||||
recommended.
|
||||
|
||||
Other languages can't use these two modes, and so don't require this
|
||||
amount of memory (but don't have as good results). Then you can go
|
||||
with less heap. For these languages, the nlp mode is the same as
|
||||
`regexonly`.
|
||||
|
||||
Training the classifier is also memory intensive, which solely depends
|
||||
on the size and number of documents that are being trained. However,
|
||||
training the classifier is done periodically and can happen maybe
|
||||
every two weeks. When classifying new documents, memory requirements
|
||||
are lower, since the model already exists.
|
||||
|
||||
More details about these modes can be found
|
||||
[here](@/docs/joex/file-processing.md#text-analysis).
|
||||
|
||||
|
||||
The restserver component is very lightweight, here you can use
|
||||
defaults.
|
||||
|
||||
|
||||
# File Format
|
||||
|
||||
The format of the configuration files can be
|
||||
|
@ -42,6 +42,8 @@ directory and uploads all incoming files to Docspell. The script can
|
||||
watch directories recursively and can skip files already uploaded, so
|
||||
you can organize the files as you want in there (rename, move etc).
|
||||
|
||||
This can be used multiple times on different machines, if desired.
|
||||
|
||||
The scanner should support 300dpi for better results. Docspell
|
||||
converts the files into PDF adding a text layer of image-only files.
|
||||
|
||||
|
@ -25,3 +25,8 @@ To get started, here are some quick links:
|
||||
user provided [notes and unraid
|
||||
templates](https://github.com/vakilando/unraid-docker-templates)
|
||||
which can get you started. Thanks for providing these!
|
||||
|
||||
Every [component](@docs/intro/_index.md#components) (restserver, joex,
|
||||
consumedir) can run on different machines and multiple times. Most of
|
||||
the time running all on one machine is sufficient and also for
|
||||
simplicity, the docker-compose setup reflects this variant.
|
||||
|
@ -27,7 +27,7 @@ result in long processing times for OCR and text analysis. The board
|
||||
should provide 4G of RAM (like the current RPi4), especially if also a
|
||||
database and solr are running next to it. The memory required by joex
|
||||
depends on the config and document language. Please pick a value that
|
||||
suits your setup from [here](@/docs/install/running.md#memory-usage).
|
||||
suits your setup from [here](@/docs/configure/_index.md#memory-usage).
|
||||
For boards like the RPi, it might be necessary to use
|
||||
`nlp.mode=basic`, rather than `nlp.mode=full`. You should also set the
|
||||
joex pool size to 1.
|
||||
|
@ -45,42 +45,14 @@ when opened up to the outside, it is recommended to lock this down.
|
||||
|
||||
{% end %}
|
||||
|
||||
## Memory Usage
|
||||
## Memory
|
||||
|
||||
The memory requirements for the joex component depends on the document
|
||||
language and the configuration for [file
|
||||
processing](@/docs/configure/_index.md#file-processing). The
|
||||
`nlp.mode` setting has significant impact, especially when your
|
||||
documents are in German. Here are some rough numbers on jvm heap usage
|
||||
(the same small jpeg file was used for all tries):
|
||||
|
||||
<table class="table is-hoverable is-striped">
|
||||
<thead>
|
||||
<tr><th>nlp.mode</th><th>English</th><th>German</th><th>French</th></tr>
|
||||
</thead>
|
||||
<tfoot>
|
||||
</tfoot>
|
||||
<tbody>
|
||||
<tr><td>full</td><td>420M</td><td>950M</td><td>490M</td></tr>
|
||||
<tr><td>basic</td><td>170M</td><td>380M</td><td>390M</td></tr>
|
||||
</tbody>
|
||||
</table>
|
||||
|
||||
When using `mode=full`, a heap setting of at least `-Xmx1400M` is
|
||||
recommended. For `mode=basic` a heap setting of at least `-Xmx500M` is
|
||||
recommended.
|
||||
|
||||
Other languages can't use these two modes, and so don't require this
|
||||
amount of memory (but don't have as good results). Then you can go
|
||||
with less heap.
|
||||
|
||||
More details about these modes can be found
|
||||
[here](@/docs/joex/file-processing.md#text-analysis).
|
||||
|
||||
|
||||
The restserver component is very lightweight, here you can use
|
||||
defaults.
|
||||
Using the options below you can define how much memory the JVM process
|
||||
is able to use. This might be necessary to adopt depending on the
|
||||
usage scenario and configured text analysis features.
|
||||
|
||||
Please have a look at the corresponding [configuration
|
||||
section](@/docs/configure/_index.md#memory-usage).
|
||||
|
||||
## Options
|
||||
|
||||
|
@ -24,9 +24,10 @@ candidates for:
|
||||
- Correspondents
|
||||
- Concerned person or things
|
||||
- A date
|
||||
- Tags
|
||||
|
||||
It will propose a few candidates and sets the most likely one to your
|
||||
item.
|
||||
For tags, it sets all that it thinks do apply. For the others, it will
|
||||
propose a few candidates and sets the most likely one to your item.
|
||||
|
||||
This might be wrong, so it is recommended to curate the results.
|
||||
However, very often the correct one is either set or within the
|
||||
|
@ -443,9 +443,10 @@ nlp processing as follows:
|
||||
- mode `regexonly`: this doesn't load any statistical models and is
|
||||
therefore much lighter on memory (depending on the address book
|
||||
size, of course). It will use the address book to create regex rules
|
||||
and match them against your document.
|
||||
- mode = disabled: this disables nlp processing altogether. Then only
|
||||
the classifier is run (unless disabled).
|
||||
and match them against your document. Memory usage then doesn't
|
||||
depend on the document language.
|
||||
- mode `disabled`: this disables nlp processing. Then only the
|
||||
classifier is run (unless disabled).
|
||||
|
||||
Note that mode `full` and `basic` is only relevant for the languages
|
||||
where models are available. For all other languages, it is effectively
|
||||
|
BIN
website/site/content/docs/webapp/scanmailbox-detail-01.png
Normal file
After Width: | Height: | Size: 142 KiB |
BIN
website/site/content/docs/webapp/scanmailbox-detail-02.png
Normal file
After Width: | Height: | Size: 152 KiB |
BIN
website/site/content/docs/webapp/scanmailbox-detail-03.png
Normal file
After Width: | Height: | Size: 187 KiB |
BIN
website/site/content/docs/webapp/scanmailbox-detail-04.png
Normal file
After Width: | Height: | Size: 188 KiB |
BIN
website/site/content/docs/webapp/scanmailbox-detail-05.png
Normal file
After Width: | Height: | Size: 196 KiB |
BIN
website/site/content/docs/webapp/scanmailbox-detail-06.png
Normal file
After Width: | Height: | Size: 184 KiB |
Before Width: | Height: | Size: 184 KiB |
@ -21,19 +21,22 @@ multiple e-mail accounts you want to import periodically.
|
||||
|
||||
# Details
|
||||
|
||||
Creating a task requires the following information:
|
||||
## General
|
||||
|
||||
{{ figure(file="scanmailbox-detail.png") }}
|
||||
{{ figure(file="scanmailbox-detail-01.png") }}
|
||||
|
||||
You can enable or disable this task. A disabled task will not run
|
||||
periodically. You can still choose to run it manually if you click the
|
||||
`Start Once` button.
|
||||
|
||||
## E-Mail Settings
|
||||
|
||||
Then you need to specify which [IMAP
|
||||
connection](@/docs/webapp/emailsettings.md#imap-settings) to use.
|
||||
|
||||
|
||||
## Processing
|
||||
|
||||
{{ figure(file="scanmailbox-detail-02.png") }}
|
||||
|
||||
A list of folders is required. Docspell will only look into these
|
||||
folders. You can specify multiple folders. The "Inbox" folder is a
|
||||
special folder, which will usually appear translated in your web-mail
|
||||
@ -43,30 +46,20 @@ mails in your inbox. Any other folder is usually case-sensitive
|
||||
except the INBOX folder). Type in a folder name and click the add
|
||||
button on the right.
|
||||
|
||||
The next two settings tell docspell what to do once a mail has been
|
||||
submitted to docspell. It can be moved into another folder in your
|
||||
mail account. This moves it out of the way for the next run. You can
|
||||
also choose to delete the mail, but *note that it will really be
|
||||
deleted and not moved to your trash folder*. If both options are off,
|
||||
nothing happens with that mail, it simply stays (and could be re-read
|
||||
on the next run).
|
||||
|
||||
Be careful when mails are neither moved nor deleted after processing.
|
||||
They could be selected anew in the next run, meaning that the job can
|
||||
not progress, because it filters out the same mails all the time. You
|
||||
can however, simply schedule the task in an interval >= the `Received
|
||||
Since Hours` setting.
|
||||
|
||||
|
||||
## Filtering
|
||||
|
||||
The following properties allow to filter mails that are imported.
|
||||
|
||||
Then the field *Received Since Hours* defines how many hours to go
|
||||
back and look for mails. Usually there are many mails in your inbox
|
||||
and importing them all at once is not feasible or desirable. It can
|
||||
work together with the *Schedule* field below. For example, you could
|
||||
run this task all 6 hours and read mails from 8 hours back.
|
||||
run this task all 6 hours and read mails from 8 hours back. This
|
||||
setting is used to query the mail server.
|
||||
|
||||
|
||||
## Additional Filter
|
||||
|
||||
{{ figure(file="scanmailbox-detail-03.png") }}
|
||||
|
||||
The following properties allow to filter those downloaded mails that
|
||||
should be imported.
|
||||
|
||||
The *File Filter* can be specified as a glob to only import mail
|
||||
attachments based on their file name. For example, a value of `*.pdf`
|
||||
@ -82,10 +75,38 @@ pattern. For example, if your scanner mails to you with a certain
|
||||
subject like _"Scanned Document 214"_, you could include those via a
|
||||
`Scanned Document*` pattern.
|
||||
|
||||
## Post Processing
|
||||
|
||||
{{ figure(file="scanmailbox-detail-04.png") }}
|
||||
|
||||
The next settings tell docspell what to do once a mail has been read
|
||||
by docspell. It can be moved into another folder in your mail account.
|
||||
This moves it out of the way for the next run. You can also choose to
|
||||
delete the mail, but *note that it will really be deleted and not
|
||||
moved to your trash folder*. If both options are off, nothing happens
|
||||
with that mail, it simply stays (and could be re-read on the next
|
||||
run).
|
||||
|
||||
Be careful when mails are neither moved nor deleted after processing.
|
||||
They could be selected anew in the next run, meaning that the job can
|
||||
not progress, because it filters out the same mails all the time. You
|
||||
can however, simply schedule the task in an interval >= the `Received
|
||||
Since Hours` setting.
|
||||
|
||||
By default, post-processing is only applied to mails that have been
|
||||
*submitted to docspell*. Some mails may have been skipped due subject
|
||||
filtering. If you also want these skipped mails to be affected by
|
||||
post-processing, enabled the *Apply post-processing to all fetched
|
||||
mails*.
|
||||
|
||||
|
||||
|
||||
## Metadata
|
||||
|
||||
The last properties allow to specify some metadata that are
|
||||
automatically attached to the items being created.
|
||||
{{ figure(file="scanmailbox-detail-05.png") }}
|
||||
|
||||
These properties allow to specify some metadata that are automatically
|
||||
attached to the items being created.
|
||||
|
||||
Every item in docspell has a direction value (incoming or outgoing).
|
||||
If you know that all mails you want to import have a specific
|
||||
@ -104,17 +125,26 @@ resulting items.
|
||||
The *Tags* setting can be used to associate a fixed number of tags to
|
||||
all items that are imported from this mail task.
|
||||
|
||||
The last field is the *Schedule* which defines when and how often this
|
||||
task should run. The syntax is similiar to a date-time string, like
|
||||
`2019-09-15 12:32`, where each part is a pattern to also match multple
|
||||
values. The ui tries to help a little by displaying the next two
|
||||
date-times this task would execute. A more in depth help is available
|
||||
[here](https://github.com/eikek/calev#what-are-calendar-events). For
|
||||
example, to execute the task every monday at noon, you would write:
|
||||
`Mon *-*-* 12:00`. A date-time part can match all values (`*`), a list
|
||||
of values (e.g. `1,5,12,19`) or a range (e.g. `1..9`). Long lists may
|
||||
be written in a shorter way using a repetition value. It is written
|
||||
like this: `1/7` which is the same as a list with `1` and all
|
||||
The *Language* setting is applied when processing the mails. If not
|
||||
set, the default language of the collective is used.
|
||||
|
||||
|
||||
## Schedule
|
||||
|
||||
{{ figure(file="scanmailbox-detail-06.png") }}
|
||||
|
||||
At last the *Schedule* defines when and how often this task should
|
||||
run. The syntax is similiar to a date-time string, like `2019-09-15
|
||||
12:32`, where each part is a pattern to also match multple values. The
|
||||
ui tries to help a little by displaying the next two date-times this
|
||||
task would execute. A more in depth help is available
|
||||
[here](https://github.com/eikek/calev#what-are-calendar-events).
|
||||
|
||||
For example, to execute the task every monday at noon, you would
|
||||
write: `Mon *-*-* 12:00`. A date-time part can match all values (`*`),
|
||||
a list of values (e.g. `1,5,12,19`) or a range (e.g. `1..9`). Long
|
||||
lists may be written in a shorter way using a repetition value. It is
|
||||
written like this: `1/7` which is the same as a list with `1` and all
|
||||
multiples of `7` added to it. In other words, it matches `1`, `1+7`,
|
||||
`1+7+7`, `1+7+7+7` and so on.
|
||||
|
||||
|