Update documentation
32
Changelog.md
@ -6,9 +6,9 @@
|
|||||||
|
|
||||||
This release comes with major improvements to the text analysis
|
This release comes with major improvements to the text analysis
|
||||||
module. It is now much more configurable, has improved results and can
|
module. It is now much more configurable, has improved results and can
|
||||||
learn tags from all categories. Additionally, more languages have been
|
learn tags from all categories. Additionally, more languages for
|
||||||
added (and it's now easier to add more, please open an issue if want
|
document processing have been added and it's now easier to add more.
|
||||||
more languages).
|
Please open an issue if want more languages to be included.
|
||||||
|
|
||||||
- text analysis improvements (#263, #570)
|
- text analysis improvements (#263, #570)
|
||||||
- docspell can now learn from all your tag categories
|
- docspell can now learn from all your tag categories
|
||||||
@ -23,15 +23,15 @@ more languages).
|
|||||||
- Adds: Spanish, Italian, Portuguese, Czech, Dutch, Danish, Finnish,
|
- Adds: Spanish, Italian, Portuguese, Czech, Dutch, Danish, Finnish,
|
||||||
Norwegian, Swedish, Russian, Romanian
|
Norwegian, Swedish, Russian, Romanian
|
||||||
- languages have different support for text-analysis, but there is
|
- languages have different support for text-analysis, but there is
|
||||||
some basic support for all, there is extended support for English,
|
some basic support for all
|
||||||
German and French through [Stanford
|
- there is extended support for English, German and French through
|
||||||
CoreNLP](https://stanfordnlp.github.io/CoreNLP/) nlp models
|
[Stanford CoreNLP](https://stanfordnlp.github.io/CoreNLP/) nlp
|
||||||
- if you want more languages, please open an issue.
|
models (as before)
|
||||||
- scan mailbox change (#576)
|
- scan mailbox change (#576)
|
||||||
- The change from last version (#551) has been moved behind a flag
|
- The change from last version (#551) has been moved behind a flag
|
||||||
in the "scan mailbox settings". Please review your scan mailbox
|
in the "scan mailbox settings". Please review your scan mailbox
|
||||||
settings.
|
tasks in your user settings.
|
||||||
- The scan mailbox settings form view has been changed to tab-style,
|
- The scan mailbox settings form view has been organized into tabs,
|
||||||
as it grew too large for a single form.
|
as it grew too large for a single form.
|
||||||
- nix tools package fixed (#584)
|
- nix tools package fixed (#584)
|
||||||
- If you are using docspell tools package for nix, it has now been
|
- If you are using docspell tools package for nix, it has now been
|
||||||
@ -50,22 +50,23 @@ more languages).
|
|||||||
- This was a bug introduced by the last release. When tag categories
|
- This was a bug introduced by the last release. When tag categories
|
||||||
can now be spelled upper- or lower-case. In 0.18.0 you had to
|
can now be spelled upper- or lower-case. In 0.18.0 you had to
|
||||||
spell them lowercase, otherwise the search doesn't work.
|
spell them lowercase, otherwise the search doesn't work.
|
||||||
- adds a workaround for mails that don't specify the used charset (#591)
|
- adds a workaround for mails that don't specify their used charset (#591)
|
||||||
|
|
||||||
### Breaking Changes
|
### Breaking Changes
|
||||||
|
|
||||||
- The joex configuration changed around text analysis. If you had some
|
- The joex configuration changed around text analysis. If you had some
|
||||||
custom settings there, please review these wrt the new default
|
custom settings there, please review these wrt the new default
|
||||||
config.
|
config.
|
||||||
- The tools package renamed the scripts to be better distinguishable,
|
- When using the nix package manager: the tools package renamed the
|
||||||
since they all end up in `$PATH`. They are now prefixed by `ds-`.
|
scripts to be better distinguishable, since they all end up in
|
||||||
|
`$PATH`. They are now prefixed by `ds-`.
|
||||||
- The path of the consumedir script changed in the consumedir docker
|
- The path of the consumedir script changed in the consumedir docker
|
||||||
image
|
image
|
||||||
- The settings of the scan-mailbox task has been extended by another
|
- The settings of the scan-mailbox task has been extended by another
|
||||||
flag. It controls when to apply the post-processing (moving or
|
flag. It controls when to apply the post-processing (moving or
|
||||||
deleting). If you were relying that all mails (even those excluded
|
deleting). If you were relying that all mails (even those excluded
|
||||||
by a subject filter) where moved away, you need to check the
|
by a subject filter) where moved away, you need to check your
|
||||||
settings.
|
scan-mailbox task settings.
|
||||||
|
|
||||||
### REST Api Changes
|
### REST Api Changes
|
||||||
|
|
||||||
@ -82,6 +83,9 @@ more languages).
|
|||||||
moved inside `text-anlysis`. Please have a look at the new
|
moved inside `text-anlysis`. Please have a look at the new
|
||||||
[default config](https://docspell.org/docs/configure/#joex) if you
|
[default config](https://docspell.org/docs/configure/#joex) if you
|
||||||
changed something there.
|
changed something there.
|
||||||
|
- The `regex-ner` section has changed: the `enabled` flag has been
|
||||||
|
removed, you can now limit the number of entries using
|
||||||
|
`max-entries` to apply and `0` means to disable it.
|
||||||
|
|
||||||
|
|
||||||
## v0.18.0
|
## v0.18.0
|
||||||
|
@ -341,13 +341,59 @@ will degrade to `regexonly` for these.
|
|||||||
|
|
||||||
The mode `disabled` skips NLP processing completely. This has least
|
The mode `disabled` skips NLP processing completely. This has least
|
||||||
impact in memory consumption, obviously, but then only the classifier
|
impact in memory consumption, obviously, but then only the classifier
|
||||||
is used to find metadata.
|
is used to find metadata (unless it is disabled, too).
|
||||||
|
|
||||||
You might want to try different modes and see what combination suits
|
You might want to try different modes and see what combination suits
|
||||||
best your usage pattern and machine running joex. If a powerful
|
best your usage pattern and machine running joex. If a powerful
|
||||||
machine is used, simply leave the defaults. When running on an older
|
machine is used, simply leave the defaults. When running on an
|
||||||
raspberry pi, for example, you might need to adjust things.
|
raspberry pi, for example, you might need to adjust things.
|
||||||
|
|
||||||
|
### Memory Usage
|
||||||
|
|
||||||
|
The memory requirements for the joex component depends on the document
|
||||||
|
language and the enabled features for text-analysis. The `nlp.mode`
|
||||||
|
setting has significant impact, especially when your documents are in
|
||||||
|
German. Here are some rough numbers on jvm heap usage (the same file
|
||||||
|
was used for all tries):
|
||||||
|
|
||||||
|
<table class="table is-hoverable is-striped">
|
||||||
|
<thead>
|
||||||
|
<tr><th>nlp.mode</th><th>English</th><th>German</th><th>French</th></tr>
|
||||||
|
</thead>
|
||||||
|
<tfoot>
|
||||||
|
</tfoot>
|
||||||
|
<tbody>
|
||||||
|
<tr><td>full</td><td>420M</td><td>950M</td><td>490M</td></tr>
|
||||||
|
<tr><td>basic</td><td>170M</td><td>380M</td><td>390M</td></tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
Note that these are only rough numbers and they show the maximum used
|
||||||
|
heap memory while processing a file.
|
||||||
|
|
||||||
|
When using `mode=full`, a heap setting of at least `-Xmx1400M` is
|
||||||
|
recommended. For `mode=basic` a heap setting of at least `-Xmx500M` is
|
||||||
|
recommended.
|
||||||
|
|
||||||
|
Other languages can't use these two modes, and so don't require this
|
||||||
|
amount of memory (but don't have as good results). Then you can go
|
||||||
|
with less heap. For these languages, the nlp mode is the same as
|
||||||
|
`regexonly`.
|
||||||
|
|
||||||
|
Training the classifier is also memory intensive, which solely depends
|
||||||
|
on the size and number of documents that are being trained. However,
|
||||||
|
training the classifier is done periodically and can happen maybe
|
||||||
|
every two weeks. When classifying new documents, memory requirements
|
||||||
|
are lower, since the model already exists.
|
||||||
|
|
||||||
|
More details about these modes can be found
|
||||||
|
[here](@/docs/joex/file-processing.md#text-analysis).
|
||||||
|
|
||||||
|
|
||||||
|
The restserver component is very lightweight, here you can use
|
||||||
|
defaults.
|
||||||
|
|
||||||
|
|
||||||
# File Format
|
# File Format
|
||||||
|
|
||||||
The format of the configuration files can be
|
The format of the configuration files can be
|
||||||
|
@ -42,6 +42,8 @@ directory and uploads all incoming files to Docspell. The script can
|
|||||||
watch directories recursively and can skip files already uploaded, so
|
watch directories recursively and can skip files already uploaded, so
|
||||||
you can organize the files as you want in there (rename, move etc).
|
you can organize the files as you want in there (rename, move etc).
|
||||||
|
|
||||||
|
This can be used multiple times on different machines, if desired.
|
||||||
|
|
||||||
The scanner should support 300dpi for better results. Docspell
|
The scanner should support 300dpi for better results. Docspell
|
||||||
converts the files into PDF adding a text layer of image-only files.
|
converts the files into PDF adding a text layer of image-only files.
|
||||||
|
|
||||||
|
@ -25,3 +25,8 @@ To get started, here are some quick links:
|
|||||||
user provided [notes and unraid
|
user provided [notes and unraid
|
||||||
templates](https://github.com/vakilando/unraid-docker-templates)
|
templates](https://github.com/vakilando/unraid-docker-templates)
|
||||||
which can get you started. Thanks for providing these!
|
which can get you started. Thanks for providing these!
|
||||||
|
|
||||||
|
Every [component](@docs/intro/_index.md#components) (restserver, joex,
|
||||||
|
consumedir) can run on different machines and multiple times. Most of
|
||||||
|
the time running all on one machine is sufficient and also for
|
||||||
|
simplicity, the docker-compose setup reflects this variant.
|
||||||
|
@ -27,7 +27,7 @@ result in long processing times for OCR and text analysis. The board
|
|||||||
should provide 4G of RAM (like the current RPi4), especially if also a
|
should provide 4G of RAM (like the current RPi4), especially if also a
|
||||||
database and solr are running next to it. The memory required by joex
|
database and solr are running next to it. The memory required by joex
|
||||||
depends on the config and document language. Please pick a value that
|
depends on the config and document language. Please pick a value that
|
||||||
suits your setup from [here](@/docs/install/running.md#memory-usage).
|
suits your setup from [here](@/docs/configure/_index.md#memory-usage).
|
||||||
For boards like the RPi, it might be necessary to use
|
For boards like the RPi, it might be necessary to use
|
||||||
`nlp.mode=basic`, rather than `nlp.mode=full`. You should also set the
|
`nlp.mode=basic`, rather than `nlp.mode=full`. You should also set the
|
||||||
joex pool size to 1.
|
joex pool size to 1.
|
||||||
|
@ -45,42 +45,14 @@ when opened up to the outside, it is recommended to lock this down.
|
|||||||
|
|
||||||
{% end %}
|
{% end %}
|
||||||
|
|
||||||
## Memory Usage
|
## Memory
|
||||||
|
|
||||||
The memory requirements for the joex component depends on the document
|
Using the options below you can define how much memory the JVM process
|
||||||
language and the configuration for [file
|
is able to use. This might be necessary to adopt depending on the
|
||||||
processing](@/docs/configure/_index.md#file-processing). The
|
usage scenario and configured text analysis features.
|
||||||
`nlp.mode` setting has significant impact, especially when your
|
|
||||||
documents are in German. Here are some rough numbers on jvm heap usage
|
|
||||||
(the same small jpeg file was used for all tries):
|
|
||||||
|
|
||||||
<table class="table is-hoverable is-striped">
|
|
||||||
<thead>
|
|
||||||
<tr><th>nlp.mode</th><th>English</th><th>German</th><th>French</th></tr>
|
|
||||||
</thead>
|
|
||||||
<tfoot>
|
|
||||||
</tfoot>
|
|
||||||
<tbody>
|
|
||||||
<tr><td>full</td><td>420M</td><td>950M</td><td>490M</td></tr>
|
|
||||||
<tr><td>basic</td><td>170M</td><td>380M</td><td>390M</td></tr>
|
|
||||||
</tbody>
|
|
||||||
</table>
|
|
||||||
|
|
||||||
When using `mode=full`, a heap setting of at least `-Xmx1400M` is
|
|
||||||
recommended. For `mode=basic` a heap setting of at least `-Xmx500M` is
|
|
||||||
recommended.
|
|
||||||
|
|
||||||
Other languages can't use these two modes, and so don't require this
|
|
||||||
amount of memory (but don't have as good results). Then you can go
|
|
||||||
with less heap.
|
|
||||||
|
|
||||||
More details about these modes can be found
|
|
||||||
[here](@/docs/joex/file-processing.md#text-analysis).
|
|
||||||
|
|
||||||
|
|
||||||
The restserver component is very lightweight, here you can use
|
|
||||||
defaults.
|
|
||||||
|
|
||||||
|
Please have a look at the corresponding [configuration
|
||||||
|
section](@/docs/configure/_index.md#memory-usage).
|
||||||
|
|
||||||
## Options
|
## Options
|
||||||
|
|
||||||
|
@ -24,9 +24,10 @@ candidates for:
|
|||||||
- Correspondents
|
- Correspondents
|
||||||
- Concerned person or things
|
- Concerned person or things
|
||||||
- A date
|
- A date
|
||||||
|
- Tags
|
||||||
|
|
||||||
It will propose a few candidates and sets the most likely one to your
|
For tags, it sets all that it thinks do apply. For the others, it will
|
||||||
item.
|
propose a few candidates and sets the most likely one to your item.
|
||||||
|
|
||||||
This might be wrong, so it is recommended to curate the results.
|
This might be wrong, so it is recommended to curate the results.
|
||||||
However, very often the correct one is either set or within the
|
However, very often the correct one is either set or within the
|
||||||
|
@ -443,9 +443,10 @@ nlp processing as follows:
|
|||||||
- mode `regexonly`: this doesn't load any statistical models and is
|
- mode `regexonly`: this doesn't load any statistical models and is
|
||||||
therefore much lighter on memory (depending on the address book
|
therefore much lighter on memory (depending on the address book
|
||||||
size, of course). It will use the address book to create regex rules
|
size, of course). It will use the address book to create regex rules
|
||||||
and match them against your document.
|
and match them against your document. Memory usage then doesn't
|
||||||
- mode = disabled: this disables nlp processing altogether. Then only
|
depend on the document language.
|
||||||
the classifier is run (unless disabled).
|
- mode `disabled`: this disables nlp processing. Then only the
|
||||||
|
classifier is run (unless disabled).
|
||||||
|
|
||||||
Note that mode `full` and `basic` is only relevant for the languages
|
Note that mode `full` and `basic` is only relevant for the languages
|
||||||
where models are available. For all other languages, it is effectively
|
where models are available. For all other languages, it is effectively
|
||||||
|
BIN
website/site/content/docs/webapp/scanmailbox-detail-01.png
Normal file
After Width: | Height: | Size: 142 KiB |
BIN
website/site/content/docs/webapp/scanmailbox-detail-02.png
Normal file
After Width: | Height: | Size: 152 KiB |
BIN
website/site/content/docs/webapp/scanmailbox-detail-03.png
Normal file
After Width: | Height: | Size: 187 KiB |
BIN
website/site/content/docs/webapp/scanmailbox-detail-04.png
Normal file
After Width: | Height: | Size: 188 KiB |
BIN
website/site/content/docs/webapp/scanmailbox-detail-05.png
Normal file
After Width: | Height: | Size: 196 KiB |
BIN
website/site/content/docs/webapp/scanmailbox-detail-06.png
Normal file
After Width: | Height: | Size: 184 KiB |
Before Width: | Height: | Size: 184 KiB |
@ -21,19 +21,22 @@ multiple e-mail accounts you want to import periodically.
|
|||||||
|
|
||||||
# Details
|
# Details
|
||||||
|
|
||||||
Creating a task requires the following information:
|
## General
|
||||||
|
|
||||||
{{ figure(file="scanmailbox-detail.png") }}
|
{{ figure(file="scanmailbox-detail-01.png") }}
|
||||||
|
|
||||||
You can enable or disable this task. A disabled task will not run
|
You can enable or disable this task. A disabled task will not run
|
||||||
periodically. You can still choose to run it manually if you click the
|
periodically. You can still choose to run it manually if you click the
|
||||||
`Start Once` button.
|
`Start Once` button.
|
||||||
|
|
||||||
## E-Mail Settings
|
|
||||||
|
|
||||||
Then you need to specify which [IMAP
|
Then you need to specify which [IMAP
|
||||||
connection](@/docs/webapp/emailsettings.md#imap-settings) to use.
|
connection](@/docs/webapp/emailsettings.md#imap-settings) to use.
|
||||||
|
|
||||||
|
|
||||||
|
## Processing
|
||||||
|
|
||||||
|
{{ figure(file="scanmailbox-detail-02.png") }}
|
||||||
|
|
||||||
A list of folders is required. Docspell will only look into these
|
A list of folders is required. Docspell will only look into these
|
||||||
folders. You can specify multiple folders. The "Inbox" folder is a
|
folders. You can specify multiple folders. The "Inbox" folder is a
|
||||||
special folder, which will usually appear translated in your web-mail
|
special folder, which will usually appear translated in your web-mail
|
||||||
@ -43,30 +46,20 @@ mails in your inbox. Any other folder is usually case-sensitive
|
|||||||
except the INBOX folder). Type in a folder name and click the add
|
except the INBOX folder). Type in a folder name and click the add
|
||||||
button on the right.
|
button on the right.
|
||||||
|
|
||||||
The next two settings tell docspell what to do once a mail has been
|
|
||||||
submitted to docspell. It can be moved into another folder in your
|
|
||||||
mail account. This moves it out of the way for the next run. You can
|
|
||||||
also choose to delete the mail, but *note that it will really be
|
|
||||||
deleted and not moved to your trash folder*. If both options are off,
|
|
||||||
nothing happens with that mail, it simply stays (and could be re-read
|
|
||||||
on the next run).
|
|
||||||
|
|
||||||
Be careful when mails are neither moved nor deleted after processing.
|
|
||||||
They could be selected anew in the next run, meaning that the job can
|
|
||||||
not progress, because it filters out the same mails all the time. You
|
|
||||||
can however, simply schedule the task in an interval >= the `Received
|
|
||||||
Since Hours` setting.
|
|
||||||
|
|
||||||
|
|
||||||
## Filtering
|
|
||||||
|
|
||||||
The following properties allow to filter mails that are imported.
|
|
||||||
|
|
||||||
Then the field *Received Since Hours* defines how many hours to go
|
Then the field *Received Since Hours* defines how many hours to go
|
||||||
back and look for mails. Usually there are many mails in your inbox
|
back and look for mails. Usually there are many mails in your inbox
|
||||||
and importing them all at once is not feasible or desirable. It can
|
and importing them all at once is not feasible or desirable. It can
|
||||||
work together with the *Schedule* field below. For example, you could
|
work together with the *Schedule* field below. For example, you could
|
||||||
run this task all 6 hours and read mails from 8 hours back.
|
run this task all 6 hours and read mails from 8 hours back. This
|
||||||
|
setting is used to query the mail server.
|
||||||
|
|
||||||
|
|
||||||
|
## Additional Filter
|
||||||
|
|
||||||
|
{{ figure(file="scanmailbox-detail-03.png") }}
|
||||||
|
|
||||||
|
The following properties allow to filter those downloaded mails that
|
||||||
|
should be imported.
|
||||||
|
|
||||||
The *File Filter* can be specified as a glob to only import mail
|
The *File Filter* can be specified as a glob to only import mail
|
||||||
attachments based on their file name. For example, a value of `*.pdf`
|
attachments based on their file name. For example, a value of `*.pdf`
|
||||||
@ -82,10 +75,38 @@ pattern. For example, if your scanner mails to you with a certain
|
|||||||
subject like _"Scanned Document 214"_, you could include those via a
|
subject like _"Scanned Document 214"_, you could include those via a
|
||||||
`Scanned Document*` pattern.
|
`Scanned Document*` pattern.
|
||||||
|
|
||||||
|
## Post Processing
|
||||||
|
|
||||||
|
{{ figure(file="scanmailbox-detail-04.png") }}
|
||||||
|
|
||||||
|
The next settings tell docspell what to do once a mail has been read
|
||||||
|
by docspell. It can be moved into another folder in your mail account.
|
||||||
|
This moves it out of the way for the next run. You can also choose to
|
||||||
|
delete the mail, but *note that it will really be deleted and not
|
||||||
|
moved to your trash folder*. If both options are off, nothing happens
|
||||||
|
with that mail, it simply stays (and could be re-read on the next
|
||||||
|
run).
|
||||||
|
|
||||||
|
Be careful when mails are neither moved nor deleted after processing.
|
||||||
|
They could be selected anew in the next run, meaning that the job can
|
||||||
|
not progress, because it filters out the same mails all the time. You
|
||||||
|
can however, simply schedule the task in an interval >= the `Received
|
||||||
|
Since Hours` setting.
|
||||||
|
|
||||||
|
By default, post-processing is only applied to mails that have been
|
||||||
|
*submitted to docspell*. Some mails may have been skipped due subject
|
||||||
|
filtering. If you also want these skipped mails to be affected by
|
||||||
|
post-processing, enabled the *Apply post-processing to all fetched
|
||||||
|
mails*.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Metadata
|
## Metadata
|
||||||
|
|
||||||
The last properties allow to specify some metadata that are
|
{{ figure(file="scanmailbox-detail-05.png") }}
|
||||||
automatically attached to the items being created.
|
|
||||||
|
These properties allow to specify some metadata that are automatically
|
||||||
|
attached to the items being created.
|
||||||
|
|
||||||
Every item in docspell has a direction value (incoming or outgoing).
|
Every item in docspell has a direction value (incoming or outgoing).
|
||||||
If you know that all mails you want to import have a specific
|
If you know that all mails you want to import have a specific
|
||||||
@ -104,17 +125,26 @@ resulting items.
|
|||||||
The *Tags* setting can be used to associate a fixed number of tags to
|
The *Tags* setting can be used to associate a fixed number of tags to
|
||||||
all items that are imported from this mail task.
|
all items that are imported from this mail task.
|
||||||
|
|
||||||
The last field is the *Schedule* which defines when and how often this
|
The *Language* setting is applied when processing the mails. If not
|
||||||
task should run. The syntax is similiar to a date-time string, like
|
set, the default language of the collective is used.
|
||||||
`2019-09-15 12:32`, where each part is a pattern to also match multple
|
|
||||||
values. The ui tries to help a little by displaying the next two
|
|
||||||
date-times this task would execute. A more in depth help is available
|
## Schedule
|
||||||
[here](https://github.com/eikek/calev#what-are-calendar-events). For
|
|
||||||
example, to execute the task every monday at noon, you would write:
|
{{ figure(file="scanmailbox-detail-06.png") }}
|
||||||
`Mon *-*-* 12:00`. A date-time part can match all values (`*`), a list
|
|
||||||
of values (e.g. `1,5,12,19`) or a range (e.g. `1..9`). Long lists may
|
At last the *Schedule* defines when and how often this task should
|
||||||
be written in a shorter way using a repetition value. It is written
|
run. The syntax is similiar to a date-time string, like `2019-09-15
|
||||||
like this: `1/7` which is the same as a list with `1` and all
|
12:32`, where each part is a pattern to also match multple values. The
|
||||||
|
ui tries to help a little by displaying the next two date-times this
|
||||||
|
task would execute. A more in depth help is available
|
||||||
|
[here](https://github.com/eikek/calev#what-are-calendar-events).
|
||||||
|
|
||||||
|
For example, to execute the task every monday at noon, you would
|
||||||
|
write: `Mon *-*-* 12:00`. A date-time part can match all values (`*`),
|
||||||
|
a list of values (e.g. `1,5,12,19`) or a range (e.g. `1..9`). Long
|
||||||
|
lists may be written in a shorter way using a repetition value. It is
|
||||||
|
written like this: `1/7` which is the same as a list with `1` and all
|
||||||
multiples of `7` added to it. In other words, it matches `1`, `1+7`,
|
multiples of `7` added to it. In other words, it matches `1`, `1+7`,
|
||||||
`1+7+7`, `1+7+7+7` and so on.
|
`1+7+7`, `1+7+7+7` and so on.
|
||||||
|
|
||||||
|