Update documentation

This commit is contained in:
Eike Kettner 2021-01-20 21:35:54 +01:00
parent 85ddc61d9d
commit a6c31be22f
6 changed files with 206 additions and 93 deletions

View File

@ -286,16 +286,13 @@ docspell.joex {
# 4. disabled - doesn't use any stanford-nlp feature # 4. disabled - doesn't use any stanford-nlp feature
# #
# The full and basic variants rely on pre-build language models # The full and basic variants rely on pre-build language models
# that are available for only 3 lanugages at the moment: German, # that are available for only a few languages. Memory usage
# English and French. # varies among the languages. So joex should run with -Xmx1400M
# # at least when using mode=full.
# Memory usage varies greatly among the languages. German has
# quite large models, that require about 1G heap. So joex should
# run with -Xmx1500M at least when using mode=full.
# #
# The basic variant does a quite good job for German and # The basic variant does a quite good job for German and
# English. It might be worse for French, always depending on the # English. It might be worse for French, always depending on the
# type of text that is analysed. Joex should run with about 600M # type of text that is analysed. Joex should run with about 500M
# heap, here again lanugage German uses the most. # heap, here again lanugage German uses the most.
# #
# The regexonly variant doesn't depend on a language. It roughly # The regexonly variant doesn't depend on a language. It roughly
@ -349,25 +346,23 @@ docspell.joex {
# Settings for doing document classification. # Settings for doing document classification.
# #
# This works by learning from existing documents. A collective can # This works by learning from existing documents. This requires a
# specify a tag category and the system will try to predict a tag # satstical model that is computed from all existing documents.
# from this category for new incoming documents. # This process is run periodically as configured by the
# # collective. It may require more memory, depending on the amount
# This requires a satstical model that is computed from all # of data.
# existing documents. This process is run periodically as
# configured by the collective. It may require a lot of memory,
# depending on the amount of data.
# #
# It utilises this NLP library: https://nlp.stanford.edu/. # It utilises this NLP library: https://nlp.stanford.edu/.
classification { classification {
# Whether to enable classification globally. Each collective can # Whether to enable classification globally. Each collective can
# decide to disable it. If it is disabled here, no collective # enable/disable auto-tagging. The classifier is also used for
# can use classification. # finding correspondents and concerned entities, if enabled
# here.
enabled = true enabled = true
# If concerned with memory consumption, this restricts the # If concerned with memory consumption, this restricts the
# number of items to consider. More are better for training. A # number of items to consider. More are better for training. A
# negative value or zero means no train on all items. # negative value or zero means to train on all items.
item-count = 0 item-count = 0
# These settings are used to configure the classifier. If # These settings are used to configure the classifier. If

View File

@ -796,7 +796,7 @@ in {
Memory usage varies greatly among the languages. German has Memory usage varies greatly among the languages. German has
quite large models, that require about 1G heap. So joex should quite large models, that require about 1G heap. So joex should
run with -Xmx1500M at least when using mode=full. run with -Xmx1400M at least when using mode=full.
The basic variant does a quite good job for German and The basic variant does a quite good job for German and
English. It might be worse for French, always depending on the English. It might be worse for French, always depending on the

View File

@ -20,6 +20,9 @@ The configuration of both components uses separate namespaces. The
configuration for the REST server is below `docspell.server`, while configuration for the REST server is below `docspell.server`, while
the one for joex is below `docspell.joex`. the one for joex is below `docspell.joex`.
You can therefore use two separate config files or one single file
containing both namespaces.
## JDBC ## JDBC
This configures the connection to the database. This has to be This configures the connection to the database. This has to be
@ -281,6 +284,56 @@ just some minutes, the web application obtains new ones
periodically. So a short time is recommended. periodically. So a short time is recommended.
## File Processing
Files are being processed by the joex component. So all the respective
configuration is in this config only.
File processing involves several stages, detailed information can be
found [here](@/docs/joex/file-processing.md#text-analysis).
Configuration allows to define the external tools and set some
limitations to control memory usage. The sections are:
- `docspell.joex.extraction`
- `docspell.joex.text-analysis`
- `docspell.joex.convert`
Options to external commands can use variables that are replaced by
values at runtime. Variables are enclosed in double braces `{{…}}`.
Please see the default configuration for more details.
### `text-analysis.nlp.mode`
This setting defines which NLP mode to use. It defaults to `full`,
which requires more memory for certain languages (with the advantage
of better results). Other values are `basic`, `regexonly` and
`disabled`. The modes `full` and `basic` use pre-defined lanugage
models for procesing documents of languaes German, English and French.
These require some amount of memory (see below).
The mode `basic` is like the "light" variant to `full`. It doesn't use
all NLP features, which makes memory consumption much lower, but comes
with the compromise of less accurate results.
The mode `regexonly` doesn't use pre-defined lanuage models, even if
available. It checks your address book against a document to find
metadata. That means, it is language independent. Also, when using
`full` or `basic` with lanugages where no pre-defined models exist, it
will degrade to `regexonly` for these.
The mode `disabled` skips NLP processing completely. This has least
impact in memory consumption, obviously, but then only the classifier
is used to find metadata.
You might want to try different modes and see what combination suits
best your usage pattern and machine running joex. If a powerful
machine is used, simply leave the defaults. When running on an older
raspberry pi, for example, you might need to adjust things. The
corresponding sections in [joex default config](#joex) and the [file
processing](@/docs/joex/file-processing.md#text-analysis) page provide more
details.
# File Format # File Format
The format of the configuration files can be The format of the configuration files can be

View File

@ -25,19 +25,18 @@ work is done by the joex components.
Running the joex component on the Raspberry Pi is possible, but will Running the joex component on the Raspberry Pi is possible, but will
result in long processing times for OCR and text analysis. The board result in long processing times for OCR and text analysis. The board
should provide 4G of RAM (like the current RPi4), especially if also a should provide 4G of RAM (like the current RPi4), especially if also a
database and solr are running next to it. I recommend to give joex a database and solr are running next to it. The memory required by joex
heap of 1.5G (`-J-Xmx1536M`). You should also set the joex pool size depends on the config and document language. Please pick a value that
to 1. suits your setup from [here](@/docs/install/running.md#memory-usage).
For boards like the RPi, it might be necessary to use
When joex processes the first file, some models are built loaded into `nlp.mode=basic`, rather than `nlp.mode=full`. You should also set the
memory which can take a while. Subsequent processing times are faster joex pool size to 1.
then.
An example: on this [UP An example: on this [UP
board](https://up-board.org/up/specifications/) with an Intel Atom board](https://up-board.org/up/specifications/) with an Intel Atom
x5-Z8350 CPU (@1.44Ghz) and 4G RAM, a scanned (300dpi) pdf file with 6 x5-Z8350 CPU (@1.44Ghz) and 4G RAM, a scanned (300dpi, in German) pdf
pages took *3:20 min* to process. This board also runs the SOLR and a file with 6 pages took *3:20 min* to process. This board also runs the
postgresql database. SOLR and a postgresql database.
The same file was processed in 55s on a qemu virtual machine on my i7 The same file was processed in 55s on a qemu virtual machine on my i7
notebook, using 1 CPU and 4G RAM (and identical config for joex). The notebook, using 1 CPU and 4G RAM (and identical config for joex). The

View File

@ -35,6 +35,42 @@ You should be able to create a new account and sign in. Check the
[configuration page](@/docs/configure/_index.md) to further customize [configuration page](@/docs/configure/_index.md) to further customize
docspell. docspell.
## Memory Usage
The memory requirements for the joex component depends on the document
language and the configuration for [file
processing](@/docs/configure/_index.md#file-processing). The
`nlp.mode` setting has significant impact, especially when your
documents are in German. Here are some rough numbers on jvm heap usage
(the same small jpeg file was used for all tries):
<table class="table is-hoverable is-striped">
<thead>
<tr><th>nlp.mode</th><th>English</th><th>German</th><th>French</th></tr>
</thead>
<tfoot>
</tfoot>
<tbody>
<tr><td>full</td><td>420M</td><td>950M</td><td>490M</td></tr>
<tr><td>basic</td><td>170M</td><td>380M</td><td>390M</td></tr>
</tbody>
</table>
When using `mode=full`, a heap setting of at least `-Xmx1400M` is
recommended. For `mode=basic` a heap setting of at least `-Xmx500M` is
recommended.
Other languages can't use these two modes, and so don't require this
amount of memory (but don't have as good results). Then you can go
with less heap.
More details about these modes can be found
[here](@/docs/joex/file-processing.md#text-analysis).
The restserver component is very lightweight, here you can use
defaults.
## Options ## Options
@ -65,10 +101,10 @@ $ ./docspell-restserver*/bin/docspell-restserver -h
gives an overview of supported options. gives an overview of supported options.
It is recommended to run joex with 1.5G heap space or more and with It is recommended to run joex with the G1GC enabled. If you use java8,
the G1GC enabled. If you use java8, you need to add an option to use you need to add an option to use G1GC (`-XX:+UseG1GC`), for java11
G1GC, for java11 this is not necessary (but doesn't hurt either). This this is not necessary (but doesn't hurt either). This could look like
could look like this: this:
``` ```
./docspell-joex-{{version()}}/bin/docspell-joex -J-Xmx1596M -J-XX:+UseG1GC -- /path/to/joex.conf ./docspell-joex-{{version()}}/bin/docspell-joex -J-Xmx1596M -J-XX:+UseG1GC -- /path/to/joex.conf

View File

@ -331,91 +331,121 @@ images for a collective. There is also a bash script provided in the
# Text Analysis # Text Analysis
This uses the extracted text to find what could be attached to the new Finally, the extracted text is analysed to find possible metadata that
item. There are multiple things provided. can be attached to the new item. There are two different approaches
provided.
Docspell depends on the [Stanford NLP The basic idea here is, that instead of *you defining textual rules* to
apply tags and other things, these rules *are found for you* based on
what you have provided so far.
Docspell relies on the [Stanford NLP
Library](https://nlp.stanford.edu/software/) for its AI features. Library](https://nlp.stanford.edu/software/) for its AI features.
Among other things they provide a classifier (used for guessing tags) Among other things they provide a classifier and NER annotators. The
and NER annotators. The latter is also a classifier, that associates a latter is also a classifier, that associates a label to terms in a
label to terms in a text. It finds out whether some term is probably text. It finds out whether some term is probably an organization, a
an organization, a person etc. This is then used to find matches in person etc. It tries to “understand” the structure of the text, like
your address book. verb, nouns and their relation.
When docspell finds several possible candidates for a match, it will The two approaches used are sketched below. They have both advantages
show the first few to you. If then the first was not the correct one, and disadvantages and are by default used together. However, depending
it can usually be fixed by a single click, because it is among the on the document languages, not all approaches are possible. They also
suggestions. have different memory footprints, and you might want to disable some
features if running on low-end machines.
## Classification ## Classification
If you enabled classification in the config file, a model is trained If you enabled classification in the config file, a model is trained
periodically from your files. This is used to guess a tag for the item periodically from a collective's files. Very roughly speaking… this
for new documents. model contains the essence of "patterns" in the text that are likeley
related to a tag, a corrpesondent etc.
You can tell docspell how many documents it should use for training. When a new document arrives, this model is used to ask for what
Sometimes (when moving?), documents may change and you only like to metadata (tag, correspondent, etc) it thinks is likely to apply here.
base next guesses on the documents of last year only. This can be
found in the collective settings.
The admin can also limit the number of documents to train with, Training the model is a rather resource intensive process, but using
because it affects memory usage. an existing model is quite cheap and fast. A model is trained
periodically, the schedule can be defined in your collective settings.
For tags, you can define the tag categories that should be trained (or
that should not be trained). Docspell assigns one tag from all tags in
a category to a new document.
Note that tags that can not be derived from the text only, should
probably be excluded from learning. For example, if you tag all your
items with `Done` at some point, it may falsly learn patterns to this
tag and tag your new documents with `Done`.
The admin can also limit the number of documents to train with in the
config file to control the memory footprint when training.
Classification is used in Docspell once for guessing tags and also for
finding correspondent and concerned entities. For correspondent and
concerned entities, the NLP approach is used first and the classifier
results then fill missing values.
## Natural Language Processing ## Natural Language Processing
NLP is used to find out which terms in a text may be a company or NLP is the other approach that works a bit differently. In this
person that is then used to find metadata in your address book. It can approach, algorithms are used that find lanugage properties from the
also uses your complete address book to match terms in the text. So given text, for example which terms are nouns, organization or person
there are two ways: using a statistical model, terms in a text are names etc. This also requires a statistical model, but this time for a
identified as organization or person etc. This information is then whole language. These are also provided by [Stanford
used to search your address book. Second, regexp rules are derived NLP](https://nlp.stanford.edu/software/), but not for all languages.
from the address book and run against the text. By default, both are So whether this can be used depends on the document language. Models
applied, where the rules are run as the last step to identify missing exist for German, English and French currently.
terms.
The statistical model approach is good, i.e. for large address books. Then [Stanford NLP](https://nlp.stanford.edu/software/) also allows to
Normally, a document contains only very few organizations or person run custom rules against a text. This can be used as a fallback for
names. So it is much more efficient to check these against your terms where the statistical model didn't succeed. But it can also be
address book (in contrast to the other way around). It can also find used by itself. Docspell derives these rules from your address book,
things *not* in your address book. However, it might not detect all or so it can find terms in the document text that match your organization
there are no statistical models for your language. Then the address and person names. This does not depend on the document language.
book is used to automatically create rules that are run against the
document.
These statistical models are provided by [Stanford By default, Docspell does both: it first uses the statistical language
NLP](https://nlp.stanford.edu/software/) and are currently available model (if available for the given language) and then runs the
for German, English and French. All other languages can use the rule address-book derived rules as a last step on so far unclassified
approach. The statistcal models, however, require quite some memory terms. This allows for the best results. If more than one candidate is
depending on the size of the models which varies between languages. found, the "most likely" one is set on the item and others are stored
English has a lower memory footprint than German, for example. If you as suggestions.
have a very large address book, the rule approach may also use a lot
memory.
The statistical model approach works generally very well, i.e. for
large address books. Normally, a document contains only very few
organizations or person names. So it is more efficient to check these
few against your (probably large) address book; in contrast to testing
hundreds of company names against a single document. It can also find
things *not* in your address book (but this is unused in Docspell
currently). However, it might not detect all or there are no
statistical models for your language. Then the address book is used to
automatically create rules that are run against the document.
Both ways require memory, it depends on the size of your address book
and on the size of the language models (they vary for each language).
In the config file, you can specify different modes of operation for In the config file, you can specify different modes of operation for
nlp processing as follows: nlp processing as follows:
- mode `full`: creates the complete nlp pipeline, requiring the most - mode `full`: creates the complete nlp pipeline, requiring the most
amount of memory, providing the best results. I'd recommend to run amount of memory, providing the best results. I'd recommend to run
joex with a heap size of a least 1.5G (for English only, it can be joex with a heap size of a least 1.4G (for English only, it can be
lower that that). lower that that).
- mode `basic`: it only loads the NER tagger. This doesn't work as - mode `basic`: it only loads the NER tagger. This doesn't work as
well as the complete pipeline, because some steps are simply well as the complete pipeline, because some NLP steps are simply
skipped. But it gives quite good results and uses less memory. I'd skipped. But it gives quite good results already and uses less
recommend to run joex with at least 600m heap in this mode. memory. I'd recommend to run joex with at least 500m heap in this
mode.
- mode `regexonly`: this doesn't load any statistical models and is - mode `regexonly`: this doesn't load any statistical models and is
therefore very memory efficient (depending on the address book size, therefore much lighter on memory (depending on the address book
of course). It will use the address book to create regex rules and size, of course). It will use the address book to create regex rules
match them against your document. It doesn't depend on a language, and match them against your document.
so this is available for all languages. - mode = disabled: this disables nlp processing altogether. Then only
- mode = disabled: this disables nlp processing altogether the classifier is run (unless disabled).
Note that mode `full` and `basic` is only relevant for the languages Note that mode `full` and `basic` is only relevant for the languages
where models are available. For all other languages, it is effectively where models are available. For all other languages, it is effectively
the same as `regexonly`. the same as `regexonly`.
The config file allows some settings. You can specify a limit for The config file allows to specify a limit for texts to analyse in
texts. Large texts result in higher memory consumption. By default, general. Large texts result in higher memory consumption. By default,
the first 10'000 characters are taken into account. the first 10'000 characters are taken into account.
Then, for the `regexonly` mode, you can restrict the number of address Then, for the `regexonly` mode, you can restrict the number of address
@ -424,7 +454,7 @@ book entries that are used to create the rule set via
footprint. footprint.
The setting `clear-stanford-nlp-interval` allows to define an idle The setting `clear-stanford-nlp-interval` allows to define an idle
time after which the model files are cleared from memory. This allows time after which the language models are cleared from memory. This
memory to be reclaimed by the OS. The timer starts after the last file allows memory to be reclaimed by the OS. The timer starts after the
has been processed. If you can afford it, it is recommended to disable last file has been processed. If you can afford it, it is recommended
it by setting it to `0`. to disable it by setting it to `0`.