mirror of
https://github.com/TheAnachronism/docspell.git
synced 2025-06-05 22:55:58 +00:00
Update documentation
This commit is contained in:
parent
85ddc61d9d
commit
a6c31be22f
@ -286,16 +286,13 @@ docspell.joex {
|
|||||||
# 4. disabled - doesn't use any stanford-nlp feature
|
# 4. disabled - doesn't use any stanford-nlp feature
|
||||||
#
|
#
|
||||||
# The full and basic variants rely on pre-build language models
|
# The full and basic variants rely on pre-build language models
|
||||||
# that are available for only 3 lanugages at the moment: German,
|
# that are available for only a few languages. Memory usage
|
||||||
# English and French.
|
# varies among the languages. So joex should run with -Xmx1400M
|
||||||
#
|
# at least when using mode=full.
|
||||||
# Memory usage varies greatly among the languages. German has
|
|
||||||
# quite large models, that require about 1G heap. So joex should
|
|
||||||
# run with -Xmx1500M at least when using mode=full.
|
|
||||||
#
|
#
|
||||||
# The basic variant does a quite good job for German and
|
# The basic variant does a quite good job for German and
|
||||||
# English. It might be worse for French, always depending on the
|
# English. It might be worse for French, always depending on the
|
||||||
# type of text that is analysed. Joex should run with about 600M
|
# type of text that is analysed. Joex should run with about 500M
|
||||||
# heap, here again lanugage German uses the most.
|
# heap, here again lanugage German uses the most.
|
||||||
#
|
#
|
||||||
# The regexonly variant doesn't depend on a language. It roughly
|
# The regexonly variant doesn't depend on a language. It roughly
|
||||||
@ -349,25 +346,23 @@ docspell.joex {
|
|||||||
|
|
||||||
# Settings for doing document classification.
|
# Settings for doing document classification.
|
||||||
#
|
#
|
||||||
# This works by learning from existing documents. A collective can
|
# This works by learning from existing documents. This requires a
|
||||||
# specify a tag category and the system will try to predict a tag
|
# satstical model that is computed from all existing documents.
|
||||||
# from this category for new incoming documents.
|
# This process is run periodically as configured by the
|
||||||
#
|
# collective. It may require more memory, depending on the amount
|
||||||
# This requires a satstical model that is computed from all
|
# of data.
|
||||||
# existing documents. This process is run periodically as
|
|
||||||
# configured by the collective. It may require a lot of memory,
|
|
||||||
# depending on the amount of data.
|
|
||||||
#
|
#
|
||||||
# It utilises this NLP library: https://nlp.stanford.edu/.
|
# It utilises this NLP library: https://nlp.stanford.edu/.
|
||||||
classification {
|
classification {
|
||||||
# Whether to enable classification globally. Each collective can
|
# Whether to enable classification globally. Each collective can
|
||||||
# decide to disable it. If it is disabled here, no collective
|
# enable/disable auto-tagging. The classifier is also used for
|
||||||
# can use classification.
|
# finding correspondents and concerned entities, if enabled
|
||||||
|
# here.
|
||||||
enabled = true
|
enabled = true
|
||||||
|
|
||||||
# If concerned with memory consumption, this restricts the
|
# If concerned with memory consumption, this restricts the
|
||||||
# number of items to consider. More are better for training. A
|
# number of items to consider. More are better for training. A
|
||||||
# negative value or zero means no train on all items.
|
# negative value or zero means to train on all items.
|
||||||
item-count = 0
|
item-count = 0
|
||||||
|
|
||||||
# These settings are used to configure the classifier. If
|
# These settings are used to configure the classifier. If
|
||||||
|
@ -796,7 +796,7 @@ in {
|
|||||||
|
|
||||||
Memory usage varies greatly among the languages. German has
|
Memory usage varies greatly among the languages. German has
|
||||||
quite large models, that require about 1G heap. So joex should
|
quite large models, that require about 1G heap. So joex should
|
||||||
run with -Xmx1500M at least when using mode=full.
|
run with -Xmx1400M at least when using mode=full.
|
||||||
|
|
||||||
The basic variant does a quite good job for German and
|
The basic variant does a quite good job for German and
|
||||||
English. It might be worse for French, always depending on the
|
English. It might be worse for French, always depending on the
|
||||||
|
@ -20,6 +20,9 @@ The configuration of both components uses separate namespaces. The
|
|||||||
configuration for the REST server is below `docspell.server`, while
|
configuration for the REST server is below `docspell.server`, while
|
||||||
the one for joex is below `docspell.joex`.
|
the one for joex is below `docspell.joex`.
|
||||||
|
|
||||||
|
You can therefore use two separate config files or one single file
|
||||||
|
containing both namespaces.
|
||||||
|
|
||||||
## JDBC
|
## JDBC
|
||||||
|
|
||||||
This configures the connection to the database. This has to be
|
This configures the connection to the database. This has to be
|
||||||
@ -281,6 +284,56 @@ just some minutes, the web application obtains new ones
|
|||||||
periodically. So a short time is recommended.
|
periodically. So a short time is recommended.
|
||||||
|
|
||||||
|
|
||||||
|
## File Processing
|
||||||
|
|
||||||
|
Files are being processed by the joex component. So all the respective
|
||||||
|
configuration is in this config only.
|
||||||
|
|
||||||
|
File processing involves several stages, detailed information can be
|
||||||
|
found [here](@/docs/joex/file-processing.md#text-analysis).
|
||||||
|
|
||||||
|
Configuration allows to define the external tools and set some
|
||||||
|
limitations to control memory usage. The sections are:
|
||||||
|
|
||||||
|
- `docspell.joex.extraction`
|
||||||
|
- `docspell.joex.text-analysis`
|
||||||
|
- `docspell.joex.convert`
|
||||||
|
|
||||||
|
Options to external commands can use variables that are replaced by
|
||||||
|
values at runtime. Variables are enclosed in double braces `{{…}}`.
|
||||||
|
Please see the default configuration for more details.
|
||||||
|
|
||||||
|
### `text-analysis.nlp.mode`
|
||||||
|
|
||||||
|
This setting defines which NLP mode to use. It defaults to `full`,
|
||||||
|
which requires more memory for certain languages (with the advantage
|
||||||
|
of better results). Other values are `basic`, `regexonly` and
|
||||||
|
`disabled`. The modes `full` and `basic` use pre-defined lanugage
|
||||||
|
models for procesing documents of languaes German, English and French.
|
||||||
|
These require some amount of memory (see below).
|
||||||
|
|
||||||
|
The mode `basic` is like the "light" variant to `full`. It doesn't use
|
||||||
|
all NLP features, which makes memory consumption much lower, but comes
|
||||||
|
with the compromise of less accurate results.
|
||||||
|
|
||||||
|
The mode `regexonly` doesn't use pre-defined lanuage models, even if
|
||||||
|
available. It checks your address book against a document to find
|
||||||
|
metadata. That means, it is language independent. Also, when using
|
||||||
|
`full` or `basic` with lanugages where no pre-defined models exist, it
|
||||||
|
will degrade to `regexonly` for these.
|
||||||
|
|
||||||
|
The mode `disabled` skips NLP processing completely. This has least
|
||||||
|
impact in memory consumption, obviously, but then only the classifier
|
||||||
|
is used to find metadata.
|
||||||
|
|
||||||
|
You might want to try different modes and see what combination suits
|
||||||
|
best your usage pattern and machine running joex. If a powerful
|
||||||
|
machine is used, simply leave the defaults. When running on an older
|
||||||
|
raspberry pi, for example, you might need to adjust things. The
|
||||||
|
corresponding sections in [joex default config](#joex) and the [file
|
||||||
|
processing](@/docs/joex/file-processing.md#text-analysis) page provide more
|
||||||
|
details.
|
||||||
|
|
||||||
# File Format
|
# File Format
|
||||||
|
|
||||||
The format of the configuration files can be
|
The format of the configuration files can be
|
||||||
|
@ -25,19 +25,18 @@ work is done by the joex components.
|
|||||||
Running the joex component on the Raspberry Pi is possible, but will
|
Running the joex component on the Raspberry Pi is possible, but will
|
||||||
result in long processing times for OCR and text analysis. The board
|
result in long processing times for OCR and text analysis. The board
|
||||||
should provide 4G of RAM (like the current RPi4), especially if also a
|
should provide 4G of RAM (like the current RPi4), especially if also a
|
||||||
database and solr are running next to it. I recommend to give joex a
|
database and solr are running next to it. The memory required by joex
|
||||||
heap of 1.5G (`-J-Xmx1536M`). You should also set the joex pool size
|
depends on the config and document language. Please pick a value that
|
||||||
to 1.
|
suits your setup from [here](@/docs/install/running.md#memory-usage).
|
||||||
|
For boards like the RPi, it might be necessary to use
|
||||||
When joex processes the first file, some models are built loaded into
|
`nlp.mode=basic`, rather than `nlp.mode=full`. You should also set the
|
||||||
memory which can take a while. Subsequent processing times are faster
|
joex pool size to 1.
|
||||||
then.
|
|
||||||
|
|
||||||
An example: on this [UP
|
An example: on this [UP
|
||||||
board](https://up-board.org/up/specifications/) with an Intel Atom
|
board](https://up-board.org/up/specifications/) with an Intel Atom
|
||||||
x5-Z8350 CPU (@1.44Ghz) and 4G RAM, a scanned (300dpi) pdf file with 6
|
x5-Z8350 CPU (@1.44Ghz) and 4G RAM, a scanned (300dpi, in German) pdf
|
||||||
pages took *3:20 min* to process. This board also runs the SOLR and a
|
file with 6 pages took *3:20 min* to process. This board also runs the
|
||||||
postgresql database.
|
SOLR and a postgresql database.
|
||||||
|
|
||||||
The same file was processed in 55s on a qemu virtual machine on my i7
|
The same file was processed in 55s on a qemu virtual machine on my i7
|
||||||
notebook, using 1 CPU and 4G RAM (and identical config for joex). The
|
notebook, using 1 CPU and 4G RAM (and identical config for joex). The
|
||||||
|
@ -35,6 +35,42 @@ You should be able to create a new account and sign in. Check the
|
|||||||
[configuration page](@/docs/configure/_index.md) to further customize
|
[configuration page](@/docs/configure/_index.md) to further customize
|
||||||
docspell.
|
docspell.
|
||||||
|
|
||||||
|
## Memory Usage
|
||||||
|
|
||||||
|
The memory requirements for the joex component depends on the document
|
||||||
|
language and the configuration for [file
|
||||||
|
processing](@/docs/configure/_index.md#file-processing). The
|
||||||
|
`nlp.mode` setting has significant impact, especially when your
|
||||||
|
documents are in German. Here are some rough numbers on jvm heap usage
|
||||||
|
(the same small jpeg file was used for all tries):
|
||||||
|
|
||||||
|
<table class="table is-hoverable is-striped">
|
||||||
|
<thead>
|
||||||
|
<tr><th>nlp.mode</th><th>English</th><th>German</th><th>French</th></tr>
|
||||||
|
</thead>
|
||||||
|
<tfoot>
|
||||||
|
</tfoot>
|
||||||
|
<tbody>
|
||||||
|
<tr><td>full</td><td>420M</td><td>950M</td><td>490M</td></tr>
|
||||||
|
<tr><td>basic</td><td>170M</td><td>380M</td><td>390M</td></tr>
|
||||||
|
</tbody>
|
||||||
|
</table>
|
||||||
|
|
||||||
|
When using `mode=full`, a heap setting of at least `-Xmx1400M` is
|
||||||
|
recommended. For `mode=basic` a heap setting of at least `-Xmx500M` is
|
||||||
|
recommended.
|
||||||
|
|
||||||
|
Other languages can't use these two modes, and so don't require this
|
||||||
|
amount of memory (but don't have as good results). Then you can go
|
||||||
|
with less heap.
|
||||||
|
|
||||||
|
More details about these modes can be found
|
||||||
|
[here](@/docs/joex/file-processing.md#text-analysis).
|
||||||
|
|
||||||
|
|
||||||
|
The restserver component is very lightweight, here you can use
|
||||||
|
defaults.
|
||||||
|
|
||||||
|
|
||||||
## Options
|
## Options
|
||||||
|
|
||||||
@ -65,10 +101,10 @@ $ ./docspell-restserver*/bin/docspell-restserver -h
|
|||||||
|
|
||||||
gives an overview of supported options.
|
gives an overview of supported options.
|
||||||
|
|
||||||
It is recommended to run joex with 1.5G heap space or more and with
|
It is recommended to run joex with the G1GC enabled. If you use java8,
|
||||||
the G1GC enabled. If you use java8, you need to add an option to use
|
you need to add an option to use G1GC (`-XX:+UseG1GC`), for java11
|
||||||
G1GC, for java11 this is not necessary (but doesn't hurt either). This
|
this is not necessary (but doesn't hurt either). This could look like
|
||||||
could look like this:
|
this:
|
||||||
|
|
||||||
```
|
```
|
||||||
./docspell-joex-{{version()}}/bin/docspell-joex -J-Xmx1596M -J-XX:+UseG1GC -- /path/to/joex.conf
|
./docspell-joex-{{version()}}/bin/docspell-joex -J-Xmx1596M -J-XX:+UseG1GC -- /path/to/joex.conf
|
||||||
|
@ -331,91 +331,121 @@ images for a collective. There is also a bash script provided in the
|
|||||||
|
|
||||||
# Text Analysis
|
# Text Analysis
|
||||||
|
|
||||||
This uses the extracted text to find what could be attached to the new
|
Finally, the extracted text is analysed to find possible metadata that
|
||||||
item. There are multiple things provided.
|
can be attached to the new item. There are two different approaches
|
||||||
|
provided.
|
||||||
|
|
||||||
Docspell depends on the [Stanford NLP
|
The basic idea here is, that instead of *you defining textual rules* to
|
||||||
|
apply tags and other things, these rules *are found for you* based on
|
||||||
|
what you have provided so far.
|
||||||
|
|
||||||
|
Docspell relies on the [Stanford NLP
|
||||||
Library](https://nlp.stanford.edu/software/) for its AI features.
|
Library](https://nlp.stanford.edu/software/) for its AI features.
|
||||||
Among other things they provide a classifier (used for guessing tags)
|
Among other things they provide a classifier and NER annotators. The
|
||||||
and NER annotators. The latter is also a classifier, that associates a
|
latter is also a classifier, that associates a label to terms in a
|
||||||
label to terms in a text. It finds out whether some term is probably
|
text. It finds out whether some term is probably an organization, a
|
||||||
an organization, a person etc. This is then used to find matches in
|
person etc. It tries to “understand” the structure of the text, like
|
||||||
your address book.
|
verb, nouns and their relation.
|
||||||
|
|
||||||
When docspell finds several possible candidates for a match, it will
|
The two approaches used are sketched below. They have both advantages
|
||||||
show the first few to you. If then the first was not the correct one,
|
and disadvantages and are by default used together. However, depending
|
||||||
it can usually be fixed by a single click, because it is among the
|
on the document languages, not all approaches are possible. They also
|
||||||
suggestions.
|
have different memory footprints, and you might want to disable some
|
||||||
|
features if running on low-end machines.
|
||||||
|
|
||||||
## Classification
|
## Classification
|
||||||
|
|
||||||
If you enabled classification in the config file, a model is trained
|
If you enabled classification in the config file, a model is trained
|
||||||
periodically from your files. This is used to guess a tag for the item
|
periodically from a collective's files. Very roughly speaking… this
|
||||||
for new documents.
|
model contains the essence of "patterns" in the text that are likeley
|
||||||
|
related to a tag, a corrpesondent etc.
|
||||||
|
|
||||||
You can tell docspell how many documents it should use for training.
|
When a new document arrives, this model is used to ask for what
|
||||||
Sometimes (when moving?), documents may change and you only like to
|
metadata (tag, correspondent, etc) it thinks is likely to apply here.
|
||||||
base next guesses on the documents of last year only. This can be
|
|
||||||
found in the collective settings.
|
|
||||||
|
|
||||||
The admin can also limit the number of documents to train with,
|
Training the model is a rather resource intensive process, but using
|
||||||
because it affects memory usage.
|
an existing model is quite cheap and fast. A model is trained
|
||||||
|
periodically, the schedule can be defined in your collective settings.
|
||||||
|
For tags, you can define the tag categories that should be trained (or
|
||||||
|
that should not be trained). Docspell assigns one tag from all tags in
|
||||||
|
a category to a new document.
|
||||||
|
|
||||||
|
Note that tags that can not be derived from the text only, should
|
||||||
|
probably be excluded from learning. For example, if you tag all your
|
||||||
|
items with `Done` at some point, it may falsly learn patterns to this
|
||||||
|
tag and tag your new documents with `Done`.
|
||||||
|
|
||||||
|
The admin can also limit the number of documents to train with in the
|
||||||
|
config file to control the memory footprint when training.
|
||||||
|
|
||||||
|
Classification is used in Docspell once for guessing tags and also for
|
||||||
|
finding correspondent and concerned entities. For correspondent and
|
||||||
|
concerned entities, the NLP approach is used first and the classifier
|
||||||
|
results then fill missing values.
|
||||||
|
|
||||||
|
|
||||||
## Natural Language Processing
|
## Natural Language Processing
|
||||||
|
|
||||||
NLP is used to find out which terms in a text may be a company or
|
NLP is the other approach that works a bit differently. In this
|
||||||
person that is then used to find metadata in your address book. It can
|
approach, algorithms are used that find lanugage properties from the
|
||||||
also uses your complete address book to match terms in the text. So
|
given text, for example which terms are nouns, organization or person
|
||||||
there are two ways: using a statistical model, terms in a text are
|
names etc. This also requires a statistical model, but this time for a
|
||||||
identified as organization or person etc. This information is then
|
whole language. These are also provided by [Stanford
|
||||||
used to search your address book. Second, regexp rules are derived
|
NLP](https://nlp.stanford.edu/software/), but not for all languages.
|
||||||
from the address book and run against the text. By default, both are
|
So whether this can be used depends on the document language. Models
|
||||||
applied, where the rules are run as the last step to identify missing
|
exist for German, English and French currently.
|
||||||
terms.
|
|
||||||
|
|
||||||
The statistical model approach is good, i.e. for large address books.
|
Then [Stanford NLP](https://nlp.stanford.edu/software/) also allows to
|
||||||
Normally, a document contains only very few organizations or person
|
run custom rules against a text. This can be used as a fallback for
|
||||||
names. So it is much more efficient to check these against your
|
terms where the statistical model didn't succeed. But it can also be
|
||||||
address book (in contrast to the other way around). It can also find
|
used by itself. Docspell derives these rules from your address book,
|
||||||
things *not* in your address book. However, it might not detect all or
|
so it can find terms in the document text that match your organization
|
||||||
there are no statistical models for your language. Then the address
|
and person names. This does not depend on the document language.
|
||||||
book is used to automatically create rules that are run against the
|
|
||||||
document.
|
|
||||||
|
|
||||||
These statistical models are provided by [Stanford
|
By default, Docspell does both: it first uses the statistical language
|
||||||
NLP](https://nlp.stanford.edu/software/) and are currently available
|
model (if available for the given language) and then runs the
|
||||||
for German, English and French. All other languages can use the rule
|
address-book derived rules as a last step on so far unclassified
|
||||||
approach. The statistcal models, however, require quite some memory –
|
terms. This allows for the best results. If more than one candidate is
|
||||||
depending on the size of the models which varies between languages.
|
found, the "most likely" one is set on the item and others are stored
|
||||||
English has a lower memory footprint than German, for example. If you
|
as suggestions.
|
||||||
have a very large address book, the rule approach may also use a lot
|
|
||||||
memory.
|
|
||||||
|
|
||||||
|
The statistical model approach works generally very well, i.e. for
|
||||||
|
large address books. Normally, a document contains only very few
|
||||||
|
organizations or person names. So it is more efficient to check these
|
||||||
|
few against your (probably large) address book; in contrast to testing
|
||||||
|
hundreds of company names against a single document. It can also find
|
||||||
|
things *not* in your address book (but this is unused in Docspell
|
||||||
|
currently). However, it might not detect all or there are no
|
||||||
|
statistical models for your language. Then the address book is used to
|
||||||
|
automatically create rules that are run against the document.
|
||||||
|
|
||||||
|
Both ways require memory, it depends on the size of your address book
|
||||||
|
and on the size of the language models (they vary for each language).
|
||||||
In the config file, you can specify different modes of operation for
|
In the config file, you can specify different modes of operation for
|
||||||
nlp processing as follows:
|
nlp processing as follows:
|
||||||
|
|
||||||
- mode `full`: creates the complete nlp pipeline, requiring the most
|
- mode `full`: creates the complete nlp pipeline, requiring the most
|
||||||
amount of memory, providing the best results. I'd recommend to run
|
amount of memory, providing the best results. I'd recommend to run
|
||||||
joex with a heap size of a least 1.5G (for English only, it can be
|
joex with a heap size of a least 1.4G (for English only, it can be
|
||||||
lower that that).
|
lower that that).
|
||||||
- mode `basic`: it only loads the NER tagger. This doesn't work as
|
- mode `basic`: it only loads the NER tagger. This doesn't work as
|
||||||
well as the complete pipeline, because some steps are simply
|
well as the complete pipeline, because some NLP steps are simply
|
||||||
skipped. But it gives quite good results and uses less memory. I'd
|
skipped. But it gives quite good results already and uses less
|
||||||
recommend to run joex with at least 600m heap in this mode.
|
memory. I'd recommend to run joex with at least 500m heap in this
|
||||||
|
mode.
|
||||||
- mode `regexonly`: this doesn't load any statistical models and is
|
- mode `regexonly`: this doesn't load any statistical models and is
|
||||||
therefore very memory efficient (depending on the address book size,
|
therefore much lighter on memory (depending on the address book
|
||||||
of course). It will use the address book to create regex rules and
|
size, of course). It will use the address book to create regex rules
|
||||||
match them against your document. It doesn't depend on a language,
|
and match them against your document.
|
||||||
so this is available for all languages.
|
- mode = disabled: this disables nlp processing altogether. Then only
|
||||||
- mode = disabled: this disables nlp processing altogether
|
the classifier is run (unless disabled).
|
||||||
|
|
||||||
Note that mode `full` and `basic` is only relevant for the languages
|
Note that mode `full` and `basic` is only relevant for the languages
|
||||||
where models are available. For all other languages, it is effectively
|
where models are available. For all other languages, it is effectively
|
||||||
the same as `regexonly`.
|
the same as `regexonly`.
|
||||||
|
|
||||||
The config file allows some settings. You can specify a limit for
|
The config file allows to specify a limit for texts to analyse in
|
||||||
texts. Large texts result in higher memory consumption. By default,
|
general. Large texts result in higher memory consumption. By default,
|
||||||
the first 10'000 characters are taken into account.
|
the first 10'000 characters are taken into account.
|
||||||
|
|
||||||
Then, for the `regexonly` mode, you can restrict the number of address
|
Then, for the `regexonly` mode, you can restrict the number of address
|
||||||
@ -424,7 +454,7 @@ book entries that are used to create the rule set via
|
|||||||
footprint.
|
footprint.
|
||||||
|
|
||||||
The setting `clear-stanford-nlp-interval` allows to define an idle
|
The setting `clear-stanford-nlp-interval` allows to define an idle
|
||||||
time after which the model files are cleared from memory. This allows
|
time after which the language models are cleared from memory. This
|
||||||
memory to be reclaimed by the OS. The timer starts after the last file
|
allows memory to be reclaimed by the OS. The timer starts after the
|
||||||
has been processed. If you can afford it, it is recommended to disable
|
last file has been processed. If you can afford it, it is recommended
|
||||||
it by setting it to `0`.
|
to disable it by setting it to `0`.
|
||||||
|
Loading…
x
Reference in New Issue
Block a user