Update documentation

This commit is contained in:
Eike Kettner 2021-01-18 00:59:37 +01:00
parent 3f75af0807
commit c5778880d9
2 changed files with 97 additions and 15 deletions

View File

@ -334,33 +334,97 @@ images for a collective. There is also a bash script provided in the
This uses the extracted text to find what could be attached to the new
item. There are multiple things provided.
Docspell depends on the [Stanford NLP
Library](https://nlp.stanford.edu/software/) for its AI features.
Among other things they provide a classifier (used for guessing tags)
and NER annotators. The latter is also a classifier, that associates a
label to terms in a text. It finds out whether some term is probably
an organization, a person etc. This is then used to find matches in
your address book.
When docspell finds several possible candidates for a match, it will
show the first few to you. If then the first was not the correct one,
it can usually be fixed by a single click, because it is among the
suggestions.
## Classification
If you enabled classification in the config file, a model is trained
periodically from your files. This is now used to guess a tag for the
item.
periodically from your files. This is used to guess a tag for the item
for new documents.
You can tell docspell how many documents it should use for training.
Sometimes (when moving?), documents may change and you only like to
base next guesses on the documents of last year only. This can be
found in the collective settings.
The admin can also limit the number of documents to train with,
because it affects memory usage.
## Natural Language Processing
NLP is used to find out which terms in the text may be a company or
person that is later used to find metadata to attach to. It also uses
your address book to match terms in the text.
NLP is used to find out which terms in a text may be a company or
person that is then used to find metadata in your address book. It can
also uses your complete address book to match terms in the text. So
there are two ways: using a statistical model, terms in a text are
identified as organization or person etc. This information is then
used to search your address book. Second, regexp rules are derived
from the address book and run against the text. By default, both are
applied, where the rules are run as the last step to identify missing
terms.
This requires to load language model files in memory, which is quite a
lot. Also, the number of languages is much more restricted than for
tesseract. Currently English, German and French are supported.
The statistical model approach is good, i.e. for large address books.
Normally, a document contains only very few organizations or person
names. So it is much more efficient to check these against your
address book (in contrast to the other way around). It can also find
things *not* in your address book. However, it might not detect all or
there are no statistical models for your language. Then the address
book is used to automatically create rules that are run against the
document.
Another feature that is planned, but not yet provided is to propose
new companies/people you don't have yet in your address book.
These statistical models are provided by [Stanford
NLP](https://nlp.stanford.edu/software/) and are currently available
for German, English and French. All other languages can use the rule
approach. The statistcal models, however, require quite some memory
depending on the size of the models which varies between languages.
English has a lower memory footprint than German, for example. If you
have a very large address book, the rule approach may also use a lot
memory.
In the config file, you can specify different modes of operation for
nlp processing as follows:
- mode `full`: creates the complete nlp pipeline, requiring the most
amount of memory, providing the best results. I'd recommend to run
joex with a heap size of a least 1.5G (for English only, it can be
lower that that).
- mode `basic`: it only loads the NER tagger. This doesn't work as
well as the complete pipeline, because some steps are simply
skipped. But it gives quite good results and uses less memory. I'd
recommend to run joex with at least 600m heap in this mode.
- mode `regexonly`: this doesn't load any statistical models and is
therefore very memory efficient (depending on the address book size,
of course). It will use the address book to create regex rules and
match them against your document. It doesn't depend on a language,
so this is available for all languages.
- mode = disabled: this disables nlp processing altogether
Note that mode `full` and `basic` is only relevant for the languages
where models are available. For all other languages, it is effectively
the same as `regexonly`.
The config file allows some settings. You can specify a limit for
texts. Large texts result in higher memory consumption. By default,
the first 10'000 characters are taken into account.
Then, for the `regexonly` mode, you can restrict the number of address
book entries that are used to create the rule set via
`regex-ner.max-entries`. This may be useful to reduce memory
footprint.
The setting `clear-stanford-nlp-interval` allows to define an idle
time after which the model files are cleared from memory. This allows
to be reclaimed by the OS. The timer starts after the last file has
been processed. If you can afford it, it is recommended to disable it
by setting it to `0`.
memory to be reclaimed by the OS. The timer starts after the last file
has been processed. If you can afford it, it is recommended to disable
it by setting it to `0`.

View File

@ -130,10 +130,28 @@ page](@/docs/webapp/customfields.md) for more information.
# Document Language
An important setting is the language of your documents. This helps OCR
and text analysis. You can select between English, German and French
currently. The language can also specified with each [upload
and text analysis. You can select between various languages. The
language can also specified with each [upload
request](@/docs/api/upload.md).
Go to the *Collective Settings* page and click *Document
Language*. This will set the lanugage for all your documents. It is
not (yet) possible to specify it when uploading.
The language has effects in several areas: text extraction, fulltext
search and text analysis. When extracting text from images, tesseract
(the external tool used for this) can yield better results if the
language is known. Also, solr (the fulltext search tool) can optimize
its index given the language, which results in better fulltext search
experience. The features of text analysis strongly depend on the
language. Docspell uses the [Stanford NLP
Library](https://nlp.stanford.edu/software/) for its great machine
learning algorithms. Some of them, like certain NLP features, are only
available for some languages namely German, English and French. The
reason is that the required statistical models are not available for
other languages. However, docspell can still run other algorithms for
the other languages, like classification and custom rules based on the
address book.
More information about file processing and text analysis can be found
[here](@/docs/joex/file-processing.md#text-analysis).