mirror of
https://github.com/TheAnachronism/docspell.git
synced 2025-04-04 10:29:34 +00:00
Update documentation
This commit is contained in:
parent
3f75af0807
commit
c5778880d9
@ -334,33 +334,97 @@ images for a collective. There is also a bash script provided in the
|
|||||||
This uses the extracted text to find what could be attached to the new
|
This uses the extracted text to find what could be attached to the new
|
||||||
item. There are multiple things provided.
|
item. There are multiple things provided.
|
||||||
|
|
||||||
|
Docspell depends on the [Stanford NLP
|
||||||
|
Library](https://nlp.stanford.edu/software/) for its AI features.
|
||||||
|
Among other things they provide a classifier (used for guessing tags)
|
||||||
|
and NER annotators. The latter is also a classifier, that associates a
|
||||||
|
label to terms in a text. It finds out whether some term is probably
|
||||||
|
an organization, a person etc. This is then used to find matches in
|
||||||
|
your address book.
|
||||||
|
|
||||||
|
When docspell finds several possible candidates for a match, it will
|
||||||
|
show the first few to you. If then the first was not the correct one,
|
||||||
|
it can usually be fixed by a single click, because it is among the
|
||||||
|
suggestions.
|
||||||
|
|
||||||
## Classification
|
## Classification
|
||||||
|
|
||||||
If you enabled classification in the config file, a model is trained
|
If you enabled classification in the config file, a model is trained
|
||||||
periodically from your files. This is now used to guess a tag for the
|
periodically from your files. This is used to guess a tag for the item
|
||||||
item.
|
for new documents.
|
||||||
|
|
||||||
|
You can tell docspell how many documents it should use for training.
|
||||||
|
Sometimes (when moving?), documents may change and you only like to
|
||||||
|
base next guesses on the documents of last year only. This can be
|
||||||
|
found in the collective settings.
|
||||||
|
|
||||||
|
The admin can also limit the number of documents to train with,
|
||||||
|
because it affects memory usage.
|
||||||
|
|
||||||
|
|
||||||
## Natural Language Processing
|
## Natural Language Processing
|
||||||
|
|
||||||
NLP is used to find out which terms in the text may be a company or
|
NLP is used to find out which terms in a text may be a company or
|
||||||
person that is later used to find metadata to attach to. It also uses
|
person that is then used to find metadata in your address book. It can
|
||||||
your address book to match terms in the text.
|
also uses your complete address book to match terms in the text. So
|
||||||
|
there are two ways: using a statistical model, terms in a text are
|
||||||
|
identified as organization or person etc. This information is then
|
||||||
|
used to search your address book. Second, regexp rules are derived
|
||||||
|
from the address book and run against the text. By default, both are
|
||||||
|
applied, where the rules are run as the last step to identify missing
|
||||||
|
terms.
|
||||||
|
|
||||||
This requires to load language model files in memory, which is quite a
|
The statistical model approach is good, i.e. for large address books.
|
||||||
lot. Also, the number of languages is much more restricted than for
|
Normally, a document contains only very few organizations or person
|
||||||
tesseract. Currently English, German and French are supported.
|
names. So it is much more efficient to check these against your
|
||||||
|
address book (in contrast to the other way around). It can also find
|
||||||
|
things *not* in your address book. However, it might not detect all or
|
||||||
|
there are no statistical models for your language. Then the address
|
||||||
|
book is used to automatically create rules that are run against the
|
||||||
|
document.
|
||||||
|
|
||||||
Another feature that is planned, but not yet provided is to propose
|
These statistical models are provided by [Stanford
|
||||||
new companies/people you don't have yet in your address book.
|
NLP](https://nlp.stanford.edu/software/) and are currently available
|
||||||
|
for German, English and French. All other languages can use the rule
|
||||||
|
approach. The statistcal models, however, require quite some memory –
|
||||||
|
depending on the size of the models which varies between languages.
|
||||||
|
English has a lower memory footprint than German, for example. If you
|
||||||
|
have a very large address book, the rule approach may also use a lot
|
||||||
|
memory.
|
||||||
|
|
||||||
|
In the config file, you can specify different modes of operation for
|
||||||
|
nlp processing as follows:
|
||||||
|
|
||||||
|
- mode `full`: creates the complete nlp pipeline, requiring the most
|
||||||
|
amount of memory, providing the best results. I'd recommend to run
|
||||||
|
joex with a heap size of a least 1.5G (for English only, it can be
|
||||||
|
lower that that).
|
||||||
|
- mode `basic`: it only loads the NER tagger. This doesn't work as
|
||||||
|
well as the complete pipeline, because some steps are simply
|
||||||
|
skipped. But it gives quite good results and uses less memory. I'd
|
||||||
|
recommend to run joex with at least 600m heap in this mode.
|
||||||
|
- mode `regexonly`: this doesn't load any statistical models and is
|
||||||
|
therefore very memory efficient (depending on the address book size,
|
||||||
|
of course). It will use the address book to create regex rules and
|
||||||
|
match them against your document. It doesn't depend on a language,
|
||||||
|
so this is available for all languages.
|
||||||
|
- mode = disabled: this disables nlp processing altogether
|
||||||
|
|
||||||
|
Note that mode `full` and `basic` is only relevant for the languages
|
||||||
|
where models are available. For all other languages, it is effectively
|
||||||
|
the same as `regexonly`.
|
||||||
|
|
||||||
The config file allows some settings. You can specify a limit for
|
The config file allows some settings. You can specify a limit for
|
||||||
texts. Large texts result in higher memory consumption. By default,
|
texts. Large texts result in higher memory consumption. By default,
|
||||||
the first 10'000 characters are taken into account.
|
the first 10'000 characters are taken into account.
|
||||||
|
|
||||||
|
Then, for the `regexonly` mode, you can restrict the number of address
|
||||||
|
book entries that are used to create the rule set via
|
||||||
|
`regex-ner.max-entries`. This may be useful to reduce memory
|
||||||
|
footprint.
|
||||||
|
|
||||||
The setting `clear-stanford-nlp-interval` allows to define an idle
|
The setting `clear-stanford-nlp-interval` allows to define an idle
|
||||||
time after which the model files are cleared from memory. This allows
|
time after which the model files are cleared from memory. This allows
|
||||||
to be reclaimed by the OS. The timer starts after the last file has
|
memory to be reclaimed by the OS. The timer starts after the last file
|
||||||
been processed. If you can afford it, it is recommended to disable it
|
has been processed. If you can afford it, it is recommended to disable
|
||||||
by setting it to `0`.
|
it by setting it to `0`.
|
||||||
|
@ -130,10 +130,28 @@ page](@/docs/webapp/customfields.md) for more information.
|
|||||||
# Document Language
|
# Document Language
|
||||||
|
|
||||||
An important setting is the language of your documents. This helps OCR
|
An important setting is the language of your documents. This helps OCR
|
||||||
and text analysis. You can select between English, German and French
|
and text analysis. You can select between various languages. The
|
||||||
currently. The language can also specified with each [upload
|
language can also specified with each [upload
|
||||||
request](@/docs/api/upload.md).
|
request](@/docs/api/upload.md).
|
||||||
|
|
||||||
Go to the *Collective Settings* page and click *Document
|
Go to the *Collective Settings* page and click *Document
|
||||||
Language*. This will set the lanugage for all your documents. It is
|
Language*. This will set the lanugage for all your documents. It is
|
||||||
not (yet) possible to specify it when uploading.
|
not (yet) possible to specify it when uploading.
|
||||||
|
|
||||||
|
The language has effects in several areas: text extraction, fulltext
|
||||||
|
search and text analysis. When extracting text from images, tesseract
|
||||||
|
(the external tool used for this) can yield better results if the
|
||||||
|
language is known. Also, solr (the fulltext search tool) can optimize
|
||||||
|
its index given the language, which results in better fulltext search
|
||||||
|
experience. The features of text analysis strongly depend on the
|
||||||
|
language. Docspell uses the [Stanford NLP
|
||||||
|
Library](https://nlp.stanford.edu/software/) for its great machine
|
||||||
|
learning algorithms. Some of them, like certain NLP features, are only
|
||||||
|
available for some languages – namely German, English and French. The
|
||||||
|
reason is that the required statistical models are not available for
|
||||||
|
other languages. However, docspell can still run other algorithms for
|
||||||
|
the other languages, like classification and custom rules based on the
|
||||||
|
address book.
|
||||||
|
|
||||||
|
More information about file processing and text analysis can be found
|
||||||
|
[here](@/docs/joex/file-processing.md#text-analysis).
|
||||||
|
Loading…
x
Reference in New Issue
Block a user