Update documentation

2025-10-18 05:41:51 +00:00 · 2021-01-18 00:59:37 +01:00
parent 3f75af0807
commit c5778880d9
2 changed files with 97 additions and 15 deletions
--- a/website/site/content/docs/joex/file-processing.md
+++ b/website/site/content/docs/joex/file-processing.md
@@ -334,33 +334,97 @@ images for a collective. There is also a bash script provided in the
 This uses the extracted text to find what could be attached to the new
 item. There are multiple things provided.

+Docspell depends on the [Stanford NLP
+Library](https://nlp.stanford.edu/software/) for its AI features.
+Among other things they provide a classifier (used for guessing tags)
+and NER annotators. The latter is also a classifier, that associates a
+label to terms in a text. It finds out whether some term is probably
+an organization, a person etc. This is then used to find matches in
+your address book.
+
+When docspell finds several possible candidates for a match, it will
+show the first few to you. If then the first was not the correct one,
+it can usually be fixed by a single click, because it is among the
+suggestions.

 ## Classification

 If you enabled classification in the config file, a model is trained
-periodically from your files. This is now used to guess a tag for the
-item.
+periodically from your files. This is used to guess a tag for the item
+for new documents.
+
+You can tell docspell how many documents it should use for training.
+Sometimes (when moving?), documents may change and you only like to
+base next guesses on the documents of last year only. This can be
+found in the collective settings.
+
+The admin can also limit the number of documents to train with,
+because it affects memory usage.


 ## Natural Language Processing

-NLP is used to find out which terms in the text may be a company or
-person that is later used to find metadata to attach to. It also uses
-your address book to match terms in the text.
+NLP is used to find out which terms in a text may be a company or
+person that is then used to find metadata in your address book. It can
+also uses your complete address book to match terms in the text. So
+there are two ways: using a statistical model, terms in a text are
+identified as organization or person etc. This information is then
+used to search your address book. Second, regexp rules are derived
+from the address book and run against the text. By default, both are
+applied, where the rules are run as the last step to identify missing
+terms.

-This requires to load language model files in memory, which is quite a
-lot. Also, the number of languages is much more restricted than for
-tesseract. Currently English, German and French are supported.
+The statistical model approach is good, i.e. for large address books.
+Normally, a document contains only very few organizations or person
+names. So it is much more efficient to check these against your
+address book (in contrast to the other way around). It can also find
+things *not* in your address book. However, it might not detect all or
+there are no statistical models for your language. Then the address
+book is used to automatically create rules that are run against the
+document.

-Another feature that is planned, but not yet provided is to propose
-new companies/people you don't have yet in your address book.
+These statistical models are provided by [Stanford
+NLP](https://nlp.stanford.edu/software/) and are currently available
+for German, English and French. All other languages can use the rule
+approach. The statistcal models, however, require quite some memory –
+depending on the size of the models which varies between languages.
+English has a lower memory footprint than German, for example. If you
+have a very large address book, the rule approach may also use a lot
+memory.
+
+In the config file, you can specify different modes of operation for
+nlp processing as follows:
+
+- mode `full`: creates the complete nlp pipeline, requiring the most
+  amount of memory, providing the best results. I'd recommend to run
+  joex with a heap size of a least 1.5G (for English only, it can be
+  lower that that).
+- mode `basic`: it only loads the NER tagger. This doesn't work as
+  well as the complete pipeline, because some steps are simply
+  skipped. But it gives quite good results and uses less memory. I'd
+  recommend to run joex with at least 600m heap in this mode.
+- mode `regexonly`: this doesn't load any statistical models and is
+  therefore very memory efficient (depending on the address book size,
+  of course). It will use the address book to create regex rules and
+  match them against your document. It doesn't depend on a language,
+  so this is available for all languages.
+- mode = disabled: this disables nlp processing altogether
+
+Note that mode `full` and `basic` is only relevant for the languages
+where models are available. For all other languages, it is effectively
+the same as `regexonly`.

 The config file allows some settings. You can specify a limit for
 texts. Large texts result in higher memory consumption. By default,
 the first 10'000 characters are taken into account.

+Then, for the `regexonly` mode, you can restrict the number of address
+book entries that are used to create the rule set via
+`regex-ner.max-entries`. This may be useful to reduce memory
+footprint.
+
 The setting `clear-stanford-nlp-interval` allows to define an idle
 time after which the model files are cleared from memory. This allows
-to be reclaimed by the OS. The timer starts after the last file has
-been processed. If you can afford it, it is recommended to disable it
-by setting it to `0`.
+memory to be reclaimed by the OS. The timer starts after the last file
+has been processed. If you can afford it, it is recommended to disable
+it by setting it to `0`.
--- a/website/site/content/docs/webapp/metadata.md
+++ b/website/site/content/docs/webapp/metadata.md
@@ -130,10 +130,28 @@ page](@/docs/webapp/customfields.md) for more information.
 # Document Language

 An important setting is the language of your documents. This helps OCR
-and text analysis. You can select between English, German and French
-currently. The language can also specified with each [upload
+and text analysis. You can select between various languages. The
+language can also specified with each [upload
 request](@/docs/api/upload.md).

 Go to the *Collective Settings* page and click *Document
 Language*. This will set the lanugage for all your documents. It is
 not (yet) possible to specify it when uploading.
+
+The language has effects in several areas: text extraction, fulltext
+search and text analysis. When extracting text from images, tesseract
+(the external tool used for this) can yield better results if the
+language is known. Also, solr (the fulltext search tool) can optimize
+its index given the language, which results in better fulltext search
+experience. The features of text analysis strongly depend on the
+language. Docspell uses the [Stanford NLP
+Library](https://nlp.stanford.edu/software/) for its great machine
+learning algorithms. Some of them, like certain NLP features, are only
+available for some languages – namely German, English and French. The
+reason is that the required statistical models are not available for
+other languages. However, docspell can still run other algorithms for
+the other languages, like classification and custom rules based on the
+address book.
+
+More information about file processing and text analysis can be found
+[here](@/docs/joex/file-processing.md#text-analysis).