Update documentation for text analysis

2025-09-15 21:46:53 +00:00 · 2021-01-21 20:06:53 +01:00
parent 9957c3267e
commit 021ac568ae
2 changed files with 32 additions and 13 deletions
--- a/website/site/content/docs/configure/_index.md
+++ b/website/site/content/docs/configure/_index.md
@@ -290,7 +290,8 @@ Files are being processed by the joex component. So all the respective
 configuration is in this config only.
 File processing involves several stages, detailed information can be
-found [here](@/docs/joex/file-processing.md#text-analysis).
+found [here](@/docs/joex/file-processing.md#text-analysis) and in the
 corresponding sections in [joex default config](#joex).
 Configuration allows to define the external tools and set some
 limitations to control memory usage. The sections are:
@@ -301,9 +302,25 @@ limitations to control memory usage. The sections are:
 Options to external commands can use variables that are replaced by
 values at runtime. Variables are enclosed in double braces `{{…}}`.
-Please see the default configuration for more details.
+Please see the default configuration for what variables exist per
 command.
-### `text-analysis.nlp.mode`
+### Classification
 In `text-analysis.classification` you can define how many documents at
 most should be used for learning. The default settings should work
 well for most cases. However, it always depends on the amount of data
 and the machine that runs joex. For example, by default the documents
 to learn from are limited to 600 (`classification.item-count`) and
 every text is cut after 8000 characters (`text-analysis.max-length`).
 This is fine if *most* of your documents are small and only a few are
 near 8000 characters). But if *all* your documents are very large, you
 probably need to either assign more heap memory or go down with the
 limits.
 Classification can be disabled, too, for when it's not needed.
 ### NLP
 This setting defines which NLP mode to use. It defaults to `full`,
 which requires more memory for certain languages (with the advantage
@@ -329,10 +346,7 @@ is used to find metadata.
 You might want to try different modes and see what combination suits
 best your usage pattern and machine running joex. If a powerful
 machine is used, simply leave the defaults. When running on an older
-raspberry pi, for example, you might need to adjust things. The
+raspberry pi, for example, you might need to adjust things.
 corresponding sections in [joex default config](#joex) and the [file
 processing](@/docs/joex/file-processing.md#text-analysis) page provide more
 details.
 # File Format
--- a/website/site/content/docs/joex/file-processing.md
+++ b/website/site/content/docs/joex/file-processing.md
@@ -363,12 +363,17 @@ related to a tag, a corrpesondent etc.
 When a new document arrives, this model is used to ask for what
 metadata (tag, correspondent, etc) it thinks is likely to apply here.
-Training the model is a rather resource intensive process, but using
+Training the model is a rather resource intensive process. How much
-an existing model is quite cheap and fast. A model is trained
+memory is needed, depends on the number of documents to learn from and
-periodically, the schedule can be defined in your collective settings.
+the size of text to consider. Both can be limited in the config file.
-For tags, you can define the tag categories that should be trained (or
+The default values might require a heap of 1.4G if you have many and
-that should not be trained). Docspell assigns one tag from all tags in
+large documents. The maximum text length is about 8000 characters, if
-a category to a new document.
+*all* your documents would be that large, adjusting these values might
 be necessary. But using an existing model is quite cheap and fast. A
 model is trained periodically, the schedule can be defined in your
 collective settings. For tags, you can define the tag categories that
 should be trained (or that should not be trained). Docspell assigns
 one tag from all tags in a category to a new document.
 Note that tags that can not be derived from the text only, should
 probably be excluded from learning. For example, if you tag all your