Update documentation for text analysis

2025-06-23 02:48:26 +00:00 · 2021-01-21 20:06:53 +01:00
parent 9957c3267e
commit 021ac568ae
2 changed files with 32 additions and 13 deletions
--- a/website/site/content/docs/joex/file-processing.md
+++ b/website/site/content/docs/joex/file-processing.md
@ -363,12 +363,17 @@ related to a tag, a corrpesondent etc.
 When a new document arrives, this model is used to ask for what
 metadata (tag, correspondent, etc) it thinks is likely to apply here.

-Training the model is a rather resource intensive process, but using
-an existing model is quite cheap and fast. A model is trained
-periodically, the schedule can be defined in your collective settings.
-For tags, you can define the tag categories that should be trained (or
-that should not be trained). Docspell assigns one tag from all tags in
-a category to a new document.
+Training the model is a rather resource intensive process. How much
+memory is needed, depends on the number of documents to learn from and
+the size of text to consider. Both can be limited in the config file.
+The default values might require a heap of 1.4G if you have many and
+large documents. The maximum text length is about 8000 characters, if
+*all* your documents would be that large, adjusting these values might
+be necessary. But using an existing model is quite cheap and fast. A
+model is trained periodically, the schedule can be defined in your
+collective settings. For tags, you can define the tag categories that
+should be trained (or that should not be trained). Docspell assigns
+one tag from all tags in a category to a new document.

 Note that tags that can not be derived from the text only, should
 probably be excluded from learning. For example, if you tag all your