mirror of
https://github.com/TheAnachronism/docspell.git
synced 2025-06-23 02:48:26 +00:00
Update documentation for text analysis
This commit is contained in:
@ -363,12 +363,17 @@ related to a tag, a corrpesondent etc.
|
||||
When a new document arrives, this model is used to ask for what
|
||||
metadata (tag, correspondent, etc) it thinks is likely to apply here.
|
||||
|
||||
Training the model is a rather resource intensive process, but using
|
||||
an existing model is quite cheap and fast. A model is trained
|
||||
periodically, the schedule can be defined in your collective settings.
|
||||
For tags, you can define the tag categories that should be trained (or
|
||||
that should not be trained). Docspell assigns one tag from all tags in
|
||||
a category to a new document.
|
||||
Training the model is a rather resource intensive process. How much
|
||||
memory is needed, depends on the number of documents to learn from and
|
||||
the size of text to consider. Both can be limited in the config file.
|
||||
The default values might require a heap of 1.4G if you have many and
|
||||
large documents. The maximum text length is about 8000 characters, if
|
||||
*all* your documents would be that large, adjusting these values might
|
||||
be necessary. But using an existing model is quite cheap and fast. A
|
||||
model is trained periodically, the schedule can be defined in your
|
||||
collective settings. For tags, you can define the tag categories that
|
||||
should be trained (or that should not be trained). Docspell assigns
|
||||
one tag from all tags in a category to a new document.
|
||||
|
||||
Note that tags that can not be derived from the text only, should
|
||||
probably be excluded from learning. For example, if you tag all your
|
||||
|
Reference in New Issue
Block a user