mirror of
https://github.com/TheAnachronism/docspell.git
synced 2025-06-21 01:48:26 +00:00
Make the text length limit optional
This commit is contained in:
@ -312,9 +312,9 @@ most should be used for learning. The default settings should work
|
||||
well for most cases. However, it always depends on the amount of data
|
||||
and the machine that runs joex. For example, by default the documents
|
||||
to learn from are limited to 600 (`classification.item-count`) and
|
||||
every text is cut after 8000 characters (`text-analysis.max-length`).
|
||||
every text is cut after 5000 characters (`text-analysis.max-length`).
|
||||
This is fine if *most* of your documents are small and only a few are
|
||||
near 8000 characters). But if *all* your documents are very large, you
|
||||
near 5000 characters). But if *all* your documents are very large, you
|
||||
probably need to either assign more heap memory or go down with the
|
||||
limits.
|
||||
|
||||
|
@ -367,13 +367,15 @@ Training the model is a rather resource intensive process. How much
|
||||
memory is needed, depends on the number of documents to learn from and
|
||||
the size of text to consider. Both can be limited in the config file.
|
||||
The default values might require a heap of 1.4G if you have many and
|
||||
large documents. The maximum text length is about 8000 characters, if
|
||||
large documents. The maximum text length is set to 5000 characters. If
|
||||
*all* your documents would be that large, adjusting these values might
|
||||
be necessary. But using an existing model is quite cheap and fast. A
|
||||
model is trained periodically, the schedule can be defined in your
|
||||
collective settings. For tags, you can define the tag categories that
|
||||
should be trained (or that should not be trained). Docspell assigns
|
||||
one tag from all tags in a category to a new document.
|
||||
be necessary. A model is trained periodically, the schedule can be
|
||||
defined in your collective settings. Although learning is resource
|
||||
intensive, using an existing model is quite cheap and fast.
|
||||
|
||||
For tags, you can define the tag categories that should be trained (or
|
||||
that should not be trained). Docspell assigns one tag (or none) from
|
||||
all tags in a category to a new document.
|
||||
|
||||
Note that tags that can not be derived from the text only, should
|
||||
probably be excluded from learning. For example, if you tag all your
|
||||
|
Reference in New Issue
Block a user