mirror of
				https://github.com/TheAnachronism/docspell.git
				synced 2025-11-03 18:00:11 +00:00 
			
		
		
		
	Make the text length limit optional
This commit is contained in:
		@@ -312,9 +312,9 @@ most should be used for learning. The default settings should work
 | 
			
		||||
well for most cases. However, it always depends on the amount of data
 | 
			
		||||
and the machine that runs joex. For example, by default the documents
 | 
			
		||||
to learn from are limited to 600 (`classification.item-count`) and
 | 
			
		||||
every text is cut after 8000 characters (`text-analysis.max-length`).
 | 
			
		||||
every text is cut after 5000 characters (`text-analysis.max-length`).
 | 
			
		||||
This is fine if *most* of your documents are small and only a few are
 | 
			
		||||
near 8000 characters). But if *all* your documents are very large, you
 | 
			
		||||
near 5000 characters). But if *all* your documents are very large, you
 | 
			
		||||
probably need to either assign more heap memory or go down with the
 | 
			
		||||
limits.
 | 
			
		||||
 | 
			
		||||
 
 | 
			
		||||
@@ -367,13 +367,15 @@ Training the model is a rather resource intensive process. How much
 | 
			
		||||
memory is needed, depends on the number of documents to learn from and
 | 
			
		||||
the size of text to consider. Both can be limited in the config file.
 | 
			
		||||
The default values might require a heap of 1.4G if you have many and
 | 
			
		||||
large documents. The maximum text length is about 8000 characters, if
 | 
			
		||||
large documents. The maximum text length is set to 5000 characters. If
 | 
			
		||||
*all* your documents would be that large, adjusting these values might
 | 
			
		||||
be necessary. But using an existing model is quite cheap and fast. A
 | 
			
		||||
model is trained periodically, the schedule can be defined in your
 | 
			
		||||
collective settings. For tags, you can define the tag categories that
 | 
			
		||||
should be trained (or that should not be trained). Docspell assigns
 | 
			
		||||
one tag from all tags in a category to a new document.
 | 
			
		||||
be necessary. A model is trained periodically, the schedule can be
 | 
			
		||||
defined in your collective settings. Although learning is resource
 | 
			
		||||
intensive, using an existing model is quite cheap and fast.
 | 
			
		||||
 | 
			
		||||
For tags, you can define the tag categories that should be trained (or
 | 
			
		||||
that should not be trained). Docspell assigns one tag (or none) from
 | 
			
		||||
all tags in a category to a new document.
 | 
			
		||||
 | 
			
		||||
Note that tags that can not be derived from the text only, should
 | 
			
		||||
probably be excluded from learning. For example, if you tag all your
 | 
			
		||||
 
 | 
			
		||||
		Reference in New Issue
	
	Block a user