mirror of
https://github.com/TheAnachronism/docspell.git
synced 2025-06-06 15:15:58 +00:00
Update documentation for text analysis
This commit is contained in:
parent
9957c3267e
commit
021ac568ae
@ -290,7 +290,8 @@ Files are being processed by the joex component. So all the respective
|
|||||||
configuration is in this config only.
|
configuration is in this config only.
|
||||||
|
|
||||||
File processing involves several stages, detailed information can be
|
File processing involves several stages, detailed information can be
|
||||||
found [here](@/docs/joex/file-processing.md#text-analysis).
|
found [here](@/docs/joex/file-processing.md#text-analysis) and in the
|
||||||
|
corresponding sections in [joex default config](#joex).
|
||||||
|
|
||||||
Configuration allows to define the external tools and set some
|
Configuration allows to define the external tools and set some
|
||||||
limitations to control memory usage. The sections are:
|
limitations to control memory usage. The sections are:
|
||||||
@ -301,9 +302,25 @@ limitations to control memory usage. The sections are:
|
|||||||
|
|
||||||
Options to external commands can use variables that are replaced by
|
Options to external commands can use variables that are replaced by
|
||||||
values at runtime. Variables are enclosed in double braces `{{…}}`.
|
values at runtime. Variables are enclosed in double braces `{{…}}`.
|
||||||
Please see the default configuration for more details.
|
Please see the default configuration for what variables exist per
|
||||||
|
command.
|
||||||
|
|
||||||
### `text-analysis.nlp.mode`
|
### Classification
|
||||||
|
|
||||||
|
In `text-analysis.classification` you can define how many documents at
|
||||||
|
most should be used for learning. The default settings should work
|
||||||
|
well for most cases. However, it always depends on the amount of data
|
||||||
|
and the machine that runs joex. For example, by default the documents
|
||||||
|
to learn from are limited to 600 (`classification.item-count`) and
|
||||||
|
every text is cut after 8000 characters (`text-analysis.max-length`).
|
||||||
|
This is fine if *most* of your documents are small and only a few are
|
||||||
|
near 8000 characters). But if *all* your documents are very large, you
|
||||||
|
probably need to either assign more heap memory or go down with the
|
||||||
|
limits.
|
||||||
|
|
||||||
|
Classification can be disabled, too, for when it's not needed.
|
||||||
|
|
||||||
|
### NLP
|
||||||
|
|
||||||
This setting defines which NLP mode to use. It defaults to `full`,
|
This setting defines which NLP mode to use. It defaults to `full`,
|
||||||
which requires more memory for certain languages (with the advantage
|
which requires more memory for certain languages (with the advantage
|
||||||
@ -329,10 +346,7 @@ is used to find metadata.
|
|||||||
You might want to try different modes and see what combination suits
|
You might want to try different modes and see what combination suits
|
||||||
best your usage pattern and machine running joex. If a powerful
|
best your usage pattern and machine running joex. If a powerful
|
||||||
machine is used, simply leave the defaults. When running on an older
|
machine is used, simply leave the defaults. When running on an older
|
||||||
raspberry pi, for example, you might need to adjust things. The
|
raspberry pi, for example, you might need to adjust things.
|
||||||
corresponding sections in [joex default config](#joex) and the [file
|
|
||||||
processing](@/docs/joex/file-processing.md#text-analysis) page provide more
|
|
||||||
details.
|
|
||||||
|
|
||||||
# File Format
|
# File Format
|
||||||
|
|
||||||
|
@ -363,12 +363,17 @@ related to a tag, a corrpesondent etc.
|
|||||||
When a new document arrives, this model is used to ask for what
|
When a new document arrives, this model is used to ask for what
|
||||||
metadata (tag, correspondent, etc) it thinks is likely to apply here.
|
metadata (tag, correspondent, etc) it thinks is likely to apply here.
|
||||||
|
|
||||||
Training the model is a rather resource intensive process, but using
|
Training the model is a rather resource intensive process. How much
|
||||||
an existing model is quite cheap and fast. A model is trained
|
memory is needed, depends on the number of documents to learn from and
|
||||||
periodically, the schedule can be defined in your collective settings.
|
the size of text to consider. Both can be limited in the config file.
|
||||||
For tags, you can define the tag categories that should be trained (or
|
The default values might require a heap of 1.4G if you have many and
|
||||||
that should not be trained). Docspell assigns one tag from all tags in
|
large documents. The maximum text length is about 8000 characters, if
|
||||||
a category to a new document.
|
*all* your documents would be that large, adjusting these values might
|
||||||
|
be necessary. But using an existing model is quite cheap and fast. A
|
||||||
|
model is trained periodically, the schedule can be defined in your
|
||||||
|
collective settings. For tags, you can define the tag categories that
|
||||||
|
should be trained (or that should not be trained). Docspell assigns
|
||||||
|
one tag from all tags in a category to a new document.
|
||||||
|
|
||||||
Note that tags that can not be derived from the text only, should
|
Note that tags that can not be derived from the text only, should
|
||||||
probably be excluded from learning. For example, if you tag all your
|
probably be excluded from learning. For example, if you tag all your
|
||||||
|
Loading…
x
Reference in New Issue
Block a user