docspell/file-processing.md at 7c2a57966ba1f1f0be416f89e28baeb5e961886f

mirror of https://github.com/TheAnachronism/docspell.git synced 2024-11-13 02:31:10 +00:00

eikek 084dcef996 Move default config into separate file

2022-05-21 17:00:27 +02:00

4.5 KiB

Raw Blame History

+++ title = "File Processing" insert_anchor_links = "right" description = "Describes the configuration file and shows all default settings." weight = 40 template = "docs.html" +++

File Processing

Files are being processed by the joex component. So all the respective configuration is in this config only.

File processing involves several stages, detailed information can be found here and in the corresponding sections in joex default config.

Configuration allows to define the external tools and set some limitations to control memory usage. The sections are:

docspell.joex.extraction
docspell.joex.text-analysis
docspell.joex.convert

Options to external commands can use variables that are replaced by values at runtime. Variables are enclosed in double braces {{…}}. Please see the default configuration for what variables exist per command.

Classification

In text-analysis.classification you can define how many documents at most should be used for learning. The default settings should work well for most cases. However, it always depends on the amount of data and the machine that runs joex. For example, by default the documents to learn from are limited to 600 (classification.item-count) and every text is cut after 5000 characters (text-analysis.max-length). This is fine if most of your documents are small and only a few are near 5000 characters). But if all your documents are very large, you probably need to either assign more heap memory or go down with the limits.

Classification can be disabled, too, for when it's not needed.

NLP

This setting defines which NLP mode to use. It defaults to full, which requires more memory for certain languages (with the advantage of better results). Other values are basic, regexonly and disabled. The modes full and basic use pre-defined lanugage models for procesing documents of languaes German, English, French and Spanish. These require some amount of memory (see below).

The mode basic is like the "light" variant to full. It doesn't use all NLP features, which makes memory consumption much lower, but comes with the compromise of less accurate results.

The mode regexonly doesn't use pre-defined lanuage models, even if available. It checks your address book against a document to find metadata. That means, it is language independent. Also, when using full or basic with lanugages where no pre-defined models exist, it will degrade to regexonly for these.

The mode disabled skips NLP processing completely. This has least impact in memory consumption, obviously, but then only the classifier is used to find metadata (unless it is disabled, too).

You might want to try different modes and see what combination suits best your usage pattern and machine running joex. If a powerful machine is used, simply leave the defaults. When running on an raspberry pi, for example, you might need to adjust things.

Memory Usage

The memory requirements for the joex component depends on the document language and the enabled features for text-analysis. The nlp.mode setting has significant impact, especially when your documents are in German. Here are some rough numbers on jvm heap usage (the same file was used for all tries):

nlp.mode	English	German	French
full	420M	950M	490M
basic	170M	380M	390M

Note that these are only rough numbers and they show the maximum used heap memory while processing a file.

When using mode=full, a heap setting of at least -Xmx1400M is recommended. For mode=basic a heap setting of at least -Xmx500M is recommended.

Other languages can't use these two modes, and so don't require this amount of memory (but don't have as good results). Then you can go with less heap. For these languages, the nlp mode is the same as regexonly.

Training the classifier is also memory intensive, which solely depends on the size and number of documents that are being trained. However, training the classifier is done periodically and can happen maybe every two weeks. When classifying new documents, memory requirements are lower, since the model already exists.

More details about these modes can be found here.

The restserver component is very lightweight, here you can use defaults.

4.5 KiB Raw Blame History

File Processing

Classification

NLP

Memory Usage

4.5 KiB

Raw Blame History