diff --git a/website/site/content/docs/joex/file-processing.md b/website/site/content/docs/joex/file-processing.md index 68811bca..7c0f7610 100644 --- a/website/site/content/docs/joex/file-processing.md +++ b/website/site/content/docs/joex/file-processing.md @@ -334,33 +334,97 @@ images for a collective. There is also a bash script provided in the This uses the extracted text to find what could be attached to the new item. There are multiple things provided. +Docspell depends on the [Stanford NLP +Library](https://nlp.stanford.edu/software/) for its AI features. +Among other things they provide a classifier (used for guessing tags) +and NER annotators. The latter is also a classifier, that associates a +label to terms in a text. It finds out whether some term is probably +an organization, a person etc. This is then used to find matches in +your address book. + +When docspell finds several possible candidates for a match, it will +show the first few to you. If then the first was not the correct one, +it can usually be fixed by a single click, because it is among the +suggestions. ## Classification If you enabled classification in the config file, a model is trained -periodically from your files. This is now used to guess a tag for the -item. +periodically from your files. This is used to guess a tag for the item +for new documents. + +You can tell docspell how many documents it should use for training. +Sometimes (when moving?), documents may change and you only like to +base next guesses on the documents of last year only. This can be +found in the collective settings. + +The admin can also limit the number of documents to train with, +because it affects memory usage. ## Natural Language Processing -NLP is used to find out which terms in the text may be a company or -person that is later used to find metadata to attach to. It also uses -your address book to match terms in the text. +NLP is used to find out which terms in a text may be a company or +person that is then used to find metadata in your address book. It can +also uses your complete address book to match terms in the text. So +there are two ways: using a statistical model, terms in a text are +identified as organization or person etc. This information is then +used to search your address book. Second, regexp rules are derived +from the address book and run against the text. By default, both are +applied, where the rules are run as the last step to identify missing +terms. -This requires to load language model files in memory, which is quite a -lot. Also, the number of languages is much more restricted than for -tesseract. Currently English, German and French are supported. +The statistical model approach is good, i.e. for large address books. +Normally, a document contains only very few organizations or person +names. So it is much more efficient to check these against your +address book (in contrast to the other way around). It can also find +things *not* in your address book. However, it might not detect all or +there are no statistical models for your language. Then the address +book is used to automatically create rules that are run against the +document. -Another feature that is planned, but not yet provided is to propose -new companies/people you don't have yet in your address book. +These statistical models are provided by [Stanford +NLP](https://nlp.stanford.edu/software/) and are currently available +for German, English and French. All other languages can use the rule +approach. The statistcal models, however, require quite some memory – +depending on the size of the models which varies between languages. +English has a lower memory footprint than German, for example. If you +have a very large address book, the rule approach may also use a lot +memory. + +In the config file, you can specify different modes of operation for +nlp processing as follows: + +- mode `full`: creates the complete nlp pipeline, requiring the most + amount of memory, providing the best results. I'd recommend to run + joex with a heap size of a least 1.5G (for English only, it can be + lower that that). +- mode `basic`: it only loads the NER tagger. This doesn't work as + well as the complete pipeline, because some steps are simply + skipped. But it gives quite good results and uses less memory. I'd + recommend to run joex with at least 600m heap in this mode. +- mode `regexonly`: this doesn't load any statistical models and is + therefore very memory efficient (depending on the address book size, + of course). It will use the address book to create regex rules and + match them against your document. It doesn't depend on a language, + so this is available for all languages. +- mode = disabled: this disables nlp processing altogether + +Note that mode `full` and `basic` is only relevant for the languages +where models are available. For all other languages, it is effectively +the same as `regexonly`. The config file allows some settings. You can specify a limit for texts. Large texts result in higher memory consumption. By default, the first 10'000 characters are taken into account. +Then, for the `regexonly` mode, you can restrict the number of address +book entries that are used to create the rule set via +`regex-ner.max-entries`. This may be useful to reduce memory +footprint. + The setting `clear-stanford-nlp-interval` allows to define an idle time after which the model files are cleared from memory. This allows -to be reclaimed by the OS. The timer starts after the last file has -been processed. If you can afford it, it is recommended to disable it -by setting it to `0`. +memory to be reclaimed by the OS. The timer starts after the last file +has been processed. If you can afford it, it is recommended to disable +it by setting it to `0`. diff --git a/website/site/content/docs/webapp/metadata.md b/website/site/content/docs/webapp/metadata.md index c6f03498..fb096641 100644 --- a/website/site/content/docs/webapp/metadata.md +++ b/website/site/content/docs/webapp/metadata.md @@ -130,10 +130,28 @@ page](@/docs/webapp/customfields.md) for more information. # Document Language An important setting is the language of your documents. This helps OCR -and text analysis. You can select between English, German and French -currently. The language can also specified with each [upload +and text analysis. You can select between various languages. The +language can also specified with each [upload request](@/docs/api/upload.md). Go to the *Collective Settings* page and click *Document Language*. This will set the lanugage for all your documents. It is not (yet) possible to specify it when uploading. + +The language has effects in several areas: text extraction, fulltext +search and text analysis. When extracting text from images, tesseract +(the external tool used for this) can yield better results if the +language is known. Also, solr (the fulltext search tool) can optimize +its index given the language, which results in better fulltext search +experience. The features of text analysis strongly depend on the +language. Docspell uses the [Stanford NLP +Library](https://nlp.stanford.edu/software/) for its great machine +learning algorithms. Some of them, like certain NLP features, are only +available for some languages – namely German, English and French. The +reason is that the required statistical models are not available for +other languages. However, docspell can still run other algorithms for +the other languages, like classification and custom rules based on the +address book. + +More information about file processing and text analysis can be found +[here](@/docs/joex/file-processing.md#text-analysis).