Updating stanford corenlp to 4.3.2; adding more languages

There are models for Spanish, that have been added now. Also the Hungarian language has been added to the list of supported languages (for tesseract mainly, no nlp models)
2025-08-05 02:24:52 +00:00 · 2021-11-20 14:31:39 +01:00
parent 20fc9955ba
commit 501c6f2988
18 changed files with 162 additions and 40 deletions
--- a/website/site/content/docs/configure/_index.md
+++ b/website/site/content/docs/configure/_index.md
@ -486,8 +486,8 @@ This setting defines which NLP mode to use. It defaults to `full`,
 which requires more memory for certain languages (with the advantage
 of better results). Other values are `basic`, `regexonly` and
 `disabled`. The modes `full` and `basic` use pre-defined lanugage
-models for procesing documents of languaes German, English and French.
-These require some amount of memory (see below).
+models for procesing documents of languaes German, English, French and
+Spanish. These require some amount of memory (see below).

 The mode `basic` is like the "light" variant to `full`. It doesn't use
 all NLP features, which makes memory consumption much lower, but comes
--- a/website/site/content/docs/joex/file-processing.md
+++ b/website/site/content/docs/joex/file-processing.md
@ -8,10 +8,10 @@ mktoc = true
 +++

 When uploading a file, it is only saved to the database together with
-the given meta information. The file is not visible in the ui yet.
-Then joex takes the next such file (or files in case you uploaded
-many) and starts processing it. When processing finished, the item and
-its files will show up in the ui.
+the given meta information as a "job". The file is not visible in the
+ui yet. Then joex takes the next such job and starts processing it.
+When processing finished, the item and its files will show up in the
+ui.

 If an error occurs during processing, the item will be created
 anyways, so you can see it. Depending on the error, some information
@ -400,7 +400,7 @@ names etc. This also requires a statistical model, but this time for a
 whole language. These are also provided by [Stanford
 NLP](https://nlp.stanford.edu/software/), but not for all languages.
 So whether this can be used depends on the document language. Models
-exist for German, English and French currently.
+exist for German, English, French and Spanish currently.

 Then [Stanford NLP](https://nlp.stanford.edu/software/) also allows to
 run custom rules against a text. This can be used as a fallback for
--- a/website/site/content/docs/webapp/metadata.md
+++ b/website/site/content/docs/webapp/metadata.md
@ -147,11 +147,11 @@ experience. The features of text analysis strongly depend on the
 language. Docspell uses the [Stanford NLP
 Library](https://nlp.stanford.edu/software/) for its great machine
 learning algorithms. Some of them, like certain NLP features, are only
-available for some languages – namely German, English and French. The
-reason is that the required statistical models are not available for
-other languages. However, docspell can still run other algorithms for
-the other languages, like classification and custom rules based on the
-address book.
+available for some languages – namely German, English, French and
+Spanish. The reason is that the required statistical models are not
+available for other languages. However, docspell can still run other
+algorithms for the other languages, like classification and custom
+rules based on the address book.

 More information about file processing and text analysis can be found
 [here](@/docs/joex/file-processing.md#text-analysis).