Merge pull request #581 from eikek/text-analysis-improvements

Text analysis improvements
2025-10-09 04:27:14 +00:00 · 2021-01-21 22:01:50 +00:00
parent b9b554980a 4cba96f390
commit df5f9e8c51
104 changed files with 3385 additions and 714 deletions
--- a/.travis.yml
+++ b/.travis.yml
@@ -24,4 +24,4 @@ before_script:
  - export TZ=Europe/Berlin

 script:
-  - sbt ++$TRAVIS_SCALA_VERSION ";project root ;scalafmtCheckAll ;make ;test"
+  - sbt -J-XX:+UseG1GC ++$TRAVIS_SCALA_VERSION ";project root ;scalafmtCheckAll ;make ;test"
--- a/Contributing.md
+++ b/Contributing.md
@@ -17,6 +17,9 @@ If you don't like to sign up to github/matrix or like to reach me
 personally, you can make a mail to `info [at] docspell.org` or on
 matrix, via `@eikek:matrix.org`.

+If you find a feature request already filed, you can vote on it. I
+tend to prefer most voted requests to those without much attention.
+

 ## Documentation

--- a/README.md
+++ b/README.md
@@ -9,25 +9,28 @@
 # Docspell

 Docspell is a personal document organizer. You'll need a scanner to
-convert your papers into files. Docspell can then assist in
-organizing the resulting mess :wink:.
+convert your papers into files. Docspell can then assist in organizing
+the resulting mess :wink:. It is targeted for home use, i.e. families
+and households and also for (smaller) groups/companies.

-You can associate tags, set correspondends, what a document is
-concerned with, a name, a date and much more. If your documents are
-associated with such meta data, you should be able to quickly find
-them later using the search feature. But adding this manually to each
-document is a tedious task. Docspell can help you by suggesting
-correspondents, guessing tags or finding dates using machine learning
-techniques. This makes adding metadata to your documents a lot easier.
+You can associate tags, set correspondends and lots of other
+predefined and custom metadata. If your documents are associated with
+such meta data, you can quickly find them later using the search
+feature. But adding this manually is a tedious task. Docspell can help
+by suggesting correspondents, guessing tags or finding dates using
+machine learning. It can learn metadata from existing documents and
+find things using NLP. This makes adding metadata to your documents a
+lot easier. For machine learning, it relies on the free (GPL)
+[Stanford Core NLP library](https://github.com/stanfordnlp/CoreNLP).

 Docspell also runs OCR (if needed) on your documents, can provide
 fulltext search and has great e-mail integration. Everything is
 accessible via a REST/HTTP api. A mobile friendly SPA web application
-is provided as the user interface and an [Android
-app](https://github.com/docspell/android-client) for conveniently
-uploading files from your phone/tablet. The [feature
-overview](https://docspell.org/#feature-selection) has a more complete
-list.
+is the default user interface. An [Android
+app](https://github.com/docspell/android-client) exists for
+conveniently uploading files from your phone/tablet. The [feature
+overview](https://docspell.org/#feature-selection) lists some more
+points.


 ## Impressions
--- a/build.sbt
+++ b/build.sbt
@@ -131,7 +131,8 @@ val openapiScalaSettings = Seq(
      case "ident" =>
        field => field.copy(typeDef = TypeDef("Ident", Imports("docspell.common.Ident")))
      case "accountid" =>
-        field => field.copy(typeDef = TypeDef("AccountId", Imports("docspell.common.AccountId")))
+        field =>
+          field.copy(typeDef = TypeDef("AccountId", Imports("docspell.common.AccountId")))
      case "collectivestate" =>
        field =>
          field.copy(typeDef =
@@ -190,6 +191,9 @@ val openapiScalaSettings = Seq(
          field.copy(typeDef =
            TypeDef("CustomFieldType", Imports("docspell.common.CustomFieldType"))
          )
+      case "listtype" =>
+        field =>
+          field.copy(typeDef = TypeDef("ListType", Imports("docspell.common.ListType")))
    }))
 )

--- a/docker/joex-base.dockerfile
+++ b/docker/joex-base.dockerfile
@@ -15,6 +15,17 @@ RUN apk add --no-cache openjdk11-jre \
    tesseract-ocr \
    tesseract-ocr-data-deu \
    tesseract-ocr-data-fra \
+    tesseract-ocr-data-ita \
+    tesseract-ocr-data-spa \
+    tesseract-ocr-data-por \
+    tesseract-ocr-data-ces \
+    tesseract-ocr-data-nld \
+    tesseract-ocr-data-dan \
+    tesseract-ocr-data-fin \
+    tesseract-ocr-data-nor \
+    tesseract-ocr-data-swe \
+    tesseract-ocr-data-rus \
+    tesseract-ocr-data-ron \
    unpaper \
    wkhtmltopdf \
    libreoffice \
--- a/modules/analysis/src/main/scala/docspell/analysis/NlpSettings.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/NlpSettings.scala
@@ -0,0 +1,7 @@
+package docspell.analysis
+
+import java.nio.file.Path
+
+import docspell.common._
+
+case class NlpSettings(lang: Language, highRecall: Boolean, regexNer: Option[Path])
--- a/modules/analysis/src/main/scala/docspell/analysis/TextAnalyser.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/TextAnalyser.scala
@@ -1,29 +1,30 @@
 package docspell.analysis

+import cats.Applicative
 import cats.effect._
 import cats.implicits._

+import docspell.analysis.classifier.{StanfordTextClassifier, TextClassifier}
 import docspell.analysis.contact.Contact
 import docspell.analysis.date.DateFind
-import docspell.analysis.nlp.PipelineCache
-import docspell.analysis.nlp.StanfordNerClassifier
-import docspell.analysis.nlp.StanfordNerSettings
-import docspell.analysis.nlp.StanfordTextClassifier
-import docspell.analysis.nlp.TextClassifier
+import docspell.analysis.nlp._
 import docspell.common._

+import org.log4s.getLogger
+
 trait TextAnalyser[F[_]] {

  def annotate(
      logger: Logger[F],
-      settings: StanfordNerSettings,
+      settings: NlpSettings,
      cacheKey: Ident,
      text: String
  ): F[TextAnalyser.Result]

-  def classifier(blocker: Blocker)(implicit CS: ContextShift[F]): TextClassifier[F]
+  def classifier: TextClassifier[F]
 }
 object TextAnalyser {
+  private[this] val logger = getLogger

  case class Result(labels: Vector[NerLabel], dates: Vector[NerDateLabel]) {

@@ -31,31 +32,30 @@ object TextAnalyser {
      labels ++ dates.map(dl => dl.label.copy(label = dl.date.toString))
  }

-  def create[F[_]: Concurrent: Timer](
-      cfg: TextAnalysisConfig
+  def create[F[_]: Concurrent: Timer: ContextShift](
+      cfg: TextAnalysisConfig,
+      blocker: Blocker
  ): Resource[F, TextAnalyser[F]] =
    Resource
-      .liftF(PipelineCache[F](cfg.clearStanfordPipelineInterval))
-      .map(cache =>
+      .liftF(Nlp(cfg.nlpConfig))
+      .map(stanfordNer =>
        new TextAnalyser[F] {
          def annotate(
              logger: Logger[F],
-              settings: StanfordNerSettings,
+              settings: NlpSettings,
              cacheKey: Ident,
              text: String
          ): F[TextAnalyser.Result] =
            for {
              input <- textLimit(logger, text)
-              tags0 <- stanfordNer(cacheKey, settings, input)
+              tags0 <- stanfordNer(Nlp.Input(cacheKey, settings, logger, input))
              tags1 <- contactNer(input)
              dates <- dateNer(settings.lang, input)
              list  = tags0 ++ tags1
              spans = NerLabelSpan.build(list)
            } yield Result(spans ++ list, dates)

-          def classifier(blocker: Blocker)(implicit
-              CS: ContextShift[F]
-          ): TextClassifier[F] =
+          def classifier: TextClassifier[F] =
            new StanfordTextClassifier[F](cfg.classifier, blocker)

          private def textLimit(logger: Logger[F], text: String): F[String] =
@@ -66,10 +66,6 @@ object TextAnalyser {
                  s" Analysing only first ${cfg.maxLength} characters."
              ) *> text.take(cfg.maxLength).pure[F]

-          private def stanfordNer(key: Ident, settings: StanfordNerSettings, text: String)
-              : F[Vector[NerLabel]] =
-            StanfordNerClassifier.nerAnnotate[F](key.id, cache)(settings, text)
-
          private def contactNer(text: String): F[Vector[NerLabel]] =
            Sync[F].delay {
              Contact.annotate(text)
@@ -82,4 +78,36 @@ object TextAnalyser {
        }
      )

+  /** Provides the nlp pipeline based on the configuration. */
+  private object Nlp {
+    def apply[F[_]: Concurrent: Timer: BracketThrow](
+        cfg: TextAnalysisConfig.NlpConfig
+    ): F[Input[F] => F[Vector[NerLabel]]] =
+      cfg.mode match {
+        case NlpMode.Disabled =>
+          Logger.log4s(logger).info("NLP is disabled as defined in config.") *>
+            Applicative[F].pure(_ => Vector.empty[NerLabel].pure[F])
+        case _ =>
+          PipelineCache(cfg.clearInterval)(
+            Annotator[F](cfg.mode),
+            Annotator.clearCaches[F]
+          )
+            .map(annotate[F])
+      }
+
+    final case class Input[F[_]](
+        key: Ident,
+        settings: NlpSettings,
+        logger: Logger[F],
+        text: String
+    )
+
+    def annotate[F[_]: BracketThrow](
+        cache: PipelineCache[F]
+    )(input: Input[F]): F[Vector[NerLabel]] =
+      cache
+        .obtain(input.key.id, input.settings)
+        .use(ann => ann.nerAnnotate(input.logger)(input.text))
+
+  }
 }
--- a/modules/analysis/src/main/scala/docspell/analysis/TextAnalysisConfig.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/TextAnalysisConfig.scala
@@ -1,10 +1,16 @@
 package docspell.analysis

-import docspell.analysis.nlp.TextClassifierConfig
+import docspell.analysis.TextAnalysisConfig.NlpConfig
+import docspell.analysis.classifier.TextClassifierConfig
 import docspell.common._

 case class TextAnalysisConfig(
    maxLength: Int,
-    clearStanfordPipelineInterval: Duration,
+    nlpConfig: NlpConfig,
    classifier: TextClassifierConfig
 )
+
+object TextAnalysisConfig {
+
+  case class NlpConfig(clearInterval: Duration, mode: NlpMode)
+}
--- a/modules/analysis/src/main/scala/docspell/analysis/classifier/ClassifierModel.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/classifier/ClassifierModel.scala
@@ -1,4 +1,4 @@
-package docspell.analysis.nlp
+package docspell.analysis.classifier

 import java.nio.file.Path

--- a/modules/analysis/src/main/scala/docspell/analysis/classifier/StanfordTextClassifier.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/classifier/StanfordTextClassifier.scala
@@ -1,4 +1,4 @@
-package docspell.analysis.nlp
+package docspell.analysis.classifier

 import java.nio.file.Path

@@ -7,8 +7,11 @@ import cats.effect.concurrent.Ref
 import cats.implicits._
 import fs2.Stream

-import docspell.analysis.nlp.TextClassifier._
+import docspell.analysis.classifier
+import docspell.analysis.classifier.TextClassifier._
+import docspell.analysis.nlp.Properties
 import docspell.common._
+import docspell.common.syntax.FileSyntax._

 import edu.stanford.nlp.classify.ColumnDataClassifier

@@ -26,7 +29,7 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
      .use { dir =>
        for {
          rawData   <- writeDataFile(blocker, dir, data)
-          _         <- logger.info(s"Learning from ${rawData.count} items.")
+          _         <- logger.debug(s"Learning from ${rawData.count} items.")
          trainData <- splitData(logger, rawData)
          scores    <- cfg.classifierConfigs.traverse(m => train(logger, trainData, m))
          sorted = scores.sortBy(-_.score)
@@ -43,7 +46,7 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
      case Some(text) =>
        Sync[F].delay {
          val cls = ColumnDataClassifier.getClassifier(
-            model.model.normalize().toAbsolutePath().toString()
+            model.model.normalize().toAbsolutePath.toString
          )
          val cat = cls.classOf(cls.makeDatumFromLine("\t\t" + normalisedText(text)))
          Option(cat)
@@ -65,7 +68,7 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
        val cdc = new ColumnDataClassifier(Properties.fromMap(amendProps(in, props)))
        cdc.trainClassifier(in.train.toString())
        val score = cdc.testClassifier(in.test.toString())
-        TrainResult(score.first(), ClassifierModel(in.modelFile))
+        TrainResult(score.first(), classifier.ClassifierModel(in.modelFile))
      }
      _ <- logger.debug(s"Trained with result $res")
    } yield res
@@ -136,9 +139,9 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
      props: Map[String, String]
  ): Map[String, String] =
    prepend("2.", props) ++ Map(
-      "trainFile"   -> trainData.train.normalize().toAbsolutePath().toString(),
-      "testFile"    -> trainData.test.normalize().toAbsolutePath().toString(),
-      "serializeTo" -> trainData.modelFile.normalize().toAbsolutePath().toString()
+      "trainFile"   -> trainData.train.absolutePathAsString,
+      "testFile"    -> trainData.test.absolutePathAsString,
+      "serializeTo" -> trainData.modelFile.absolutePathAsString
    ).toList

  case class RawData(count: Long, file: Path)
--- a/modules/analysis/src/main/scala/docspell/analysis/classifier/TextClassifier.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/classifier/TextClassifier.scala
@@ -1,9 +1,9 @@
-package docspell.analysis.nlp
+package docspell.analysis.classifier

 import cats.data.Kleisli
 import fs2.Stream

-import docspell.analysis.nlp.TextClassifier.Data
+import docspell.analysis.classifier.TextClassifier.Data
 import docspell.common._

 trait TextClassifier[F[_]] {
--- a/modules/analysis/src/main/scala/docspell/analysis/classifier/TextClassifierConfig.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/classifier/TextClassifierConfig.scala
@@ -1,4 +1,4 @@
-package docspell.analysis.nlp
+package docspell.analysis.classifier

 import java.nio.file.Path

--- a/modules/analysis/src/main/scala/docspell/analysis/date/DateFind.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/date/DateFind.scala
@@ -41,23 +41,41 @@ object DateFind {
  }

  object SimpleDate {
-    val p0 = (readYear >> readMonth >> readDay).map { case ((y, m), d) =>
-      List(SimpleDate(y, m, d))
+    def pattern0(lang: Language) = (readYear >> readMonth(lang) >> readDay).map {
+      case ((y, m), d) =>
+        List(SimpleDate(y, m, d))
    }
-    val p1 = (readDay >> readMonth >> readYear).map { case ((d, m), y) =>
-      List(SimpleDate(y, m, d))
+    def pattern1(lang: Language) = (readDay >> readMonth(lang) >> readYear).map {
+      case ((d, m), y) =>
+        List(SimpleDate(y, m, d))
    }
-    val p2 = (readMonth >> readDay >> readYear).map { case ((m, d), y) =>
-      List(SimpleDate(y, m, d))
+    def pattern2(lang: Language) = (readMonth(lang) >> readDay >> readYear).map {
+      case ((m, d), y) =>
+        List(SimpleDate(y, m, d))
    }

    // ymd ✔, ydm, dmy ✔, dym, myd, mdy ✔
    def fromParts(parts: List[Word], lang: Language): List[SimpleDate] = {
+      val ymd = pattern0(lang)
+      val dmy = pattern1(lang)
+      val mdy = pattern2(lang)
+      // most is from wikipedia…
      val p = lang match {
        case Language.English =>
-          p2.alt(p1).map(t => t._1 ++ t._2).or(p2).or(p0).or(p1)
-        case Language.German => p1.or(p0).or(p2)
-        case Language.French => p1.or(p0).or(p2)
+          mdy.alt(dmy).map(t => t._1 ++ t._2).or(mdy).or(ymd).or(dmy)
+        case Language.German     => dmy.or(ymd).or(mdy)
+        case Language.French     => dmy.or(ymd).or(mdy)
+        case Language.Italian    => dmy.or(ymd).or(mdy)
+        case Language.Spanish    => dmy.or(ymd).or(mdy)
+        case Language.Czech      => dmy.or(ymd).or(mdy)
+        case Language.Danish     => dmy.or(ymd).or(mdy)
+        case Language.Finnish    => dmy.or(ymd).or(mdy)
+        case Language.Norwegian  => dmy.or(ymd).or(mdy)
+        case Language.Portuguese => dmy.or(ymd).or(mdy)
+        case Language.Romanian   => dmy.or(ymd).or(mdy)
+        case Language.Russian    => dmy.or(ymd).or(mdy)
+        case Language.Swedish    => ymd.or(dmy).or(mdy)
+        case Language.Dutch      => dmy.or(ymd).or(mdy)
      }
      p.read(parts) match {
        case Result.Success(sds, _) =>
@@ -76,9 +94,11 @@ object DateFind {
        }
      )

-    def readMonth: Reader[Int] =
+    def readMonth(lang: Language): Reader[Int] =
      Reader.readFirst(w =>
-        Some(months.indexWhere(_.contains(w.value))).filter(_ >= 0).map(_ + 1)
+        Some(MonthName.getAll(lang).indexWhere(_.contains(w.value)))
+          .filter(_ >= 0)
+          .map(_ + 1)
      )

    def readDay: Reader[Int] =
@@ -150,20 +170,5 @@ object DateFind {
            Failure
        }
    }
-
-    private val months = List(
-      List("jan", "january", "januar", "01"),
-      List("feb", "february", "februar", "02"),
-      List("mar", "march", "märz", "marz", "03"),
-      List("apr", "april", "04"),
-      List("may", "mai", "05"),
-      List("jun", "june", "juni", "06"),
-      List("jul", "july", "juli", "07"),
-      List("aug", "august", "08"),
-      List("sep", "september", "09"),
-      List("oct", "october", "oktober", "10"),
-      List("nov", "november", "11"),
-      List("dec", "december", "dezember", "12")
-    )
  }
 }
--- a/modules/analysis/src/main/scala/docspell/analysis/date/MonthName.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/date/MonthName.scala
@@ -0,0 +1,270 @@
+package docspell.analysis.date
+
+import docspell.common.Language
+
+object MonthName {
+
+  def getAll(lang: Language): List[List[String]] =
+    merge(numbers, forLang(lang))
+
+  private def merge(n0: List[List[String]], ns: List[List[String]]*): List[List[String]] =
+    ns.foldLeft(n0) { (res, el) =>
+      res.zip(el).map({ case (a, b) => a ++ b })
+    }
+
+  private def forLang(lang: Language): List[List[String]] =
+    lang match {
+      case Language.English =>
+        english
+      case Language.German =>
+        german
+      case Language.French =>
+        french
+      case Language.Italian =>
+        italian
+      case Language.Spanish =>
+        spanish
+      case Language.Swedish =>
+        swedish
+      case Language.Norwegian =>
+        norwegian
+      case Language.Dutch =>
+        dutch
+      case Language.Czech =>
+        czech
+      case Language.Danish =>
+        danish
+      case Language.Portuguese =>
+        portuguese
+      case Language.Romanian =>
+        romanian
+      case Language.Finnish =>
+        finnish
+      case Language.Russian =>
+        russian
+    }
+
+  private val numbers = List(
+    List("01"),
+    List("02"),
+    List("03"),
+    List("04"),
+    List("05"),
+    List("06"),
+    List("07"),
+    List("08"),
+    List("09"),
+    List("10"),
+    List("11"),
+    List("12")
+  )
+
+  private val english = List(
+    List("jan", "january"),
+    List("feb", "february"),
+    List("mar", "march"),
+    List("apr", "april"),
+    List("may"),
+    List("jun", "june"),
+    List("jul", "july"),
+    List("aug", "august"),
+    List("sept", "september"),
+    List("oct", "october"),
+    List("nov", "november"),
+    List("dec", "december")
+  )
+
+  private val german = List(
+    List("jan", "januar"),
+    List("feb", "februar"),
+    List("märz"),
+    List("apr", "april"),
+    List("mai"),
+    List("juni"),
+    List("juli"),
+    List("aug", "august"),
+    List("sept", "september"),
+    List("okt", "oktober"),
+    List("nov", "november"),
+    List("dez", "dezember")
+  )
+
+  private val french = List(
+    List("janv", "janvier"),
+    List("févr", "fevr", "février", "fevrier"),
+    List("mars"),
+    List("avril"),
+    List("mai"),
+    List("juin"),
+    List("juil", "juillet"),
+    List("aout", "août"),
+    List("sept", "septembre"),
+    List("oct", "octobre"),
+    List("nov", "novembre"),
+    List("dec", "déc", "décembre", "decembre")
+  )
+
+  private val italian = List(
+    List("genn", "gennaio"),
+    List("febbr", "febbraio"),
+    List("mar", "marzo"),
+    List("apr", "aprile"),
+    List("magg", "maggio"),
+    List("giugno"),
+    List("luglio"),
+    List("ag", "agosto"),
+    List("sett", "settembre"),
+    List("ott", "ottobre"),
+    List("nov", "novembre"),
+    List("dic", "dicembre")
+  )
+
+  private val spanish = List(
+    List("ene", "enero"),
+    List("feb", "febrero"),
+    List("mar", "marzo"),
+    List("abr", "abril"),
+    List("may", "mayo"),
+    List("jun"),
+    List("jul"),
+    List("ago", "agosto"),
+    List("sep", "septiembre"),
+    List("oct", "octubre"),
+    List("nov", "noviembre"),
+    List("dic", "diciembre")
+  )
+
+  private val swedish = List(
+    List("jan", "januari"),
+    List("febr", "februari"),
+    List("mars"),
+    List("april"),
+    List("maj"),
+    List("juni"),
+    List("juli"),
+    List("aug", "augusti"),
+    List("sept", "september"),
+    List("okt", "oktober"),
+    List("nov", "november"),
+    List("dec", "december")
+  )
+  private val norwegian = List(
+    List("jan", "januar"),
+    List("febr", "februar"),
+    List("mars"),
+    List("april"),
+    List("mai"),
+    List("juni"),
+    List("juli"),
+    List("aug", "august"),
+    List("sept", "september"),
+    List("okt", "oktober"),
+    List("nov", "november"),
+    List("des", "desember")
+  )
+
+  private val czech = List(
+    List("led", "leden"),
+    List("un", "ún", "únor", "unor"),
+    List("brez", "březen", "brezen"),
+    List("dub", "duben"),
+    List("kvet", "květen"),
+    List("cerv", "červen"),
+    List("cerven", "červenec"),
+    List("srp", "srpen"),
+    List("zari", "září"),
+    List("ríj", "rij", "říjen"),
+    List("list", "listopad"),
+    List("pros", "prosinec")
+  )
+
+  private val romanian = List(
+    List("ian", "ianuarie"),
+    List("feb", "februarie"),
+    List("mar", "martie"),
+    List("apr", "aprilie"),
+    List("mai"),
+    List("iunie"),
+    List("iulie"),
+    List("aug", "august"),
+    List("sept", "septembrie"),
+    List("oct", "octombrie"),
+    List("noem", "nov", "noiembrie"),
+    List("dec", "decembrie")
+  )
+
+  private val danish = List(
+    List("jan", "januar"),
+    List("febr", "februar"),
+    List("marts"),
+    List("april"),
+    List("maj"),
+    List("juni"),
+    List("juli"),
+    List("aug", "august"),
+    List("sept", "september"),
+    List("okt", "oktober"),
+    List("nov", "november"),
+    List("dec", "december")
+  )
+
+  private val portuguese = List(
+    List("jan", "janeiro"),
+    List("fev", "fevereiro"),
+    List("março", "marco"),
+    List("abril"),
+    List("maio"),
+    List("junho"),
+    List("julho"),
+    List("agosto"),
+    List("set", "setembro"),
+    List("out", "outubro"),
+    List("nov", "novembro"),
+    List("dez", "dezembro")
+  )
+
+  private val finnish = List(
+    List("tammikuu"),
+    List("helmikuu"),
+    List("maaliskuu"),
+    List("huhtikuu"),
+    List("toukokuu"),
+    List("kesäkuu"),
+    List("heinäkuu"),
+    List("elokuu"),
+    List("syyskuu"),
+    List("lokakuu"),
+    List("marraskuu"),
+    List("joulukuu")
+  )
+
+  private val russian = List(
+    List("январь"),
+    List("февраль"),
+    List("март"),
+    List("апрель"),
+    List("май"),
+    List("июнь"),
+    List("июль"),
+    List("август"),
+    List("сентябрь"),
+    List("октябрь"),
+    List("ноябрь"),
+    List("декабрь")
+  )
+
+  private val dutch = List(
+    List("jan", "januari"),
+    List("feb", "februari"),
+    List("maart"),
+    List("apr", "april"),
+    List("mei"),
+    List("juni"),
+    List("juli"),
+    List("aug", "augustus"),
+    List("sept", "september"),
+    List("okt", "oct", "oktober"),
+    List("nov", "november"),
+    List("dec", "december")
+  )
+}
--- a/modules/analysis/src/main/scala/docspell/analysis/nlp/Annotator.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/nlp/Annotator.scala
@@ -0,0 +1,98 @@
+package docspell.analysis.nlp
+
+import cats.effect.Sync
+import cats.implicits._
+import cats.{Applicative, FlatMap}
+
+import docspell.analysis.NlpSettings
+import docspell.common._
+
+import edu.stanford.nlp.pipeline.StanfordCoreNLP
+
+/** Analyses a text to mark certain parts with a `NerLabel`. */
+trait Annotator[F[_]] { self =>
+  def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]]
+
+  def ++(next: Annotator[F])(implicit F: FlatMap[F]): Annotator[F] =
+    new Annotator[F] {
+      def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
+        for {
+          n0 <- self.nerAnnotate(logger)(text)
+          n1 <- next.nerAnnotate(logger)(text)
+        } yield (n0 ++ n1).distinct
+    }
+}
+
+object Annotator {
+
+  /** Creates an annotator according to the given `mode` and `settings`.
+    *
+    * There are the following ways:
+    *
+    * - disabled: it returns a no-op annotator that always gives an empty list
+    * - full: the complete stanford pipeline is used
+    * - basic: only the ner classifier is used
+    *
+    * Additionally, if there is a regexNer-file specified, the regexner annotator is
+    * also run. In case the full pipeline is used, this is already included.
+    */
+  def apply[F[_]: Sync](mode: NlpMode)(settings: NlpSettings): Annotator[F] =
+    mode match {
+      case NlpMode.Disabled =>
+        Annotator.none[F]
+      case NlpMode.Full =>
+        StanfordNerSettings.fromNlpSettings(settings) match {
+          case Some(ss) =>
+            Annotator.pipeline(StanfordNerAnnotator.makePipeline(ss))
+          case None =>
+            Annotator.none[F]
+        }
+      case NlpMode.Basic =>
+        StanfordNerSettings.fromNlpSettings(settings) match {
+          case Some(StanfordNerSettings.Full(lang, _, Some(file))) =>
+            Annotator.basic(BasicCRFAnnotator.Cache.getAnnotator(lang)) ++
+              Annotator.pipeline(StanfordNerAnnotator.regexNerPipeline(file))
+          case Some(StanfordNerSettings.Full(lang, _, None)) =>
+            Annotator.basic(BasicCRFAnnotator.Cache.getAnnotator(lang))
+          case Some(StanfordNerSettings.RegexOnly(file)) =>
+            Annotator.pipeline(StanfordNerAnnotator.regexNerPipeline(file))
+          case None =>
+            Annotator.none[F]
+        }
+      case NlpMode.RegexOnly =>
+        settings.regexNer match {
+          case Some(file) =>
+            Annotator.pipeline(StanfordNerAnnotator.regexNerPipeline(file))
+          case None =>
+            Annotator.none[F]
+        }
+    }
+
+  def none[F[_]: Applicative]: Annotator[F] =
+    new Annotator[F] {
+      def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
+        logger.debug("Running empty annotator. NLP not supported.") *>
+          Vector.empty[NerLabel].pure[F]
+    }
+
+  def basic[F[_]: Sync](ann: BasicCRFAnnotator.Annotator): Annotator[F] =
+    new Annotator[F] {
+      def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
+        Sync[F].delay(
+          BasicCRFAnnotator.nerAnnotate(ann)(text)
+        )
+    }
+
+  def pipeline[F[_]: Sync](cp: StanfordCoreNLP): Annotator[F] =
+    new Annotator[F] {
+      def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
+        Sync[F].delay(StanfordNerAnnotator.nerAnnotate(cp, text))
+
+    }
+
+  def clearCaches[F[_]: Sync]: F[Unit] =
+    Sync[F].delay {
+      StanfordCoreNLP.clearAnnotatorPool()
+      BasicCRFAnnotator.Cache.clearCache()
+    }
+}
--- a/modules/analysis/src/main/scala/docspell/analysis/nlp/BasicCRFAnnotator.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/nlp/BasicCRFAnnotator.scala
@@ -0,0 +1,94 @@
+package docspell.analysis.nlp
+
+import java.net.URL
+import java.util.concurrent.atomic.AtomicReference
+import java.util.zip.GZIPInputStream
+
+import scala.jdk.CollectionConverters._
+import scala.util.Using
+
+import docspell.common.Language.NLPLanguage
+import docspell.common._
+
+import edu.stanford.nlp.ie.AbstractSequenceClassifier
+import edu.stanford.nlp.ie.crf.CRFClassifier
+import edu.stanford.nlp.ling.{CoreAnnotations, CoreLabel}
+import org.log4s.getLogger
+
+/** This is only using the CRFClassifier without building an analysis
+  * pipeline. The ner-classifier cannot use results from POS-tagging
+  * etc. and is therefore not as good as the [[StanfordNerAnnotator]].
+  * But it uses less memory, while still being not bad.
+  */
+object BasicCRFAnnotator {
+  private[this] val logger = getLogger
+
+  // assert correct resource names
+  List(Language.French, Language.German, Language.English).foreach(classifierResource)
+
+  type Annotator = AbstractSequenceClassifier[CoreLabel]
+
+  def nerAnnotate(nerClassifier: Annotator)(text: String): Vector[NerLabel] =
+    nerClassifier
+      .classify(text)
+      .asScala
+      .flatMap(a => a.asScala)
+      .collect(Function.unlift { label =>
+        val tag = label.get(classOf[CoreAnnotations.AnswerAnnotation])
+        NerTag
+          .fromString(Option(tag).getOrElse(""))
+          .toOption
+          .map(t => NerLabel(label.word(), t, label.beginPosition(), label.endPosition()))
+      })
+      .toVector
+
+  def makeAnnotator(lang: NLPLanguage): Annotator = {
+    logger.info(s"Creating ${lang.name} Stanford NLP NER-only classifier...")
+    val ner = classifierResource(lang)
+    Using(new GZIPInputStream(ner.openStream())) { in =>
+      CRFClassifier.getClassifier(in).asInstanceOf[Annotator]
+    }.fold(throw _, identity)
+  }
+
+  private def classifierResource(lang: NLPLanguage): URL = {
+    def check(name: String): URL =
+      Option(getClass.getResource(name)) match {
+        case None =>
+          sys.error(s"NER model resource '$name' not found for language ${lang.name}")
+        case Some(url) => url
+      }
+
+    check(lang match {
+      case Language.French =>
+        "/edu/stanford/nlp/models/ner/french-wikiner-4class.crf.ser.gz"
+      case Language.German =>
+        "/edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz"
+      case Language.English =>
+        "/edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz"
+    })
+  }
+
+  final class Cache {
+    private[this] lazy val germanNerClassifier  = makeAnnotator(Language.German)
+    private[this] lazy val englishNerClassifier = makeAnnotator(Language.English)
+    private[this] lazy val frenchNerClassifier  = makeAnnotator(Language.French)
+
+    def forLang(language: NLPLanguage): Annotator =
+      language match {
+        case Language.French  => frenchNerClassifier
+        case Language.German  => germanNerClassifier
+        case Language.English => englishNerClassifier
+      }
+  }
+
+  object Cache {
+
+    private[this] val cacheRef = new AtomicReference[Cache](new Cache)
+
+    def getAnnotator(language: NLPLanguage): Annotator =
+      cacheRef.get().forLang(language)
+
+    def clearCache(): Unit =
+      cacheRef.set(new Cache)
+  }
+}
--- a/modules/analysis/src/main/scala/docspell/analysis/nlp/PipelineCache.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/nlp/PipelineCache.scala
@@ -7,9 +7,9 @@ import cats.effect._
 import cats.effect.concurrent.Ref
 import cats.implicits._

+import docspell.analysis.NlpSettings
 import docspell.common._

-import edu.stanford.nlp.pipeline.StanfordCoreNLP
 import org.log4s.getLogger

 /** Creating the StanfordCoreNLP pipeline is quite expensive as it
@@ -21,46 +21,45 @@ import org.log4s.getLogger
  */
 trait PipelineCache[F[_]] {

-  def obtain(key: String, settings: StanfordNerSettings): Resource[F, StanfordCoreNLP]
+  def obtain(key: String, settings: NlpSettings): Resource[F, Annotator[F]]

 }

 object PipelineCache {
  private[this] val logger = getLogger

-  def none[F[_]: Applicative]: PipelineCache[F] =
-    new PipelineCache[F] {
-      def obtain(
-          ignored: String,
-          settings: StanfordNerSettings
-      ): Resource[F, StanfordCoreNLP] =
-        Resource.liftF(makeClassifier(settings).pure[F])
-    }
-
-  def apply[F[_]: Concurrent: Timer](clearInterval: Duration): F[PipelineCache[F]] =
+  def apply[F[_]: Concurrent: Timer](clearInterval: Duration)(
+      creator: NlpSettings => Annotator[F],
+      release: F[Unit]
+  ): F[PipelineCache[F]] =
    for {
-      data       <- Ref.of(Map.empty[String, Entry])
-      cacheClear <- CacheClearing.create(data, clearInterval)
-    } yield new Impl[F](data, cacheClear)
+      data       <- Ref.of(Map.empty[String, Entry[Annotator[F]]])
+      cacheClear <- CacheClearing.create(data, clearInterval, release)
+      _          <- Logger.log4s(logger).info("Creating nlp pipeline cache")
+    } yield new Impl[F](data, creator, cacheClear)

  final private class Impl[F[_]: Sync](
-      data: Ref[F, Map[String, Entry]],
+      data: Ref[F, Map[String, Entry[Annotator[F]]]],
+      creator: NlpSettings => Annotator[F],
      cacheClear: CacheClearing[F]
  ) extends PipelineCache[F] {

-    def obtain(key: String, settings: StanfordNerSettings): Resource[F, StanfordCoreNLP] =
+    def obtain(key: String, settings: NlpSettings): Resource[F, Annotator[F]] =
      for {
-        _   <- cacheClear.withCache
-        id  <- Resource.liftF(makeSettingsId(settings))
-        nlp <- Resource.liftF(data.modify(cache => getOrCreate(key, id, cache, settings)))
+        _  <- cacheClear.withCache
+        id <- Resource.liftF(makeSettingsId(settings))
+        nlp <- Resource.liftF(
+          data.modify(cache => getOrCreate(key, id, cache, settings, creator))
+        )
      } yield nlp

    private def getOrCreate(
        key: String,
        id: String,
-        cache: Map[String, Entry],
-        settings: StanfordNerSettings
-    ): (Map[String, Entry], StanfordCoreNLP) =
+        cache: Map[String, Entry[Annotator[F]]],
+        settings: NlpSettings,
+        creator: NlpSettings => Annotator[F]
+    ): (Map[String, Entry[Annotator[F]]], Annotator[F]) =
      cache.get(key) match {
        case Some(entry) =>
          if (entry.id == id) (cache, entry.value)
@@ -68,18 +67,18 @@ object PipelineCache {
            logger.info(
              s"StanfordNLP settings changed for key $key. Creating new classifier"
            )
-            val nlp = makeClassifier(settings)
+            val nlp = creator(settings)
            val e   = Entry(id, nlp)
            (cache.updated(key, e), nlp)
          }

        case None =>
-          val nlp = makeClassifier(settings)
+          val nlp = creator(settings)
          val e   = Entry(id, nlp)
          (cache.updated(key, e), nlp)
      }

-    private def makeSettingsId(settings: StanfordNerSettings): F[String] = {
+    private def makeSettingsId(settings: NlpSettings): F[String] = {
      val base = settings.copy(regexNer = None).toString
      val size: F[Long] =
        settings.regexNer match {
@@ -104,9 +103,10 @@ object PipelineCache {
          Resource.pure[F, Unit](())
      }

-    def create[F[_]: Concurrent: Timer](
-        data: Ref[F, Map[String, Entry]],
-        interval: Duration
+    def create[F[_]: Concurrent: Timer, A](
+        data: Ref[F, Map[String, Entry[A]]],
+        interval: Duration,
+        release: F[Unit]
    ): F[CacheClearing[F]] =
      for {
        counter  <- Ref.of(0L)
@@ -121,16 +121,23 @@ object PipelineCache {
            log
              .info(s"Clearing StanfordNLP cache after $interval idle time")
              .map(_ =>
-                new CacheClearingImpl[F](data, counter, cleaning, interval.toScala)
+                new CacheClearingImpl[F, A](
+                  data,
+                  counter,
+                  cleaning,
+                  interval.toScala,
+                  release
+                )
              )
      } yield result
  }

-  final private class CacheClearingImpl[F[_]](
-      data: Ref[F, Map[String, Entry]],
+  final private class CacheClearingImpl[F[_], A](
+      data: Ref[F, Map[String, Entry[A]]],
      counter: Ref[F, Long],
      cleaningFiber: Ref[F, Option[Fiber[F, Unit]]],
-      clearInterval: FiniteDuration
+      clearInterval: FiniteDuration,
+      release: F[Unit]
  )(implicit T: Timer[F], F: Concurrent[F])
      extends CacheClearing[F] {
    private[this] val log = Logger.log4s[F](logger)
@@ -158,17 +165,10 @@ object PipelineCache {

    def clearAll: F[Unit] =
      log.info("Clearing stanford nlp cache now!") *>
-        data.set(Map.empty) *> Sync[F].delay {
-          // turns out that everything is cached in a static map
-          StanfordCoreNLP.clearAnnotatorPool()
+        data.set(Map.empty) *> release *> Sync[F].delay {
          System.gc();
        }
  }

-  private def makeClassifier(settings: StanfordNerSettings): StanfordCoreNLP = {
-    logger.info(s"Creating ${settings.lang.name} Stanford NLP NER classifier...")
-    new StanfordCoreNLP(Properties.forSettings(settings))
-  }
-
-  private case class Entry(id: String, value: StanfordCoreNLP)
+  private case class Entry[A](id: String, value: A)
 }
--- a/modules/analysis/src/main/scala/docspell/analysis/nlp/Properties.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/nlp/Properties.scala
@@ -1,9 +1,11 @@
 package docspell.analysis.nlp

+import java.nio.file.Path
 import java.util.{Properties => JProps}

 import docspell.analysis.nlp.Properties.Implicits._
 import docspell.common._
+import docspell.common.syntax.FileSyntax._

 object Properties {

@@ -17,18 +19,21 @@ object Properties {
    p
  }

-  def forSettings(settings: StanfordNerSettings): JProps = {
-    val regexNerFile = settings.regexNer
-      .map(p => p.normalize().toAbsolutePath().toString())
-    settings.lang match {
-      case Language.German =>
-        Properties.nerGerman(regexNerFile, settings.highRecall)
-      case Language.English =>
-        Properties.nerEnglish(regexNerFile)
-      case Language.French =>
-        Properties.nerFrench(regexNerFile, settings.highRecall)
+  def forSettings(settings: StanfordNerSettings): JProps =
+    settings match {
+      case StanfordNerSettings.Full(lang, highRecall, regexNer) =>
+        val regexNerFile = regexNer.map(p => p.absolutePathAsString)
+        lang match {
+          case Language.German =>
+            Properties.nerGerman(regexNerFile, highRecall)
+          case Language.English =>
+            Properties.nerEnglish(regexNerFile)
+          case Language.French =>
+            Properties.nerFrench(regexNerFile, highRecall)
+        }
+      case StanfordNerSettings.RegexOnly(path) =>
+        Properties.regexNerOnly(path)
    }
-  }

  def nerGerman(regexNerMappingFile: Option[String], highRecall: Boolean): JProps =
    Properties(
@@ -76,6 +81,11 @@ object Properties {
      "ner.model"                   -> "edu/stanford/nlp/models/ner/french-wikiner-4class.crf.ser.gz,edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz"
    ).withRegexNer(regexNerMappingFile).withHighRecall(highRecall)

+  def regexNerOnly(regexNerMappingFile: Path): JProps =
+    Properties(
+      "annotators" -> "tokenize,ssplit"
+    ).withRegexNer(Some(regexNerMappingFile.absolutePathAsString))
+
  object Implicits {
    implicit final class JPropsOps(val p: JProps) extends AnyVal {

--- a/modules/analysis/src/main/scala/docspell/analysis/nlp/StanfordNerAnnotator.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/nlp/StanfordNerAnnotator.scala
@@ -0,0 +1,52 @@
+package docspell.analysis.nlp
+
+import java.nio.file.Path
+
+import scala.jdk.CollectionConverters._
+
+import cats.effect._
+
+import docspell.common._
+
+import edu.stanford.nlp.pipeline.{CoreDocument, StanfordCoreNLP}
+import org.log4s.getLogger
+
+object StanfordNerAnnotator {
+  private[this] val logger = getLogger
+
+  /** Runs named entity recognition on the given `text`.
+    *
+    * This uses the classifier pipeline from stanford-nlp, see
+    * https://nlp.stanford.edu/software/CRF-NER.html. Creating these
+    * classifiers is quite expensive, it involves loading large model
+    * files. The classifiers are thread-safe and so they are cached.
+    * The `cacheKey` defines the "slot" where classifiers are stored
+    * and retrieved. If for a given `cacheKey` the `settings` change,
+    * a new classifier must be created. It will then replace the
+    * previous one.
+    */
+  def nerAnnotate(nerClassifier: StanfordCoreNLP, text: String): Vector[NerLabel] = {
+    val doc = new CoreDocument(text)
+    nerClassifier.annotate(doc)
+    doc.tokens().asScala.collect(Function.unlift(LabelConverter.toNerLabel)).toVector
+  }
+
+  def makePipeline(settings: StanfordNerSettings): StanfordCoreNLP =
+    settings match {
+      case s: StanfordNerSettings.Full =>
+        logger.info(s"Creating ${s.lang.name} Stanford NLP NER classifier...")
+        new StanfordCoreNLP(Properties.forSettings(settings))
+      case StanfordNerSettings.RegexOnly(path) =>
+        logger.info(s"Creating regexNer-only Stanford NLP NER classifier...")
+        regexNerPipeline(path)
+    }
+
+  def regexNerPipeline(regexNerFile: Path): StanfordCoreNLP =
+    new StanfordCoreNLP(Properties.regexNerOnly(regexNerFile))
+
+  def clearPipelineCaches[F[_]: Sync]: F[Unit] =
+    Sync[F].delay {
+      // turns out that everything is cached in a static map
+      StanfordCoreNLP.clearAnnotatorPool()
+    }
+}
--- a/modules/analysis/src/main/scala/docspell/analysis/nlp/StanfordNerClassifier.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/nlp/StanfordNerClassifier.scala
@@ -1,39 +0,0 @@
-package docspell.analysis.nlp
-
-import scala.jdk.CollectionConverters._
-
-import cats.Applicative
-import cats.effect._
-
-import docspell.common._
-
-import edu.stanford.nlp.pipeline.{CoreDocument, StanfordCoreNLP}
-
-object StanfordNerClassifier {
-
-  /** Runs named entity recognition on the given `text`.
-    *
-    * This uses the classifier pipeline from stanford-nlp, see
-    * https://nlp.stanford.edu/software/CRF-NER.html. Creating these
-    * classifiers is quite expensive, it involves loading large model
-    * files. The classifiers are thread-safe and so they are cached.
-    * The `cacheKey` defines the "slot" where classifiers are stored
-    * and retrieved. If for a given `cacheKey` the `settings` change,
-    * a new classifier must be created. It will then replace the
-    * previous one.
-    */
-  def nerAnnotate[F[_]: BracketThrow](
-      cacheKey: String,
-      cache: PipelineCache[F]
-  )(settings: StanfordNerSettings, text: String): F[Vector[NerLabel]] =
-    cache
-      .obtain(cacheKey, settings)
-      .use(crf => Applicative[F].pure(runClassifier(crf, text)))
-
-  def runClassifier(nerClassifier: StanfordCoreNLP, text: String): Vector[NerLabel] = {
-    val doc = new CoreDocument(text)
-    nerClassifier.annotate(doc)
-    doc.tokens().asScala.collect(Function.unlift(LabelConverter.toNerLabel)).toVector
-  }
-
-}
--- a/modules/analysis/src/main/scala/docspell/analysis/nlp/StanfordNerSettings.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/nlp/StanfordNerSettings.scala
@@ -2,25 +2,41 @@ package docspell.analysis.nlp

 import java.nio.file.Path

-import docspell.common._
+import docspell.analysis.NlpSettings
+import docspell.common.Language.NLPLanguage

-/** Settings for configuring the stanford NER pipeline.
-  *
-  * The language is mandatory, only the provided ones are supported.
-  * The `highRecall` only applies for non-English languages. For
-  * non-English languages the english classifier is run as second
-  * classifier and if `highRecall` is true, then it will be used to
-  * tag untagged tokens. This may lead to a lot of false positives,
-  * but since English is omnipresent in other languages, too it
-  * depends on the use case for whether this is useful or not.
-  *
-  * The `regexNer` allows to specify a text file as described here:
-  * https://nlp.stanford.edu/software/regexner.html. This will be used
-  * as a last step to tag untagged tokens using the provided list of
-  * regexps.
-  */
-case class StanfordNerSettings(
-    lang: Language,
-    highRecall: Boolean,
-    regexNer: Option[Path]
-)
+sealed trait StanfordNerSettings
+
+object StanfordNerSettings {
+
+  /** Settings for configuring the stanford NER pipeline.
+    *
+    * The language is mandatory, only the provided ones are supported.
+    * The `highRecall` only applies for non-English languages. For
+    * non-English languages the english classifier is run as second
+    * classifier and if `highRecall` is true, then it will be used to
+    * tag untagged tokens. This may lead to a lot of false positives,
+    * but since English is omnipresent in other languages, too it
+    * depends on the use case for whether this is useful or not.
+    *
+    * The `regexNer` allows to specify a text file as described here:
+    * https://nlp.stanford.edu/software/regexner.html. This will be used
+    * as a last step to tag untagged tokens using the provided list of
+    * regexps.
+    */
+  case class Full(
+      lang: NLPLanguage,
+      highRecall: Boolean,
+      regexNer: Option[Path]
+  ) extends StanfordNerSettings
+
+  /** Not all languages are supported with predefined statistical models. This allows to provide regexps only.
+    */
+  case class RegexOnly(regexNerFile: Path) extends StanfordNerSettings
+
+  def fromNlpSettings(ns: NlpSettings): Option[StanfordNerSettings] =
+    NLPLanguage.all
+      .find(nl => nl == ns.lang)
+      .map(nl => Full(nl, ns.highRecall, ns.regexNer))
+      .orElse(ns.regexNer.map(nrf => RegexOnly(nrf)))
+}
--- a/modules/analysis/src/test/scala/docspell/analysis/Env.scala
+++ b/modules/analysis/src/test/scala/docspell/analysis/Env.scala
@@ -0,0 +1,12 @@
+package docspell.analysis
+
+object Env {
+
+  def isCI = bool("CI")
+
+  def bool(key: String): Boolean =
+    string(key).contains("true")
+
+  def string(key: String): Option[String] =
+    Option(System.getenv(key)).filter(_.nonEmpty)
+}
--- a/modules/analysis/src/test/scala/docspell/analysis/classifier/StanfordTextClassifierSuite.scala
+++ b/modules/analysis/src/test/scala/docspell/analysis/classifier/StanfordTextClassifierSuite.scala
@@ -1,4 +1,4 @@
-package docspell.analysis.nlp
+package docspell.analysis.classifier

 import minitest._
 import cats.effect._
--- a/modules/analysis/src/test/scala/docspell/analysis/nlp/BaseCRFAnnotatorSuite.scala
+++ b/modules/analysis/src/test/scala/docspell/analysis/nlp/BaseCRFAnnotatorSuite.scala
@@ -1,19 +1,22 @@
 package docspell.analysis.nlp

+import docspell.analysis.Env
+import docspell.common.Language.NLPLanguage
 import minitest.SimpleTestSuite
 import docspell.files.TestFiles
 import docspell.common._
-import edu.stanford.nlp.pipeline.StanfordCoreNLP

-object TextAnalyserSuite extends SimpleTestSuite {
-  lazy val germanClassifier =
-    new StanfordCoreNLP(Properties.nerGerman(None, false))
-  lazy val englishClassifier =
-    new StanfordCoreNLP(Properties.nerEnglish(None))
+object BaseCRFAnnotatorSuite extends SimpleTestSuite {
+
+  def annotate(language: NLPLanguage): String => Vector[NerLabel] =
+    BasicCRFAnnotator.nerAnnotate(BasicCRFAnnotator.Cache.getAnnotator(language))

  test("find english ner labels") {
-    val labels =
-      StanfordNerClassifier.runClassifier(englishClassifier, TestFiles.letterENText)
+    if (Env.isCI) {
+      ignore("Test ignored on travis.")
+    }
+
+    val labels = annotate(Language.English)(TestFiles.letterENText)
    val expect = Vector(
      NerLabel("Derek", NerTag.Person, 0, 5),
      NerLabel("Jeter", NerTag.Person, 6, 11),
@@ -45,11 +48,15 @@ object TextAnalyserSuite extends SimpleTestSuite {
      NerLabel("Jeter", NerTag.Person, 1123, 1128)
    )
    assertEquals(labels, expect)
+    BasicCRFAnnotator.Cache.clearCache()
  }

  test("find german ner labels") {
-    val labels =
-      StanfordNerClassifier.runClassifier(germanClassifier, TestFiles.letterDEText)
+    if (Env.isCI) {
+      ignore("Test ignored on travis.")
+    }
+
+    val labels = annotate(Language.German)(TestFiles.letterDEText)
    val expect = Vector(
      NerLabel("Max", NerTag.Person, 0, 3),
      NerLabel("Mustermann", NerTag.Person, 4, 14),
@@ -65,5 +72,6 @@ object TextAnalyserSuite extends SimpleTestSuite {
      NerLabel("Mustermann", NerTag.Person, 509, 519)
    )
    assertEquals(labels, expect)
+    BasicCRFAnnotator.Cache.clearCache()
  }
 }
--- a/modules/analysis/src/test/scala/docspell/analysis/nlp/StanfordNerAnnotatorSuite.scala
+++ b/modules/analysis/src/test/scala/docspell/analysis/nlp/StanfordNerAnnotatorSuite.scala
@@ -0,0 +1,120 @@
+package docspell.analysis.nlp
+
+import java.nio.file.Paths
+
+import cats.effect.IO
+import docspell.analysis.Env
+import minitest.SimpleTestSuite
+import docspell.files.TestFiles
+import docspell.common._
+import docspell.common.syntax.FileSyntax._
+import edu.stanford.nlp.pipeline.StanfordCoreNLP
+
+object StanfordNerAnnotatorSuite extends SimpleTestSuite {
+  lazy val germanClassifier =
+    new StanfordCoreNLP(Properties.nerGerman(None, false))
+  lazy val englishClassifier =
+    new StanfordCoreNLP(Properties.nerEnglish(None))
+
+  test("find english ner labels") {
+    if (Env.isCI) {
+      ignore("Test ignored on travis.")
+    }
+
+    val labels =
+      StanfordNerAnnotator.nerAnnotate(englishClassifier, TestFiles.letterENText)
+    val expect = Vector(
+      NerLabel("Derek", NerTag.Person, 0, 5),
+      NerLabel("Jeter", NerTag.Person, 6, 11),
+      NerLabel("Elm", NerTag.Misc, 17, 20),
+      NerLabel("Ave.", NerTag.Misc, 21, 25),
+      NerLabel("Treesville", NerTag.Misc, 27, 37),
+      NerLabel("Derek", NerTag.Person, 68, 73),
+      NerLabel("Jeter", NerTag.Person, 74, 79),
+      NerLabel("Elm", NerTag.Misc, 85, 88),
+      NerLabel("Ave.", NerTag.Misc, 89, 93),
+      NerLabel("Treesville", NerTag.Person, 95, 105),
+      NerLabel("Leaf", NerTag.Organization, 144, 148),
+      NerLabel("Chief", NerTag.Organization, 150, 155),
+      NerLabel("of", NerTag.Organization, 156, 158),
+      NerLabel("Syrup", NerTag.Organization, 159, 164),
+      NerLabel("Production", NerTag.Organization, 165, 175),
+      NerLabel("Old", NerTag.Organization, 176, 179),
+      NerLabel("Sticky", NerTag.Organization, 180, 186),
+      NerLabel("Pancake", NerTag.Organization, 187, 194),
+      NerLabel("Company", NerTag.Organization, 195, 202),
+      NerLabel("Maple", NerTag.Organization, 207, 212),
+      NerLabel("Lane", NerTag.Organization, 213, 217),
+      NerLabel("Forest", NerTag.Organization, 219, 225),
+      NerLabel("Hemptown", NerTag.Location, 239, 247),
+      NerLabel("Leaf", NerTag.Person, 276, 280),
+      NerLabel("Little", NerTag.Misc, 347, 353),
+      NerLabel("League", NerTag.Misc, 354, 360),
+      NerLabel("Derek", NerTag.Person, 1117, 1122),
+      NerLabel("Jeter", NerTag.Person, 1123, 1128)
+    )
+    assertEquals(labels, expect)
+    StanfordCoreNLP.clearAnnotatorPool()
+  }
+
+  test("find german ner labels") {
+    if (Env.isCI) {
+      ignore("Test ignored on travis.")
+    }
+
+    val labels =
+      StanfordNerAnnotator.nerAnnotate(germanClassifier, TestFiles.letterDEText)
+    val expect = Vector(
+      NerLabel("Max", NerTag.Person, 0, 3),
+      NerLabel("Mustermann", NerTag.Person, 4, 14),
+      NerLabel("Lilienweg", NerTag.Person, 16, 25),
+      NerLabel("Max", NerTag.Person, 77, 80),
+      NerLabel("Mustermann", NerTag.Person, 81, 91),
+      NerLabel("Lilienweg", NerTag.Location, 93, 102),
+      NerLabel("EasyCare", NerTag.Organization, 124, 132),
+      NerLabel("AG", NerTag.Organization, 133, 135),
+      NerLabel("Ackerweg", NerTag.Location, 158, 166),
+      NerLabel("Nebendorf", NerTag.Location, 184, 193),
+      NerLabel("Max", NerTag.Person, 505, 508),
+      NerLabel("Mustermann", NerTag.Person, 509, 519)
+    )
+    assertEquals(labels, expect)
+    StanfordCoreNLP.clearAnnotatorPool()
+  }
+
+  test("regexner-only annotator") {
+    if (Env.isCI) {
+      ignore("Test ignored on travis.")
+    }
+
+    val regexNerContent =
+      s"""(?i)volantino ag${"\t"}ORGANIZATION${"\t"}LOCATION,PERSON,MISC${"\t"}3
+      |(?i)volantino${"\t"}ORGANIZATION${"\t"}LOCATION,PERSON,MISC${"\t"}3
+      |(?i)ag${"\t"}ORGANIZATION${"\t"}LOCATION,PERSON,MISC${"\t"}3
+      |(?i)andrea rossi${"\t"}PERSON${"\t"}LOCATION,MISC${"\t"}2
+      |(?i)andrea${"\t"}PERSON${"\t"}LOCATION,MISC${"\t"}2
+      |(?i)rossi${"\t"}PERSON${"\t"}LOCATION,MISC${"\t"}2
+      |""".stripMargin
+
+    File
+      .withTempDir[IO](Paths.get("target"), "test-regex-ner")
+      .use { dir =>
+        for {
+          out <- File.writeString[IO](dir / "regex.txt", regexNerContent)
+          ann    = StanfordNerAnnotator.makePipeline(StanfordNerSettings.RegexOnly(out))
+          labels = StanfordNerAnnotator.nerAnnotate(ann, "Hello Andrea Rossi, can you.")
+          _ <- IO(
+            assertEquals(
+              labels,
+              Vector(
+                NerLabel("Andrea", NerTag.Person, 6, 12),
+                NerLabel("Rossi", NerTag.Person, 13, 18)
+              )
+            )
+          )
+        } yield ()
+      }
+      .unsafeRunSync()
+    StanfordCoreNLP.clearAnnotatorPool()
+  }
+}
--- a/modules/backend/src/main/scala/docspell/backend/ops/OItem.scala
+++ b/modules/backend/src/main/scala/docspell/backend/ops/OItem.scala
@@ -591,7 +591,7 @@ object OItem {
          for {
            itemIds <- store.transact(RItem.filterItems(items, collective))
            results <- itemIds.traverse(item => deleteItem(item, collective))
-            n = results.fold(0)(_ + _)
+            n = results.sum
          } yield n

        def getProposals(item: Ident, collective: Ident): F[MetaProposalList] =
--- a/modules/common/src/main/scala/docspell/common/Language.scala
+++ b/modules/common/src/main/scala/docspell/common/Language.scala
@@ -1,5 +1,7 @@
 package docspell.common

+import cats.data.NonEmptyList
+
 import io.circe.{Decoder, Encoder}

 sealed trait Language { self: Product =>
@@ -11,28 +13,107 @@ sealed trait Language { self: Product =>

  def iso3: String

+  val allowsNLP: Boolean = false
+
  private[common] def allNames =
    Set(name, iso3, iso2)
 }

 object Language {
+  sealed trait NLPLanguage extends Language with Product {
+    override val allowsNLP = true
+  }
+  object NLPLanguage {
+    val all: NonEmptyList[NLPLanguage] = NonEmptyList.of(German, English, French)
+  }

-  case object German extends Language {
+  case object German extends NLPLanguage {
    val iso2 = "de"
    val iso3 = "deu"
  }

-  case object English extends Language {
+  case object English extends NLPLanguage {
    val iso2 = "en"
    val iso3 = "eng"
  }

-  case object French extends Language {
+  case object French extends NLPLanguage {
    val iso2 = "fr"
    val iso3 = "fra"
  }

-  val all: List[Language] = List(German, English, French)
+  case object Italian extends Language {
+    val iso2 = "it"
+    val iso3 = "ita"
+  }
+
+  case object Spanish extends Language {
+    val iso2 = "es"
+    val iso3 = "spa"
+  }
+
+  case object Portuguese extends Language {
+    val iso2 = "pt"
+    val iso3 = "por"
+  }
+
+  case object Czech extends Language {
+    val iso2 = "cs"
+    val iso3 = "ces"
+  }
+
+  case object Danish extends Language {
+    val iso2 = "da"
+    val iso3 = "dan"
+  }
+
+  case object Finnish extends Language {
+    val iso2 = "fi"
+    val iso3 = "fin"
+  }
+
+  case object Norwegian extends Language {
+    val iso2 = "no"
+    val iso3 = "nor"
+  }
+
+  case object Swedish extends Language {
+    val iso2 = "sv"
+    val iso3 = "swe"
+  }
+
+  case object Russian extends Language {
+    val iso2 = "ru"
+    val iso3 = "rus"
+  }
+
+  case object Romanian extends Language {
+    val iso2 = "ro"
+    val iso3 = "ron"
+  }
+
+  case object Dutch extends Language {
+    val iso2 = "nl"
+    val iso3 = "nld"
+  }
+
+  val all: List[Language] =
+    List(
+      German,
+      English,
+      French,
+      Italian,
+      Spanish,
+      Dutch,
+      Portuguese,
+      Czech,
+      Danish,
+      Finnish,
+      Norwegian,
+      Swedish,
+      Russian,
+      Romanian
+    )

  def fromString(str: String): Either[String, Language] = {
    val lang = str.toLowerCase
--- a/modules/common/src/main/scala/docspell/common/ListType.scala
+++ b/modules/common/src/main/scala/docspell/common/ListType.scala
@@ -0,0 +1,33 @@
+package docspell.common
+
+import cats.data.NonEmptyList
+
+import io.circe.{Decoder, Encoder}
+
+sealed trait ListType { self: Product =>
+  def name: String =
+    productPrefix.toLowerCase
+}
+
+object ListType {
+
+  case object Whitelist extends ListType
+  val whitelist: ListType = Whitelist
+
+  case object Blacklist extends ListType
+  val blacklist: ListType = Blacklist
+
+  val all: NonEmptyList[ListType] = NonEmptyList.of(Whitelist, Blacklist)
+
+  def fromString(name: String): Either[String, ListType] =
+    all.find(_.name.equalsIgnoreCase(name)).toRight(s"Unknown list type: $name")
+
+  def unsafeFromString(name: String): ListType =
+    fromString(name).fold(sys.error, identity)
+
+  implicit val jsonEncoder: Encoder[ListType] =
+    Encoder.encodeString.contramap(_.name)
+
+  implicit val jsonDecoder: Decoder[ListType] =
+    Decoder.decodeString.emap(fromString)
+}
--- a/modules/common/src/main/scala/docspell/common/MetaProposal.scala
+++ b/modules/common/src/main/scala/docspell/common/MetaProposal.scala
@@ -87,7 +87,7 @@ object MetaProposal {
    }
  }

-  /** Merges candidates with same `IdRef' values and concatenates their
+  /** Merges candidates with same `IdRef` values and concatenates their
    * respective labels. The candidate order is preserved.
    */
  def flatten(s: NonEmptyList[Candidate]): NonEmptyList[Candidate] = {
--- a/modules/common/src/main/scala/docspell/common/MetaProposalList.scala
+++ b/modules/common/src/main/scala/docspell/common/MetaProposalList.scala
@@ -45,6 +45,19 @@ case class MetaProposalList private (proposals: List[MetaProposal]) {

  def sortByWeights: MetaProposalList =
    change(_.sortByWeight)
+
+  def insertSecond(ml: MetaProposalList): MetaProposalList =
+    MetaProposalList.flatten0(
+      Seq(this, ml),
+      (map, next) =>
+        map.get(next.proposalType) match {
+          case Some(MetaProposal(mt, values)) =>
+            val cand = NonEmptyList(values.head, next.values.toList ++ values.tail)
+            map.updated(next.proposalType, MetaProposal(mt, MetaProposal.flatten(cand)))
+          case None =>
+            map.updated(next.proposalType, next)
+        }
+    )
 }

 object MetaProposalList {
@@ -74,20 +87,25 @@ object MetaProposalList {
    * is preserved and candidates of proposals are appended as given
    * by the order of the given `seq'.
    */
-  def flatten(ml: Seq[MetaProposalList]): MetaProposalList = {
-    val init: Map[MetaProposalType, MetaProposal] = Map.empty
-
-    def updateMap(
-        map: Map[MetaProposalType, MetaProposal],
-        mp: MetaProposal
-    ): Map[MetaProposalType, MetaProposal] =
-      map.get(mp.proposalType) match {
-        case Some(mp0) => map.updated(mp.proposalType, mp0.addIdRef(mp.values.toList))
-        case None      => map.updated(mp.proposalType, mp)
-      }
-
-    val merged = ml.foldLeft(init)((map, el) => el.proposals.foldLeft(map)(updateMap))
+  def flatten(ml: Seq[MetaProposalList]): MetaProposalList =
+    flatten0(
+      ml,
+      (map, mp) =>
+        map.get(mp.proposalType) match {
+          case Some(mp0) => map.updated(mp.proposalType, mp0.addIdRef(mp.values.toList))
+          case None      => map.updated(mp.proposalType, mp)
+        }
+    )

+  private def flatten0(
+      ml: Seq[MetaProposalList],
+      merge: (
+          Map[MetaProposalType, MetaProposal],
+          MetaProposal
+      ) => Map[MetaProposalType, MetaProposal]
+  ): MetaProposalList = {
+    val init   = Map.empty[MetaProposalType, MetaProposal]
+    val merged = ml.foldLeft(init)((map, el) => el.proposals.foldLeft(map)(merge))
    fromMap(merged)
  }

--- a/modules/common/src/main/scala/docspell/common/NlpMode.scala
+++ b/modules/common/src/main/scala/docspell/common/NlpMode.scala
@@ -0,0 +1,25 @@
+package docspell.common
+
+sealed trait NlpMode { self: Product =>
+
+  def name: String =
+    self.productPrefix
+}
+object NlpMode {
+  case object Full      extends NlpMode
+  case object Basic     extends NlpMode
+  case object RegexOnly extends NlpMode
+  case object Disabled  extends NlpMode
+
+  def fromString(name: String): Either[String, NlpMode] =
+    name.toLowerCase match {
+      case "full"      => Right(Full)
+      case "basic"     => Right(Basic)
+      case "regexonly" => Right(RegexOnly)
+      case "disabled"  => Right(Disabled)
+      case _           => Left(s"Unknown nlp-mode: $name")
+    }
+
+  def unsafeFromString(name: String): NlpMode =
+    fromString(name).fold(sys.error, identity)
+}
--- a/modules/common/src/main/scala/docspell/common/config/Implicits.scala
+++ b/modules/common/src/main/scala/docspell/common/config/Implicits.scala
@@ -44,6 +44,9 @@ object Implicits {
  implicit val priorityReader: ConfigReader[Priority] =
    ConfigReader[String].emap(reason(Priority.fromString))

+  implicit val nlpModeReader: ConfigReader[NlpMode] =
+    ConfigReader[String].emap(reason(NlpMode.fromString))
+
  def reason[A: ClassTag](
      f: String => Either[String, A]
  ): String => Either[FailureReason, A] =
--- a/modules/common/src/main/scala/docspell/common/syntax/FileSyntax.scala
+++ b/modules/common/src/main/scala/docspell/common/syntax/FileSyntax.scala
@@ -0,0 +1,20 @@
+package docspell.common.syntax
+
+import java.nio.file.Path
+
+trait FileSyntax {
+
+  implicit final class PathOps(p: Path) {
+
+    def absolutePath: Path =
+      p.normalize().toAbsolutePath
+
+    def absolutePathAsString: String =
+      absolutePath.toString
+
+    def /(next: String): Path =
+      p.resolve(next)
+  }
+}
+
+object FileSyntax extends FileSyntax
--- a/modules/common/src/main/scala/docspell/common/syntax/package.scala
+++ b/modules/common/src/main/scala/docspell/common/syntax/package.scala
@@ -2,6 +2,11 @@ package docspell.common

 package object syntax {

-  object all extends EitherSyntax with StreamSyntax with StringSyntax with LoggerSyntax
+  object all
+      extends EitherSyntax
+      with StreamSyntax
+      with StringSyntax
+      with LoggerSyntax
+      with FileSyntax

 }
--- a/modules/common/src/test/scala/docspell/common/MetaProposalListTest.scala
+++ b/modules/common/src/test/scala/docspell/common/MetaProposalListTest.scala
@@ -68,4 +68,35 @@ object MetaProposalListTest extends SimpleTestSuite {
    assertEquals(candidates.head, cand1)
    assertEquals(candidates.tail.head, cand2)
  }
+
+  test("insert second") {
+    val cand1 = Candidate(IdRef(Ident.unsafe("123"), "name"), Set.empty)
+    val cand2 = Candidate(IdRef(Ident.unsafe("456"), "name"), Set.empty)
+    val cand3 = Candidate(IdRef(Ident.unsafe("789"), "name"), Set.empty)
+    val cand4 = Candidate(IdRef(Ident.unsafe("abc"), "name"), Set.empty)
+    val cand5 = Candidate(IdRef(Ident.unsafe("def"), "name"), Set.empty)
+
+    val mpl1 = MetaProposalList
+      .of(
+        MetaProposal(MetaProposalType.CorrOrg, NonEmptyList.of(cand1, cand2)),
+        MetaProposal(MetaProposalType.ConcPerson, NonEmptyList.of(cand3))
+      )
+
+    val mpl2 = MetaProposalList
+      .of(
+        MetaProposal(MetaProposalType.CorrOrg, NonEmptyList.of(cand4)),
+        MetaProposal(MetaProposalType.ConcPerson, NonEmptyList.of(cand5))
+      )
+
+    val result = mpl1.insertSecond(mpl2)
+    assertEquals(
+      result,
+      MetaProposalList(
+        List(
+          MetaProposal(MetaProposalType.CorrOrg, NonEmptyList.of(cand1, cand4, cand2)),
+          MetaProposal(MetaProposalType.ConcPerson, NonEmptyList.of(cand3, cand5))
+        )
+      )
+    )
+  }
 }
--- a/modules/files/src/test/resources/examples/letter-ita.txt
+++ b/modules/files/src/test/resources/examples/letter-ita.txt
@@ -0,0 +1,13 @@
+Pontremoli, 9 aprile 2013
+
+Spettabile Villa Albicocca
+Via Francigena, 9
+55100 Pontetetto (LU)
+
+Oggetto: Prenotazione
+
+Gentile Direttore,
+
+Vorrei prenotare una camera matrimoniale …….
+
+In attesa di una Sua pronta risposta, La saluto cordialmente
--- a/modules/fts-client/src/main/scala/docspell/ftsclient/FtsMigration.scala
+++ b/modules/fts-client/src/main/scala/docspell/ftsclient/FtsMigration.scala
@@ -1,5 +1,8 @@
 package docspell.ftsclient

+import cats.Functor
+import cats.implicits._
+
 import docspell.common._

 final case class FtsMigration[F[_]](
@@ -7,7 +10,13 @@ final case class FtsMigration[F[_]](
    engine: Ident,
    description: String,
    task: F[FtsMigration.Result]
-)
+) {
+
+  def changeResult(f: FtsMigration.Result => FtsMigration.Result)(implicit
+      F: Functor[F]
+  ): FtsMigration[F] =
+    copy(task = task.map(f))
+}

 object FtsMigration {

--- a/modules/fts-solr/src/main/scala/docspell/ftssolr/Field.scala
+++ b/modules/fts-solr/src/main/scala/docspell/ftssolr/Field.scala
@@ -21,22 +21,19 @@ object Field {
  val discriminator  = Field("discriminator")
  val attachmentName = Field("attachmentName")
  val content        = Field("content")
-  val content_de     = Field("content_de")
-  val content_en     = Field("content_en")
-  val content_fr     = Field("content_fr")
+  val content_de     = contentField(Language.German)
+  val content_en     = contentField(Language.English)
+  val content_fr     = contentField(Language.French)
  val itemName       = Field("itemName")
  val itemNotes      = Field("itemNotes")
  val folderId       = Field("folder")

+  val contentLangFields = Language.all
+    .map(contentField)
+
  def contentField(lang: Language): Field =
-    lang match {
-      case Language.German =>
-        Field.content_de
-      case Language.English =>
-        Field.content_en
-      case Language.French =>
-        Field.content_fr
-    }
+    if (lang == Language.Czech) Field(s"content_cz")
+    else Field(s"content_${lang.iso2}")

  implicit val jsonEncoder: Encoder[Field] =
    Encoder.encodeString.contramap(_.name)
--- a/modules/fts-solr/src/main/scala/docspell/ftssolr/SolrQuery.scala
+++ b/modules/fts-solr/src/main/scala/docspell/ftssolr/SolrQuery.scala
@@ -37,13 +37,10 @@ object SolrQuery {
          cfg,
          List(
            Field.content,
-            Field.content_de,
-            Field.content_en,
-            Field.content_fr,
            Field.itemName,
            Field.itemNotes,
            Field.attachmentName
-          ),
+          ) ++ Field.contentLangFields,
          List(
            Field.id,
            Field.itemId,
--- a/modules/fts-solr/src/main/scala/docspell/ftssolr/SolrSetup.scala
+++ b/modules/fts-solr/src/main/scala/docspell/ftssolr/SolrSetup.scala
@@ -56,21 +56,51 @@ object SolrSetup {
            5,
            solrEngine,
            "Add content_fr field",
-            addContentFrField.map(_ => FtsMigration.Result.workDone)
+            addContentField(Language.French).map(_ => FtsMigration.Result.workDone)
          ),
          FtsMigration[F](
            6,
            solrEngine,
            "Index all from database",
            FtsMigration.Result.indexAll.pure[F]
+          ),
+          FtsMigration[F](
+            7,
+            solrEngine,
+            "Add content_it field",
+            addContentField(Language.Italian).map(_ => FtsMigration.Result.reIndexAll)
+          ),
+          FtsMigration[F](
+            8,
+            solrEngine,
+            "Add content_es field",
+            addContentField(Language.Spanish).map(_ => FtsMigration.Result.reIndexAll)
+          ),
+          FtsMigration[F](
+            9,
+            solrEngine,
+            "Add more content fields",
+            addMoreContentFields.map(_ => FtsMigration.Result.reIndexAll)
          )
        )

      def addFolderField: F[Unit] =
        addStringField(Field.folderId)

-      def addContentFrField: F[Unit] =
-        addTextField(Some(Language.French))(Field.content_fr)
+      def addMoreContentFields: F[Unit] = {
+        val remain = List[Language](
+          Language.Norwegian,
+          Language.Romanian,
+          Language.Swedish,
+          Language.Finnish,
+          Language.Danish,
+          Language.Czech,
+          Language.Dutch,
+          Language.Portuguese,
+          Language.Russian
+        )
+        remain.traverse(addContentField).map(_ => ())
+      }

      def setupCoreSchema: F[Unit] = {
        val cmds0 =
@@ -90,13 +120,15 @@ object SolrSetup {
        )
          .traverse(addTextField(None))

-        val cntLang = Language.all.traverse {
+        val cntLang = List(Language.German, Language.English, Language.French).traverse {
          case l @ Language.German =>
            addTextField(l.some)(Field.content_de)
          case l @ Language.English =>
            addTextField(l.some)(Field.content_en)
          case l @ Language.French =>
            addTextField(l.some)(Field.content_fr)
+          case _ =>
+            ().pure[F]
        }

        cmds0 *> cmds1 *> cntLang *> ().pure[F]
@@ -111,20 +143,17 @@ object SolrSetup {
        run(DeleteField.command(DeleteField(field))).attempt *>
          run(AddField.command(AddField.string(field)))

+      private def addContentField(lang: Language): F[Unit] =
+        addTextField(Some(lang))(Field.contentField(lang))
+
      private def addTextField(lang: Option[Language])(field: Field): F[Unit] =
        lang match {
          case None =>
            run(DeleteField.command(DeleteField(field))).attempt *>
-              run(AddField.command(AddField.text(field)))
-          case Some(Language.German) =>
+              run(AddField.command(AddField.textGeneral(field)))
+          case Some(lang) =>
            run(DeleteField.command(DeleteField(field))).attempt *>
-              run(AddField.command(AddField.textDE(field)))
-          case Some(Language.English) =>
-            run(DeleteField.command(DeleteField(field))).attempt *>
-              run(AddField.command(AddField.textEN(field)))
-          case Some(Language.French) =>
-            run(DeleteField.command(DeleteField(field))).attempt *>
-              run(AddField.command(AddField.textFR(field)))
+              run(AddField.command(AddField.textLang(field, lang)))
        }
    }
  }
@@ -150,17 +179,12 @@ object SolrSetup {
    def string(field: Field): AddField =
      AddField(field, "string", true, true, false)

-    def text(field: Field): AddField =
+    def textGeneral(field: Field): AddField =
      AddField(field, "text_general", true, true, false)

-    def textDE(field: Field): AddField =
-      AddField(field, "text_de", true, true, false)
-
-    def textEN(field: Field): AddField =
-      AddField(field, "text_en", true, true, false)
-
-    def textFR(field: Field): AddField =
-      AddField(field, "text_fr", true, true, false)
+    def textLang(field: Field, lang: Language): AddField =
+      if (lang == Language.Czech) AddField(field, s"text_cz", true, true, false)
+      else AddField(field, s"text_${lang.iso2}", true, true, false)
  }

  case class DeleteField(name: Field)
--- a/modules/joex/src/main/resources/reference.conf
+++ b/modules/joex/src/main/resources/reference.conf
@@ -269,62 +269,101 @@ docspell.joex {
    # All text to analyse must fit into RAM. A large document may take
    # too much heap. Also, most important information is at the
    # beginning of a document, so in most cases the first two pages
-    # should suffice. Default is 10000, which are about 2-3 pages
-    # (just a rough guess, of course).
-    max-length = 10000
+    # should suffice. Default is 8000, which are about 2-3 pages (just
+    # a rough guess, of course).
+    max-length = 8000

    # A working directory for the analyser to store temporary/working
    # files.
    working-dir = ${java.io.tmpdir}"/docspell-analysis"

-    # The StanfordCoreNLP library caches language models which
-    # requires quite some amount of memory. Setting this interval to a
-    # positive duration, the cache is cleared after this amount of
-    # idle time. Set it to 0 to disable it if you have enough memory,
-    # processing will be faster.
-    clear-stanford-nlp-interval = "15 minutes"
-
-    regex-ner {
-      # Whether to enable custom NER annotation. This uses the address
-      # book of a collective as input for NER tagging (to automatically
-      # find correspondent and concerned entities). If the address book
-      # is large, this can be quite memory intensive and also makes text
-      # analysis slower. But it greatly improves accuracy. If this is
-      # false, NER tagging uses only statistical models (that also work
-      # quite well).
+    nlp {
+      # The mode for configuring NLP models:
      #
-      # This setting might be moved to the collective settings in the
-      # future.
-      enabled = true
+      # 1. full – builds the complete pipeline
+      # 2. basic - builds only the ner annotator
+      # 3. regexonly - matches each entry in your address book via regexps
+      # 4. disabled - doesn't use any stanford-nlp feature
+      #
+      # The full and basic variants rely on pre-build language models
+      # that are available for only a few languages. Memory usage
+      # varies among the languages. So joex should run with -Xmx1400M
+      # at least when using mode=full.
+      #
+      # The basic variant does a quite good job for German and
+      # English. It might be worse for French, always depending on the
+      # type of text that is analysed. Joex should run with about 500M
+      # heap, here again lanugage German uses the most.
+      #
+      # The regexonly variant doesn't depend on a language. It roughly
+      # works by converting all entries in your addressbook into
+      # regexps and matches each one against the text. This can get
+      # memory intensive, too, when the addressbook grows large. This
+      # is included in the full and basic by default, but can be used
+      # independently by setting mode=regexner.
+      #
+      # When mode=disabled, then the whole nlp pipeline is disabled,
+      # and you won't get any suggestions. Only what the classifier
+      # returns (if enabled).
+      mode = full

-      # The NER annotation uses a file of patterns that is derived from
-      # a collective's address book. This is is the time how long this
-      # file will be kept until a check for a state change is done.
-      file-cache-time = "1 minute"
+      # The StanfordCoreNLP library caches language models which
+      # requires quite some amount of memory. Setting this interval to a
+      # positive duration, the cache is cleared after this amount of
+      # idle time. Set it to 0 to disable it if you have enough memory,
+      # processing will be faster.
+      #
+      # This has only any effect, if mode != disabled.
+      clear-interval = "15 minutes"
+
+      # Restricts proposals for due dates. Only dates earlier than this
+      # number of years in the future are considered.
+      max-due-date-years = 10
+
+      regex-ner {
+        # Whether to enable custom NER annotation. This uses the
+        # address book of a collective as input for NER tagging (to
+        # automatically find correspondent and concerned entities). If
+        # the address book is large, this can be quite memory
+        # intensive and also makes text analysis much slower. But it
+        # improves accuracy and can be used independent of the
+        # lanugage. If this is set to 0, it is effectively disabled
+        # and NER tagging uses only statistical models (that also work
+        # quite well, but are restricted to the languages mentioned
+        # above).
+        #
+        # Note, this is only relevant if nlp-config.mode is not
+        # "disabled".
+        max-entries = 1000
+
+        # The NER annotation uses a file of patterns that is derived
+        # from a collective's address book. This is is the time how
+        # long this data will be kept until a check for a state change
+        # is done.
+        file-cache-time = "1 minute"
+      }
    }

    # Settings for doing document classification.
    #
-    # This works by learning from existing documents. A collective can
-    # specify a tag category and the system will try to predict a tag
-    # from this category for new incoming documents.
-    #
-    # This requires a satstical model that is computed from all
-    # existing documents. This process is run periodically as
-    # configured by the collective. It may require a lot of memory,
-    # depending on the amount of data.
+    # This works by learning from existing documents. This requires a
+    # satstical model that is computed from all existing documents.
+    # This process is run periodically as configured by the
+    # collective. It may require more memory, depending on the amount
+    # of data.
    #
    # It utilises this NLP library: https://nlp.stanford.edu/.
    classification {
      # Whether to enable classification globally. Each collective can
-      # decide to disable it. If it is disabled here, no collective
-      # can use classification.
+      # enable/disable auto-tagging. The classifier is also used for
+      # finding correspondents and concerned entities, if enabled
+      # here.
      enabled = true

      # If concerned with memory consumption, this restricts the
      # number of items to consider. More are better for training. A
-      # negative value or zero means no train on all items.
-      item-count = 0
+      # negative value or zero means to train on all items.
+      item-count = 600

      # These settings are used to configure the classifier. If
      # multiple are given, they are all tried and the "best" is
@@ -477,13 +516,6 @@ docspell.joex {
    }
  }

-  # General config for processing documents
-  processing {
-    # Restricts proposals for due dates. Only dates earlier than this
-    # number of years in the future are considered.
-    max-due-date-years = 10
-  }
-
  # The same section is also present in the rest-server config. It is
  # used when submitting files into the job queue for processing.
  #
--- a/modules/joex/src/main/scala/docspell/joex/Config.scala
+++ b/modules/joex/src/main/scala/docspell/joex/Config.scala
@@ -5,7 +5,7 @@ import java.nio.file.Path
 import cats.data.NonEmptyList

 import docspell.analysis.TextAnalysisConfig
-import docspell.analysis.nlp.TextClassifierConfig
+import docspell.analysis.classifier.TextClassifierConfig
 import docspell.backend.Config.Files
 import docspell.common._
 import docspell.convert.ConvertConfig
@@ -31,8 +31,7 @@ case class Config(
    sendMail: MailSendConfig,
    files: Files,
    mailDebug: Boolean,
-    fullTextSearch: Config.FullTextSearch,
-    processing: Config.Processing
+    fullTextSearch: Config.FullTextSearch
 )

 object Config {
@@ -55,20 +54,17 @@ object Config {
    final case class Migration(indexAllChunk: Int)
  }

-  case class Processing(maxDueDateYears: Int)
-
  case class TextAnalysis(
      maxLength: Int,
      workingDir: Path,
-      clearStanfordNlpInterval: Duration,
-      regexNer: RegexNer,
+      nlp: NlpConfig,
      classification: Classification
  ) {

    def textAnalysisConfig: TextAnalysisConfig =
      TextAnalysisConfig(
        maxLength,
-        clearStanfordNlpInterval,
+        TextAnalysisConfig.NlpConfig(nlp.clearInterval, nlp.mode),
        TextClassifierConfig(
          workingDir,
          NonEmptyList
@@ -78,14 +74,30 @@ object Config {
      )

    def regexNerFileConfig: RegexNerFile.Config =
-      RegexNerFile.Config(regexNer.enabled, workingDir, regexNer.fileCacheTime)
+      RegexNerFile.Config(
+        nlp.regexNer.maxEntries,
+        workingDir,
+        nlp.regexNer.fileCacheTime
+      )
  }

-  case class RegexNer(enabled: Boolean, fileCacheTime: Duration)
+  case class NlpConfig(
+      mode: NlpMode,
+      clearInterval: Duration,
+      maxDueDateYears: Int,
+      regexNer: RegexNer
+  )
+
+  case class RegexNer(maxEntries: Int, fileCacheTime: Duration)

  case class Classification(
      enabled: Boolean,
      itemCount: Int,
      classifiers: List[Map[String, String]]
-  )
+  ) {
+
+    def itemCountOrWhenLower(other: Int): Int =
+      if (itemCount <= 0 || (itemCount > other && other > 0)) other
+      else itemCount
+  }
 }
--- a/modules/joex/src/main/scala/docspell/joex/JoexAppImpl.scala
+++ b/modules/joex/src/main/scala/docspell/joex/JoexAppImpl.scala
@@ -97,7 +97,7 @@ object JoexAppImpl {
      upload   <- OUpload(store, queue, cfg.files, joex)
      fts      <- createFtsClient(cfg)(httpClient)
      itemOps  <- OItem(store, fts, queue, joex)
-      analyser <- TextAnalyser.create[F](cfg.textAnalysis.textAnalysisConfig)
+      analyser <- TextAnalyser.create[F](cfg.textAnalysis.textAnalysisConfig, blocker)
      regexNer <- RegexNerFile(cfg.textAnalysis.regexNerFileConfig, blocker, store)
      javaEmil =
        JavaMailEmil(blocker, Settings.defaultSettings.copy(debug = cfg.mailDebug))
@@ -169,7 +169,7 @@ object JoexAppImpl {
        .withTask(
          JobTask.json(
            LearnClassifierArgs.taskName,
-            LearnClassifierTask[F](cfg.textAnalysis, blocker, analyser),
+            LearnClassifierTask[F](cfg.textAnalysis, analyser),
            LearnClassifierTask.onCancel[F]
          )
        )
--- a/modules/joex/src/main/scala/docspell/joex/analysis/RegexNerFile.scala
+++ b/modules/joex/src/main/scala/docspell/joex/analysis/RegexNerFile.scala
@@ -29,7 +29,7 @@ trait RegexNerFile[F[_]] {
 object RegexNerFile {
  private[this] val logger = getLogger

-  case class Config(enabled: Boolean, directory: Path, minTime: Duration)
+  case class Config(maxEntries: Int, directory: Path, minTime: Duration)

  def apply[F[_]: Concurrent: ContextShift](
      cfg: Config,
@@ -49,7 +49,7 @@ object RegexNerFile {
  ) extends RegexNerFile[F] {

    def makeFile(collective: Ident): F[Option[Path]] =
-      if (cfg.enabled) doMakeFile(collective)
+      if (cfg.maxEntries > 0) doMakeFile(collective)
      else (None: Option[Path]).pure[F]

    def doMakeFile(collective: Ident): F[Option[Path]] =
@@ -127,7 +127,7 @@ object RegexNerFile {

      for {
        _     <- logger.finfo(s"Generating custom NER file for collective '${collective.id}'")
-        names <- store.transact(QCollective.allNames(collective))
+        names <- store.transact(QCollective.allNames(collective, cfg.maxEntries))
        nerFile = NerFile(collective, lastUpdate, now)
        _ <- update(nerFile, NerFile.mkNerConfig(names))
      } yield nerFile
--- a/modules/joex/src/main/scala/docspell/joex/fts/FtsWork.scala
+++ b/modules/joex/src/main/scala/docspell/joex/fts/FtsWork.scala
@@ -14,16 +14,26 @@ object FtsWork {
  def apply[F[_]](f: FtsContext[F] => F[Unit]): FtsWork[F] =
    Kleisli(f)

-  def allInitializeTasks[F[_]: Monad]: FtsWork[F] =
-    FtsWork[F](_ => ().pure[F]).tap[FtsContext[F]].flatMap { ctx =>
-      NonEmptyList.fromList(ctx.fts.initialize.map(fm => from[F](fm.task))) match {
+  /** Runs all migration tasks unconditionally and inserts all data as last step. */
+  def reInitializeTasks[F[_]: Monad]: FtsWork[F] =
+    FtsWork { ctx =>
+      val migrations =
+        ctx.fts.initialize.map(fm => fm.changeResult(_ => FtsMigration.Result.workDone))
+
+      NonEmptyList.fromList(migrations) match {
        case Some(nel) =>
-          nel.reduce(semigroup[F])
+          nel
+            .map(fm => from[F](fm.task))
+            .append(insertAll[F](None))
+            .reduce(semigroup[F])
+            .run(ctx)
        case None =>
-          FtsWork[F](_ => ().pure[F])
+          ().pure[F]
      }
    }

+  /**
+    */
  def from[F[_]: FlatMap: Applicative](t: F[FtsMigration.Result]): FtsWork[F] =
    Kleisli.liftF(t).flatMap(transformResult[F])

--- a/modules/joex/src/main/scala/docspell/joex/fts/Migration.scala
+++ b/modules/joex/src/main/scala/docspell/joex/fts/Migration.scala
@@ -11,6 +11,11 @@ import docspell.joex.Config
 import docspell.store.records.RFtsMigration
 import docspell.store.{AddResult, Store}

+/** Migrating the index from the previous version to this version.
+  *
+  * The sql database stores the outcome of a migration task. If this
+  * task has already been applied, it is skipped.
+  */
 case class Migration[F[_]](
    version: Int,
    engine: Ident,
--- a/modules/joex/src/main/scala/docspell/joex/fts/ReIndexTask.scala
+++ b/modules/joex/src/main/scala/docspell/joex/fts/ReIndexTask.scala
@@ -46,6 +46,6 @@ object ReIndexTask {
              FtsWork.log[F](_.info("Clearing data failed. Continue re-indexing."))
            ) ++
            FtsWork.log[F](_.info("Running index initialize")) ++
-            FtsWork.allInitializeTasks[F]
+            FtsWork.reInitializeTasks[F]
      })
 }
--- a/modules/joex/src/main/scala/docspell/joex/fts/package.scala
+++ b/modules/joex/src/main/scala/docspell/joex/fts/package.scala
@@ -4,6 +4,9 @@ import cats.data.Kleisli

 package object fts {

+  /** Some work that must be done to advance the schema of the fulltext
+    * index.
+    */
  type FtsWork[F[_]] = Kleisli[F, FtsContext[F], Unit]

 }
--- a/modules/joex/src/main/scala/docspell/joex/learn/ClassifierName.scala
+++ b/modules/joex/src/main/scala/docspell/joex/learn/ClassifierName.scala
@@ -0,0 +1,66 @@
+package docspell.joex.learn
+
+import cats.data.NonEmptyList
+import cats.implicits._
+
+import docspell.common.Ident
+import docspell.store.records.{RClassifierModel, RClassifierSetting}
+
+import doobie._
+
+final class ClassifierName(val name: String) extends AnyVal
+
+object ClassifierName {
+  def apply(name: String): ClassifierName =
+    new ClassifierName(name)
+
+  private val categoryPrefix = "tagcategory-"
+
+  def tagCategory(cat: String): ClassifierName =
+    apply(s"${categoryPrefix}${cat}")
+
+  val concernedPerson: ClassifierName =
+    apply("concernedperson")
+
+  val concernedEquip: ClassifierName =
+    apply("concernedequip")
+
+  val correspondentOrg: ClassifierName =
+    apply("correspondentorg")
+
+  val correspondentPerson: ClassifierName =
+    apply("correspondentperson")
+
+  def findTagClassifiers[F[_]](coll: Ident): ConnectionIO[List[ClassifierName]] =
+    for {
+      categories <- RClassifierSetting.getActiveCategories(coll)
+    } yield categories.map(tagCategory)
+
+  def findTagModels[F[_]](coll: Ident): ConnectionIO[List[RClassifierModel]] =
+    for {
+      categories <- RClassifierSetting.getActiveCategories(coll)
+      models <- NonEmptyList.fromList(categories) match {
+        case Some(nel) =>
+          RClassifierModel.findAllByName(coll, nel.map(tagCategory).map(_.name))
+        case None =>
+          List.empty[RClassifierModel].pure[ConnectionIO]
+      }
+    } yield models
+
+  def findOrphanTagModels[F[_]](coll: Ident): ConnectionIO[List[RClassifierModel]] =
+    for {
+      cats <- RClassifierSetting.getActiveCategories(coll)
+      allModels = RClassifierModel.findAllByQuery(coll, s"${categoryPrefix}%")
+      result <- NonEmptyList.fromList(cats) match {
+        case Some(nel) =>
+          allModels.flatMap(all =>
+            RClassifierModel
+              .findAllByName(coll, nel.map(tagCategory).map(_.name))
+              .map(active => all.diff(active))
+          )
+        case None =>
+          allModels
+      }
+    } yield result
+
+}
--- a/modules/joex/src/main/scala/docspell/joex/learn/Classify.scala
+++ b/modules/joex/src/main/scala/docspell/joex/learn/Classify.scala
@@ -0,0 +1,48 @@
+package docspell.joex.learn
+
+import java.nio.file.Path
+
+import cats.data.OptionT
+import cats.effect._
+import cats.implicits._
+
+import docspell.analysis.classifier.{ClassifierModel, TextClassifier}
+import docspell.common._
+import docspell.store.Store
+import docspell.store.records.RClassifierModel
+
+import bitpeace.RangeDef
+
+object Classify {
+
+  def apply[F[_]: Sync: ContextShift](
+      blocker: Blocker,
+      logger: Logger[F],
+      workingDir: Path,
+      store: Store[F],
+      classifier: TextClassifier[F],
+      coll: Ident,
+      text: String
+  )(cname: ClassifierName): F[Option[String]] =
+    (for {
+      _ <- OptionT.liftF(logger.info(s"Guessing label for ${cname.name} …"))
+      model <- OptionT(store.transact(RClassifierModel.findByName(coll, cname.name)))
+        .flatTapNone(logger.debug("No classifier model found."))
+      modelData =
+        store.bitpeace
+          .get(model.fileId.id)
+          .unNoneTerminate
+          .through(store.bitpeace.fetchData2(RangeDef.all))
+      cls <- OptionT(File.withTempDir(workingDir, "classify").use { dir =>
+        val modelFile = dir.resolve("model.ser.gz")
+        modelData
+          .through(fs2.io.file.writeAll(modelFile, blocker))
+          .compile
+          .drain
+          .flatMap(_ => classifier.classify(logger, ClassifierModel(modelFile), text))
+      }).filter(_ != LearnClassifierTask.noClass)
+        .flatTapNone(logger.debug("Guessed: <none>"))
+      _ <- OptionT.liftF(logger.debug(s"Guessed: ${cls}"))
+    } yield cls).value
+
+}
--- a/modules/joex/src/main/scala/docspell/joex/learn/LearnClassifierTask.scala
+++ b/modules/joex/src/main/scala/docspell/joex/learn/LearnClassifierTask.scala
@@ -1,26 +1,19 @@
 package docspell.joex.learn

-import cats.data.Kleisli
 import cats.data.OptionT
 import cats.effect._
 import cats.implicits._
-import fs2.{Pipe, Stream}

 import docspell.analysis.TextAnalyser
-import docspell.analysis.nlp.ClassifierModel
-import docspell.analysis.nlp.TextClassifier.Data
 import docspell.backend.ops.OCollective
 import docspell.common._
 import docspell.joex.Config
 import docspell.joex.scheduler._
-import docspell.store.queries.QItem
-import docspell.store.records.RClassifierSetting
-
-import bitpeace.MimetypeHint
+import docspell.store.records.{RClassifierModel, RClassifierSetting}

 object LearnClassifierTask {
-  val noClass = "__NONE__"
  val pageSep = " --n-- "
+  val noClass = "__NONE__"

  type Args = LearnClassifierArgs

@@ -29,83 +22,86 @@ object LearnClassifierTask {

  def apply[F[_]: Sync: ContextShift](
      cfg: Config.TextAnalysis,
-      blocker: Blocker,
+      analyser: TextAnalyser[F]
+  ): Task[F, Args, Unit] =
+    learnTags(cfg, analyser)
+      .flatMap(_ => learnItemEntities(cfg, analyser))
+      .flatMap(_ => Task(_ => Sync[F].delay(System.gc())))
+
+  private def learnItemEntities[F[_]: Sync: ContextShift](
+      cfg: Config.TextAnalysis,
      analyser: TextAnalyser[F]
  ): Task[F, Args, Unit] =
    Task { ctx =>
-      (for {
-        sett <- findActiveSettings[F](ctx, cfg)
-        data = selectItems(
-          ctx,
-          math.min(cfg.classification.itemCount, sett.itemCount).toLong,
-          sett.category.getOrElse("")
-        )
-        _ <- OptionT.liftF(
-          analyser
-            .classifier(blocker)
-            .trainClassifier[Unit](ctx.logger, data)(Kleisli(handleModel(ctx, blocker)))
-        )
-      } yield ())
-        .getOrElseF(logInactiveWarning(ctx.logger))
+      if (cfg.classification.enabled)
+        LearnItemEntities
+          .learnAll(
+            analyser,
+            ctx.args.collective,
+            cfg.classification.itemCount,
+            cfg.maxLength
+          )
+          .run(ctx)
+      else ().pure[F]
    }

-  private def handleModel[F[_]: Sync: ContextShift](
-      ctx: Context[F, Args],
-      blocker: Blocker
-  )(trainedModel: ClassifierModel): F[Unit] =
+  private def learnTags[F[_]: Sync: ContextShift](
+      cfg: Config.TextAnalysis,
+      analyser: TextAnalyser[F]
+  ): Task[F, Args, Unit] =
+    Task { ctx =>
+      val learnTags =
+        for {
+          sett <- findActiveSettings[F](ctx, cfg)
+          maxItems = cfg.classification.itemCountOrWhenLower(sett.itemCount)
+          _ <- OptionT.liftF(
+            LearnTags
+              .learnAllTagCategories(analyser)(
+                ctx.args.collective,
+                maxItems,
+                cfg.maxLength
+              )
+              .run(ctx)
+          )
+        } yield ()
+      // learn classifier models from active tag categories
+      learnTags.getOrElseF(logInactiveWarning(ctx.logger)) *>
+        // delete classifier model files for categories that have been removed
+        clearObsoleteTagModels(ctx) *>
+        // when tags are deleted, categories may get removed. fix the json array
+        ctx.store
+          .transact(RClassifierSetting.fixCategoryList(ctx.args.collective))
+          .map(_ => ())
+    }
+
+  private def clearObsoleteTagModels[F[_]: Sync](ctx: Context[F, Args]): F[Unit] =
    for {
-      oldFile <- ctx.store.transact(
-        RClassifierSetting.findById(ctx.args.collective).map(_.flatMap(_.fileId))
+      list <- ctx.store.transact(
+        ClassifierName.findOrphanTagModels(ctx.args.collective)
      )
-      _ <- ctx.logger.info("Storing new trained model")
-      fileData = fs2.io.file.readAll(trainedModel.model, blocker, 4096)
-      newFile <-
-        ctx.store.bitpeace.saveNew(fileData, 4096, MimetypeHint.none).compile.lastOrError
-      _ <- ctx.store.transact(
-        RClassifierSetting.updateFile(ctx.args.collective, Ident.unsafe(newFile.id))
+      _ <- ctx.logger.info(
+        s"Found ${list.size} obsolete model files that are deleted now."
      )
-      _ <- ctx.logger.debug(s"New model stored at file ${newFile.id}")
-      _ <- oldFile match {
-        case Some(fid) =>
-          ctx.logger.debug(s"Deleting old model file ${fid.id}") *>
-            ctx.store.bitpeace.delete(fid.id).compile.drain
-        case None => ().pure[F]
-      }
+      n <- ctx.store.transact(RClassifierModel.deleteAll(list.map(_.id)))
+      _ <- list
+        .map(_.fileId.id)
+        .traverse(id => ctx.store.bitpeace.delete(id).compile.drain)
+      _ <- ctx.logger.debug(s"Deleted $n model files.")
    } yield ()

-  private def selectItems[F[_]](
-      ctx: Context[F, Args],
-      max: Long,
-      category: String
-  ): Stream[F, Data] = {
-    val connStream =
-      for {
-        item <- QItem.findAllNewesFirst(ctx.args.collective, 10).through(restrictTo(max))
-        tt <- Stream.eval(
-          QItem.resolveTextAndTag(ctx.args.collective, item, category, pageSep)
-        )
-      } yield Data(tt.tag.map(_.name).getOrElse(noClass), item.id, tt.text.trim)
-    ctx.store.transact(connStream.filter(_.text.nonEmpty))
-  }
-
-  private def restrictTo[F[_], A](max: Long): Pipe[F, A, A] =
-    if (max <= 0) identity
-    else _.take(max)
-
  private def findActiveSettings[F[_]: Sync](
      ctx: Context[F, Args],
      cfg: Config.TextAnalysis
  ): OptionT[F, OCollective.Classifier] =
    if (cfg.classification.enabled)
      OptionT(ctx.store.transact(RClassifierSetting.findById(ctx.args.collective)))
-        .filter(_.enabled)
-        .filter(_.category.nonEmpty)
+        .filter(_.autoTagEnabled)
        .map(OCollective.Classifier.fromRecord)
    else
      OptionT.none

  private def logInactiveWarning[F[_]: Sync](logger: Logger[F]): F[Unit] =
    logger.warn(
-      "Classification is disabled. Check joex config and the collective settings."
+      "Auto-tagging is disabled. Check joex config and the collective settings."
    )
 }
--- a/modules/joex/src/main/scala/docspell/joex/learn/LearnItemEntities.scala
+++ b/modules/joex/src/main/scala/docspell/joex/learn/LearnItemEntities.scala
@@ -0,0 +1,79 @@
+package docspell.joex.learn
+
+import cats.data.Kleisli
+import cats.effect._
+import cats.implicits._
+import fs2.Stream
+
+import docspell.analysis.TextAnalyser
+import docspell.analysis.classifier.TextClassifier.Data
+import docspell.common._
+import docspell.joex.scheduler._
+
+object LearnItemEntities {
+  def learnAll[F[_]: Sync: ContextShift, A](
+      analyser: TextAnalyser[F],
+      collective: Ident,
+      maxItems: Int,
+      maxTextLen: Int
+  ): Task[F, A, Unit] =
+    learnCorrOrg(analyser, collective, maxItems, maxTextLen)
+      .flatMap(_ => learnCorrPerson[F, A](analyser, collective, maxItems, maxTextLen))
+      .flatMap(_ => learnConcPerson(analyser, collective, maxItems, maxTextLen))
+      .flatMap(_ => learnConcEquip(analyser, collective, maxItems, maxTextLen))
+
+  def learnCorrOrg[F[_]: Sync: ContextShift, A](
+      analyser: TextAnalyser[F],
+      collective: Ident,
+      maxItems: Int,
+      maxTextLen: Int
+  ): Task[F, A, Unit] =
+    learn(analyser, collective)(
+      ClassifierName.correspondentOrg,
+      ctx => SelectItems.forCorrOrg(ctx.store, collective, maxItems, maxTextLen)
+    )
+
+  def learnCorrPerson[F[_]: Sync: ContextShift, A](
+      analyser: TextAnalyser[F],
+      collective: Ident,
+      maxItems: Int,
+      maxTextLen: Int
+  ): Task[F, A, Unit] =
+    learn(analyser, collective)(
+      ClassifierName.correspondentPerson,
+      ctx => SelectItems.forCorrPerson(ctx.store, collective, maxItems, maxTextLen)
+    )
+
+  def learnConcPerson[F[_]: Sync: ContextShift, A](
+      analyser: TextAnalyser[F],
+      collective: Ident,
+      maxItems: Int,
+      maxTextLen: Int
+  ): Task[F, A, Unit] =
+    learn(analyser, collective)(
+      ClassifierName.concernedPerson,
+      ctx => SelectItems.forConcPerson(ctx.store, collective, maxItems, maxTextLen)
+    )
+
+  def learnConcEquip[F[_]: Sync: ContextShift, A](
+      analyser: TextAnalyser[F],
+      collective: Ident,
+      maxItems: Int,
+      maxTextLen: Int
+  ): Task[F, A, Unit] =
+    learn(analyser, collective)(
+      ClassifierName.concernedEquip,
+      ctx => SelectItems.forConcEquip(ctx.store, collective, maxItems, maxTextLen)
+    )
+
+  private def learn[F[_]: Sync: ContextShift, A](
+      analyser: TextAnalyser[F],
+      collective: Ident
+  )(cname: ClassifierName, data: Context[F, _] => Stream[F, Data]): Task[F, A, Unit] =
+    Task { ctx =>
+      ctx.logger.info(s"Learn classifier ${cname.name}") *>
+        analyser.classifier.trainClassifier(ctx.logger, data(ctx))(
+          Kleisli(StoreClassifierModel.handleModel(ctx, collective, cname))
+        )
+    }
+}
--- a/modules/joex/src/main/scala/docspell/joex/learn/LearnTags.scala
+++ b/modules/joex/src/main/scala/docspell/joex/learn/LearnTags.scala
@@ -0,0 +1,48 @@
+package docspell.joex.learn
+
+import cats.data.Kleisli
+import cats.effect._
+import cats.implicits._
+
+import docspell.analysis.TextAnalyser
+import docspell.common._
+import docspell.joex.scheduler._
+import docspell.store.records.RClassifierSetting
+
+object LearnTags {
+
+  def learnTagCategory[F[_]: Sync: ContextShift, A](
+      analyser: TextAnalyser[F],
+      collective: Ident,
+      maxItems: Int,
+      maxTextLen: Int
+  )(
+      category: String
+  ): Task[F, A, Unit] =
+    Task { ctx =>
+      val data = SelectItems.forCategory(ctx, collective)(maxItems, category, maxTextLen)
+      ctx.logger.info(s"Learn classifier for tag category: $category") *>
+        analyser.classifier.trainClassifier(ctx.logger, data)(
+          Kleisli(
+            StoreClassifierModel.handleModel(
+              ctx,
+              collective,
+              ClassifierName.tagCategory(category)
+            )
+          )
+        )
+    }
+
+  def learnAllTagCategories[F[_]: Sync: ContextShift, A](analyser: TextAnalyser[F])(
+      collective: Ident,
+      maxItems: Int,
+      maxTextLen: Int
+  ): Task[F, A, Unit] =
+    Task { ctx =>
+      for {
+        cats <- ctx.store.transact(RClassifierSetting.getActiveCategories(collective))
+        task = learnTagCategory[F, A](analyser, collective, maxItems, maxTextLen) _
+        _ <- cats.map(task).traverse(_.run(ctx))
+      } yield ()
+    }
+}
--- a/modules/joex/src/main/scala/docspell/joex/learn/SelectItems.scala
+++ b/modules/joex/src/main/scala/docspell/joex/learn/SelectItems.scala
@@ -0,0 +1,109 @@
+package docspell.joex.learn
+
+import fs2.{Pipe, Stream}
+
+import docspell.analysis.classifier.TextClassifier.Data
+import docspell.common._
+import docspell.joex.scheduler.Context
+import docspell.store.Store
+import docspell.store.qb.Batch
+import docspell.store.queries.{QItem, TextAndTag}
+
+import doobie._
+
+object SelectItems {
+  val pageSep = LearnClassifierTask.pageSep
+  val noClass = LearnClassifierTask.noClass
+
+  def forCategory[F[_]](ctx: Context[F, _], collective: Ident)(
+      maxItems: Int,
+      category: String,
+      maxTextLen: Int
+  ): Stream[F, Data] =
+    forCategory(ctx.store, collective, maxItems, category, maxTextLen)
+
+  def forCategory[F[_]](
+      store: Store[F],
+      collective: Ident,
+      maxItems: Int,
+      category: String,
+      maxTextLen: Int
+  ): Stream[F, Data] = {
+    val connStream =
+      allItems(collective, maxItems)
+        .evalMap(item =>
+          QItem.resolveTextAndTag(collective, item, category, maxTextLen, pageSep)
+        )
+        .through(mkData)
+    store.transact(connStream)
+  }
+
+  def forCorrOrg[F[_]](
+      store: Store[F],
+      collective: Ident,
+      maxItems: Int,
+      maxTextLen: Int
+  ): Stream[F, Data] = {
+    val connStream =
+      allItems(collective, maxItems)
+        .evalMap(item =>
+          QItem.resolveTextAndCorrOrg(collective, item, maxTextLen, pageSep)
+        )
+        .through(mkData)
+    store.transact(connStream)
+  }
+
+  def forCorrPerson[F[_]](
+      store: Store[F],
+      collective: Ident,
+      maxItems: Int,
+      maxTextLen: Int
+  ): Stream[F, Data] = {
+    val connStream =
+      allItems(collective, maxItems)
+        .evalMap(item =>
+          QItem.resolveTextAndCorrPerson(collective, item, maxTextLen, pageSep)
+        )
+        .through(mkData)
+    store.transact(connStream)
+  }
+
+  def forConcPerson[F[_]](
+      store: Store[F],
+      collective: Ident,
+      maxItems: Int,
+      maxTextLen: Int
+  ): Stream[F, Data] = {
+    val connStream =
+      allItems(collective, maxItems)
+        .evalMap(item =>
+          QItem.resolveTextAndConcPerson(collective, item, maxTextLen, pageSep)
+        )
+        .through(mkData)
+    store.transact(connStream)
+  }
+
+  def forConcEquip[F[_]](
+      store: Store[F],
+      collective: Ident,
+      maxItems: Int,
+      maxTextLen: Int
+  ): Stream[F, Data] = {
+    val connStream =
+      allItems(collective, maxItems)
+        .evalMap(item =>
+          QItem.resolveTextAndConcEquip(collective, item, maxTextLen, pageSep)
+        )
+        .through(mkData)
+    store.transact(connStream)
+  }
+
+  private def allItems(collective: Ident, max: Int): Stream[ConnectionIO, Ident] = {
+    val limit = if (max <= 0) Batch.all else Batch.limit(max)
+    QItem.findAllNewesFirst(collective, 10, limit)
+  }
+
+  private def mkData[F[_]]: Pipe[F, TextAndTag, Data] =
+    _.map(tt => Data(tt.tag.map(_.name).getOrElse(noClass), tt.itemId.id, tt.text.trim))
+      .filter(_.text.nonEmpty)
+}
--- a/modules/joex/src/main/scala/docspell/joex/learn/StoreClassifierModel.scala
+++ b/modules/joex/src/main/scala/docspell/joex/learn/StoreClassifierModel.scala
@@ -0,0 +1,53 @@
+package docspell.joex.learn
+
+import cats.effect._
+import cats.implicits._
+
+import docspell.analysis.classifier.ClassifierModel
+import docspell.common._
+import docspell.joex.scheduler._
+import docspell.store.Store
+import docspell.store.records.RClassifierModel
+
+import bitpeace.MimetypeHint
+
+object StoreClassifierModel {
+
+  def handleModel[F[_]: Sync: ContextShift](
+      ctx: Context[F, _],
+      collective: Ident,
+      modelName: ClassifierName
+  )(
+      trainedModel: ClassifierModel
+  ): F[Unit] =
+    handleModel(ctx.store, ctx.blocker, ctx.logger)(collective, modelName, trainedModel)
+
+  def handleModel[F[_]: Sync: ContextShift](
+      store: Store[F],
+      blocker: Blocker,
+      logger: Logger[F]
+  )(
+      collective: Ident,
+      modelName: ClassifierName,
+      trainedModel: ClassifierModel
+  ): F[Unit] =
+    for {
+      oldFile <- store.transact(
+        RClassifierModel.findByName(collective, modelName.name).map(_.map(_.fileId))
+      )
+      _ <- logger.debug(s"Storing new trained model for: ${modelName.name}")
+      fileData = fs2.io.file.readAll(trainedModel.model, blocker, 4096)
+      newFile <-
+        store.bitpeace.saveNew(fileData, 4096, MimetypeHint.none).compile.lastOrError
+      _ <- store.transact(
+        RClassifierModel.updateFile(collective, modelName.name, Ident.unsafe(newFile.id))
+      )
+      _ <- logger.debug(s"New model stored at file ${newFile.id}")
+      _ <- oldFile match {
+        case Some(fid) =>
+          logger.debug(s"Deleting old model file ${fid.id}") *>
+            store.bitpeace.delete(fid.id).compile.drain
+        case None => ().pure[F]
+      }
+    } yield ()
+}
--- a/modules/joex/src/main/scala/docspell/joex/process/AttachmentPageCount.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/AttachmentPageCount.scala
@@ -78,7 +78,14 @@ object AttachmentPageCount {
            s"No attachmentmeta record exists for ${ra.id.id}. Creating new."
          ) *> ctx.store.transact(
            RAttachmentMeta.insert(
-              RAttachmentMeta(ra.id, None, Nil, MetaProposalList.empty, md.pageCount.some)
+              RAttachmentMeta(
+                ra.id,
+                None,
+                Nil,
+                MetaProposalList.empty,
+                md.pageCount.some,
+                None
+              )
            )
          )
        else 0.pure[F]
--- a/modules/joex/src/main/scala/docspell/joex/process/ConvertPdf.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/ConvertPdf.scala
@@ -108,7 +108,18 @@ object ConvertPdf {
        ctx.logger.info(s"Conversion to pdf+txt successful. Saving file.") *>
          storePDF(ctx, cfg, ra, pdf)
            .flatMap(r =>
-              txt.map(t => (r, item.changeMeta(ra.id, _.setContentIfEmpty(t.some)).some))
+              txt.map(t =>
+                (
+                  r,
+                  item
+                    .changeMeta(
+                      ra.id,
+                      ctx.args.meta.language,
+                      _.setContentIfEmpty(t.some)
+                    )
+                    .some
+                )
+              )
            )

      case ConversionResult.UnsupportedFormat(mt) =>
--- a/modules/joex/src/main/scala/docspell/joex/process/CreateItem.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/CreateItem.scala
@@ -107,6 +107,8 @@ object CreateItem {
        Vector.empty,
        fm.map(a => a.id -> a.fileId).toMap,
        MetaProposalList.empty,
+        Nil,
+        MetaProposalList.empty,
        Nil
      )
    }
@@ -166,6 +168,8 @@ object CreateItem {
          Vector.empty,
          origMap,
          MetaProposalList.empty,
+          Nil,
+          MetaProposalList.empty,
          Nil
        )
      )
--- a/modules/joex/src/main/scala/docspell/joex/process/ExtractArchive.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/ExtractArchive.scala
@@ -42,7 +42,7 @@ object ExtractArchive {
      archive: Option[RAttachmentArchive]
  ): Task[F, ProcessItemArgs, (Option[RAttachmentArchive], ItemData)] =
    singlePass(item, archive).flatMap { t =>
-      if (t._1 == None) Task.pure(t)
+      if (t._1.isEmpty) Task.pure(t)
      else multiPass(t._2, t._1)
    }

--- a/modules/joex/src/main/scala/docspell/joex/process/FindProposal.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/FindProposal.scala
@@ -17,24 +17,92 @@ import docspell.store.records._
  * by looking up values from NER in the users address book.
  */
 object FindProposal {
+  type Args = ProcessItemArgs

  def apply[F[_]: Sync](
-      cfg: Config.Processing
-  )(data: ItemData): Task[F, ProcessItemArgs, ItemData] =
+      cfg: Config.TextAnalysis
+  )(data: ItemData): Task[F, Args, ItemData] =
    Task { ctx =>
      val rmas = data.metas.map(rm => rm.copy(nerlabels = removeDuplicates(rm.nerlabels)))
-
-      ctx.logger.info("Starting find-proposal") *>
-        rmas
+      for {
+        _ <- ctx.logger.info("Starting find-proposal")
+        rmv <- rmas
          .traverse(rm =>
            processAttachment(cfg, rm, data.findDates(rm), ctx)
              .map(ml => rm.copy(proposals = ml))
          )
-          .map(rmv => data.copy(metas = rmv))
+        clp <- lookupClassifierProposals(ctx, data.classifyProposals)
+      } yield data.copy(metas = rmv, classifyProposals = clp)
    }

+  def lookupClassifierProposals[F[_]: Sync](
+      ctx: Context[F, Args],
+      mpList: MetaProposalList
+  ): F[MetaProposalList] = {
+    val coll = ctx.args.meta.collective
+
+    def lookup(mp: MetaProposal): F[Option[IdRef]] =
+      mp.proposalType match {
+        case MetaProposalType.CorrOrg =>
+          ctx.store
+            .transact(
+              ROrganization
+                .findLike(coll, mp.values.head.ref.name.toLowerCase)
+                .map(_.headOption)
+            )
+            .flatTap(oref =>
+              ctx.logger.debug(s"Found classifier organization for $mp: $oref")
+            )
+        case MetaProposalType.CorrPerson =>
+          ctx.store
+            .transact(
+              RPerson
+                .findLike(coll, mp.values.head.ref.name.toLowerCase, false)
+                .map(_.headOption)
+            )
+            .flatTap(oref =>
+              ctx.logger.debug(s"Found classifier corr-person for $mp: $oref")
+            )
+        case MetaProposalType.ConcPerson =>
+          ctx.store
+            .transact(
+              RPerson
+                .findLike(coll, mp.values.head.ref.name.toLowerCase, true)
+                .map(_.headOption)
+            )
+            .flatTap(oref =>
+              ctx.logger.debug(s"Found classifier conc-person for $mp: $oref")
+            )
+        case MetaProposalType.ConcEquip =>
+          ctx.store
+            .transact(
+              REquipment
+                .findLike(coll, mp.values.head.ref.name.toLowerCase)
+                .map(_.headOption)
+            )
+            .flatTap(oref =>
+              ctx.logger.debug(s"Found classifier conc-equip for $mp: $oref")
+            )
+        case MetaProposalType.DocDate =>
+          (None: Option[IdRef]).pure[F]
+
+        case MetaProposalType.DueDate =>
+          (None: Option[IdRef]).pure[F]
+      }
+
+    def updateRef(mp: MetaProposal)(idRef: Option[IdRef]): Option[MetaProposal] =
+      idRef // this proposal contains a single value only, since coming from classifier
+        .map(ref => mp.copy(values = mp.values.map(_.copy(ref = ref))))
+
+    ctx.logger.debug(s"Looking up classifier results: ${mpList.proposals}") *>
+      mpList.proposals
+        .traverse(mp => lookup(mp).map(updateRef(mp)))
+        .map(_.flatten)
+        .map(MetaProposalList.apply)
+  }
+
  def processAttachment[F[_]: Sync](
-      cfg: Config.Processing,
+      cfg: Config.TextAnalysis,
      rm: RAttachmentMeta,
      rd: Vector[NerDateLabel],
      ctx: Context[F, ProcessItemArgs]
@@ -46,11 +114,11 @@ object FindProposal {
  }

  def makeDateProposal[F[_]: Sync](
-      cfg: Config.Processing,
+      cfg: Config.TextAnalysis,
      dates: Vector[NerDateLabel]
  ): F[MetaProposalList] =
    Timestamp.current[F].map { now =>
-      val maxFuture = now.plus(Duration.years(cfg.maxDueDateYears.toLong))
+      val maxFuture = now.plus(Duration.years(cfg.nlp.maxDueDateYears.toLong))
      val latestFirst = dates
        .filter(_.date.isBefore(maxFuture.toUtcDate))
        .sortWith((l1, l2) => l1.date.isAfter(l2.date))
--- a/modules/joex/src/main/scala/docspell/joex/process/ItemData.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/ItemData.scala
@@ -15,6 +15,9 @@ import docspell.store.records.{RAttachment, RAttachmentMeta, RItem}
  * containng the source or origin file
  * @param givenMeta meta data to this item that was not "guessed"
  * from an attachment but given and thus is always correct
+  * @param classifyProposals these are proposals that were obtained by
+  * a trained classifier. There are no ner-tags, it will only provide a
+  * single label
  */
 case class ItemData(
    item: RItem,
@@ -23,7 +26,11 @@ case class ItemData(
    dateLabels: Vector[AttachmentDates],
    originFile: Map[Ident, Ident], // maps RAttachment.id -> FileMeta.id
    givenMeta: MetaProposalList,   // given meta data not associated to a specific attachment
-    tags: List[String]             // a list of tags (names or ids) attached to the item if they exist
+    // a list of tags (names or ids) attached to the item if they exist
+    tags: List[String],
+    // proposals obtained from the classifier
+    classifyProposals: MetaProposalList,
+    classifyTags: List[String]
 ) {

  def findMeta(attachId: Ident): Option[RAttachmentMeta] =
@@ -32,8 +39,12 @@ case class ItemData(
  def findDates(rm: RAttachmentMeta): Vector[NerDateLabel] =
    dateLabels.find(m => m.rm.id == rm.id).map(_.dates).getOrElse(Vector.empty)

-  def mapMeta(attachId: Ident, f: RAttachmentMeta => RAttachmentMeta): ItemData = {
-    val item = changeMeta(attachId, f)
+  def mapMeta(
+      attachId: Ident,
+      lang: Language,
+      f: RAttachmentMeta => RAttachmentMeta
+  ): ItemData = {
+    val item = changeMeta(attachId, lang, f)
    val next = metas.map(a => if (a.id == attachId) item else a)
    copy(metas = next)
  }
@@ -43,13 +54,14 @@ case class ItemData(

  def changeMeta(
      attachId: Ident,
+      lang: Language,
      f: RAttachmentMeta => RAttachmentMeta
  ): RAttachmentMeta =
-    f(findOrCreate(attachId))
+    f(findOrCreate(attachId, lang))

-  def findOrCreate(attachId: Ident): RAttachmentMeta =
+  def findOrCreate(attachId: Ident, lang: Language): RAttachmentMeta =
    metas.find(_.id == attachId).getOrElse {
-      RAttachmentMeta.empty(attachId)
+      RAttachmentMeta.empty(attachId, lang)
    }

 }
--- a/modules/joex/src/main/scala/docspell/joex/process/LinkProposal.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/LinkProposal.scala
@@ -24,6 +24,7 @@ object LinkProposal {
          .flatten(data.metas.map(_.proposals))
          .filter(_.proposalType != MetaProposalType.DocDate)
          .sortByWeights
+          .fillEmptyFrom(data.classifyProposals)

        ctx.logger.info(s"Starting linking proposals") *>
          MetaProposalType.all
--- a/modules/joex/src/main/scala/docspell/joex/process/ProcessItem.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/ProcessItem.scala
@@ -41,7 +41,7 @@ object ProcessItem {
      regexNer: RegexNerFile[F]
  )(item: ItemData): Task[F, ProcessItemArgs, ItemData] =
    TextAnalysis[F](cfg.textAnalysis, analyser, regexNer)(item)
-      .flatMap(FindProposal[F](cfg.processing))
+      .flatMap(FindProposal[F](cfg.textAnalysis))
      .flatMap(EvalProposals[F])
      .flatMap(SaveProposals[F])

--- a/modules/joex/src/main/scala/docspell/joex/process/ReProcessItem.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/ReProcessItem.scala
@@ -65,6 +65,8 @@ object ReProcessItem {
        Vector.empty,
        asrcMap.view.mapValues(_.fileId).toMap,
        MetaProposalList.empty,
+        Nil,
+        MetaProposalList.empty,
        Nil
      )).getOrElseF(
        Sync[F].raiseError(new Exception(s"Item not found: ${ctx.args.itemId.id}"))
--- a/modules/joex/src/main/scala/docspell/joex/process/SaveProposals.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/SaveProposals.scala
@@ -4,21 +4,51 @@ import cats.effect.Sync
 import cats.implicits._

 import docspell.common._
-import docspell.joex.scheduler.Task
+import docspell.joex.scheduler.{Context, Task}
+import docspell.store.AddResult
 import docspell.store.records._

 /** Saves the proposals in the database
  */
 object SaveProposals {
+  type Args = ProcessItemArgs

-  def apply[F[_]: Sync](data: ItemData): Task[F, ProcessItemArgs, ItemData] =
+  def apply[F[_]: Sync](data: ItemData): Task[F, Args, ItemData] =
    Task { ctx =>
-      ctx.logger.info("Storing proposals") *>
-        data.metas
+      for {
+        _ <- ctx.logger.info("Storing proposals")
+        _ <- data.metas
          .traverse(rm =>
-            ctx.logger.debug(s"Storing attachment proposals: ${rm.proposals}") *>
-              ctx.store.transact(RAttachmentMeta.updateProposals(rm.id, rm.proposals))
+            ctx.logger.debug(
+              s"Storing attachment proposals: ${rm.proposals}"
+            ) *> ctx.store.transact(RAttachmentMeta.updateProposals(rm.id, rm.proposals))
          )
-          .map(_ => data)
+        _ <-
+          if (data.classifyProposals.isEmpty && data.classifyTags.isEmpty) 0.pure[F]
+          else saveItemProposal(ctx, data)
+      } yield data
    }
+
+  def saveItemProposal[F[_]: Sync](ctx: Context[F, Args], data: ItemData): F[Unit] = {
+    def upsert(v: RItemProposal): F[Int] =
+      ctx.store.add(RItemProposal.insert(v), RItemProposal.exists(v.itemId)).flatMap {
+        case AddResult.Success => 1.pure[F]
+        case AddResult.EntityExists(_) =>
+          ctx.store.transact(RItemProposal.update(v))
+        case AddResult.Failure(ex) =>
+          ctx.logger.warn(s"Could not store item proposals: ${ex.getMessage}") *> 0
+            .pure[F]
+      }
+
+    for {
+      _ <- ctx.logger.debug(s"Storing classifier proposals: ${data.classifyProposals}")
+      tags <- ctx.store.transact(
+        RTag.findAllByNameOrId(data.classifyTags, ctx.args.meta.collective)
+      )
+      tagRefs = tags.map(t => IdRef(t.tagId, t.name))
+      now <- Timestamp.current[F]
+      value = RItemProposal(data.item.id, data.classifyProposals, tagRefs.toList, now)
+      _ <- upsert(value)
+    } yield ()
+  }
 }
--- a/modules/joex/src/main/scala/docspell/joex/process/SetGivenData.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/SetGivenData.scala
@@ -45,7 +45,8 @@ object SetGivenData {
    Task { ctx =>
      val itemId     = data.item.id
      val collective = ctx.args.meta.collective
-      val tags       = (ctx.args.meta.tags.getOrElse(Nil) ++ data.tags).distinct
+      val tags =
+        (ctx.args.meta.tags.getOrElse(Nil) ++ data.tags ++ data.classifyTags).distinct
      for {
        _ <- ctx.logger.info(s"Set tags from given data: ${tags}")
        e <- ops.linkTags(itemId, tags, collective).attempt
--- a/modules/joex/src/main/scala/docspell/joex/process/TextAnalysis.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/TextAnalysis.scala
@@ -1,24 +1,20 @@
 package docspell.joex.process

-import cats.data.OptionT
+import cats.Traverse
 import cats.effect._
 import cats.implicits._

-import docspell.analysis.TextAnalyser
-import docspell.analysis.nlp.ClassifierModel
-import docspell.analysis.nlp.StanfordNerSettings
-import docspell.analysis.nlp.TextClassifier
+import docspell.analysis.classifier.TextClassifier
+import docspell.analysis.{NlpSettings, TextAnalyser}
+import docspell.common.MetaProposal.Candidate
 import docspell.common._
 import docspell.joex.Config
 import docspell.joex.analysis.RegexNerFile
-import docspell.joex.learn.LearnClassifierTask
+import docspell.joex.learn.{ClassifierName, Classify, LearnClassifierTask}
 import docspell.joex.process.ItemData.AttachmentDates
 import docspell.joex.scheduler.Context
 import docspell.joex.scheduler.Task
-import docspell.store.records.RAttachmentMeta
-import docspell.store.records.RClassifierSetting
-
-import bitpeace.RangeDef
+import docspell.store.records.{RAttachmentMeta, RClassifierSetting}

 object TextAnalysis {
  type Args = ProcessItemArgs
@@ -41,13 +37,27 @@ object TextAnalysis {
        _ <- t.traverse(m =>
          ctx.store.transact(RAttachmentMeta.updateLabels(m._1.id, m._1.nerlabels))
        )
+
+        v = t.toVector
+        autoTagEnabled <- getActiveAutoTag(ctx, cfg)
+        tag <-
+          if (autoTagEnabled) predictTags(ctx, cfg, item.metas, analyser.classifier)
+          else List.empty[String].pure[F]
+
+        classProposals <-
+          if (cfg.classification.enabled)
+            predictItemEntities(ctx, cfg, item.metas, analyser.classifier)
+          else MetaProposalList.empty.pure[F]
+
        e <- s
        _ <- ctx.logger.info(s"Text-Analysis finished in ${e.formatExact}")
-        v = t.toVector
-        tag <- predictTag(ctx, cfg, item.metas, analyser.classifier(ctx.blocker)).value
      } yield item
-        .copy(metas = v.map(_._1), dateLabels = v.map(_._2))
-        .appendTags(tag.toSeq)
+        .copy(
+          metas = v.map(_._1),
+          dateLabels = v.map(_._2),
+          classifyProposals = classProposals,
+          classifyTags = tag
+        )
    }

  def annotateAttachment[F[_]: Sync](
@@ -55,7 +65,7 @@ object TextAnalysis {
      analyser: TextAnalyser[F],
      nerFile: RegexNerFile[F]
  )(rm: RAttachmentMeta): F[(RAttachmentMeta, AttachmentDates)] = {
-    val settings = StanfordNerSettings(ctx.args.meta.language, false, None)
+    val settings = NlpSettings(ctx.args.meta.language, false, None)
    for {
      customNer <- nerFile.makeFile(ctx.args.meta.collective)
      sett = settings.copy(regexNer = customNer)
@@ -68,44 +78,84 @@ object TextAnalysis {
    } yield (rm.copy(nerlabels = labels.all.toList), AttachmentDates(rm, labels.dates))
  }

-  def predictTag[F[_]: Sync: ContextShift](
+  def predictTags[F[_]: Sync: ContextShift](
      ctx: Context[F, Args],
      cfg: Config.TextAnalysis,
      metas: Vector[RAttachmentMeta],
      classifier: TextClassifier[F]
-  ): OptionT[F, String] =
-    for {
-      model <- findActiveModel(ctx, cfg)
-      _     <- OptionT.liftF(ctx.logger.info(s"Guessing tag …"))
-      text = metas.flatMap(_.content).mkString(LearnClassifierTask.pageSep)
-      modelData =
-        ctx.store.bitpeace
-          .get(model.id)
-          .unNoneTerminate
-          .through(ctx.store.bitpeace.fetchData2(RangeDef.all))
-      cls <- OptionT(File.withTempDir(cfg.workingDir, "classify").use { dir =>
-        val modelFile = dir.resolve("model.ser.gz")
-        modelData
-          .through(fs2.io.file.writeAll(modelFile, ctx.blocker))
-          .compile
-          .drain
-          .flatMap(_ => classifier.classify(ctx.logger, ClassifierModel(modelFile), text))
-      }).filter(_ != LearnClassifierTask.noClass)
-      _ <- OptionT.liftF(ctx.logger.debug(s"Guessed tag: ${cls}"))
-    } yield cls
+  ): F[List[String]] = {
+    val text = metas.flatMap(_.content).mkString(LearnClassifierTask.pageSep)
+    val classifyWith: ClassifierName => F[Option[String]] =
+      makeClassify(ctx, cfg, classifier)(text)

-  private def findActiveModel[F[_]: Sync](
+    for {
+      names <- ctx.store.transact(
+        ClassifierName.findTagClassifiers(ctx.args.meta.collective)
+      )
+      _    <- ctx.logger.debug(s"Guessing tags for ${names.size} categories")
+      tags <- names.traverse(classifyWith)
+    } yield tags.flatten
+  }
+
+  def predictItemEntities[F[_]: Sync: ContextShift](
      ctx: Context[F, Args],
-      cfg: Config.TextAnalysis
-  ): OptionT[F, Ident] =
-    (if (cfg.classification.enabled)
-       OptionT(ctx.store.transact(RClassifierSetting.findById(ctx.args.meta.collective)))
-         .filter(_.enabled)
-         .mapFilter(_.fileId)
-     else
-       OptionT.none[F, Ident]).orElse(
-      OptionT.liftF(ctx.logger.info("Classification is disabled.")) *> OptionT
-        .none[F, Ident]
+      cfg: Config.TextAnalysis,
+      metas: Vector[RAttachmentMeta],
+      classifier: TextClassifier[F]
+  ): F[MetaProposalList] = {
+    val text = metas.flatMap(_.content).mkString(LearnClassifierTask.pageSep)
+
+    def classifyWith(
+        cname: ClassifierName,
+        mtype: MetaProposalType
+    ): F[Option[MetaProposal]] =
+      for {
+        label <- makeClassify(ctx, cfg, classifier)(text).apply(cname)
+      } yield label.map(str =>
+        MetaProposal(mtype, Candidate(IdRef(Ident.unsafe(""), str), Set.empty))
+      )
+
+    Traverse[List]
+      .sequence(
+        List(
+          classifyWith(ClassifierName.correspondentOrg, MetaProposalType.CorrOrg),
+          classifyWith(ClassifierName.correspondentPerson, MetaProposalType.CorrPerson),
+          classifyWith(ClassifierName.concernedPerson, MetaProposalType.ConcPerson),
+          classifyWith(ClassifierName.concernedEquip, MetaProposalType.ConcEquip)
+        )
+      )
+      .map(_.flatten)
+      .map(MetaProposalList.apply)
+  }
+
+  private def makeClassify[F[_]: Sync: ContextShift](
+      ctx: Context[F, Args],
+      cfg: Config.TextAnalysis,
+      classifier: TextClassifier[F]
+  )(text: String): ClassifierName => F[Option[String]] =
+    Classify[F](
+      ctx.blocker,
+      ctx.logger,
+      cfg.workingDir,
+      ctx.store,
+      classifier,
+      ctx.args.meta.collective,
+      text
    )

+  private def getActiveAutoTag[F[_]: Sync](
+      ctx: Context[F, Args],
+      cfg: Config.TextAnalysis
+  ): F[Boolean] =
+    if (cfg.classification.enabled)
+      ctx.store
+        .transact(RClassifierSetting.findById(ctx.args.meta.collective))
+        .map(_.exists(_.autoTagEnabled))
+        .flatTap(enabled =>
+          if (enabled) ().pure[F]
+          else ctx.logger.info("Classification is disabled. Check config or settings.")
+        )
+    else
+      ctx.logger.info("Classification is disabled.") *> false.pure[F]
+
 }
--- a/modules/joex/src/main/scala/docspell/joex/process/TextExtraction.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/TextExtraction.scala
@@ -46,10 +46,14 @@ object TextExtraction {
        )
        _   <- fts.indexData(ctx.logger, (idxItem +: txt.map(_.td)).toSeq: _*)
        dur <- start
-        _   <- ctx.logger.info(s"Text extraction finished in ${dur.formatExact}")
+        extractedTags = txt.flatMap(_.tags).distinct.toList
+        _ <- ctx.logger.info(s"Text extraction finished in ${dur.formatExact}.")
+        _ <-
+          if (extractedTags.isEmpty) ().pure[F]
+          else ctx.logger.debug(s"Found tags in file: $extractedTags")
      } yield item
        .copy(metas = txt.map(_.am))
-        .appendTags(txt.flatMap(_.tags).distinct.toList)
+        .appendTags(extractedTags)
    }

  // --  helpers
@@ -78,7 +82,7 @@ object TextExtraction {
        pair._2
      )

-    val rm = item.findOrCreate(ra.id)
+    val rm = item.findOrCreate(ra.id, lang)
    rm.content match {
      case Some(_) =>
        ctx.logger.info("TextExtraction skipped, since text is already available.") *>
@@ -102,6 +106,7 @@ object TextExtraction {
      res  <- extractTextFallback(ctx, cfg, ra, lang)(fids)
      meta = item.changeMeta(
        ra.id,
+        lang,
        rm =>
          rm.setContentIfEmpty(
            res.map(_.appendPdfMetaToText.text.trim).filter(_.nonEmpty)
--- a/modules/joexapi/src/main/resources/joex-openapi.yml
+++ b/modules/joexapi/src/main/resources/joex-openapi.yml
@@ -9,7 +9,7 @@ servers:
    description: Current host

 paths:
-  /api/info:
+  /api/info/version:
    get:
      tags: [ Api Info ]
      summary: Get basic information about this software.
--- a/modules/restapi/src/main/resources/docspell-openapi.yml
+++ b/modules/restapi/src/main/resources/docspell-openapi.yml
@@ -4850,14 +4850,11 @@ components:
      description: |
        Settings for learning a document classifier.
      required:
-        - enabled
        - schedule
        - itemCount
+        - categoryList
+        - listType
      properties:
-        enabled:
-          type: boolean
-        category:
-          type: string
        itemCount:
          type: integer
          format: int32
@@ -4867,6 +4864,16 @@ components:
        schedule:
          type: string
          format: calevent
+        categoryList:
+          type: array
+          items:
+            type: string
+        listType:
+          type: string
+          format: listtype
+          enum:
+            - blacklist
+            - whitelist

    SourceList:
      description: |
--- a/modules/restserver/src/main/scala/docspell/restserver/routes/CollectiveRoutes.scala
+++ b/modules/restserver/src/main/scala/docspell/restserver/routes/CollectiveRoutes.scala
@@ -6,7 +6,7 @@ import cats.implicits._
 import docspell.backend.BackendApp
 import docspell.backend.auth.AuthToken
 import docspell.backend.ops.OCollective
-import docspell.common.MakePreviewArgs
+import docspell.common.{ListType, MakePreviewArgs}
 import docspell.restapi.model._
 import docspell.restserver.conv.Conversions
 import docspell.restserver.http4s._
@@ -44,10 +44,10 @@ object CollectiveRoutes {
            settings.integrationEnabled,
            Some(
              OCollective.Classifier(
-                settings.classifier.enabled,
                settings.classifier.schedule,
                settings.classifier.itemCount,
-                settings.classifier.category
+                settings.classifier.categoryList,
+                settings.classifier.listType
              )
            )
          )
@@ -65,12 +65,12 @@ object CollectiveRoutes {
              c.language,
              c.integrationEnabled,
              ClassifierSetting(
-                c.classifier.map(_.enabled).getOrElse(false),
-                c.classifier.flatMap(_.category),
                c.classifier.map(_.itemCount).getOrElse(0),
                c.classifier
                  .map(_.schedule)
-                  .getOrElse(CalEvent.unsafe("*-1/3-01 01:00:00"))
+                  .getOrElse(CalEvent.unsafe("*-1/3-01 01:00:00")),
+                c.classifier.map(_.categories).getOrElse(Nil),
+                c.classifier.map(_.listType).getOrElse(ListType.whitelist)
              )
            )
          )
--- a/modules/store/src/main/resources/db/migration/h2/V1.17.0__meta_language.sql
+++ b/modules/store/src/main/resources/db/migration/h2/V1.17.0__meta_language.sql
@@ -0,0 +1,35 @@
+ALTER TABLE "attachmentmeta"
+ADD COLUMN "language" varchar(254);
+
+update "attachmentmeta"
+set "language" = 'deu'
+where "attachid" in (
+  select "m"."attachid"
+  from "attachmentmeta" m
+  inner join "attachment" a on "a"."attachid" = "m"."attachid"
+  inner join "item" i on "a"."itemid" = "i"."itemid"
+  inner join "collective" c on "c"."cid" = "i"."cid"
+  where "c"."doclang" = 'deu'
+);
+
+update "attachmentmeta"
+set "language" = 'eng'
+where "attachid" in (
+  select "m"."attachid"
+  from "attachmentmeta" m
+  inner join "attachment" a on "a"."attachid" = "m"."attachid"
+  inner join "item" i on "a"."itemid" = "i"."itemid"
+  inner join "collective" c on "c"."cid" = "i"."cid"
+  where "c"."doclang" = 'eng'
+);
+
+update "attachmentmeta"
+set "language" = 'fra'
+where "attachid" in (
+  select "m"."attachid"
+  from "attachmentmeta" m
+  inner join "attachment" a on "a"."attachid" = "m"."attachid"
+  inner join "item" i on "a"."itemid" = "i"."itemid"
+  inner join "collective" c on "c"."cid" = "i"."cid"
+  where "c"."doclang" = 'fra'
+);
--- a/modules/store/src/main/resources/db/migration/h2/V1.18.0__classifier_model.sql
+++ b/modules/store/src/main/resources/db/migration/h2/V1.18.0__classifier_model.sql
@@ -0,0 +1,44 @@
+CREATE TABLE "classifier_model"(
+  "id" varchar(254) not null primary key,
+  "cid" varchar(254) not null,
+  "name" varchar(254) not null,
+  "file_id" varchar(254) not null,
+  "created" timestamp not null,
+  foreign key ("cid") references "collective"("cid"),
+  foreign key ("file_id") references "filemeta"("id"),
+  unique ("cid", "name")
+);
+
+insert into "classifier_model"
+select random_uuid() as "id", "cid", concat('tagcategory-', "category") as "name", "file_id", "created"
+from "classifier_setting"
+where "file_id" is not null;
+
+alter table "classifier_setting"
+add column "categories" text;
+
+alter table "classifier_setting"
+add column "category_list_type" varchar(254);
+
+update "classifier_setting"
+set "category_list_type" = 'whitelist';
+
+update "classifier_setting"
+set "categories" = concat('["', "category", '"]')
+where category is not null;
+
+update "classifier_setting"
+set "categories" = '[]'
+where category is null;
+
+alter table "classifier_setting"
+drop column "category";
+
+alter table "classifier_setting"
+drop column "file_id";
+
+ALTER TABLE "classifier_setting"
+ALTER COLUMN "categories" SET NOT NULL;
+
+ALTER TABLE "classifier_setting"
+ALTER COLUMN "category_list_type" SET NOT NULL;
--- a/modules/store/src/main/resources/db/migration/h2/V1.19.0__add_classify_meta.sql
+++ b/modules/store/src/main/resources/db/migration/h2/V1.19.0__add_classify_meta.sql
@@ -0,0 +1,7 @@
+CREATE TABLE "item_proposal" (
+  "itemid" varchar(254) not null primary key,
+  "classifier_proposals" text not null,
+  "classifier_tags" text not null,
+  "created" timestamp not null,
+  foreign key ("itemid") references "item"("itemid")
+);
--- a/modules/store/src/main/resources/db/migration/mariadb/V1.17.0__meta_language.sql
+++ b/modules/store/src/main/resources/db/migration/mariadb/V1.17.0__meta_language.sql
@@ -0,0 +1,14 @@
+ALTER TABLE `attachmentmeta`
+ADD COLUMN (`language` varchar(254));
+
+update `attachmentmeta` `m`
+inner join (
+    select `m`.`attachid`, `c`.`doclang`
+    from `attachmentmeta` m
+    inner join `attachment` a on `a`.`attachid` = `m`.`attachid`
+    inner join `item` i on `a`.`itemid` = `i`.`itemid`
+    inner join `collective` c on `c`.`cid` = `i`.`cid`
+  ) as `c`
+set `m`.`language` = `c`.`doclang`
+where `m`.`attachid` = `c`.`attachid` and `m`.`language` is null;
+
--- a/modules/store/src/main/resources/db/migration/mariadb/V1.18.0__classifier_model.sql
+++ b/modules/store/src/main/resources/db/migration/mariadb/V1.18.0__classifier_model.sql
@@ -0,0 +1,48 @@
+CREATE TABLE `classifier_model`(
+  `id` varchar(254) not null primary key,
+  `cid` varchar(254) not null,
+  `name` varchar(254) not null,
+  `file_id` varchar(254) not null,
+  `created` timestamp not null,
+  foreign key (`cid`) references `collective`(`cid`),
+  foreign key (`file_id`) references `filemeta`(`id`),
+  unique (`cid`, `name`)
+);
+
+insert into `classifier_model`
+select md5(rand()) as id, `cid`,concat('tagcategory-', `category`) as `name`, `file_id`, `created`
+from `classifier_setting`
+where `file_id` is not null;
+
+alter table `classifier_setting`
+add column (`categories` mediumtext);
+
+alter table `classifier_setting`
+add column (`category_list_type` varchar(254));
+
+update `classifier_setting`
+set `category_list_type` = 'whitelist';
+
+update `classifier_setting`
+set `categories` = concat('["', `category`, '"]')
+where category is not null;
+
+update `classifier_setting`
+set `categories` = '[]'
+where category is null;
+
+alter table `classifier_setting`
+drop column `category`;
+
+-- mariadb requires to drop constraint manually when dropping a column
+alter table `classifier_setting`
+drop constraint `classifier_setting_ibfk_2`;
+
+alter table `classifier_setting`
+drop column `file_id`;
+
+ALTER TABLE `classifier_setting`
+MODIFY `categories` mediumtext NOT NULL;
+
+ALTER TABLE `classifier_setting`
+MODIFY `category_list_type` varchar(254) NOT NULL;
--- a/modules/store/src/main/resources/db/migration/mariadb/V1.19.0__add_classify_meta.sql
+++ b/modules/store/src/main/resources/db/migration/mariadb/V1.19.0__add_classify_meta.sql
@@ -0,0 +1,7 @@
+CREATE TABLE `item_proposal` (
+  `itemid` varchar(254) not null primary key,
+  `classifier_proposals` mediumtext not null,
+  `classifier_tags` mediumtext not null,
+  `created` timestamp not null,
+  foreign key (`itemid`) references `item`(`itemid`)
+);
--- a/modules/store/src/main/resources/db/migration/postgresql/V1.17.0__meta_language.sql
+++ b/modules/store/src/main/resources/db/migration/postgresql/V1.17.0__meta_language.sql
@@ -0,0 +1,15 @@
+ALTER TABLE "attachmentmeta"
+ADD COLUMN "language" varchar(254);
+
+with
+  "attachlang" as (
+    select "m"."attachid", "m"."language", "c"."doclang"
+    from "attachmentmeta" m
+    inner join "attachment" a on "a"."attachid" = "m"."attachid"
+    inner join "item" i on "a"."itemid" = "i"."itemid"
+    inner join "collective" c on "c"."cid" = "i"."cid"
+  )
+update "attachmentmeta" as "m"
+set "language" = "c"."doclang"
+from "attachlang" c
+where "m"."attachid" = "c"."attachid" and "m"."language" is null;
--- a/modules/store/src/main/resources/db/migration/postgresql/V1.18.0__classifier_model.sql
+++ b/modules/store/src/main/resources/db/migration/postgresql/V1.18.0__classifier_model.sql
@@ -0,0 +1,44 @@
+CREATE TABLE "classifier_model"(
+  "id" varchar(254) not null primary key,
+  "cid" varchar(254) not null,
+  "name" varchar(254) not null,
+  "file_id" varchar(254) not null,
+  "created" timestamp not null,
+  foreign key ("cid") references "collective"("cid"),
+  foreign key ("file_id") references "filemeta"("id"),
+  unique ("cid", "name")
+);
+
+insert into "classifier_model"
+select md5(random()::text) as id, "cid",'tagcategory-' || "category" as "name", "file_id", "created"
+from "classifier_setting"
+where "file_id" is not null;
+
+alter table "classifier_setting"
+add column "categories" text;
+
+alter table "classifier_setting"
+add column "category_list_type" varchar(254);
+
+update "classifier_setting"
+set "category_list_type" = 'whitelist';
+
+update "classifier_setting"
+set "categories" = concat('["', "category", '"]')
+where category is not null;
+
+update "classifier_setting"
+set "categories" = '[]'
+where category is null;
+
+alter table "classifier_setting"
+drop column "category";
+
+alter table "classifier_setting"
+drop column "file_id";
+
+ALTER TABLE "classifier_setting"
+ALTER COLUMN "categories" SET NOT NULL;
+
+ALTER TABLE "classifier_setting"
+ALTER COLUMN "category_list_type" SET NOT NULL;
--- a/modules/store/src/main/resources/db/migration/postgresql/V1.19.0__add_classify_meta.sql
+++ b/modules/store/src/main/resources/db/migration/postgresql/V1.19.0__add_classify_meta.sql
@@ -0,0 +1,7 @@
+CREATE TABLE "item_proposal" (
+  "itemid" varchar(254) not null primary key,
+  "classifier_proposals" text not null,
+  "classifier_tags" text not null,
+  "created" timestamp not null,
+  foreign key ("itemid") references "item"("itemid")
+);
--- a/modules/store/src/main/scala/docspell/store/impl/DoobieMeta.scala
+++ b/modules/store/src/main/scala/docspell/store/impl/DoobieMeta.scala
@@ -86,6 +86,9 @@ trait DoobieMeta extends EmilDoobieMeta {
  implicit val metaItemProposalList: Meta[MetaProposalList] =
    jsonMeta[MetaProposalList]

+  implicit val metaIdRef: Meta[List[IdRef]] =
+    jsonMeta[List[IdRef]]
+
  implicit val metaLanguage: Meta[Language] =
    Meta[String].imap(Language.unsafe)(_.iso3)

@@ -97,6 +100,9 @@ trait DoobieMeta extends EmilDoobieMeta {

  implicit val metaCustomFieldType: Meta[CustomFieldType] =
    Meta[String].timap(CustomFieldType.unsafe)(_.name)
+
+  implicit val metaListType: Meta[ListType] =
+    Meta[String].timap(ListType.unsafeFromString)(_.name)
 }

 object DoobieMeta extends DoobieMeta {
--- a/modules/store/src/main/scala/docspell/store/qb/DBFunction.scala
+++ b/modules/store/src/main/scala/docspell/store/qb/DBFunction.scala
@@ -1,5 +1,7 @@
 package docspell.store.qb

+import cats.data.NonEmptyList
+
 sealed trait DBFunction {}

 object DBFunction {
@@ -31,6 +33,8 @@ object DBFunction {

  case class Sum(expr: SelectExpr) extends DBFunction

+  case class Concat(exprs: NonEmptyList[SelectExpr]) extends DBFunction
+
  sealed trait Operator
  object Operator {
    case object Plus  extends Operator
--- a/modules/store/src/main/scala/docspell/store/qb/DSL.scala
+++ b/modules/store/src/main/scala/docspell/store/qb/DSL.scala
@@ -98,6 +98,9 @@ trait DSL extends DoobieMeta {
  def substring(expr: SelectExpr, start: Int, length: Int): DBFunction =
    DBFunction.Substring(expr, start, length)

+  def concat(expr: SelectExpr, exprs: SelectExpr*): DBFunction =
+    DBFunction.Concat(Nel.of(expr, exprs: _*))
+
  def lit[A](value: A)(implicit P: Put[A]): SelectExpr.SelectLit[A] =
    SelectExpr.SelectLit(value, None)

--- a/modules/store/src/main/scala/docspell/store/qb/impl/DBFunctionBuilder.scala
+++ b/modules/store/src/main/scala/docspell/store/qb/impl/DBFunctionBuilder.scala
@@ -32,6 +32,10 @@ object DBFunctionBuilder extends CommonBuilder {
      case DBFunction.Substring(expr, start, len) =>
        sql"SUBSTRING(" ++ SelectExprBuilder.build(expr) ++ fr" FROM $start FOR $len)"

+      case DBFunction.Concat(exprs) =>
+        val inner = exprs.map(SelectExprBuilder.build).toList.reduce(_ ++ comma ++ _)
+        sql"CONCAT(" ++ inner ++ sql")"
+
      case DBFunction.Calc(op, left, right) =>
        SelectExprBuilder.build(left) ++
          buildOperator(op) ++
--- a/modules/store/src/main/scala/docspell/store/queries/QAttachment.scala
+++ b/modules/store/src/main/scala/docspell/store/queries/QAttachment.scala
@@ -21,6 +21,7 @@ object QAttachment {
  private val item = RItem.as("i")
  private val am   = RAttachmentMeta.as("am")
  private val c    = RCollective.as("c")
+  private val im   = RItemProposal.as("im")

  def deletePreview[F[_]: Sync](store: Store[F])(attachId: Ident): F[Int] = {
    val findPreview =
@@ -118,17 +119,27 @@ object QAttachment {
    } yield ns.sum

  def getMetaProposals(itemId: Ident, coll: Ident): ConnectionIO[MetaProposalList] = {
-    val q = Select(
-      am.proposals.s,
+    val qa = Select(
+      select(am.proposals),
      from(am)
        .innerJoin(a, a.id === am.id)
        .innerJoin(item, a.itemId === item.id),
      a.itemId === itemId && item.cid === coll
    ).build

+    val qi = Select(
+      select(im.classifyProposals),
+      from(im)
+        .innerJoin(item, item.id === im.itemId),
+      item.cid === coll && im.itemId === itemId
+    ).build
+
    for {
-      ml <- q.query[MetaProposalList].to[Vector]
-    } yield MetaProposalList.flatten(ml)
+      mla <- qa.query[MetaProposalList].to[Vector]
+      mli <- qi.query[MetaProposalList].to[Vector]
+    } yield MetaProposalList
+      .flatten(mla)
+      .insertSecond(MetaProposalList.flatten(mli))
  }

  def getAttachmentMeta(
@@ -160,7 +171,15 @@ object QAttachment {
      chunkSize: Int
  ): Stream[ConnectionIO, ContentAndName] =
    Select(
-      select(a.id, a.itemId, item.cid, item.folder, c.language, a.name, am.content),
+      select(
+        a.id.s,
+        a.itemId.s,
+        item.cid.s,
+        item.folder.s,
+        coalesce(am.language.s, c.language.s).s,
+        a.name.s,
+        am.content.s
+      ),
      from(a)
        .innerJoin(am, am.id === a.id)
        .innerJoin(item, item.id === a.itemId)
--- a/modules/store/src/main/scala/docspell/store/queries/QCollective.scala
+++ b/modules/store/src/main/scala/docspell/store/queries/QCollective.scala
@@ -1,10 +1,8 @@
 package docspell.store.queries

-import cats.data.OptionT
 import fs2.Stream

-import docspell.common.ContactKind
-import docspell.common.{Direction, Ident}
+import docspell.common._
 import docspell.store.qb.DSL._
 import docspell.store.qb._
 import docspell.store.records._
@@ -17,6 +15,7 @@ object QCollective {
  private val t  = RTag.as("t")
  private val ro = ROrganization.as("o")
  private val rp = RPerson.as("p")
+  private val re = REquipment.as("e")
  private val rc = RContact.as("c")
  private val i  = RItem.as("i")

@@ -25,13 +24,37 @@ object QCollective {
    val empty = Names(Vector.empty, Vector.empty, Vector.empty)
  }

-  def allNames(collective: Ident): ConnectionIO[Names] =
-    (for {
-      orgs <- OptionT.liftF(ROrganization.findAllRef(collective, None, _.name))
-      pers <- OptionT.liftF(RPerson.findAllRef(collective, None, _.name))
-      equp <- OptionT.liftF(REquipment.findAll(collective, None, _.name))
-    } yield Names(orgs.map(_.name), pers.map(_.name), equp.map(_.name)))
-      .getOrElse(Names.empty)
+  def allNames(collective: Ident, maxEntries: Int): ConnectionIO[Names] = {
+    val created = Column[Timestamp]("created", TableDef(""))
+    union(
+      Select(
+        select(ro.name.s, lit(1).as("kind"), ro.created.as(created)),
+        from(ro),
+        ro.cid === collective
+      ),
+      Select(
+        select(rp.name.s, lit(2).as("kind"), rp.created.as(created)),
+        from(rp),
+        rp.cid === collective
+      ),
+      Select(
+        select(re.name.s, lit(3).as("kind"), re.created.as(created)),
+        from(re),
+        re.cid === collective
+      )
+    ).orderBy(created.desc)
+      .limit(Batch.limit(maxEntries))
+      .build
+      .query[(String, Int)]
+      .streamWithChunkSize(maxEntries)
+      .fold(Names.empty) { case (names, (name, kind)) =>
+        if (kind == 1) names.copy(org = names.org :+ name)
+        else if (kind == 2) names.copy(pers = names.pers :+ name)
+        else names.copy(equip = names.equip :+ name)
+      }
+      .compile
+      .lastOrError
+  }

  case class InsightData(
      incoming: Int,
--- a/modules/store/src/main/scala/docspell/store/queries/QItem.scala
+++ b/modules/store/src/main/scala/docspell/store/queries/QItem.scala
@@ -441,8 +441,9 @@ object QItem {
      tn <- store.transact(RTagItem.deleteItemTags(itemId))
      mn <- store.transact(RSentMail.deleteByItem(itemId))
      cf <- store.transact(RCustomFieldValue.deleteByItem(itemId))
+      im <- store.transact(RItemProposal.deleteByItem(itemId))
      n  <- store.transact(RItem.deleteByIdAndCollective(itemId, collective))
-    } yield tn + rn + n + mn + cf
+    } yield tn + rn + n + mn + cf + im

  private def findByFileIdsQuery(
      fileMetaIds: Nel[Ident],
@@ -543,11 +544,13 @@ object QItem {

  def findAllNewesFirst(
      collective: Ident,
-      chunkSize: Int
+      chunkSize: Int,
+      limit: Batch
  ): Stream[ConnectionIO, Ident] = {
    val i = RItem.as("i")
    Select(i.id.s, from(i), i.cid === collective && i.state === ItemState.confirmed)
      .orderBy(i.created.desc)
+      .limit(limit)
      .build
      .query[Ident]
      .streamWithChunkSize(chunkSize)
@@ -557,6 +560,7 @@ object QItem {
      collective: Ident,
      itemId: Ident,
      tagCategory: String,
+      maxLen: Int,
      pageSep: String
  ): ConnectionIO[TextAndTag] = {
    val tags     = TableDef("tags").as("tt")
@@ -564,7 +568,7 @@ object QItem {
    val tagsTid  = Column[Ident]("tid", tags)
    val tagsName = Column[String]("tname", tags)

-    val q =
+    readTextAndTag(collective, itemId, pageSep) {
      withCte(
        tags -> Select(
          select(ti.itemId.as(tagsItem), tag.tid.as(tagsTid), tag.name.as(tagsName)),
@@ -574,25 +578,98 @@ object QItem {
        )
      )(
        Select(
-          select(m.content, tagsTid, tagsName),
+          select(substring(m.content.s, 0, maxLen).s, tagsTid.s, tagsName.s),
          from(i)
            .innerJoin(a, a.itemId === i.id)
            .innerJoin(m, a.id === m.id)
            .leftJoin(tags, tagsItem === i.id),
          i.id === itemId && i.cid === collective && m.content.isNotNull && m.content <> ""
        )
-      ).build
+      )
+    }
+  }

+  def resolveTextAndCorrOrg(
+      collective: Ident,
+      itemId: Ident,
+      maxLen: Int,
+      pageSep: String
+  ): ConnectionIO[TextAndTag] =
+    readTextAndTag(collective, itemId, pageSep) {
+      Select(
+        select(substring(m.content.s, 0, maxLen).s, org.oid.s, org.name.s),
+        from(i)
+          .innerJoin(a, a.itemId === i.id)
+          .innerJoin(m, m.id === a.id)
+          .leftJoin(org, org.oid === i.corrOrg),
+        i.id === itemId && m.content.isNotNull && m.content <> ""
+      )
+    }
+
+  def resolveTextAndCorrPerson(
+      collective: Ident,
+      itemId: Ident,
+      maxLen: Int,
+      pageSep: String
+  ): ConnectionIO[TextAndTag] =
+    readTextAndTag(collective, itemId, pageSep) {
+      Select(
+        select(substring(m.content.s, 0, maxLen).s, pers0.pid.s, pers0.name.s),
+        from(i)
+          .innerJoin(a, a.itemId === i.id)
+          .innerJoin(m, m.id === a.id)
+          .leftJoin(pers0, pers0.pid === i.corrPerson),
+        i.id === itemId && m.content.isNotNull && m.content <> ""
+      )
+    }
+
+  def resolveTextAndConcPerson(
+      collective: Ident,
+      itemId: Ident,
+      maxLen: Int,
+      pageSep: String
+  ): ConnectionIO[TextAndTag] =
+    readTextAndTag(collective, itemId, pageSep) {
+      Select(
+        select(substring(m.content.s, 0, maxLen).s, pers0.pid.s, pers0.name.s),
+        from(i)
+          .innerJoin(a, a.itemId === i.id)
+          .innerJoin(m, m.id === a.id)
+          .leftJoin(pers0, pers0.pid === i.concPerson),
+        i.id === itemId && m.content.isNotNull && m.content <> ""
+      )
+    }
+
+  def resolveTextAndConcEquip(
+      collective: Ident,
+      itemId: Ident,
+      maxLen: Int,
+      pageSep: String
+  ): ConnectionIO[TextAndTag] =
+    readTextAndTag(collective, itemId, pageSep) {
+      Select(
+        select(substring(m.content.s, 0, maxLen).s, equip.eid.s, equip.name.s),
+        from(i)
+          .innerJoin(a, a.itemId === i.id)
+          .innerJoin(m, m.id === a.id)
+          .leftJoin(equip, equip.eid === i.concEquipment),
+        i.id === itemId && m.content.isNotNull && m.content <> ""
+      )
+    }
+
+  private def readTextAndTag(collective: Ident, itemId: Ident, pageSep: String)(
+      q: Select
+  ): ConnectionIO[TextAndTag] =
    for {
      _ <- logger.ftrace[ConnectionIO](
-        s"query: $q  (${itemId.id}, ${collective.id}, ${tagCategory})"
+        s"query: $q  (${itemId.id}, ${collective.id})"
      )
-      texts <- q.query[(String, Option[TextAndTag.TagName])].to[List]
+      texts <- q.build.query[(String, Option[TextAndTag.TagName])].to[List]
      _ <- logger.ftrace[ConnectionIO](
        s"Got ${texts.size} text and tag entries for item ${itemId.id}"
      )
      tag = texts.headOption.flatMap(_._2)
      txt = texts.map(_._1).mkString(pageSep)
    } yield TextAndTag(itemId, txt, tag)
-  }
+
 }
--- a/modules/store/src/main/scala/docspell/store/records/RAttachmentMeta.scala
+++ b/modules/store/src/main/scala/docspell/store/records/RAttachmentMeta.scala
@@ -15,7 +15,8 @@ case class RAttachmentMeta(
    content: Option[String],
    nerlabels: List[NerLabel],
    proposals: MetaProposalList,
-    pages: Option[Int]
+    pages: Option[Int],
+    language: Option[Language]
 ) {

  def setContentIfEmpty(txt: Option[String]): RAttachmentMeta =
@@ -27,8 +28,8 @@ case class RAttachmentMeta(
 }

 object RAttachmentMeta {
-  def empty(attachId: Ident) =
-    RAttachmentMeta(attachId, None, Nil, MetaProposalList.empty, None)
+  def empty(attachId: Ident, lang: Language) =
+    RAttachmentMeta(attachId, None, Nil, MetaProposalList.empty, None, Some(lang))

  final case class Table(alias: Option[String]) extends TableDef {
    val tableName = "attachmentmeta"
@@ -38,7 +39,16 @@ object RAttachmentMeta {
    val nerlabels = Column[List[NerLabel]]("nerlabels", this)
    val proposals = Column[MetaProposalList]("itemproposals", this)
    val pages     = Column[Int]("page_count", this)
-    val all       = NonEmptyList.of[Column[_]](id, content, nerlabels, proposals, pages)
+    val language  = Column[Language]("language", this)
+    val all =
+      NonEmptyList.of[Column[_]](
+        id,
+        content,
+        nerlabels,
+        proposals,
+        pages,
+        language
+      )
  }

  val T = Table(None)
@@ -49,7 +59,7 @@ object RAttachmentMeta {
    DML.insert(
      T,
      T.all,
-      fr"${v.id},${v.content},${v.nerlabels},${v.proposals},${v.pages}"
+      fr"${v.id},${v.content},${v.nerlabels},${v.proposals},${v.pages},${v.language}"
    )

  def exists(attachId: Ident): ConnectionIO[Boolean] =
@@ -90,13 +100,14 @@ object RAttachmentMeta {
      )
    )

-  def updateProposals(mid: Ident, plist: MetaProposalList): ConnectionIO[Int] =
+  def updateProposals(
+      mid: Ident,
+      plist: MetaProposalList
+  ): ConnectionIO[Int] =
    DML.update(
      T,
      T.id === mid,
-      DML.set(
-        T.proposals.setTo(plist)
-      )
+      DML.set(T.proposals.setTo(plist))
    )

  def updatePageCount(mid: Ident, pageCount: Option[Int]): ConnectionIO[Int] =
--- a/modules/store/src/main/scala/docspell/store/records/RClassifierModel.scala
+++ b/modules/store/src/main/scala/docspell/store/records/RClassifierModel.scala
@@ -0,0 +1,102 @@
+package docspell.store.records
+
+import cats.data.NonEmptyList
+import cats.effect._
+import cats.implicits._
+
+import docspell.common._
+import docspell.store.qb.DSL._
+import docspell.store.qb._
+
+import doobie._
+import doobie.implicits._
+
+final case class RClassifierModel(
+    id: Ident,
+    cid: Ident,
+    name: String,
+    fileId: Ident,
+    created: Timestamp
+) {}
+
+object RClassifierModel {
+
+  def createNew[F[_]: Sync](
+      cid: Ident,
+      name: String,
+      fileId: Ident
+  ): F[RClassifierModel] =
+    for {
+      id  <- Ident.randomId[F]
+      now <- Timestamp.current[F]
+    } yield RClassifierModel(id, cid, name, fileId, now)
+
+  final case class Table(alias: Option[String]) extends TableDef {
+    val tableName = "classifier_model"
+
+    val id      = Column[Ident]("id", this)
+    val cid     = Column[Ident]("cid", this)
+    val name    = Column[String]("name", this)
+    val fileId  = Column[Ident]("file_id", this)
+    val created = Column[Timestamp]("created", this)
+
+    val all = NonEmptyList.of[Column[_]](id, cid, name, fileId, created)
+  }
+
+  def as(alias: String): Table =
+    Table(Some(alias))
+
+  val T = Table(None)
+
+  def insert(v: RClassifierModel): ConnectionIO[Int] =
+    DML.insert(
+      T,
+      T.all,
+      fr"${v.id},${v.cid},${v.name},${v.fileId},${v.created}"
+    )
+
+  def updateFile(coll: Ident, name: String, fid: Ident): ConnectionIO[Int] =
+    for {
+      now <- Timestamp.current[ConnectionIO]
+      n <- DML.update(
+        T,
+        T.cid === coll && T.name === name,
+        DML.set(T.fileId.setTo(fid), T.created.setTo(now))
+      )
+      k <-
+        if (n == 0) createNew[ConnectionIO](coll, name, fid).flatMap(insert)
+        else 0.pure[ConnectionIO]
+    } yield n + k
+
+  def deleteById(id: Ident): ConnectionIO[Int] =
+    DML.delete(T, T.id === id)
+
+  def deleteAll(ids: List[Ident]): ConnectionIO[Int] =
+    NonEmptyList.fromList(ids) match {
+      case Some(nel) =>
+        DML.delete(T, T.id.in(nel))
+      case None =>
+        0.pure[ConnectionIO]
+    }
+
+  def findByName(cid: Ident, name: String): ConnectionIO[Option[RClassifierModel]] =
+    Select(select(T.all), from(T), T.cid === cid && T.name === name).build
+      .query[RClassifierModel]
+      .option
+
+  def findAllByName(
+      cid: Ident,
+      names: NonEmptyList[String]
+  ): ConnectionIO[List[RClassifierModel]] =
+    Select(select(T.all), from(T), T.cid === cid && T.name.in(names)).build
+      .query[RClassifierModel]
+      .to[List]
+
+  def findAllByQuery(
+      cid: Ident,
+      nameQuery: String
+  ): ConnectionIO[List[RClassifierModel]] =
+    Select(select(T.all), from(T), T.cid === cid && T.name.like(nameQuery)).build
+      .query[RClassifierModel]
+      .to[List]
+}
--- a/modules/store/src/main/scala/docspell/store/records/RClassifierSetting.scala
+++ b/modules/store/src/main/scala/docspell/store/records/RClassifierSetting.scala
@@ -1,6 +1,6 @@
 package docspell.store.records

-import cats.data.NonEmptyList
+import cats.data.{NonEmptyList, OptionT}
 import cats.implicits._

 import docspell.common._
@@ -13,27 +13,38 @@ import doobie.implicits._

 case class RClassifierSetting(
    cid: Ident,
-    enabled: Boolean,
    schedule: CalEvent,
-    category: String,
    itemCount: Int,
-    fileId: Option[Ident],
-    created: Timestamp
-) {}
+    created: Timestamp,
+    categoryList: List[String],
+    listType: ListType
+) {
+
+  def autoTagEnabled: Boolean =
+    listType match {
+      case ListType.Blacklist =>
+        true
+      case ListType.Whitelist =>
+        categoryList.nonEmpty
+    }
+}

 object RClassifierSetting {
+  // the categoryList is stored as a json array
+  implicit val stringListMeta: Meta[List[String]] =
+    jsonMeta[List[String]]
+
  final case class Table(alias: Option[String]) extends TableDef {
    val tableName = "classifier_setting"

-    val cid       = Column[Ident]("cid", this)
-    val enabled   = Column[Boolean]("enabled", this)
-    val schedule  = Column[CalEvent]("schedule", this)
-    val category  = Column[String]("category", this)
-    val itemCount = Column[Int]("item_count", this)
-    val fileId    = Column[Ident]("file_id", this)
-    val created   = Column[Timestamp]("created", this)
+    val cid        = Column[Ident]("cid", this)
+    val schedule   = Column[CalEvent]("schedule", this)
+    val itemCount  = Column[Int]("item_count", this)
+    val created    = Column[Timestamp]("created", this)
+    val categories = Column[List[String]]("categories", this)
+    val listType   = Column[ListType]("category_list_type", this)
    val all = NonEmptyList
-      .of[Column[_]](cid, enabled, schedule, category, itemCount, fileId, created)
+      .of[Column[_]](cid, schedule, itemCount, created, categories, listType)
  }

  val T = Table(None)
@@ -44,35 +55,19 @@ object RClassifierSetting {
    DML.insert(
      T,
      T.all,
-      fr"${v.cid},${v.enabled},${v.schedule},${v.category},${v.itemCount},${v.fileId},${v.created}"
+      fr"${v.cid},${v.schedule},${v.itemCount},${v.created},${v.categoryList},${v.listType}"
    )

-  def updateAll(v: RClassifierSetting): ConnectionIO[Int] =
-    DML.update(
-      T,
-      T.cid === v.cid,
-      DML.set(
-        T.enabled.setTo(v.enabled),
-        T.schedule.setTo(v.schedule),
-        T.category.setTo(v.category),
-        T.itemCount.setTo(v.itemCount),
-        T.fileId.setTo(v.fileId)
-      )
-    )
-
-  def updateFile(coll: Ident, fid: Ident): ConnectionIO[Int] =
-    DML.update(T, T.cid === coll, DML.set(T.fileId.setTo(fid)))
-
-  def updateSettings(v: RClassifierSetting): ConnectionIO[Int] =
+  def update(v: RClassifierSetting): ConnectionIO[Int] =
    for {
      n1 <- DML.update(
        T,
        T.cid === v.cid,
        DML.set(
-          T.enabled.setTo(v.enabled),
          T.schedule.setTo(v.schedule),
          T.itemCount.setTo(v.itemCount),
-          T.category.setTo(v.category)
+          T.categories.setTo(v.categoryList),
+          T.listType.setTo(v.listType)
        )
      )
      n2 <- if (n1 <= 0) insert(v) else 0.pure[ConnectionIO]
@@ -86,27 +81,62 @@ object RClassifierSetting {
  def delete(coll: Ident): ConnectionIO[Int] =
    DML.delete(T, T.cid === coll)

+  /** Finds tag categories that exist and match the classifier setting.
+    * If the setting contains a black list, they are removed from the
+    * existing categories. If it is a whitelist, the intersection is
+    * returned.
+    */
+  def getActiveCategories(coll: Ident): ConnectionIO[List[String]] =
+    (for {
+      sett <- OptionT(findById(coll))
+      cats <- OptionT.liftF(RTag.listCategories(coll))
+      res = sett.listType match {
+        case ListType.Blacklist =>
+          cats.diff(sett.categoryList)
+        case ListType.Whitelist =>
+          sett.categoryList.intersect(cats)
+      }
+    } yield res).getOrElse(Nil)
+
+  /** Checks the json array of tag categories and removes those that are not present anymore. */
+  def fixCategoryList(coll: Ident): ConnectionIO[Int] =
+    (for {
+      sett <- OptionT(findById(coll))
+      cats <- OptionT.liftF(RTag.listCategories(coll))
+      fixed = sett.categoryList.intersect(cats)
+      n <- OptionT.liftF(
+        if (fixed == sett.categoryList) 0.pure[ConnectionIO]
+        else DML.update(T, T.cid === coll, DML.set(T.categories.setTo(fixed)))
+      )
+    } yield n).getOrElse(0)
+
  case class Classifier(
-      enabled: Boolean,
      schedule: CalEvent,
      itemCount: Int,
-      category: Option[String]
+      categories: List[String],
+      listType: ListType
  ) {
+    def enabled: Boolean =
+      listType match {
+        case ListType.Blacklist =>
+          true
+        case ListType.Whitelist =>
+          categories.nonEmpty
+      }

    def toRecord(coll: Ident, created: Timestamp): RClassifierSetting =
      RClassifierSetting(
        coll,
-        enabled,
        schedule,
-        category.getOrElse(""),
        itemCount,
-        None,
-        created
+        created,
+        categories,
+        listType
      )
  }
  object Classifier {
    def fromRecord(r: RClassifierSetting): Classifier =
-      Classifier(r.enabled, r.schedule, r.itemCount, r.category.some)
+      Classifier(r.schedule, r.itemCount, r.categoryList, r.listType)
  }

 }
--- a/modules/store/src/main/scala/docspell/store/records/RCollective.scala
+++ b/modules/store/src/main/scala/docspell/store/records/RCollective.scala
@@ -1,6 +1,6 @@
 package docspell.store.records

-import cats.data.NonEmptyList
+import cats.data.{NonEmptyList, OptionT}
 import fs2.Stream

 import docspell.common._
@@ -73,13 +73,24 @@ object RCollective {
          .map(now => settings.classifier.map(_.toRecord(cid, now)))
      n2 <- cls match {
        case Some(cr) =>
-          RClassifierSetting.updateSettings(cr)
+          RClassifierSetting.update(cr)
        case None =>
          RClassifierSetting.delete(cid)
      }
    } yield n1 + n2

-  def getSettings(coll: Ident): ConnectionIO[Option[Settings]] = {
+  // this hides categories that have been deleted in the meantime
+  // they are finally removed from the json array once the learn classifier task is run
+  def getSettings(coll: Ident): ConnectionIO[Option[Settings]] =
+    (for {
+      sett <- OptionT(getRawSettings(coll))
+      prev <- OptionT.fromOption[ConnectionIO](sett.classifier)
+      cats <- OptionT.liftF(RTag.listCategories(coll))
+      next = prev.copy(categories = prev.categories.intersect(cats))
+    } yield sett.copy(classifier = Some(next))).value
+
+  private def getRawSettings(coll: Ident): ConnectionIO[Option[Settings]] = {
+    import RClassifierSetting.stringListMeta
    val c  = RCollective.as("c")
    val cs = RClassifierSetting.as("cs")

@@ -87,10 +98,10 @@ object RCollective {
      select(
        c.language.s,
        c.integration.s,
-        cs.enabled.s,
        cs.schedule.s,
        cs.itemCount.s,
-        cs.category.s
+        cs.categories.s,
+        cs.listType.s
      ),
      from(c).leftJoin(cs, cs.cid === c.id),
      c.id === coll
--- a/modules/store/src/main/scala/docspell/store/records/RItemProposal.scala
+++ b/modules/store/src/main/scala/docspell/store/records/RItemProposal.scala
@@ -0,0 +1,60 @@
+package docspell.store.records
+
+import cats.data.NonEmptyList
+
+import docspell.common._
+import docspell.store.qb.DSL._
+import docspell.store.qb._
+
+import doobie._
+import doobie.implicits._
+
+case class RItemProposal(
+    itemId: Ident,
+    classifyProposals: MetaProposalList,
+    classifyTags: List[IdRef],
+    created: Timestamp
+)
+
+object RItemProposal {
+  final case class Table(alias: Option[String]) extends TableDef {
+    val tableName = "item_proposal"
+
+    val itemId            = Column[Ident]("itemid", this)
+    val classifyProposals = Column[MetaProposalList]("classifier_proposals", this)
+    val classifyTags      = Column[List[IdRef]]("classifier_tags", this)
+    val created           = Column[Timestamp]("created", this)
+    val all               = NonEmptyList.of[Column[_]](itemId, classifyProposals, classifyTags, created)
+  }
+
+  val T = Table(None)
+  def as(alias: String): Table =
+    Table(Some(alias))
+
+  def insert(v: RItemProposal): ConnectionIO[Int] =
+    DML.insert(
+      T,
+      T.all,
+      fr"${v.itemId},${v.classifyProposals},${v.classifyTags},${v.created}"
+    )
+
+  def update(v: RItemProposal): ConnectionIO[Int] =
+    DML.update(
+      T,
+      T.itemId === v.itemId,
+      DML.set(
+        T.classifyProposals.setTo(v.classifyProposals),
+        T.classifyTags.setTo(v.classifyTags)
+      )
+    )
+
+  def deleteByItem(itemId: Ident): ConnectionIO[Int] =
+    DML.delete(T, T.itemId === itemId)
+
+  def exists(itemId: Ident): ConnectionIO[Boolean] =
+    Select(select(countAll), from(T), T.itemId === itemId).build
+      .query[Int]
+      .unique
+      .map(_ > 0)
+
+}
--- a/modules/store/src/main/scala/docspell/store/records/RTag.scala
+++ b/modules/store/src/main/scala/docspell/store/records/RTag.scala
@@ -148,6 +148,13 @@ object RTag {
    ).orderBy(T.name.asc).build.query[RTag].to[List]
  }

+  def listCategories(coll: Ident): ConnectionIO[List[String]] =
+    Select(
+      T.category.s,
+      from(T),
+      T.cid === coll && T.category.isNotNull
+    ).distinct.build.query[String].to[List]
+
  def delete(tagId: Ident, coll: Ident): ConnectionIO[Int] =
    DML.delete(T, T.tid === tagId && T.cid === coll)
 }
--- a/modules/webapp/src/main/elm/Comp/ClassifierSettingsForm.elm
+++ b/modules/webapp/src/main/elm/Comp/ClassifierSettingsForm.elm
@@ -11,35 +11,38 @@ import Api
 import Api.Model.ClassifierSetting exposing (ClassifierSetting)
 import Api.Model.TagList exposing (TagList)
 import Comp.CalEventInput
+import Comp.Dropdown
 import Comp.FixedDropdown
 import Comp.IntField
 import Data.CalEvent exposing (CalEvent)
 import Data.Flags exposing (Flags)
+import Data.ListType exposing (ListType)
+import Data.UiSettings exposing (UiSettings)
 import Data.Validated exposing (Validated(..))
 import Html exposing (..)
 import Html.Attributes exposing (..)
-import Html.Events exposing (onCheck)
 import Http
+import Markdown
 import Util.Tag


 type alias Model =
-    { enabled : Bool
-    , categoryModel : Comp.FixedDropdown.Model String
-    , category : Maybe String
-    , scheduleModel : Comp.CalEventInput.Model
+    { scheduleModel : Comp.CalEventInput.Model
    , schedule : Validated CalEvent
    , itemCountModel : Comp.IntField.Model
    , itemCount : Maybe Int
+    , categoryListModel : Comp.Dropdown.Model String
+    , categoryListType : ListType
+    , categoryListTypeModel : Comp.FixedDropdown.Model ListType
    }


 type Msg
-    = GetTagsResp (Result Http.Error TagList)
-    | ScheduleMsg Comp.CalEventInput.Msg
-    | ToggleEnabled
-    | CategoryMsg (Comp.FixedDropdown.Msg String)
+    = ScheduleMsg Comp.CalEventInput.Msg
    | ItemCountMsg Comp.IntField.Msg
+    | GetTagsResp (Result Http.Error TagList)
+    | CategoryListMsg (Comp.Dropdown.Msg String)
+    | CategoryListTypeMsg (Comp.FixedDropdown.Msg ListType)


 init : Flags -> ClassifierSetting -> ( Model, Cmd Msg )
@@ -52,13 +55,36 @@ init flags sett =
        ( cem, cec ) =
            Comp.CalEventInput.init flags newSchedule
    in
-    ( { enabled = sett.enabled
-      , categoryModel = Comp.FixedDropdown.initString []
-      , category = sett.category
-      , scheduleModel = cem
+    ( { scheduleModel = cem
      , schedule = Data.Validated.Unknown newSchedule
      , itemCountModel = Comp.IntField.init (Just 0) Nothing True "Item Count"
      , itemCount = Just sett.itemCount
+      , categoryListModel =
+            let
+                mkOption s =
+                    { value = s, text = s, additional = "" }
+
+                minit =
+                    Comp.Dropdown.makeModel
+                        { multiple = True
+                        , searchable = \n -> n > 0
+                        , makeOption = mkOption
+                        , labelColor = \_ -> \_ -> "grey "
+                        , placeholder = "Choose categories …"
+                        }
+
+                lm =
+                    Comp.Dropdown.SetSelection sett.categoryList
+
+                ( m_, _ ) =
+                    Comp.Dropdown.update lm minit
+            in
+            m_
+      , categoryListType =
+            Data.ListType.fromString sett.listType
+                |> Maybe.withDefault Data.ListType.Whitelist
+      , categoryListTypeModel =
+            Comp.FixedDropdown.initMap Data.ListType.label Data.ListType.all
      }
    , Cmd.batch
        [ Api.getTags flags "" GetTagsResp
@@ -71,11 +97,11 @@ getSettings : Model -> Validated ClassifierSetting
 getSettings model =
    Data.Validated.map
        (\sch ->
-            { enabled = model.enabled
-            , category = model.category
-            , schedule =
+            { schedule =
                Data.CalEvent.makeEvent sch
            , itemCount = Maybe.withDefault 0 model.itemCount
+            , listType = Data.ListType.toString model.categoryListType
+            , categoryList = Comp.Dropdown.getSelected model.categoryListModel
            }
        )
        model.schedule
@@ -89,18 +115,11 @@ update flags msg model =
                categories =
                    Util.Tag.getCategories tl.items
                        |> List.sort
-            in
-            ( { model
-                | categoryModel = Comp.FixedDropdown.initString categories
-                , category =
-                    if model.category == Nothing then
-                        List.head categories

-                    else
-                        model.category
-              }
-            , Cmd.none
-            )
+                lm =
+                    Comp.Dropdown.SetOptions categories
+            in
+            update flags (CategoryListMsg lm) model

        GetTagsResp (Err _) ->
            ( model, Cmd.none )
@@ -121,28 +140,6 @@ update flags msg model =
            , Cmd.map ScheduleMsg cc
            )

-        ToggleEnabled ->
-            ( { model | enabled = not model.enabled }
-            , Cmd.none
-            )
-
-        CategoryMsg lmsg ->
-            let
-                ( mm, ma ) =
-                    Comp.FixedDropdown.update lmsg model.categoryModel
-            in
-            ( { model
-                | categoryModel = mm
-                , category =
-                    if ma == Nothing then
-                        model.category
-
-                    else
-                        ma
-              }
-            , Cmd.none
-            )
-
        ItemCountMsg lmsg ->
            let
                ( im, iv ) =
@@ -155,39 +152,68 @@ update flags msg model =
            , Cmd.none
            )

+        CategoryListMsg lm ->
+            let
+                ( m_, cmd_ ) =
+                    Comp.Dropdown.update lm model.categoryListModel
+            in
+            ( { model | categoryListModel = m_ }
+            , Cmd.map CategoryListMsg cmd_
+            )

-view : Model -> Html Msg
-view model =
+        CategoryListTypeMsg lm ->
+            let
+                ( m_, sel ) =
+                    Comp.FixedDropdown.update lm model.categoryListTypeModel
+
+                newListType =
+                    Maybe.withDefault model.categoryListType sel
+            in
+            ( { model
+                | categoryListTypeModel = m_
+                , categoryListType = newListType
+              }
+            , Cmd.none
+            )
+
+
+view : UiSettings -> Model -> Html Msg
+view settings model =
+    let
+        catListTypeItem =
+            Comp.FixedDropdown.Item
+                model.categoryListType
+                (Data.ListType.label model.categoryListType)
+    in
    div []
-        [ div
-            [ class "field"
-            ]
-            [ div [ class "ui checkbox" ]
-                [ input
-                    [ type_ "checkbox"
-                    , onCheck (\_ -> ToggleEnabled)
-                    , checked model.enabled
-                    ]
-                    []
-                , label [] [ text "Enable classification" ]
-                , span [ class "small-info" ]
-                    [ text "Disable document classification if not needed."
-                    ]
-                ]
-            ]
-        , div [ class "ui basic segment" ]
-            [ text "Document classification tries to predict a tag for new incoming documents. This "
-            , text "works by learning from existing documents in order to find common patterns within "
-            , text "the text. The more documents you have correctly tagged, the better. Learning is done "
-            , text "periodically based on a schedule and you need to specify a tag-group that should "
-            , text "be used for learning."
+        [ Markdown.toHtml [ class "ui basic segment" ]
+            """
+
+Auto-tagging works by learning from existing documents. The more
+documents you have correctly tagged, the better. Learning is done
+periodically based on a schedule. You can specify tag-groups that
+should either be used (whitelist) or not used (blacklist) for
+learning.
+
+Use an empty whitelist to disable auto tagging.
+
+            """
+        , div [ class "field" ]
+            [ label [] [ text "Is the following a blacklist or whitelist?" ]
+            , Html.map CategoryListTypeMsg
+                (Comp.FixedDropdown.view (Just catListTypeItem) model.categoryListTypeModel)
            ]
        , div [ class "field" ]
-            [ label [] [ text "Category" ]
-            , Html.map CategoryMsg
-                (Comp.FixedDropdown.viewString model.category
-                    model.categoryModel
-                )
+            [ label []
+                [ case model.categoryListType of
+                    Data.ListType.Whitelist ->
+                        text "Include tag categories for learning"
+
+                    Data.ListType.Blacklist ->
+                        text "Exclude tag categories from learning"
+                ]
+            , Html.map CategoryListMsg
+                (Comp.Dropdown.view settings model.categoryListModel)
            ]
        , Html.map ItemCountMsg
            (Comp.IntField.viewWithInfo
--- a/modules/webapp/src/main/elm/Comp/CollectiveSettingsForm.elm
+++ b/modules/webapp/src/main/elm/Comp/CollectiveSettingsForm.elm
@@ -280,7 +280,7 @@ view flags settings model =
                , ( "invisible hidden", not flags.config.showClassificationSettings )
                ]
            ]
-            [ text "Document Classifier"
+            [ text "Auto-Tagging"
            ]
        , div
            [ classList
@@ -289,13 +289,10 @@ view flags settings model =
                ]
            ]
            [ Html.map ClassifierSettingMsg
-                (Comp.ClassifierSettingsForm.view model.classifierModel)
+                (Comp.ClassifierSettingsForm.view settings model.classifierModel)
            , div [ class "ui vertical segment" ]
                [ button
-                    [ classList
-                        [ ( "ui small secondary basic button", True )
-                        , ( "disabled", not model.classifierModel.enabled )
-                        ]
+                    [ class "ui small secondary basic button"
                    , title "Starts a task to train a classifier"
                    , onClick StartClassifierTask
                    ]
--- a/modules/webapp/src/main/elm/Comp/ItemDetail/View.elm
+++ b/modules/webapp/src/main/elm/Comp/ItemDetail/View.elm
@@ -958,7 +958,6 @@ renderSuggestions model mkName idnames tagger =
                ]
            , div [ class "menu" ] <|
                (idnames
-                    |> List.take 5
                    |> List.map (\p -> a [ class "item", href "#", onClick (tagger p) ] [ text (mkName p) ])
                )
            ]
@@ -969,7 +968,7 @@ renderOrgSuggestions : Model -> Html Msg
 renderOrgSuggestions model =
    renderSuggestions model
        .name
-        (List.take 5 model.itemProposals.corrOrg)
+        (List.take 6 model.itemProposals.corrOrg)
        SetCorrOrgSuggestion


@@ -977,7 +976,7 @@ renderCorrPersonSuggestions : Model -> Html Msg
 renderCorrPersonSuggestions model =
    renderSuggestions model
        .name
-        (List.take 5 model.itemProposals.corrPerson)
+        (List.take 6 model.itemProposals.corrPerson)
        SetCorrPersonSuggestion


@@ -985,7 +984,7 @@ renderConcPersonSuggestions : Model -> Html Msg
 renderConcPersonSuggestions model =
    renderSuggestions model
        .name
-        (List.take 5 model.itemProposals.concPerson)
+        (List.take 6 model.itemProposals.concPerson)
        SetConcPersonSuggestion


@@ -993,7 +992,7 @@ renderConcEquipSuggestions : Model -> Html Msg
 renderConcEquipSuggestions model =
    renderSuggestions model
        .name
-        (List.take 5 model.itemProposals.concEquipment)
+        (List.take 6 model.itemProposals.concEquipment)
        SetConcEquipSuggestion


@@ -1001,7 +1000,7 @@ renderItemDateSuggestions : Model -> Html Msg
 renderItemDateSuggestions model =
    renderSuggestions model
        Util.Time.formatDate
-        (List.take 5 model.itemProposals.itemDate)
+        (List.take 6 model.itemProposals.itemDate)
        SetItemDateSuggestion


@@ -1009,7 +1008,7 @@ renderDueDateSuggestions : Model -> Html Msg
 renderDueDateSuggestions model =
    renderSuggestions model
        Util.Time.formatDate
-        (List.take 5 model.itemProposals.dueDate)
+        (List.take 6 model.itemProposals.dueDate)
        SetDueDateSuggestion


--- a/modules/webapp/src/main/elm/Data/Language.elm
+++ b/modules/webapp/src/main/elm/Data/Language.elm
@@ -11,6 +11,17 @@ type Language
    = German
    | English
    | French
+    | Italian
+    | Spanish
+    | Portuguese
+    | Czech
+    | Danish
+    | Finnish
+    | Norwegian
+    | Swedish
+    | Russian
+    | Romanian
+    | Dutch


 fromString : String -> Maybe Language
@@ -24,6 +35,39 @@ fromString str =
    else if str == "fra" || str == "fr" || str == "french" then
        Just French

+    else if str == "ita" || str == "it" || str == "italian" then
+        Just Italian
+
+    else if str == "spa" || str == "es" || str == "spanish" then
+        Just Spanish
+
+    else if str == "por" || str == "pt" || str == "portuguese" then
+        Just Portuguese
+
+    else if str == "ces" || str == "cs" || str == "czech" then
+        Just Czech
+
+    else if str == "dan" || str == "da" || str == "danish" then
+        Just Danish
+
+    else if str == "nld" || str == "nd" || str == "dutch" then
+        Just Dutch
+
+    else if str == "fin" || str == "fi" || str == "finnish" then
+        Just Finnish
+
+    else if str == "nor" || str == "no" || str == "norwegian" then
+        Just Norwegian
+
+    else if str == "swe" || str == "sv" || str == "swedish" then
+        Just Swedish
+
+    else if str == "rus" || str == "ru" || str == "russian" then
+        Just Russian
+
+    else if str == "ron" || str == "ro" || str == "romanian" then
+        Just Romanian
+
    else
        Nothing

@@ -40,6 +84,39 @@ toIso3 lang =
        French ->
            "fra"

+        Italian ->
+            "ita"
+
+        Spanish ->
+            "spa"
+
+        Portuguese ->
+            "por"
+
+        Czech ->
+            "ces"
+
+        Danish ->
+            "dan"
+
+        Finnish ->
+            "fin"
+
+        Norwegian ->
+            "nor"
+
+        Swedish ->
+            "swe"
+
+        Russian ->
+            "rus"
+
+        Romanian ->
+            "ron"
+
+        Dutch ->
+            "nld"
+

 toName : Language -> String
 toName lang =
@@ -53,7 +130,54 @@ toName lang =
        French ->
            "French"

+        Italian ->
+            "Italian"
+
+        Spanish ->
+            "Spanish"
+
+        Portuguese ->
+            "Portuguese"
+
+        Czech ->
+            "Czech"
+
+        Danish ->
+            "Danish"
+
+        Finnish ->
+            "Finnish"
+
+        Norwegian ->
+            "Norwegian"
+
+        Swedish ->
+            "Swedish"
+
+        Russian ->
+            "Russian"
+
+        Romanian ->
+            "Romanian"
+
+        Dutch ->
+            "Dutch"
+

 all : List Language
 all =
-    [ German, English, French ]
+    [ German
+    , English
+    , French
+    , Italian
+    , Spanish
+    , Portuguese
+    , Czech
+    , Dutch
+    , Danish
+    , Finnish
+    , Norwegian
+    , Swedish
+    , Russian
+    , Romanian
+    ]
--- a/modules/webapp/src/main/elm/Data/ListType.elm
+++ b/modules/webapp/src/main/elm/Data/ListType.elm
@@ -0,0 +1,50 @@
+module Data.ListType exposing
+    ( ListType(..)
+    , all
+    , fromString
+    , label
+    , toString
+    )
+
+
+type ListType
+    = Blacklist
+    | Whitelist
+
+
+all : List ListType
+all =
+    [ Blacklist, Whitelist ]
+
+
+toString : ListType -> String
+toString lt =
+    case lt of
+        Blacklist ->
+            "blacklist"
+
+        Whitelist ->
+            "whitelist"
+
+
+label : ListType -> String
+label lt =
+    case lt of
+        Blacklist ->
+            "Blacklist"
+
+        Whitelist ->
+            "Whitelist"
+
+
+fromString : String -> Maybe ListType
+fromString str =
+    case String.toLower str of
+        "blacklist" ->
+            Just Blacklist
+
+        "whitelist" ->
+            Just Whitelist
+
+        _ ->
+            Nothing
--- a/nix/module-joex.nix
+++ b/nix/module-joex.nix
@@ -98,9 +98,13 @@ let
    };
    text-analysis = {
      max-length = 10000;
-      regex-ner = {
-        enabled = true;
-        file-cache-time = "1 minute";
+      nlp = {
+        mode = "full";
+        clear-interval = "15 minutes";
+        regex-ner = {
+          max-entries = 1000;
+          file-cache-time = "1 minute";
+        };
      };
      classification = {
        enabled = true;
@@ -118,7 +122,6 @@ let
        ];
      };
      working-dir = "/tmp/docspell-analysis";
-      clear-stanford-nlp-interval = "15 minutes";
    };
    processing = {
      max-due-date-years = 10;
@@ -772,47 +775,96 @@ in {
                files.
              '';
            };
-            clear-stanford-nlp-interval = mkOption {
-              type = types.str;
-              default = defaults.text-analysis.clear-stanford-nlp-interval;
-              description = ''
-                Idle time after which the NLP caches are cleared to free
-                memory. If <= 0 clearing the cache is disabled.
-              '';
-            };

-            regex-ner = mkOption {
+            nlp = mkOption {
              type = types.submodule({
                options = {
-                  enabled = mkOption {
-                    type = types.bool;
-                    default = defaults.text-analysis.regex-ner.enabled;
+                  mode = mkOption {
+                    type = types.str;
+                    default = defaults.text-analysis.nlp.mode;
                    description = ''
-                      Whether to enable custom NER annotation. This uses the address
-                      book of a collective as input for NER tagging (to automatically
-                      find correspondent and concerned entities). If the address book
-                      is large, this can be quite memory intensive and also makes text
-                      analysis slower. But it greatly improves accuracy. If this is
-                      false, NER tagging uses only statistical models (that also work
-                      quite well).
+                      The mode for configuring NLP models:

-                      This setting might be moved to the collective settings in the
-                      future.
+                      1. full – builds the complete pipeline
+                      2. basic - builds only the ner annotator
+                      3. regexonly - matches each entry in your address book via regexps
+                      4. disabled - doesn't use any stanford-nlp feature
+
+                      The full and basic variants rely on pre-build language models
+                      that are available for only 3 lanugages at the moment: German,
+                      English and French.
+
+                      Memory usage varies greatly among the languages. German has
+                      quite large models, that require about 1G heap. So joex should
+                      run with -Xmx1400M at least when using mode=full.
+
+                      The basic variant does a quite good job for German and
+                      English. It might be worse for French, always depending on the
+                      type of text that is analysed. Joex should run with about 600M
+                      heap, here again lanugage German uses the most.
+
+                      The regexonly variant doesn't depend on a language. It roughly
+                      works by converting all entries in your addressbook into
+                      regexps and matches each one against the text. This can get
+                      memory intensive, too, when the addressbook grows large. This
+                      is included in the full and basic by default, but can be used
+                      independently by setting mode=regexner.
+
+                      When mode=disabled, then the whole nlp pipeline is disabled,
+                      and you won't get any suggestions. Only what the classifier
+                      returns (if enabled).
                    '';
                  };
-                  file-cache-time = mkOption {
+
+                  clear-interval = mkOption {
                    type = types.str;
-                    default = defaults.text-analysis.ner-file-cache-time;
+                    default = defaults.text-analysis.nlp.clear-interval;
                    description = ''
-                      The NER annotation uses a file of patterns that is derived from
-                      a collective's address book. This is is the time how long this
-                      file will be kept until a check for a state change is done.
+                      Idle time after which the NLP caches are cleared to free
+                      memory. If <= 0 clearing the cache is disabled.
                    '';
                  };
+
+                  regex-ner = mkOption {
+                    type = types.submodule({
+                      options = {
+                        enabled = mkOption {
+                          type = types.int;
+                          default = defaults.text-analysis.regex-ner.max-entries;
+                          description = ''
+                            Whether to enable custom NER annotation. This uses the
+                            address book of a collective as input for NER tagging (to
+                            automatically find correspondent and concerned entities). If
+                            the address book is large, this can be quite memory
+                            intensive and also makes text analysis much slower. But it
+                            improves accuracy and can be used independent of the
+                            lanugage. If this is set to 0, it is effectively disabled
+                            and NER tagging uses only statistical models (that also work
+                            quite well, but are restricted to the languages mentioned
+                            above).
+
+                            Note, this is only relevant if nlp-config.mode is not
+                            "disabled".
+                          '';
+                        };
+                        file-cache-time = mkOption {
+                          type = types.str;
+                          default = defaults.text-analysis.ner-file-cache-time;
+                          description = ''
+                            The NER annotation uses a file of patterns that is derived from
+                            a collective's address book. This is is the time how long this
+                            file will be kept until a check for a state change is done.
+                          '';
+                        };
+                      };
+                    });
+                    default = defaults.text-analysis.nlp.regex-ner;
+                    description = "";
+                  };
                };
              });
-              default = defaults.text-analysis.regex-ner;
-              description = "";
+              default = defaults.text-analysis.nlp;
+              description = "Configure NLP";
            };

            classification = mkOption {
--- a/website/site/content/docs/configure/_index.md
+++ b/website/site/content/docs/configure/_index.md
@@ -20,6 +20,9 @@ The configuration of both components uses separate namespaces. The
 configuration for the REST server is below `docspell.server`, while
 the one for joex is below `docspell.joex`.

+You can therefore use two separate config files or one single file
+containing both namespaces.
+
 ## JDBC

 This configures the connection to the database. This has to be
@@ -281,6 +284,70 @@ just some minutes, the web application obtains new ones
 periodically. So a short time is recommended.


+## File Processing
+
+Files are being processed by the joex component. So all the respective
+configuration is in this config only.
+
+File processing involves several stages, detailed information can be
+found [here](@/docs/joex/file-processing.md#text-analysis) and in the
+corresponding sections in [joex default config](#joex).
+
+Configuration allows to define the external tools and set some
+limitations to control memory usage. The sections are:
+
+- `docspell.joex.extraction`
+- `docspell.joex.text-analysis`
+- `docspell.joex.convert`
+
+Options to external commands can use variables that are replaced by
+values at runtime. Variables are enclosed in double braces `{{…}}`.
+Please see the default configuration for what variables exist per
+command.
+
+### Classification
+
+In `text-analysis.classification` you can define how many documents at
+most should be used for learning. The default settings should work
+well for most cases. However, it always depends on the amount of data
+and the machine that runs joex. For example, by default the documents
+to learn from are limited to 600 (`classification.item-count`) and
+every text is cut after 8000 characters (`text-analysis.max-length`).
+This is fine if *most* of your documents are small and only a few are
+near 8000 characters). But if *all* your documents are very large, you
+probably need to either assign more heap memory or go down with the
+limits.
+
+Classification can be disabled, too, for when it's not needed.
+
+### NLP
+
+This setting defines which NLP mode to use. It defaults to `full`,
+which requires more memory for certain languages (with the advantage
+of better results). Other values are `basic`, `regexonly` and
+`disabled`. The modes `full` and `basic` use pre-defined lanugage
+models for procesing documents of languaes German, English and French.
+These require some amount of memory (see below).
+
+The mode `basic` is like the "light" variant to `full`. It doesn't use
+all NLP features, which makes memory consumption much lower, but comes
+with the compromise of less accurate results.
+
+The mode `regexonly` doesn't use pre-defined lanuage models, even if
+available. It checks your address book against a document to find
+metadata. That means, it is language independent. Also, when using
+`full` or `basic` with lanugages where no pre-defined models exist, it
+will degrade to `regexonly` for these.
+
+The mode `disabled` skips NLP processing completely. This has least
+impact in memory consumption, obviously, but then only the classifier
+is used to find metadata.
+
+You might want to try different modes and see what combination suits
+best your usage pattern and machine running joex. If a powerful
+machine is used, simply leave the defaults. When running on an older
+raspberry pi, for example, you might need to adjust things.
+
 # File Format

 The format of the configuration files can be
--- a/Show More
+++ b/Show More