Merge pull request #581 from eikek/text-analysis-improvements

Text analysis improvements
2025-10-11 18:37:13 +00:00 · 2021-01-21 22:01:50 +00:00
parent b9b554980a 4cba96f390
commit df5f9e8c51
104 changed files with 3385 additions and 714 deletions
--- a/.travis.yml
+++ b/.travis.yml
@@ -24,4 +24,4 @@ before_script:
  - export TZ=Europe/Berlin
 script:
-  - sbt ++$TRAVIS_SCALA_VERSION ";project root ;scalafmtCheckAll ;make ;test"
+  - sbt -J-XX:+UseG1GC ++$TRAVIS_SCALA_VERSION ";project root ;scalafmtCheckAll ;make ;test"
--- a/Contributing.md
+++ b/Contributing.md
@@ -17,6 +17,9 @@ If you don't like to sign up to github/matrix or like to reach me
 personally, you can make a mail to `info [at] docspell.org` or on
 matrix, via `@eikek:matrix.org`.
 If you find a feature request already filed, you can vote on it. I
 tend to prefer most voted requests to those without much attention.
 ## Documentation
--- a/README.md
+++ b/README.md
@@ -9,25 +9,28 @@
 # Docspell
 Docspell is a personal document organizer. You'll need a scanner to
-convert your papers into files. Docspell can then assist in
+convert your papers into files. Docspell can then assist in organizing
-organizing the resulting mess :wink:.
+the resulting mess :wink:. It is targeted for home use, i.e. families
 and households and also for (smaller) groups/companies.
-You can associate tags, set correspondends, what a document is
+You can associate tags, set correspondends and lots of other
-concerned with, a name, a date and much more. If your documents are
+predefined and custom metadata. If your documents are associated with
-associated with such meta data, you should be able to quickly find
+such meta data, you can quickly find them later using the search
-them later using the search feature. But adding this manually to each
+feature. But adding this manually is a tedious task. Docspell can help
-document is a tedious task. Docspell can help you by suggesting
+by suggesting correspondents, guessing tags or finding dates using
-correspondents, guessing tags or finding dates using machine learning
+machine learning. It can learn metadata from existing documents and
-techniques. This makes adding metadata to your documents a lot easier.
+find things using NLP. This makes adding metadata to your documents a
 lot easier. For machine learning, it relies on the free (GPL)
 [Stanford Core NLP library](https://github.com/stanfordnlp/CoreNLP).
 Docspell also runs OCR (if needed) on your documents, can provide
 fulltext search and has great e-mail integration. Everything is
 accessible via a REST/HTTP api. A mobile friendly SPA web application
-is provided as the user interface and an [Android
+is the default user interface. An [Android
-app](https://github.com/docspell/android-client) for conveniently
+app](https://github.com/docspell/android-client) exists for
-uploading files from your phone/tablet. The [feature
+conveniently uploading files from your phone/tablet. The [feature
-overview](https://docspell.org/#feature-selection) has a more complete
+overview](https://docspell.org/#feature-selection) lists some more
-list.
+points.
 ## Impressions
--- a/build.sbt
+++ b/build.sbt
@@ -131,7 +131,8 @@ val openapiScalaSettings = Seq(
      case "ident" =>
        field => field.copy(typeDef = TypeDef("Ident", Imports("docspell.common.Ident")))
      case "accountid" =>
-        field => field.copy(typeDef = TypeDef("AccountId", Imports("docspell.common.AccountId")))
+        field =>
          field.copy(typeDef = TypeDef("AccountId", Imports("docspell.common.AccountId")))
      case "collectivestate" =>
        field =>
          field.copy(typeDef =
@@ -190,6 +191,9 @@ val openapiScalaSettings = Seq(
          field.copy(typeDef =
            TypeDef("CustomFieldType", Imports("docspell.common.CustomFieldType"))
          )
      case "listtype" =>
        field =>
          field.copy(typeDef = TypeDef("ListType", Imports("docspell.common.ListType")))
    }))
 )
--- a/docker/joex-base.dockerfile
+++ b/docker/joex-base.dockerfile
@@ -15,6 +15,17 @@ RUN apk add --no-cache openjdk11-jre \
    tesseract-ocr \
    tesseract-ocr-data-deu \
    tesseract-ocr-data-fra \
    tesseract-ocr-data-ita \
    tesseract-ocr-data-spa \
    tesseract-ocr-data-por \
    tesseract-ocr-data-ces \
    tesseract-ocr-data-nld \
    tesseract-ocr-data-dan \
    tesseract-ocr-data-fin \
    tesseract-ocr-data-nor \
    tesseract-ocr-data-swe \
    tesseract-ocr-data-rus \
    tesseract-ocr-data-ron \
    unpaper \
    wkhtmltopdf \
    libreoffice \
--- a/modules/analysis/src/main/scala/docspell/analysis/NlpSettings.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/NlpSettings.scala
@@ -0,0 +1,7 @@
 package docspell.analysis
 import java.nio.file.Path
 import docspell.common._
 case class NlpSettings(lang: Language, highRecall: Boolean, regexNer: Option[Path])
--- a/modules/analysis/src/main/scala/docspell/analysis/TextAnalyser.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/TextAnalyser.scala
@@ -1,29 +1,30 @@
 package docspell.analysis
 import cats.Applicative
 import cats.effect._
 import cats.implicits._
 import docspell.analysis.classifier.{StanfordTextClassifier, TextClassifier}
 import docspell.analysis.contact.Contact
 import docspell.analysis.date.DateFind
-import docspell.analysis.nlp.PipelineCache
+import docspell.analysis.nlp._
 import docspell.analysis.nlp.StanfordNerClassifier
 import docspell.analysis.nlp.StanfordNerSettings
 import docspell.analysis.nlp.StanfordTextClassifier
 import docspell.analysis.nlp.TextClassifier
 import docspell.common._
 import org.log4s.getLogger
 trait TextAnalyser[F[_]] {
  def annotate(
      logger: Logger[F],
-      settings: StanfordNerSettings,
+      settings: NlpSettings,
      cacheKey: Ident,
      text: String
  ): F[TextAnalyser.Result]
-  def classifier(blocker: Blocker)(implicit CS: ContextShift[F]): TextClassifier[F]
+  def classifier: TextClassifier[F]
 }
 object TextAnalyser {
  private[this] val logger = getLogger
  case class Result(labels: Vector[NerLabel], dates: Vector[NerDateLabel]) {
@@ -31,31 +32,30 @@ object TextAnalyser {
      labels ++ dates.map(dl => dl.label.copy(label = dl.date.toString))
  }
-  def create[F[_]: Concurrent: Timer](
+  def create[F[_]: Concurrent: Timer: ContextShift](
-      cfg: TextAnalysisConfig
+      cfg: TextAnalysisConfig,
      blocker: Blocker
  ): Resource[F, TextAnalyser[F]] =
    Resource
-      .liftF(PipelineCache[F](cfg.clearStanfordPipelineInterval))
+      .liftF(Nlp(cfg.nlpConfig))
-      .map(cache =>
+      .map(stanfordNer =>
        new TextAnalyser[F] {
          def annotate(
              logger: Logger[F],
-              settings: StanfordNerSettings,
+              settings: NlpSettings,
              cacheKey: Ident,
              text: String
          ): F[TextAnalyser.Result] =
            for {
              input <- textLimit(logger, text)
-              tags0 <- stanfordNer(cacheKey, settings, input)
+              tags0 <- stanfordNer(Nlp.Input(cacheKey, settings, logger, input))
              tags1 <- contactNer(input)
              dates <- dateNer(settings.lang, input)
              list  = tags0 ++ tags1
              spans = NerLabelSpan.build(list)
            } yield Result(spans ++ list, dates)
-          def classifier(blocker: Blocker)(implicit
+          def classifier: TextClassifier[F] =
              CS: ContextShift[F]
          ): TextClassifier[F] =
            new StanfordTextClassifier[F](cfg.classifier, blocker)
          private def textLimit(logger: Logger[F], text: String): F[String] =
@@ -66,10 +66,6 @@ object TextAnalyser {
                  s" Analysing only first ${cfg.maxLength} characters."
              ) *> text.take(cfg.maxLength).pure[F]
          private def stanfordNer(key: Ident, settings: StanfordNerSettings, text: String)
              : F[Vector[NerLabel]] =
            StanfordNerClassifier.nerAnnotate[F](key.id, cache)(settings, text)
          private def contactNer(text: String): F[Vector[NerLabel]] =
            Sync[F].delay {
              Contact.annotate(text)
@@ -82,4 +78,36 @@ object TextAnalyser {
        }
      )
  /** Provides the nlp pipeline based on the configuration. */
  private object Nlp {
    def apply[F[_]: Concurrent: Timer: BracketThrow](
        cfg: TextAnalysisConfig.NlpConfig
    ): F[Input[F] => F[Vector[NerLabel]]] =
      cfg.mode match {
        case NlpMode.Disabled =>
          Logger.log4s(logger).info("NLP is disabled as defined in config.") *>
            Applicative[F].pure(_ => Vector.empty[NerLabel].pure[F])
        case _ =>
          PipelineCache(cfg.clearInterval)(
            Annotator[F](cfg.mode),
            Annotator.clearCaches[F]
          )
            .map(annotate[F])
      }
    final case class Input[F[_]](
        key: Ident,
        settings: NlpSettings,
        logger: Logger[F],
        text: String
    )
    def annotate[F[_]: BracketThrow](
        cache: PipelineCache[F]
    )(input: Input[F]): F[Vector[NerLabel]] =
      cache
        .obtain(input.key.id, input.settings)
        .use(ann => ann.nerAnnotate(input.logger)(input.text))
  }
 }
--- a/modules/analysis/src/main/scala/docspell/analysis/TextAnalysisConfig.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/TextAnalysisConfig.scala
@@ -1,10 +1,16 @@
 package docspell.analysis
-import docspell.analysis.nlp.TextClassifierConfig
+import docspell.analysis.TextAnalysisConfig.NlpConfig
 import docspell.analysis.classifier.TextClassifierConfig
 import docspell.common._
 case class TextAnalysisConfig(
    maxLength: Int,
-    clearStanfordPipelineInterval: Duration,
+    nlpConfig: NlpConfig,
    classifier: TextClassifierConfig
 )
 object TextAnalysisConfig {
  case class NlpConfig(clearInterval: Duration, mode: NlpMode)
 }
--- a/modules/analysis/src/main/scala/docspell/analysis/classifier/ClassifierModel.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/classifier/ClassifierModel.scala
@@ -1,4 +1,4 @@
-package docspell.analysis.nlp
+package docspell.analysis.classifier
 import java.nio.file.Path
--- a/modules/analysis/src/main/scala/docspell/analysis/classifier/StanfordTextClassifier.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/classifier/StanfordTextClassifier.scala
@@ -1,4 +1,4 @@
-package docspell.analysis.nlp
+package docspell.analysis.classifier
 import java.nio.file.Path
@@ -7,8 +7,11 @@ import cats.effect.concurrent.Ref
 import cats.implicits._
 import fs2.Stream
-import docspell.analysis.nlp.TextClassifier._
+import docspell.analysis.classifier
 import docspell.analysis.classifier.TextClassifier._
 import docspell.analysis.nlp.Properties
 import docspell.common._
 import docspell.common.syntax.FileSyntax._
 import edu.stanford.nlp.classify.ColumnDataClassifier
@@ -26,7 +29,7 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
      .use { dir =>
        for {
          rawData   <- writeDataFile(blocker, dir, data)
-          _         <- logger.info(s"Learning from ${rawData.count} items.")
+          _         <- logger.debug(s"Learning from ${rawData.count} items.")
          trainData <- splitData(logger, rawData)
          scores    <- cfg.classifierConfigs.traverse(m => train(logger, trainData, m))
          sorted = scores.sortBy(-_.score)
@@ -43,7 +46,7 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
      case Some(text) =>
        Sync[F].delay {
          val cls = ColumnDataClassifier.getClassifier(
-            model.model.normalize().toAbsolutePath().toString()
+            model.model.normalize().toAbsolutePath.toString
          )
          val cat = cls.classOf(cls.makeDatumFromLine("\t\t" + normalisedText(text)))
          Option(cat)
@@ -65,7 +68,7 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
        val cdc = new ColumnDataClassifier(Properties.fromMap(amendProps(in, props)))
        cdc.trainClassifier(in.train.toString())
        val score = cdc.testClassifier(in.test.toString())
-        TrainResult(score.first(), ClassifierModel(in.modelFile))
+        TrainResult(score.first(), classifier.ClassifierModel(in.modelFile))
      }
      _ <- logger.debug(s"Trained with result $res")
    } yield res
@@ -136,9 +139,9 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
      props: Map[String, String]
  ): Map[String, String] =
    prepend("2.", props) ++ Map(
-      "trainFile"   -> trainData.train.normalize().toAbsolutePath().toString(),
+      "trainFile"   -> trainData.train.absolutePathAsString,
-      "testFile"    -> trainData.test.normalize().toAbsolutePath().toString(),
+      "testFile"    -> trainData.test.absolutePathAsString,
-      "serializeTo" -> trainData.modelFile.normalize().toAbsolutePath().toString()
+      "serializeTo" -> trainData.modelFile.absolutePathAsString
    ).toList
  case class RawData(count: Long, file: Path)
--- a/modules/analysis/src/main/scala/docspell/analysis/classifier/TextClassifier.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/classifier/TextClassifier.scala
@@ -1,9 +1,9 @@
-package docspell.analysis.nlp
+package docspell.analysis.classifier
 import cats.data.Kleisli
 import fs2.Stream
-import docspell.analysis.nlp.TextClassifier.Data
+import docspell.analysis.classifier.TextClassifier.Data
 import docspell.common._
 trait TextClassifier[F[_]] {
--- a/modules/analysis/src/main/scala/docspell/analysis/classifier/TextClassifierConfig.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/classifier/TextClassifierConfig.scala
@@ -1,4 +1,4 @@
-package docspell.analysis.nlp
+package docspell.analysis.classifier
 import java.nio.file.Path
--- a/modules/analysis/src/main/scala/docspell/analysis/date/DateFind.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/date/DateFind.scala
@@ -41,23 +41,41 @@ object DateFind {
  }
  object SimpleDate {
-    val p0 = (readYear >> readMonth >> readDay).map { case ((y, m), d) =>
+    def pattern0(lang: Language) = (readYear >> readMonth(lang) >> readDay).map {
      case ((y, m), d) =>
        List(SimpleDate(y, m, d))
    }
-    val p1 = (readDay >> readMonth >> readYear).map { case ((d, m), y) =>
+    def pattern1(lang: Language) = (readDay >> readMonth(lang) >> readYear).map {
      case ((d, m), y) =>
        List(SimpleDate(y, m, d))
    }
-    val p2 = (readMonth >> readDay >> readYear).map { case ((m, d), y) =>
+    def pattern2(lang: Language) = (readMonth(lang) >> readDay >> readYear).map {
      case ((m, d), y) =>
        List(SimpleDate(y, m, d))
    }
    // ymd ✔, ydm, dmy ✔, dym, myd, mdy ✔
    def fromParts(parts: List[Word], lang: Language): List[SimpleDate] = {
      val ymd = pattern0(lang)
      val dmy = pattern1(lang)
      val mdy = pattern2(lang)
      // most is from wikipedia…
      val p = lang match {
        case Language.English =>
-          p2.alt(p1).map(t => t._1 ++ t._2).or(p2).or(p0).or(p1)
+          mdy.alt(dmy).map(t => t._1 ++ t._2).or(mdy).or(ymd).or(dmy)
-        case Language.German => p1.or(p0).or(p2)
+        case Language.German     => dmy.or(ymd).or(mdy)
-        case Language.French => p1.or(p0).or(p2)
+        case Language.French     => dmy.or(ymd).or(mdy)
        case Language.Italian    => dmy.or(ymd).or(mdy)
        case Language.Spanish    => dmy.or(ymd).or(mdy)
        case Language.Czech      => dmy.or(ymd).or(mdy)
        case Language.Danish     => dmy.or(ymd).or(mdy)
        case Language.Finnish    => dmy.or(ymd).or(mdy)
        case Language.Norwegian  => dmy.or(ymd).or(mdy)
        case Language.Portuguese => dmy.or(ymd).or(mdy)
        case Language.Romanian   => dmy.or(ymd).or(mdy)
        case Language.Russian    => dmy.or(ymd).or(mdy)
        case Language.Swedish    => ymd.or(dmy).or(mdy)
        case Language.Dutch      => dmy.or(ymd).or(mdy)
      }
      p.read(parts) match {
        case Result.Success(sds, _) =>
@@ -76,9 +94,11 @@ object DateFind {
        }
      )
-    def readMonth: Reader[Int] =
+    def readMonth(lang: Language): Reader[Int] =
      Reader.readFirst(w =>
-        Some(months.indexWhere(_.contains(w.value))).filter(_ >= 0).map(_ + 1)
+        Some(MonthName.getAll(lang).indexWhere(_.contains(w.value)))
          .filter(_ >= 0)
          .map(_ + 1)
      )
    def readDay: Reader[Int] =
@@ -150,20 +170,5 @@ object DateFind {
            Failure
        }
    }
    private val months = List(
      List("jan", "january", "januar", "01"),
      List("feb", "february", "februar", "02"),
      List("mar", "march", "märz", "marz", "03"),
      List("apr", "april", "04"),
      List("may", "mai", "05"),
      List("jun", "june", "juni", "06"),
      List("jul", "july", "juli", "07"),
      List("aug", "august", "08"),
      List("sep", "september", "09"),
      List("oct", "october", "oktober", "10"),
      List("nov", "november", "11"),
      List("dec", "december", "dezember", "12")
    )
  }
 }
--- a/modules/analysis/src/main/scala/docspell/analysis/date/MonthName.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/date/MonthName.scala
@@ -0,0 +1,270 @@
 package docspell.analysis.date
 import docspell.common.Language
 object MonthName {
  def getAll(lang: Language): List[List[String]] =
    merge(numbers, forLang(lang))
  private def merge(n0: List[List[String]], ns: List[List[String]]*): List[List[String]] =
    ns.foldLeft(n0) { (res, el) =>
      res.zip(el).map({ case (a, b) => a ++ b })
    }
  private def forLang(lang: Language): List[List[String]] =
    lang match {
      case Language.English =>
        english
      case Language.German =>
        german
      case Language.French =>
        french
      case Language.Italian =>
        italian
      case Language.Spanish =>
        spanish
      case Language.Swedish =>
        swedish
      case Language.Norwegian =>
        norwegian
      case Language.Dutch =>
        dutch
      case Language.Czech =>
        czech
      case Language.Danish =>
        danish
      case Language.Portuguese =>
        portuguese
      case Language.Romanian =>
        romanian
      case Language.Finnish =>
        finnish
      case Language.Russian =>
        russian
    }
  private val numbers = List(
    List("01"),
    List("02"),
    List("03"),
    List("04"),
    List("05"),
    List("06"),
    List("07"),
    List("08"),
    List("09"),
    List("10"),
    List("11"),
    List("12")
  )
  private val english = List(
    List("jan", "january"),
    List("feb", "february"),
    List("mar", "march"),
    List("apr", "april"),
    List("may"),
    List("jun", "june"),
    List("jul", "july"),
    List("aug", "august"),
    List("sept", "september"),
    List("oct", "october"),
    List("nov", "november"),
    List("dec", "december")
  )
  private val german = List(
    List("jan", "januar"),
    List("feb", "februar"),
    List("märz"),
    List("apr", "april"),
    List("mai"),
    List("juni"),
    List("juli"),
    List("aug", "august"),
    List("sept", "september"),
    List("okt", "oktober"),
    List("nov", "november"),
    List("dez", "dezember")
  )
  private val french = List(
    List("janv", "janvier"),
    List("févr", "fevr", "février", "fevrier"),
    List("mars"),
    List("avril"),
    List("mai"),
    List("juin"),
    List("juil", "juillet"),
    List("aout", "août"),
    List("sept", "septembre"),
    List("oct", "octobre"),
    List("nov", "novembre"),
    List("dec", "déc", "décembre", "decembre")
  )
  private val italian = List(
    List("genn", "gennaio"),
    List("febbr", "febbraio"),
    List("mar", "marzo"),
    List("apr", "aprile"),
    List("magg", "maggio"),
    List("giugno"),
    List("luglio"),
    List("ag", "agosto"),
    List("sett", "settembre"),
    List("ott", "ottobre"),
    List("nov", "novembre"),
    List("dic", "dicembre")
  )
  private val spanish = List(
    List("ene", "enero"),
    List("feb", "febrero"),
    List("mar", "marzo"),
    List("abr", "abril"),
    List("may", "mayo"),
    List("jun"),
    List("jul"),
    List("ago", "agosto"),
    List("sep", "septiembre"),
    List("oct", "octubre"),
    List("nov", "noviembre"),
    List("dic", "diciembre")
  )
  private val swedish = List(
    List("jan", "januari"),
    List("febr", "februari"),
    List("mars"),
    List("april"),
    List("maj"),
    List("juni"),
    List("juli"),
    List("aug", "augusti"),
    List("sept", "september"),
    List("okt", "oktober"),
    List("nov", "november"),
    List("dec", "december")
  )
  private val norwegian = List(
    List("jan", "januar"),
    List("febr", "februar"),
    List("mars"),
    List("april"),
    List("mai"),
    List("juni"),
    List("juli"),
    List("aug", "august"),
    List("sept", "september"),
    List("okt", "oktober"),
    List("nov", "november"),
    List("des", "desember")
  )
  private val czech = List(
    List("led", "leden"),
    List("un", "ún", "únor", "unor"),
    List("brez", "březen", "brezen"),
    List("dub", "duben"),
    List("kvet", "květen"),
    List("cerv", "červen"),
    List("cerven", "červenec"),
    List("srp", "srpen"),
    List("zari", "září"),
    List("ríj", "rij", "říjen"),
    List("list", "listopad"),
    List("pros", "prosinec")
  )
  private val romanian = List(
    List("ian", "ianuarie"),
    List("feb", "februarie"),
    List("mar", "martie"),
    List("apr", "aprilie"),
    List("mai"),
    List("iunie"),
    List("iulie"),
    List("aug", "august"),
    List("sept", "septembrie"),
    List("oct", "octombrie"),
    List("noem", "nov", "noiembrie"),
    List("dec", "decembrie")
  )
  private val danish = List(
    List("jan", "januar"),
    List("febr", "februar"),
    List("marts"),
    List("april"),
    List("maj"),
    List("juni"),
    List("juli"),
    List("aug", "august"),
    List("sept", "september"),
    List("okt", "oktober"),
    List("nov", "november"),
    List("dec", "december")
  )
  private val portuguese = List(
    List("jan", "janeiro"),
    List("fev", "fevereiro"),
    List("março", "marco"),
    List("abril"),
    List("maio"),
    List("junho"),
    List("julho"),
    List("agosto"),
    List("set", "setembro"),
    List("out", "outubro"),
    List("nov", "novembro"),
    List("dez", "dezembro")
  )
  private val finnish = List(
    List("tammikuu"),
    List("helmikuu"),
    List("maaliskuu"),
    List("huhtikuu"),
    List("toukokuu"),
    List("kesäkuu"),
    List("heinäkuu"),
    List("elokuu"),
    List("syyskuu"),
    List("lokakuu"),
    List("marraskuu"),
    List("joulukuu")
  )
  private val russian = List(
    List("январь"),
    List("февраль"),
    List("март"),
    List("апрель"),
    List("май"),
    List("июнь"),
    List("июль"),
    List("август"),
    List("сентябрь"),
    List("октябрь"),
    List("ноябрь"),
    List("декабрь")
  )
  private val dutch = List(
    List("jan", "januari"),
    List("feb", "februari"),
    List("maart"),
    List("apr", "april"),
    List("mei"),
    List("juni"),
    List("juli"),
    List("aug", "augustus"),
    List("sept", "september"),
    List("okt", "oct", "oktober"),
    List("nov", "november"),
    List("dec", "december")
  )
 }
--- a/modules/analysis/src/main/scala/docspell/analysis/nlp/Annotator.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/nlp/Annotator.scala
@@ -0,0 +1,98 @@
 package docspell.analysis.nlp
 import cats.effect.Sync
 import cats.implicits._
 import cats.{Applicative, FlatMap}
 import docspell.analysis.NlpSettings
 import docspell.common._
 import edu.stanford.nlp.pipeline.StanfordCoreNLP
 /** Analyses a text to mark certain parts with a `NerLabel`. */
 trait Annotator[F[_]] { self =>
  def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]]
  def ++(next: Annotator[F])(implicit F: FlatMap[F]): Annotator[F] =
    new Annotator[F] {
      def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
        for {
          n0 <- self.nerAnnotate(logger)(text)
          n1 <- next.nerAnnotate(logger)(text)
        } yield (n0 ++ n1).distinct
    }
 }
 object Annotator {
  /** Creates an annotator according to the given `mode` and `settings`.
    *
    * There are the following ways:
    *
    * - disabled: it returns a no-op annotator that always gives an empty list
    * - full: the complete stanford pipeline is used
    * - basic: only the ner classifier is used
    *
    * Additionally, if there is a regexNer-file specified, the regexner annotator is
    * also run. In case the full pipeline is used, this is already included.
    */
  def apply[F[_]: Sync](mode: NlpMode)(settings: NlpSettings): Annotator[F] =
    mode match {
      case NlpMode.Disabled =>
        Annotator.none[F]
      case NlpMode.Full =>
        StanfordNerSettings.fromNlpSettings(settings) match {
          case Some(ss) =>
            Annotator.pipeline(StanfordNerAnnotator.makePipeline(ss))
          case None =>
            Annotator.none[F]
        }
      case NlpMode.Basic =>
        StanfordNerSettings.fromNlpSettings(settings) match {
          case Some(StanfordNerSettings.Full(lang, _, Some(file))) =>
            Annotator.basic(BasicCRFAnnotator.Cache.getAnnotator(lang)) ++
              Annotator.pipeline(StanfordNerAnnotator.regexNerPipeline(file))
          case Some(StanfordNerSettings.Full(lang, _, None)) =>
            Annotator.basic(BasicCRFAnnotator.Cache.getAnnotator(lang))
          case Some(StanfordNerSettings.RegexOnly(file)) =>
            Annotator.pipeline(StanfordNerAnnotator.regexNerPipeline(file))
          case None =>
            Annotator.none[F]
        }
      case NlpMode.RegexOnly =>
        settings.regexNer match {
          case Some(file) =>
            Annotator.pipeline(StanfordNerAnnotator.regexNerPipeline(file))
          case None =>
            Annotator.none[F]
        }
    }
  def none[F[_]: Applicative]: Annotator[F] =
    new Annotator[F] {
      def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
        logger.debug("Running empty annotator. NLP not supported.") *>
          Vector.empty[NerLabel].pure[F]
    }
  def basic[F[_]: Sync](ann: BasicCRFAnnotator.Annotator): Annotator[F] =
    new Annotator[F] {
      def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
        Sync[F].delay(
          BasicCRFAnnotator.nerAnnotate(ann)(text)
        )
    }
  def pipeline[F[_]: Sync](cp: StanfordCoreNLP): Annotator[F] =
    new Annotator[F] {
      def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
        Sync[F].delay(StanfordNerAnnotator.nerAnnotate(cp, text))
    }
  def clearCaches[F[_]: Sync]: F[Unit] =
    Sync[F].delay {
      StanfordCoreNLP.clearAnnotatorPool()
      BasicCRFAnnotator.Cache.clearCache()
    }
 }
--- a/modules/analysis/src/main/scala/docspell/analysis/nlp/BasicCRFAnnotator.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/nlp/BasicCRFAnnotator.scala
@@ -0,0 +1,94 @@
 package docspell.analysis.nlp
 import java.net.URL
 import java.util.concurrent.atomic.AtomicReference
 import java.util.zip.GZIPInputStream
 import scala.jdk.CollectionConverters._
 import scala.util.Using
 import docspell.common.Language.NLPLanguage
 import docspell.common._
 import edu.stanford.nlp.ie.AbstractSequenceClassifier
 import edu.stanford.nlp.ie.crf.CRFClassifier
 import edu.stanford.nlp.ling.{CoreAnnotations, CoreLabel}
 import org.log4s.getLogger
 /** This is only using the CRFClassifier without building an analysis
  * pipeline. The ner-classifier cannot use results from POS-tagging
  * etc. and is therefore not as good as the [[StanfordNerAnnotator]].
  * But it uses less memory, while still being not bad.
  */
 object BasicCRFAnnotator {
  private[this] val logger = getLogger
  // assert correct resource names
  List(Language.French, Language.German, Language.English).foreach(classifierResource)
  type Annotator = AbstractSequenceClassifier[CoreLabel]
  def nerAnnotate(nerClassifier: Annotator)(text: String): Vector[NerLabel] =
    nerClassifier
      .classify(text)
      .asScala
      .flatMap(a => a.asScala)
      .collect(Function.unlift { label =>
        val tag = label.get(classOf[CoreAnnotations.AnswerAnnotation])
        NerTag
          .fromString(Option(tag).getOrElse(""))
          .toOption
          .map(t => NerLabel(label.word(), t, label.beginPosition(), label.endPosition()))
      })
      .toVector
  def makeAnnotator(lang: NLPLanguage): Annotator = {
    logger.info(s"Creating ${lang.name} Stanford NLP NER-only classifier...")
    val ner = classifierResource(lang)
    Using(new GZIPInputStream(ner.openStream())) { in =>
      CRFClassifier.getClassifier(in).asInstanceOf[Annotator]
    }.fold(throw _, identity)
  }
  private def classifierResource(lang: NLPLanguage): URL = {
    def check(name: String): URL =
      Option(getClass.getResource(name)) match {
        case None =>
          sys.error(s"NER model resource '$name' not found for language ${lang.name}")
        case Some(url) => url
      }
    check(lang match {
      case Language.French =>
        "/edu/stanford/nlp/models/ner/french-wikiner-4class.crf.ser.gz"
      case Language.German =>
        "/edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz"
      case Language.English =>
        "/edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz"
    })
  }
  final class Cache {
    private[this] lazy val germanNerClassifier  = makeAnnotator(Language.German)
    private[this] lazy val englishNerClassifier = makeAnnotator(Language.English)
    private[this] lazy val frenchNerClassifier  = makeAnnotator(Language.French)
    def forLang(language: NLPLanguage): Annotator =
      language match {
        case Language.French  => frenchNerClassifier
        case Language.German  => germanNerClassifier
        case Language.English => englishNerClassifier
      }
  }
  object Cache {
    private[this] val cacheRef = new AtomicReference[Cache](new Cache)
    def getAnnotator(language: NLPLanguage): Annotator =
      cacheRef.get().forLang(language)
    def clearCache(): Unit =
      cacheRef.set(new Cache)
  }
 }
--- a/modules/analysis/src/main/scala/docspell/analysis/nlp/PipelineCache.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/nlp/PipelineCache.scala
@@ -7,9 +7,9 @@ import cats.effect._
 import cats.effect.concurrent.Ref
 import cats.implicits._
 import docspell.analysis.NlpSettings
 import docspell.common._
 import edu.stanford.nlp.pipeline.StanfordCoreNLP
 import org.log4s.getLogger
 /** Creating the StanfordCoreNLP pipeline is quite expensive as it
@@ -21,46 +21,45 @@ import org.log4s.getLogger
  */
 trait PipelineCache[F[_]] {
-  def obtain(key: String, settings: StanfordNerSettings): Resource[F, StanfordCoreNLP]
+  def obtain(key: String, settings: NlpSettings): Resource[F, Annotator[F]]
 }
 object PipelineCache {
  private[this] val logger = getLogger
-  def none[F[_]: Applicative]: PipelineCache[F] =
+  def apply[F[_]: Concurrent: Timer](clearInterval: Duration)(
-    new PipelineCache[F] {
+      creator: NlpSettings => Annotator[F],
-      def obtain(
+      release: F[Unit]
-          ignored: String,
+  ): F[PipelineCache[F]] =
          settings: StanfordNerSettings
      ): Resource[F, StanfordCoreNLP] =
        Resource.liftF(makeClassifier(settings).pure[F])
    }
  def apply[F[_]: Concurrent: Timer](clearInterval: Duration): F[PipelineCache[F]] =
    for {
-      data       <- Ref.of(Map.empty[String, Entry])
+      data       <- Ref.of(Map.empty[String, Entry[Annotator[F]]])
-      cacheClear <- CacheClearing.create(data, clearInterval)
+      cacheClear <- CacheClearing.create(data, clearInterval, release)
-    } yield new Impl[F](data, cacheClear)
+      _          <- Logger.log4s(logger).info("Creating nlp pipeline cache")
    } yield new Impl[F](data, creator, cacheClear)
  final private class Impl[F[_]: Sync](
-      data: Ref[F, Map[String, Entry]],
+      data: Ref[F, Map[String, Entry[Annotator[F]]]],
      creator: NlpSettings => Annotator[F],
      cacheClear: CacheClearing[F]
  ) extends PipelineCache[F] {
-    def obtain(key: String, settings: StanfordNerSettings): Resource[F, StanfordCoreNLP] =
+    def obtain(key: String, settings: NlpSettings): Resource[F, Annotator[F]] =
      for {
        _  <- cacheClear.withCache
        id <- Resource.liftF(makeSettingsId(settings))
-        nlp <- Resource.liftF(data.modify(cache => getOrCreate(key, id, cache, settings)))
+        nlp <- Resource.liftF(
          data.modify(cache => getOrCreate(key, id, cache, settings, creator))
        )
      } yield nlp
    private def getOrCreate(
        key: String,
        id: String,
-        cache: Map[String, Entry],
+        cache: Map[String, Entry[Annotator[F]]],
-        settings: StanfordNerSettings
+        settings: NlpSettings,
-    ): (Map[String, Entry], StanfordCoreNLP) =
+        creator: NlpSettings => Annotator[F]
    ): (Map[String, Entry[Annotator[F]]], Annotator[F]) =
      cache.get(key) match {
        case Some(entry) =>
          if (entry.id == id) (cache, entry.value)
@@ -68,18 +67,18 @@ object PipelineCache {
            logger.info(
              s"StanfordNLP settings changed for key $key. Creating new classifier"
            )
-            val nlp = makeClassifier(settings)
+            val nlp = creator(settings)
            val e   = Entry(id, nlp)
            (cache.updated(key, e), nlp)
          }
        case None =>
-          val nlp = makeClassifier(settings)
+          val nlp = creator(settings)
          val e   = Entry(id, nlp)
          (cache.updated(key, e), nlp)
      }
-    private def makeSettingsId(settings: StanfordNerSettings): F[String] = {
+    private def makeSettingsId(settings: NlpSettings): F[String] = {
      val base = settings.copy(regexNer = None).toString
      val size: F[Long] =
        settings.regexNer match {
@@ -104,9 +103,10 @@ object PipelineCache {
          Resource.pure[F, Unit](())
      }
-    def create[F[_]: Concurrent: Timer](
+    def create[F[_]: Concurrent: Timer, A](
-        data: Ref[F, Map[String, Entry]],
+        data: Ref[F, Map[String, Entry[A]]],
-        interval: Duration
+        interval: Duration,
        release: F[Unit]
    ): F[CacheClearing[F]] =
      for {
        counter  <- Ref.of(0L)
@@ -121,16 +121,23 @@ object PipelineCache {
            log
              .info(s"Clearing StanfordNLP cache after $interval idle time")
              .map(_ =>
-                new CacheClearingImpl[F](data, counter, cleaning, interval.toScala)
+                new CacheClearingImpl[F, A](
                  data,
                  counter,
                  cleaning,
                  interval.toScala,
                  release
                )
              )
      } yield result
  }
-  final private class CacheClearingImpl[F[_]](
+  final private class CacheClearingImpl[F[_], A](
-      data: Ref[F, Map[String, Entry]],
+      data: Ref[F, Map[String, Entry[A]]],
      counter: Ref[F, Long],
      cleaningFiber: Ref[F, Option[Fiber[F, Unit]]],
-      clearInterval: FiniteDuration
+      clearInterval: FiniteDuration,
      release: F[Unit]
  )(implicit T: Timer[F], F: Concurrent[F])
      extends CacheClearing[F] {
    private[this] val log = Logger.log4s[F](logger)
@@ -158,17 +165,10 @@ object PipelineCache {
    def clearAll: F[Unit] =
      log.info("Clearing stanford nlp cache now!") *>
-        data.set(Map.empty) *> Sync[F].delay {
+        data.set(Map.empty) *> release *> Sync[F].delay {
          // turns out that everything is cached in a static map
          StanfordCoreNLP.clearAnnotatorPool()
          System.gc();
        }
  }
-  private def makeClassifier(settings: StanfordNerSettings): StanfordCoreNLP = {
+  private case class Entry[A](id: String, value: A)
    logger.info(s"Creating ${settings.lang.name} Stanford NLP NER classifier...")
    new StanfordCoreNLP(Properties.forSettings(settings))
  }
  private case class Entry(id: String, value: StanfordCoreNLP)
 }
--- a/modules/analysis/src/main/scala/docspell/analysis/nlp/Properties.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/nlp/Properties.scala
@@ -1,9 +1,11 @@
 package docspell.analysis.nlp
 import java.nio.file.Path
 import java.util.{Properties => JProps}
 import docspell.analysis.nlp.Properties.Implicits._
 import docspell.common._
 import docspell.common.syntax.FileSyntax._
 object Properties {
@@ -17,17 +19,20 @@ object Properties {
    p
  }
-  def forSettings(settings: StanfordNerSettings): JProps = {
+  def forSettings(settings: StanfordNerSettings): JProps =
-    val regexNerFile = settings.regexNer
+    settings match {
-      .map(p => p.normalize().toAbsolutePath().toString())
+      case StanfordNerSettings.Full(lang, highRecall, regexNer) =>
-    settings.lang match {
+        val regexNerFile = regexNer.map(p => p.absolutePathAsString)
        lang match {
          case Language.German =>
-        Properties.nerGerman(regexNerFile, settings.highRecall)
+            Properties.nerGerman(regexNerFile, highRecall)
          case Language.English =>
            Properties.nerEnglish(regexNerFile)
          case Language.French =>
-        Properties.nerFrench(regexNerFile, settings.highRecall)
+            Properties.nerFrench(regexNerFile, highRecall)
        }
      case StanfordNerSettings.RegexOnly(path) =>
        Properties.regexNerOnly(path)
    }
  def nerGerman(regexNerMappingFile: Option[String], highRecall: Boolean): JProps =
@@ -76,6 +81,11 @@ object Properties {
      "ner.model"                   -> "edu/stanford/nlp/models/ner/french-wikiner-4class.crf.ser.gz,edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz"
    ).withRegexNer(regexNerMappingFile).withHighRecall(highRecall)
  def regexNerOnly(regexNerMappingFile: Path): JProps =
    Properties(
      "annotators" -> "tokenize,ssplit"
    ).withRegexNer(Some(regexNerMappingFile.absolutePathAsString))
  object Implicits {
    implicit final class JPropsOps(val p: JProps) extends AnyVal {
--- a/modules/analysis/src/main/scala/docspell/analysis/nlp/StanfordNerAnnotator.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/nlp/StanfordNerAnnotator.scala
@@ -0,0 +1,52 @@
 package docspell.analysis.nlp
 import java.nio.file.Path
 import scala.jdk.CollectionConverters._
 import cats.effect._
 import docspell.common._
 import edu.stanford.nlp.pipeline.{CoreDocument, StanfordCoreNLP}
 import org.log4s.getLogger
 object StanfordNerAnnotator {
  private[this] val logger = getLogger
  /** Runs named entity recognition on the given `text`.
    *
    * This uses the classifier pipeline from stanford-nlp, see
    * https://nlp.stanford.edu/software/CRF-NER.html. Creating these
    * classifiers is quite expensive, it involves loading large model
    * files. The classifiers are thread-safe and so they are cached.
    * The `cacheKey` defines the "slot" where classifiers are stored
    * and retrieved. If for a given `cacheKey` the `settings` change,
    * a new classifier must be created. It will then replace the
    * previous one.
    */
  def nerAnnotate(nerClassifier: StanfordCoreNLP, text: String): Vector[NerLabel] = {
    val doc = new CoreDocument(text)
    nerClassifier.annotate(doc)
    doc.tokens().asScala.collect(Function.unlift(LabelConverter.toNerLabel)).toVector
  }
  def makePipeline(settings: StanfordNerSettings): StanfordCoreNLP =
    settings match {
      case s: StanfordNerSettings.Full =>
        logger.info(s"Creating ${s.lang.name} Stanford NLP NER classifier...")
        new StanfordCoreNLP(Properties.forSettings(settings))
      case StanfordNerSettings.RegexOnly(path) =>
        logger.info(s"Creating regexNer-only Stanford NLP NER classifier...")
        regexNerPipeline(path)
    }
  def regexNerPipeline(regexNerFile: Path): StanfordCoreNLP =
    new StanfordCoreNLP(Properties.regexNerOnly(regexNerFile))
  def clearPipelineCaches[F[_]: Sync]: F[Unit] =
    Sync[F].delay {
      // turns out that everything is cached in a static map
      StanfordCoreNLP.clearAnnotatorPool()
    }
 }
--- a/modules/analysis/src/main/scala/docspell/analysis/nlp/StanfordNerClassifier.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/nlp/StanfordNerClassifier.scala
@@ -1,39 +0,0 @@
 package docspell.analysis.nlp
 import scala.jdk.CollectionConverters._
 import cats.Applicative
 import cats.effect._
 import docspell.common._
 import edu.stanford.nlp.pipeline.{CoreDocument, StanfordCoreNLP}
 object StanfordNerClassifier {
  /** Runs named entity recognition on the given `text`.
    *
    * This uses the classifier pipeline from stanford-nlp, see
    * https://nlp.stanford.edu/software/CRF-NER.html. Creating these
    * classifiers is quite expensive, it involves loading large model
    * files. The classifiers are thread-safe and so they are cached.
    * The `cacheKey` defines the "slot" where classifiers are stored
    * and retrieved. If for a given `cacheKey` the `settings` change,
    * a new classifier must be created. It will then replace the
    * previous one.
    */
  def nerAnnotate[F[_]: BracketThrow](
      cacheKey: String,
      cache: PipelineCache[F]
  )(settings: StanfordNerSettings, text: String): F[Vector[NerLabel]] =
    cache
      .obtain(cacheKey, settings)
      .use(crf => Applicative[F].pure(runClassifier(crf, text)))
  def runClassifier(nerClassifier: StanfordCoreNLP, text: String): Vector[NerLabel] = {
    val doc = new CoreDocument(text)
    nerClassifier.annotate(doc)
    doc.tokens().asScala.collect(Function.unlift(LabelConverter.toNerLabel)).toVector
  }
 }
--- a/modules/analysis/src/main/scala/docspell/analysis/nlp/StanfordNerSettings.scala
+++ b/modules/analysis/src/main/scala/docspell/analysis/nlp/StanfordNerSettings.scala
@@ -2,7 +2,12 @@ package docspell.analysis.nlp
 import java.nio.file.Path
-import docspell.common._
+import docspell.analysis.NlpSettings
 import docspell.common.Language.NLPLanguage
 sealed trait StanfordNerSettings
 object StanfordNerSettings {
  /** Settings for configuring the stanford NER pipeline.
    *
@@ -19,8 +24,19 @@ import docspell.common._
    * as a last step to tag untagged tokens using the provided list of
    * regexps.
    */
-case class StanfordNerSettings(
+  case class Full(
-    lang: Language,
+      lang: NLPLanguage,
      highRecall: Boolean,
      regexNer: Option[Path]
-)
+  ) extends StanfordNerSettings
  /** Not all languages are supported with predefined statistical models. This allows to provide regexps only.
    */
  case class RegexOnly(regexNerFile: Path) extends StanfordNerSettings
  def fromNlpSettings(ns: NlpSettings): Option[StanfordNerSettings] =
    NLPLanguage.all
      .find(nl => nl == ns.lang)
      .map(nl => Full(nl, ns.highRecall, ns.regexNer))
      .orElse(ns.regexNer.map(nrf => RegexOnly(nrf)))
 }
--- a/modules/analysis/src/test/scala/docspell/analysis/Env.scala
+++ b/modules/analysis/src/test/scala/docspell/analysis/Env.scala
@@ -0,0 +1,12 @@
 package docspell.analysis
 object Env {
  def isCI = bool("CI")
  def bool(key: String): Boolean =
    string(key).contains("true")
  def string(key: String): Option[String] =
    Option(System.getenv(key)).filter(_.nonEmpty)
 }
--- a/modules/analysis/src/test/scala/docspell/analysis/classifier/StanfordTextClassifierSuite.scala
+++ b/modules/analysis/src/test/scala/docspell/analysis/classifier/StanfordTextClassifierSuite.scala
@@ -1,4 +1,4 @@
-package docspell.analysis.nlp
+package docspell.analysis.classifier
 import minitest._
 import cats.effect._
--- a/modules/analysis/src/test/scala/docspell/analysis/nlp/BaseCRFAnnotatorSuite.scala
+++ b/modules/analysis/src/test/scala/docspell/analysis/nlp/BaseCRFAnnotatorSuite.scala
@@ -1,19 +1,22 @@
 package docspell.analysis.nlp
 import docspell.analysis.Env
 import docspell.common.Language.NLPLanguage
 import minitest.SimpleTestSuite
 import docspell.files.TestFiles
 import docspell.common._
 import edu.stanford.nlp.pipeline.StanfordCoreNLP
-object TextAnalyserSuite extends SimpleTestSuite {
+object BaseCRFAnnotatorSuite extends SimpleTestSuite {
-  lazy val germanClassifier =
+
-    new StanfordCoreNLP(Properties.nerGerman(None, false))
+  def annotate(language: NLPLanguage): String => Vector[NerLabel] =
-  lazy val englishClassifier =
+    BasicCRFAnnotator.nerAnnotate(BasicCRFAnnotator.Cache.getAnnotator(language))
    new StanfordCoreNLP(Properties.nerEnglish(None))
  test("find english ner labels") {
-    val labels =
+    if (Env.isCI) {
-      StanfordNerClassifier.runClassifier(englishClassifier, TestFiles.letterENText)
+      ignore("Test ignored on travis.")
    }
    val labels = annotate(Language.English)(TestFiles.letterENText)
    val expect = Vector(
      NerLabel("Derek", NerTag.Person, 0, 5),
      NerLabel("Jeter", NerTag.Person, 6, 11),
@@ -45,11 +48,15 @@ object TextAnalyserSuite extends SimpleTestSuite {
      NerLabel("Jeter", NerTag.Person, 1123, 1128)
    )
    assertEquals(labels, expect)
    BasicCRFAnnotator.Cache.clearCache()
  }
  test("find german ner labels") {
-    val labels =
+    if (Env.isCI) {
-      StanfordNerClassifier.runClassifier(germanClassifier, TestFiles.letterDEText)
+      ignore("Test ignored on travis.")
    }
    val labels = annotate(Language.German)(TestFiles.letterDEText)
    val expect = Vector(
      NerLabel("Max", NerTag.Person, 0, 3),
      NerLabel("Mustermann", NerTag.Person, 4, 14),
@@ -65,5 +72,6 @@ object TextAnalyserSuite extends SimpleTestSuite {
      NerLabel("Mustermann", NerTag.Person, 509, 519)
    )
    assertEquals(labels, expect)
    BasicCRFAnnotator.Cache.clearCache()
  }
 }
--- a/modules/analysis/src/test/scala/docspell/analysis/nlp/StanfordNerAnnotatorSuite.scala
+++ b/modules/analysis/src/test/scala/docspell/analysis/nlp/StanfordNerAnnotatorSuite.scala
@@ -0,0 +1,120 @@
 package docspell.analysis.nlp
 import java.nio.file.Paths
 import cats.effect.IO
 import docspell.analysis.Env
 import minitest.SimpleTestSuite
 import docspell.files.TestFiles
 import docspell.common._
 import docspell.common.syntax.FileSyntax._
 import edu.stanford.nlp.pipeline.StanfordCoreNLP
 object StanfordNerAnnotatorSuite extends SimpleTestSuite {
  lazy val germanClassifier =
    new StanfordCoreNLP(Properties.nerGerman(None, false))
  lazy val englishClassifier =
    new StanfordCoreNLP(Properties.nerEnglish(None))
  test("find english ner labels") {
    if (Env.isCI) {
      ignore("Test ignored on travis.")
    }
    val labels =
      StanfordNerAnnotator.nerAnnotate(englishClassifier, TestFiles.letterENText)
    val expect = Vector(
      NerLabel("Derek", NerTag.Person, 0, 5),
      NerLabel("Jeter", NerTag.Person, 6, 11),
      NerLabel("Elm", NerTag.Misc, 17, 20),
      NerLabel("Ave.", NerTag.Misc, 21, 25),
      NerLabel("Treesville", NerTag.Misc, 27, 37),
      NerLabel("Derek", NerTag.Person, 68, 73),
      NerLabel("Jeter", NerTag.Person, 74, 79),
      NerLabel("Elm", NerTag.Misc, 85, 88),
      NerLabel("Ave.", NerTag.Misc, 89, 93),
      NerLabel("Treesville", NerTag.Person, 95, 105),
      NerLabel("Leaf", NerTag.Organization, 144, 148),
      NerLabel("Chief", NerTag.Organization, 150, 155),
      NerLabel("of", NerTag.Organization, 156, 158),
      NerLabel("Syrup", NerTag.Organization, 159, 164),
      NerLabel("Production", NerTag.Organization, 165, 175),
      NerLabel("Old", NerTag.Organization, 176, 179),
      NerLabel("Sticky", NerTag.Organization, 180, 186),
      NerLabel("Pancake", NerTag.Organization, 187, 194),
      NerLabel("Company", NerTag.Organization, 195, 202),
      NerLabel("Maple", NerTag.Organization, 207, 212),
      NerLabel("Lane", NerTag.Organization, 213, 217),
      NerLabel("Forest", NerTag.Organization, 219, 225),
      NerLabel("Hemptown", NerTag.Location, 239, 247),
      NerLabel("Leaf", NerTag.Person, 276, 280),
      NerLabel("Little", NerTag.Misc, 347, 353),
      NerLabel("League", NerTag.Misc, 354, 360),
      NerLabel("Derek", NerTag.Person, 1117, 1122),
      NerLabel("Jeter", NerTag.Person, 1123, 1128)
    )
    assertEquals(labels, expect)
    StanfordCoreNLP.clearAnnotatorPool()
  }
  test("find german ner labels") {
    if (Env.isCI) {
      ignore("Test ignored on travis.")
    }
    val labels =
      StanfordNerAnnotator.nerAnnotate(germanClassifier, TestFiles.letterDEText)
    val expect = Vector(
      NerLabel("Max", NerTag.Person, 0, 3),
      NerLabel("Mustermann", NerTag.Person, 4, 14),
      NerLabel("Lilienweg", NerTag.Person, 16, 25),
      NerLabel("Max", NerTag.Person, 77, 80),
      NerLabel("Mustermann", NerTag.Person, 81, 91),
      NerLabel("Lilienweg", NerTag.Location, 93, 102),
      NerLabel("EasyCare", NerTag.Organization, 124, 132),
      NerLabel("AG", NerTag.Organization, 133, 135),
      NerLabel("Ackerweg", NerTag.Location, 158, 166),
      NerLabel("Nebendorf", NerTag.Location, 184, 193),
      NerLabel("Max", NerTag.Person, 505, 508),
      NerLabel("Mustermann", NerTag.Person, 509, 519)
    )
    assertEquals(labels, expect)
    StanfordCoreNLP.clearAnnotatorPool()
  }
  test("regexner-only annotator") {
    if (Env.isCI) {
      ignore("Test ignored on travis.")
    }
    val regexNerContent =
      s"""(?i)volantino ag${"\t"}ORGANIZATION${"\t"}LOCATION,PERSON,MISC${"\t"}3
      |(?i)volantino${"\t"}ORGANIZATION${"\t"}LOCATION,PERSON,MISC${"\t"}3
      |(?i)ag${"\t"}ORGANIZATION${"\t"}LOCATION,PERSON,MISC${"\t"}3
      |(?i)andrea rossi${"\t"}PERSON${"\t"}LOCATION,MISC${"\t"}2
      |(?i)andrea${"\t"}PERSON${"\t"}LOCATION,MISC${"\t"}2
      |(?i)rossi${"\t"}PERSON${"\t"}LOCATION,MISC${"\t"}2
      |""".stripMargin
    File
      .withTempDir[IO](Paths.get("target"), "test-regex-ner")
      .use { dir =>
        for {
          out <- File.writeString[IO](dir / "regex.txt", regexNerContent)
          ann    = StanfordNerAnnotator.makePipeline(StanfordNerSettings.RegexOnly(out))
          labels = StanfordNerAnnotator.nerAnnotate(ann, "Hello Andrea Rossi, can you.")
          _ <- IO(
            assertEquals(
              labels,
              Vector(
                NerLabel("Andrea", NerTag.Person, 6, 12),
                NerLabel("Rossi", NerTag.Person, 13, 18)
              )
            )
          )
        } yield ()
      }
      .unsafeRunSync()
    StanfordCoreNLP.clearAnnotatorPool()
  }
 }
--- a/modules/backend/src/main/scala/docspell/backend/ops/OItem.scala
+++ b/modules/backend/src/main/scala/docspell/backend/ops/OItem.scala
@@ -591,7 +591,7 @@ object OItem {
          for {
            itemIds <- store.transact(RItem.filterItems(items, collective))
            results <- itemIds.traverse(item => deleteItem(item, collective))
-            n = results.fold(0)(_ + _)
+            n = results.sum
          } yield n
        def getProposals(item: Ident, collective: Ident): F[MetaProposalList] =
--- a/modules/common/src/main/scala/docspell/common/Language.scala
+++ b/modules/common/src/main/scala/docspell/common/Language.scala
@@ -1,5 +1,7 @@
 package docspell.common
 import cats.data.NonEmptyList
 import io.circe.{Decoder, Encoder}
 sealed trait Language { self: Product =>
@@ -11,28 +13,107 @@ sealed trait Language { self: Product =>
  def iso3: String
  val allowsNLP: Boolean = false
  private[common] def allNames =
    Set(name, iso3, iso2)
 }
 object Language {
  sealed trait NLPLanguage extends Language with Product {
    override val allowsNLP = true
  }
  object NLPLanguage {
    val all: NonEmptyList[NLPLanguage] = NonEmptyList.of(German, English, French)
  }
-  case object German extends Language {
+  case object German extends NLPLanguage {
    val iso2 = "de"
    val iso3 = "deu"
  }
-  case object English extends Language {
+  case object English extends NLPLanguage {
    val iso2 = "en"
    val iso3 = "eng"
  }
-  case object French extends Language {
+  case object French extends NLPLanguage {
    val iso2 = "fr"
    val iso3 = "fra"
  }
-  val all: List[Language] = List(German, English, French)
+  case object Italian extends Language {
    val iso2 = "it"
    val iso3 = "ita"
  }
  case object Spanish extends Language {
    val iso2 = "es"
    val iso3 = "spa"
  }
  case object Portuguese extends Language {
    val iso2 = "pt"
    val iso3 = "por"
  }
  case object Czech extends Language {
    val iso2 = "cs"
    val iso3 = "ces"
  }
  case object Danish extends Language {
    val iso2 = "da"
    val iso3 = "dan"
  }
  case object Finnish extends Language {
    val iso2 = "fi"
    val iso3 = "fin"
  }
  case object Norwegian extends Language {
    val iso2 = "no"
    val iso3 = "nor"
  }
  case object Swedish extends Language {
    val iso2 = "sv"
    val iso3 = "swe"
  }
  case object Russian extends Language {
    val iso2 = "ru"
    val iso3 = "rus"
  }
  case object Romanian extends Language {
    val iso2 = "ro"
    val iso3 = "ron"
  }
  case object Dutch extends Language {
    val iso2 = "nl"
    val iso3 = "nld"
  }
  val all: List[Language] =
    List(
      German,
      English,
      French,
      Italian,
      Spanish,
      Dutch,
      Portuguese,
      Czech,
      Danish,
      Finnish,
      Norwegian,
      Swedish,
      Russian,
      Romanian
    )
  def fromString(str: String): Either[String, Language] = {
    val lang = str.toLowerCase
--- a/modules/common/src/main/scala/docspell/common/ListType.scala
+++ b/modules/common/src/main/scala/docspell/common/ListType.scala
@@ -0,0 +1,33 @@
 package docspell.common
 import cats.data.NonEmptyList
 import io.circe.{Decoder, Encoder}
 sealed trait ListType { self: Product =>
  def name: String =
    productPrefix.toLowerCase
 }
 object ListType {
  case object Whitelist extends ListType
  val whitelist: ListType = Whitelist
  case object Blacklist extends ListType
  val blacklist: ListType = Blacklist
  val all: NonEmptyList[ListType] = NonEmptyList.of(Whitelist, Blacklist)
  def fromString(name: String): Either[String, ListType] =
    all.find(_.name.equalsIgnoreCase(name)).toRight(s"Unknown list type: $name")
  def unsafeFromString(name: String): ListType =
    fromString(name).fold(sys.error, identity)
  implicit val jsonEncoder: Encoder[ListType] =
    Encoder.encodeString.contramap(_.name)
  implicit val jsonDecoder: Decoder[ListType] =
    Decoder.decodeString.emap(fromString)
 }
--- a/modules/common/src/main/scala/docspell/common/MetaProposal.scala
+++ b/modules/common/src/main/scala/docspell/common/MetaProposal.scala
@@ -87,7 +87,7 @@ object MetaProposal {
    }
  }
-  /** Merges candidates with same `IdRef' values and concatenates their
+  /** Merges candidates with same `IdRef` values and concatenates their
    * respective labels. The candidate order is preserved.
    */
  def flatten(s: NonEmptyList[Candidate]): NonEmptyList[Candidate] = {
--- a/modules/common/src/main/scala/docspell/common/MetaProposalList.scala
+++ b/modules/common/src/main/scala/docspell/common/MetaProposalList.scala
@@ -45,6 +45,19 @@ case class MetaProposalList private (proposals: List[MetaProposal]) {
  def sortByWeights: MetaProposalList =
    change(_.sortByWeight)
  def insertSecond(ml: MetaProposalList): MetaProposalList =
    MetaProposalList.flatten0(
      Seq(this, ml),
      (map, next) =>
        map.get(next.proposalType) match {
          case Some(MetaProposal(mt, values)) =>
            val cand = NonEmptyList(values.head, next.values.toList ++ values.tail)
            map.updated(next.proposalType, MetaProposal(mt, MetaProposal.flatten(cand)))
          case None =>
            map.updated(next.proposalType, next)
        }
    )
 }
 object MetaProposalList {
@@ -74,20 +87,25 @@ object MetaProposalList {
    * is preserved and candidates of proposals are appended as given
    * by the order of the given `seq'.
    */
-  def flatten(ml: Seq[MetaProposalList]): MetaProposalList = {
+  def flatten(ml: Seq[MetaProposalList]): MetaProposalList =
-    val init: Map[MetaProposalType, MetaProposal] = Map.empty
+    flatten0(
-
+      ml,
-    def updateMap(
+      (map, mp) =>
        map: Map[MetaProposalType, MetaProposal],
        mp: MetaProposal
    ): Map[MetaProposalType, MetaProposal] =
        map.get(mp.proposalType) match {
          case Some(mp0) => map.updated(mp.proposalType, mp0.addIdRef(mp.values.toList))
          case None      => map.updated(mp.proposalType, mp)
        }
    )
-    val merged = ml.foldLeft(init)((map, el) => el.proposals.foldLeft(map)(updateMap))
+  private def flatten0(
-
+      ml: Seq[MetaProposalList],
      merge: (
          Map[MetaProposalType, MetaProposal],
          MetaProposal
      ) => Map[MetaProposalType, MetaProposal]
  ): MetaProposalList = {
    val init   = Map.empty[MetaProposalType, MetaProposal]
    val merged = ml.foldLeft(init)((map, el) => el.proposals.foldLeft(map)(merge))
    fromMap(merged)
  }
--- a/modules/common/src/main/scala/docspell/common/NlpMode.scala
+++ b/modules/common/src/main/scala/docspell/common/NlpMode.scala
@@ -0,0 +1,25 @@
 package docspell.common
 sealed trait NlpMode { self: Product =>
  def name: String =
    self.productPrefix
 }
 object NlpMode {
  case object Full      extends NlpMode
  case object Basic     extends NlpMode
  case object RegexOnly extends NlpMode
  case object Disabled  extends NlpMode
  def fromString(name: String): Either[String, NlpMode] =
    name.toLowerCase match {
      case "full"      => Right(Full)
      case "basic"     => Right(Basic)
      case "regexonly" => Right(RegexOnly)
      case "disabled"  => Right(Disabled)
      case _           => Left(s"Unknown nlp-mode: $name")
    }
  def unsafeFromString(name: String): NlpMode =
    fromString(name).fold(sys.error, identity)
 }
--- a/modules/common/src/main/scala/docspell/common/config/Implicits.scala
+++ b/modules/common/src/main/scala/docspell/common/config/Implicits.scala
@@ -44,6 +44,9 @@ object Implicits {
  implicit val priorityReader: ConfigReader[Priority] =
    ConfigReader[String].emap(reason(Priority.fromString))
  implicit val nlpModeReader: ConfigReader[NlpMode] =
    ConfigReader[String].emap(reason(NlpMode.fromString))
  def reason[A: ClassTag](
      f: String => Either[String, A]
  ): String => Either[FailureReason, A] =
--- a/modules/common/src/main/scala/docspell/common/syntax/FileSyntax.scala
+++ b/modules/common/src/main/scala/docspell/common/syntax/FileSyntax.scala
@@ -0,0 +1,20 @@
 package docspell.common.syntax
 import java.nio.file.Path
 trait FileSyntax {
  implicit final class PathOps(p: Path) {
    def absolutePath: Path =
      p.normalize().toAbsolutePath
    def absolutePathAsString: String =
      absolutePath.toString
    def /(next: String): Path =
      p.resolve(next)
  }
 }
 object FileSyntax extends FileSyntax
--- a/modules/common/src/main/scala/docspell/common/syntax/package.scala
+++ b/modules/common/src/main/scala/docspell/common/syntax/package.scala
@@ -2,6 +2,11 @@ package docspell.common
 package object syntax {
-  object all extends EitherSyntax with StreamSyntax with StringSyntax with LoggerSyntax
+  object all
      extends EitherSyntax
      with StreamSyntax
      with StringSyntax
      with LoggerSyntax
      with FileSyntax
 }
--- a/modules/common/src/test/scala/docspell/common/MetaProposalListTest.scala
+++ b/modules/common/src/test/scala/docspell/common/MetaProposalListTest.scala
@@ -68,4 +68,35 @@ object MetaProposalListTest extends SimpleTestSuite {
    assertEquals(candidates.head, cand1)
    assertEquals(candidates.tail.head, cand2)
  }
  test("insert second") {
    val cand1 = Candidate(IdRef(Ident.unsafe("123"), "name"), Set.empty)
    val cand2 = Candidate(IdRef(Ident.unsafe("456"), "name"), Set.empty)
    val cand3 = Candidate(IdRef(Ident.unsafe("789"), "name"), Set.empty)
    val cand4 = Candidate(IdRef(Ident.unsafe("abc"), "name"), Set.empty)
    val cand5 = Candidate(IdRef(Ident.unsafe("def"), "name"), Set.empty)
    val mpl1 = MetaProposalList
      .of(
        MetaProposal(MetaProposalType.CorrOrg, NonEmptyList.of(cand1, cand2)),
        MetaProposal(MetaProposalType.ConcPerson, NonEmptyList.of(cand3))
      )
    val mpl2 = MetaProposalList
      .of(
        MetaProposal(MetaProposalType.CorrOrg, NonEmptyList.of(cand4)),
        MetaProposal(MetaProposalType.ConcPerson, NonEmptyList.of(cand5))
      )
    val result = mpl1.insertSecond(mpl2)
    assertEquals(
      result,
      MetaProposalList(
        List(
          MetaProposal(MetaProposalType.CorrOrg, NonEmptyList.of(cand1, cand4, cand2)),
          MetaProposal(MetaProposalType.ConcPerson, NonEmptyList.of(cand3, cand5))
        )
      )
    )
  }
 }
--- a/modules/files/src/test/resources/examples/letter-ita.txt
+++ b/modules/files/src/test/resources/examples/letter-ita.txt
@@ -0,0 +1,13 @@
 Pontremoli, 9 aprile 2013
 Spettabile Villa Albicocca
 Via Francigena, 9
 55100 Pontetetto (LU)
 Oggetto: Prenotazione
 Gentile Direttore,
 Vorrei prenotare una camera matrimoniale …….
 In attesa di una Sua pronta risposta, La saluto cordialmente
--- a/modules/fts-client/src/main/scala/docspell/ftsclient/FtsMigration.scala
+++ b/modules/fts-client/src/main/scala/docspell/ftsclient/FtsMigration.scala
@@ -1,5 +1,8 @@
 package docspell.ftsclient
 import cats.Functor
 import cats.implicits._
 import docspell.common._
 final case class FtsMigration[F[_]](
@@ -7,7 +10,13 @@ final case class FtsMigration[F[_]](
    engine: Ident,
    description: String,
    task: F[FtsMigration.Result]
-)
+) {
  def changeResult(f: FtsMigration.Result => FtsMigration.Result)(implicit
      F: Functor[F]
  ): FtsMigration[F] =
    copy(task = task.map(f))
 }
 object FtsMigration {
--- a/modules/fts-solr/src/main/scala/docspell/ftssolr/Field.scala
+++ b/modules/fts-solr/src/main/scala/docspell/ftssolr/Field.scala
@@ -21,22 +21,19 @@ object Field {
  val discriminator  = Field("discriminator")
  val attachmentName = Field("attachmentName")
  val content        = Field("content")
-  val content_de     = Field("content_de")
+  val content_de     = contentField(Language.German)
-  val content_en     = Field("content_en")
+  val content_en     = contentField(Language.English)
-  val content_fr     = Field("content_fr")
+  val content_fr     = contentField(Language.French)
  val itemName       = Field("itemName")
  val itemNotes      = Field("itemNotes")
  val folderId       = Field("folder")
  val contentLangFields = Language.all
    .map(contentField)
  def contentField(lang: Language): Field =
-    lang match {
+    if (lang == Language.Czech) Field(s"content_cz")
-      case Language.German =>
+    else Field(s"content_${lang.iso2}")
        Field.content_de
      case Language.English =>
        Field.content_en
      case Language.French =>
        Field.content_fr
    }
  implicit val jsonEncoder: Encoder[Field] =
    Encoder.encodeString.contramap(_.name)
--- a/modules/fts-solr/src/main/scala/docspell/ftssolr/SolrQuery.scala
+++ b/modules/fts-solr/src/main/scala/docspell/ftssolr/SolrQuery.scala
@@ -37,13 +37,10 @@ object SolrQuery {
          cfg,
          List(
            Field.content,
            Field.content_de,
            Field.content_en,
            Field.content_fr,
            Field.itemName,
            Field.itemNotes,
            Field.attachmentName
-          ),
+          ) ++ Field.contentLangFields,
          List(
            Field.id,
            Field.itemId,
--- a/modules/fts-solr/src/main/scala/docspell/ftssolr/SolrSetup.scala
+++ b/modules/fts-solr/src/main/scala/docspell/ftssolr/SolrSetup.scala
@@ -56,21 +56,51 @@ object SolrSetup {
            5,
            solrEngine,
            "Add content_fr field",
-            addContentFrField.map(_ => FtsMigration.Result.workDone)
+            addContentField(Language.French).map(_ => FtsMigration.Result.workDone)
          ),
          FtsMigration[F](
            6,
            solrEngine,
            "Index all from database",
            FtsMigration.Result.indexAll.pure[F]
          ),
          FtsMigration[F](
            7,
            solrEngine,
            "Add content_it field",
            addContentField(Language.Italian).map(_ => FtsMigration.Result.reIndexAll)
          ),
          FtsMigration[F](
            8,
            solrEngine,
            "Add content_es field",
            addContentField(Language.Spanish).map(_ => FtsMigration.Result.reIndexAll)
          ),
          FtsMigration[F](
            9,
            solrEngine,
            "Add more content fields",
            addMoreContentFields.map(_ => FtsMigration.Result.reIndexAll)
          )
        )
      def addFolderField: F[Unit] =
        addStringField(Field.folderId)
-      def addContentFrField: F[Unit] =
+      def addMoreContentFields: F[Unit] = {
-        addTextField(Some(Language.French))(Field.content_fr)
+        val remain = List[Language](
          Language.Norwegian,
          Language.Romanian,
          Language.Swedish,
          Language.Finnish,
          Language.Danish,
          Language.Czech,
          Language.Dutch,
          Language.Portuguese,
          Language.Russian
        )
        remain.traverse(addContentField).map(_ => ())
      }
      def setupCoreSchema: F[Unit] = {
        val cmds0 =
@@ -90,13 +120,15 @@ object SolrSetup {
        )
          .traverse(addTextField(None))
-        val cntLang = Language.all.traverse {
+        val cntLang = List(Language.German, Language.English, Language.French).traverse {
          case l @ Language.German =>
            addTextField(l.some)(Field.content_de)
          case l @ Language.English =>
            addTextField(l.some)(Field.content_en)
          case l @ Language.French =>
            addTextField(l.some)(Field.content_fr)
          case _ =>
            ().pure[F]
        }
        cmds0 *> cmds1 *> cntLang *> ().pure[F]
@@ -111,20 +143,17 @@ object SolrSetup {
        run(DeleteField.command(DeleteField(field))).attempt *>
          run(AddField.command(AddField.string(field)))
      private def addContentField(lang: Language): F[Unit] =
        addTextField(Some(lang))(Field.contentField(lang))
      private def addTextField(lang: Option[Language])(field: Field): F[Unit] =
        lang match {
          case None =>
            run(DeleteField.command(DeleteField(field))).attempt *>
-              run(AddField.command(AddField.text(field)))
+              run(AddField.command(AddField.textGeneral(field)))
-          case Some(Language.German) =>
+          case Some(lang) =>
            run(DeleteField.command(DeleteField(field))).attempt *>
-              run(AddField.command(AddField.textDE(field)))
+              run(AddField.command(AddField.textLang(field, lang)))
          case Some(Language.English) =>
            run(DeleteField.command(DeleteField(field))).attempt *>
              run(AddField.command(AddField.textEN(field)))
          case Some(Language.French) =>
            run(DeleteField.command(DeleteField(field))).attempt *>
              run(AddField.command(AddField.textFR(field)))
        }
    }
  }
@@ -150,17 +179,12 @@ object SolrSetup {
    def string(field: Field): AddField =
      AddField(field, "string", true, true, false)
-    def text(field: Field): AddField =
+    def textGeneral(field: Field): AddField =
      AddField(field, "text_general", true, true, false)
-    def textDE(field: Field): AddField =
+    def textLang(field: Field, lang: Language): AddField =
-      AddField(field, "text_de", true, true, false)
+      if (lang == Language.Czech) AddField(field, s"text_cz", true, true, false)
-
+      else AddField(field, s"text_${lang.iso2}", true, true, false)
    def textEN(field: Field): AddField =
      AddField(field, "text_en", true, true, false)
    def textFR(field: Field): AddField =
      AddField(field, "text_fr", true, true, false)
  }
  case class DeleteField(name: Field)
--- a/modules/joex/src/main/resources/reference.conf
+++ b/modules/joex/src/main/resources/reference.conf
@@ -269,62 +269,101 @@ docspell.joex {
    # All text to analyse must fit into RAM. A large document may take
    # too much heap. Also, most important information is at the
    # beginning of a document, so in most cases the first two pages
-    # should suffice. Default is 10000, which are about 2-3 pages
+    # should suffice. Default is 8000, which are about 2-3 pages (just
-    # (just a rough guess, of course).
+    # a rough guess, of course).
-    max-length = 10000
+    max-length = 8000
    # A working directory for the analyser to store temporary/working
    # files.
    working-dir = ${java.io.tmpdir}"/docspell-analysis"
    nlp {
      # The mode for configuring NLP models:
      #
      # 1. full – builds the complete pipeline
      # 2. basic - builds only the ner annotator
      # 3. regexonly - matches each entry in your address book via regexps
      # 4. disabled - doesn't use any stanford-nlp feature
      #
      # The full and basic variants rely on pre-build language models
      # that are available for only a few languages. Memory usage
      # varies among the languages. So joex should run with -Xmx1400M
      # at least when using mode=full.
      #
      # The basic variant does a quite good job for German and
      # English. It might be worse for French, always depending on the
      # type of text that is analysed. Joex should run with about 500M
      # heap, here again lanugage German uses the most.
      #
      # The regexonly variant doesn't depend on a language. It roughly
      # works by converting all entries in your addressbook into
      # regexps and matches each one against the text. This can get
      # memory intensive, too, when the addressbook grows large. This
      # is included in the full and basic by default, but can be used
      # independently by setting mode=regexner.
      #
      # When mode=disabled, then the whole nlp pipeline is disabled,
      # and you won't get any suggestions. Only what the classifier
      # returns (if enabled).
      mode = full
      # The StanfordCoreNLP library caches language models which
      # requires quite some amount of memory. Setting this interval to a
      # positive duration, the cache is cleared after this amount of
      # idle time. Set it to 0 to disable it if you have enough memory,
      # processing will be faster.
-    clear-stanford-nlp-interval = "15 minutes"
+      #
      # This has only any effect, if mode != disabled.
      clear-interval = "15 minutes"
      # Restricts proposals for due dates. Only dates earlier than this
      # number of years in the future are considered.
      max-due-date-years = 10
      regex-ner {
-      # Whether to enable custom NER annotation. This uses the address
+        # Whether to enable custom NER annotation. This uses the
-      # book of a collective as input for NER tagging (to automatically
+        # address book of a collective as input for NER tagging (to
-      # find correspondent and concerned entities). If the address book
+        # automatically find correspondent and concerned entities). If
-      # is large, this can be quite memory intensive and also makes text
+        # the address book is large, this can be quite memory
-      # analysis slower. But it greatly improves accuracy. If this is
+        # intensive and also makes text analysis much slower. But it
-      # false, NER tagging uses only statistical models (that also work
+        # improves accuracy and can be used independent of the
-      # quite well).
+        # lanugage. If this is set to 0, it is effectively disabled
        # and NER tagging uses only statistical models (that also work
        # quite well, but are restricted to the languages mentioned
        # above).
        #
-      # This setting might be moved to the collective settings in the
+        # Note, this is only relevant if nlp-config.mode is not
-      # future.
+        # "disabled".
-      enabled = true
+        max-entries = 1000
-      # The NER annotation uses a file of patterns that is derived from
+        # The NER annotation uses a file of patterns that is derived
-      # a collective's address book. This is is the time how long this
+        # from a collective's address book. This is is the time how
-      # file will be kept until a check for a state change is done.
+        # long this data will be kept until a check for a state change
        # is done.
        file-cache-time = "1 minute"
      }
    }
    # Settings for doing document classification.
    #
-    # This works by learning from existing documents. A collective can
+    # This works by learning from existing documents. This requires a
-    # specify a tag category and the system will try to predict a tag
+    # satstical model that is computed from all existing documents.
-    # from this category for new incoming documents.
+    # This process is run periodically as configured by the
-    #
+    # collective. It may require more memory, depending on the amount
-    # This requires a satstical model that is computed from all
+    # of data.
    # existing documents. This process is run periodically as
    # configured by the collective. It may require a lot of memory,
    # depending on the amount of data.
    #
    # It utilises this NLP library: https://nlp.stanford.edu/.
    classification {
      # Whether to enable classification globally. Each collective can
-      # decide to disable it. If it is disabled here, no collective
+      # enable/disable auto-tagging. The classifier is also used for
-      # can use classification.
+      # finding correspondents and concerned entities, if enabled
      # here.
      enabled = true
      # If concerned with memory consumption, this restricts the
      # number of items to consider. More are better for training. A
-      # negative value or zero means no train on all items.
+      # negative value or zero means to train on all items.
-      item-count = 0
+      item-count = 600
      # These settings are used to configure the classifier. If
      # multiple are given, they are all tried and the "best" is
@@ -477,13 +516,6 @@ docspell.joex {
    }
  }
  # General config for processing documents
  processing {
    # Restricts proposals for due dates. Only dates earlier than this
    # number of years in the future are considered.
    max-due-date-years = 10
  }
  # The same section is also present in the rest-server config. It is
  # used when submitting files into the job queue for processing.
  #
--- a/modules/joex/src/main/scala/docspell/joex/Config.scala
+++ b/modules/joex/src/main/scala/docspell/joex/Config.scala
@@ -5,7 +5,7 @@ import java.nio.file.Path
 import cats.data.NonEmptyList
 import docspell.analysis.TextAnalysisConfig
-import docspell.analysis.nlp.TextClassifierConfig
+import docspell.analysis.classifier.TextClassifierConfig
 import docspell.backend.Config.Files
 import docspell.common._
 import docspell.convert.ConvertConfig
@@ -31,8 +31,7 @@ case class Config(
    sendMail: MailSendConfig,
    files: Files,
    mailDebug: Boolean,
-    fullTextSearch: Config.FullTextSearch,
+    fullTextSearch: Config.FullTextSearch
    processing: Config.Processing
 )
 object Config {
@@ -55,20 +54,17 @@ object Config {
    final case class Migration(indexAllChunk: Int)
  }
  case class Processing(maxDueDateYears: Int)
  case class TextAnalysis(
      maxLength: Int,
      workingDir: Path,
-      clearStanfordNlpInterval: Duration,
+      nlp: NlpConfig,
      regexNer: RegexNer,
      classification: Classification
  ) {
    def textAnalysisConfig: TextAnalysisConfig =
      TextAnalysisConfig(
        maxLength,
-        clearStanfordNlpInterval,
+        TextAnalysisConfig.NlpConfig(nlp.clearInterval, nlp.mode),
        TextClassifierConfig(
          workingDir,
          NonEmptyList
@@ -78,14 +74,30 @@ object Config {
      )
    def regexNerFileConfig: RegexNerFile.Config =
-      RegexNerFile.Config(regexNer.enabled, workingDir, regexNer.fileCacheTime)
+      RegexNerFile.Config(
        nlp.regexNer.maxEntries,
        workingDir,
        nlp.regexNer.fileCacheTime
      )
  }
-  case class RegexNer(enabled: Boolean, fileCacheTime: Duration)
+  case class NlpConfig(
      mode: NlpMode,
      clearInterval: Duration,
      maxDueDateYears: Int,
      regexNer: RegexNer
  )
  case class RegexNer(maxEntries: Int, fileCacheTime: Duration)
  case class Classification(
      enabled: Boolean,
      itemCount: Int,
      classifiers: List[Map[String, String]]
-  )
+  ) {
    def itemCountOrWhenLower(other: Int): Int =
      if (itemCount <= 0 || (itemCount > other && other > 0)) other
      else itemCount
  }
 }
--- a/modules/joex/src/main/scala/docspell/joex/JoexAppImpl.scala
+++ b/modules/joex/src/main/scala/docspell/joex/JoexAppImpl.scala
@@ -97,7 +97,7 @@ object JoexAppImpl {
      upload   <- OUpload(store, queue, cfg.files, joex)
      fts      <- createFtsClient(cfg)(httpClient)
      itemOps  <- OItem(store, fts, queue, joex)
-      analyser <- TextAnalyser.create[F](cfg.textAnalysis.textAnalysisConfig)
+      analyser <- TextAnalyser.create[F](cfg.textAnalysis.textAnalysisConfig, blocker)
      regexNer <- RegexNerFile(cfg.textAnalysis.regexNerFileConfig, blocker, store)
      javaEmil =
        JavaMailEmil(blocker, Settings.defaultSettings.copy(debug = cfg.mailDebug))
@@ -169,7 +169,7 @@ object JoexAppImpl {
        .withTask(
          JobTask.json(
            LearnClassifierArgs.taskName,
-            LearnClassifierTask[F](cfg.textAnalysis, blocker, analyser),
+            LearnClassifierTask[F](cfg.textAnalysis, analyser),
            LearnClassifierTask.onCancel[F]
          )
        )
--- a/modules/joex/src/main/scala/docspell/joex/analysis/RegexNerFile.scala
+++ b/modules/joex/src/main/scala/docspell/joex/analysis/RegexNerFile.scala
@@ -29,7 +29,7 @@ trait RegexNerFile[F[_]] {
 object RegexNerFile {
  private[this] val logger = getLogger
-  case class Config(enabled: Boolean, directory: Path, minTime: Duration)
+  case class Config(maxEntries: Int, directory: Path, minTime: Duration)
  def apply[F[_]: Concurrent: ContextShift](
      cfg: Config,
@@ -49,7 +49,7 @@ object RegexNerFile {
  ) extends RegexNerFile[F] {
    def makeFile(collective: Ident): F[Option[Path]] =
-      if (cfg.enabled) doMakeFile(collective)
+      if (cfg.maxEntries > 0) doMakeFile(collective)
      else (None: Option[Path]).pure[F]
    def doMakeFile(collective: Ident): F[Option[Path]] =
@@ -127,7 +127,7 @@ object RegexNerFile {
      for {
        _     <- logger.finfo(s"Generating custom NER file for collective '${collective.id}'")
-        names <- store.transact(QCollective.allNames(collective))
+        names <- store.transact(QCollective.allNames(collective, cfg.maxEntries))
        nerFile = NerFile(collective, lastUpdate, now)
        _ <- update(nerFile, NerFile.mkNerConfig(names))
      } yield nerFile
--- a/modules/joex/src/main/scala/docspell/joex/fts/FtsWork.scala
+++ b/modules/joex/src/main/scala/docspell/joex/fts/FtsWork.scala
@@ -14,16 +14,26 @@ object FtsWork {
  def apply[F[_]](f: FtsContext[F] => F[Unit]): FtsWork[F] =
    Kleisli(f)
-  def allInitializeTasks[F[_]: Monad]: FtsWork[F] =
+  /** Runs all migration tasks unconditionally and inserts all data as last step. */
-    FtsWork[F](_ => ().pure[F]).tap[FtsContext[F]].flatMap { ctx =>
+  def reInitializeTasks[F[_]: Monad]: FtsWork[F] =
-      NonEmptyList.fromList(ctx.fts.initialize.map(fm => from[F](fm.task))) match {
+    FtsWork { ctx =>
      val migrations =
        ctx.fts.initialize.map(fm => fm.changeResult(_ => FtsMigration.Result.workDone))
      NonEmptyList.fromList(migrations) match {
        case Some(nel) =>
-          nel.reduce(semigroup[F])
+          nel
            .map(fm => from[F](fm.task))
            .append(insertAll[F](None))
            .reduce(semigroup[F])
            .run(ctx)
        case None =>
-          FtsWork[F](_ => ().pure[F])
+          ().pure[F]
      }
    }
  /**
    */
  def from[F[_]: FlatMap: Applicative](t: F[FtsMigration.Result]): FtsWork[F] =
    Kleisli.liftF(t).flatMap(transformResult[F])
--- a/modules/joex/src/main/scala/docspell/joex/fts/Migration.scala
+++ b/modules/joex/src/main/scala/docspell/joex/fts/Migration.scala
@@ -11,6 +11,11 @@ import docspell.joex.Config
 import docspell.store.records.RFtsMigration
 import docspell.store.{AddResult, Store}
 /** Migrating the index from the previous version to this version.
  *
  * The sql database stores the outcome of a migration task. If this
  * task has already been applied, it is skipped.
  */
 case class Migration[F[_]](
    version: Int,
    engine: Ident,
--- a/modules/joex/src/main/scala/docspell/joex/fts/ReIndexTask.scala
+++ b/modules/joex/src/main/scala/docspell/joex/fts/ReIndexTask.scala
@@ -46,6 +46,6 @@ object ReIndexTask {
              FtsWork.log[F](_.info("Clearing data failed. Continue re-indexing."))
            ) ++
            FtsWork.log[F](_.info("Running index initialize")) ++
-            FtsWork.allInitializeTasks[F]
+            FtsWork.reInitializeTasks[F]
      })
 }
--- a/modules/joex/src/main/scala/docspell/joex/fts/package.scala
+++ b/modules/joex/src/main/scala/docspell/joex/fts/package.scala
@@ -4,6 +4,9 @@ import cats.data.Kleisli
 package object fts {
  /** Some work that must be done to advance the schema of the fulltext
    * index.
    */
  type FtsWork[F[_]] = Kleisli[F, FtsContext[F], Unit]
 }
--- a/modules/joex/src/main/scala/docspell/joex/learn/ClassifierName.scala
+++ b/modules/joex/src/main/scala/docspell/joex/learn/ClassifierName.scala
@@ -0,0 +1,66 @@
 package docspell.joex.learn
 import cats.data.NonEmptyList
 import cats.implicits._
 import docspell.common.Ident
 import docspell.store.records.{RClassifierModel, RClassifierSetting}
 import doobie._
 final class ClassifierName(val name: String) extends AnyVal
 object ClassifierName {
  def apply(name: String): ClassifierName =
    new ClassifierName(name)
  private val categoryPrefix = "tagcategory-"
  def tagCategory(cat: String): ClassifierName =
    apply(s"${categoryPrefix}${cat}")
  val concernedPerson: ClassifierName =
    apply("concernedperson")
  val concernedEquip: ClassifierName =
    apply("concernedequip")
  val correspondentOrg: ClassifierName =
    apply("correspondentorg")
  val correspondentPerson: ClassifierName =
    apply("correspondentperson")
  def findTagClassifiers[F[_]](coll: Ident): ConnectionIO[List[ClassifierName]] =
    for {
      categories <- RClassifierSetting.getActiveCategories(coll)
    } yield categories.map(tagCategory)
  def findTagModels[F[_]](coll: Ident): ConnectionIO[List[RClassifierModel]] =
    for {
      categories <- RClassifierSetting.getActiveCategories(coll)
      models <- NonEmptyList.fromList(categories) match {
        case Some(nel) =>
          RClassifierModel.findAllByName(coll, nel.map(tagCategory).map(_.name))
        case None =>
          List.empty[RClassifierModel].pure[ConnectionIO]
      }
    } yield models
  def findOrphanTagModels[F[_]](coll: Ident): ConnectionIO[List[RClassifierModel]] =
    for {
      cats <- RClassifierSetting.getActiveCategories(coll)
      allModels = RClassifierModel.findAllByQuery(coll, s"${categoryPrefix}%")
      result <- NonEmptyList.fromList(cats) match {
        case Some(nel) =>
          allModels.flatMap(all =>
            RClassifierModel
              .findAllByName(coll, nel.map(tagCategory).map(_.name))
              .map(active => all.diff(active))
          )
        case None =>
          allModels
      }
    } yield result
 }
--- a/modules/joex/src/main/scala/docspell/joex/learn/Classify.scala
+++ b/modules/joex/src/main/scala/docspell/joex/learn/Classify.scala
@@ -0,0 +1,48 @@
 package docspell.joex.learn
 import java.nio.file.Path
 import cats.data.OptionT
 import cats.effect._
 import cats.implicits._
 import docspell.analysis.classifier.{ClassifierModel, TextClassifier}
 import docspell.common._
 import docspell.store.Store
 import docspell.store.records.RClassifierModel
 import bitpeace.RangeDef
 object Classify {
  def apply[F[_]: Sync: ContextShift](
      blocker: Blocker,
      logger: Logger[F],
      workingDir: Path,
      store: Store[F],
      classifier: TextClassifier[F],
      coll: Ident,
      text: String
  )(cname: ClassifierName): F[Option[String]] =
    (for {
      _ <- OptionT.liftF(logger.info(s"Guessing label for ${cname.name} …"))
      model <- OptionT(store.transact(RClassifierModel.findByName(coll, cname.name)))
        .flatTapNone(logger.debug("No classifier model found."))
      modelData =
        store.bitpeace
          .get(model.fileId.id)
          .unNoneTerminate
          .through(store.bitpeace.fetchData2(RangeDef.all))
      cls <- OptionT(File.withTempDir(workingDir, "classify").use { dir =>
        val modelFile = dir.resolve("model.ser.gz")
        modelData
          .through(fs2.io.file.writeAll(modelFile, blocker))
          .compile
          .drain
          .flatMap(_ => classifier.classify(logger, ClassifierModel(modelFile), text))
      }).filter(_ != LearnClassifierTask.noClass)
        .flatTapNone(logger.debug("Guessed: <none>"))
      _ <- OptionT.liftF(logger.debug(s"Guessed: ${cls}"))
    } yield cls).value
 }
--- a/modules/joex/src/main/scala/docspell/joex/learn/LearnClassifierTask.scala
+++ b/modules/joex/src/main/scala/docspell/joex/learn/LearnClassifierTask.scala
@@ -1,26 +1,19 @@
 package docspell.joex.learn
 import cats.data.Kleisli
 import cats.data.OptionT
 import cats.effect._
 import cats.implicits._
 import fs2.{Pipe, Stream}
 import docspell.analysis.TextAnalyser
 import docspell.analysis.nlp.ClassifierModel
 import docspell.analysis.nlp.TextClassifier.Data
 import docspell.backend.ops.OCollective
 import docspell.common._
 import docspell.joex.Config
 import docspell.joex.scheduler._
-import docspell.store.queries.QItem
+import docspell.store.records.{RClassifierModel, RClassifierSetting}
 import docspell.store.records.RClassifierSetting
 import bitpeace.MimetypeHint
 object LearnClassifierTask {
  val noClass = "__NONE__"
  val pageSep = " --n-- "
  val noClass = "__NONE__"
  type Args = LearnClassifierArgs
@@ -29,68 +22,72 @@ object LearnClassifierTask {
  def apply[F[_]: Sync: ContextShift](
      cfg: Config.TextAnalysis,
-      blocker: Blocker,
+      analyser: TextAnalyser[F]
  ): Task[F, Args, Unit] =
    learnTags(cfg, analyser)
      .flatMap(_ => learnItemEntities(cfg, analyser))
      .flatMap(_ => Task(_ => Sync[F].delay(System.gc())))
  private def learnItemEntities[F[_]: Sync: ContextShift](
      cfg: Config.TextAnalysis,
      analyser: TextAnalyser[F]
  ): Task[F, Args, Unit] =
    Task { ctx =>
-      (for {
+      if (cfg.classification.enabled)
        LearnItemEntities
          .learnAll(
            analyser,
            ctx.args.collective,
            cfg.classification.itemCount,
            cfg.maxLength
          )
          .run(ctx)
      else ().pure[F]
    }
  private def learnTags[F[_]: Sync: ContextShift](
      cfg: Config.TextAnalysis,
      analyser: TextAnalyser[F]
  ): Task[F, Args, Unit] =
    Task { ctx =>
      val learnTags =
        for {
          sett <- findActiveSettings[F](ctx, cfg)
-        data = selectItems(
+          maxItems = cfg.classification.itemCountOrWhenLower(sett.itemCount)
          ctx,
          math.min(cfg.classification.itemCount, sett.itemCount).toLong,
          sett.category.getOrElse("")
        )
          _ <- OptionT.liftF(
-          analyser
+            LearnTags
-            .classifier(blocker)
+              .learnAllTagCategories(analyser)(
-            .trainClassifier[Unit](ctx.logger, data)(Kleisli(handleModel(ctx, blocker)))
+                ctx.args.collective,
                maxItems,
                cfg.maxLength
              )
-      } yield ())
+              .run(ctx)
        .getOrElseF(logInactiveWarning(ctx.logger))
    }
  private def handleModel[F[_]: Sync: ContextShift](
      ctx: Context[F, Args],
      blocker: Blocker
  )(trainedModel: ClassifierModel): F[Unit] =
    for {
      oldFile <- ctx.store.transact(
        RClassifierSetting.findById(ctx.args.collective).map(_.flatMap(_.fileId))
          )
      _ <- ctx.logger.info("Storing new trained model")
      fileData = fs2.io.file.readAll(trainedModel.model, blocker, 4096)
      newFile <-
        ctx.store.bitpeace.saveNew(fileData, 4096, MimetypeHint.none).compile.lastOrError
      _ <- ctx.store.transact(
        RClassifierSetting.updateFile(ctx.args.collective, Ident.unsafe(newFile.id))
      )
      _ <- ctx.logger.debug(s"New model stored at file ${newFile.id}")
      _ <- oldFile match {
        case Some(fid) =>
          ctx.logger.debug(s"Deleting old model file ${fid.id}") *>
            ctx.store.bitpeace.delete(fid.id).compile.drain
        case None => ().pure[F]
      }
        } yield ()
-
+      // learn classifier models from active tag categories
-  private def selectItems[F[_]](
+      learnTags.getOrElseF(logInactiveWarning(ctx.logger)) *>
-      ctx: Context[F, Args],
+        // delete classifier model files for categories that have been removed
-      max: Long,
+        clearObsoleteTagModels(ctx) *>
-      category: String
+        // when tags are deleted, categories may get removed. fix the json array
-  ): Stream[F, Data] = {
+        ctx.store
-    val connStream =
+          .transact(RClassifierSetting.fixCategoryList(ctx.args.collective))
-      for {
+          .map(_ => ())
        item <- QItem.findAllNewesFirst(ctx.args.collective, 10).through(restrictTo(max))
        tt <- Stream.eval(
          QItem.resolveTextAndTag(ctx.args.collective, item, category, pageSep)
        )
      } yield Data(tt.tag.map(_.name).getOrElse(noClass), item.id, tt.text.trim)
    ctx.store.transact(connStream.filter(_.text.nonEmpty))
    }
-  private def restrictTo[F[_], A](max: Long): Pipe[F, A, A] =
+  private def clearObsoleteTagModels[F[_]: Sync](ctx: Context[F, Args]): F[Unit] =
-    if (max <= 0) identity
+    for {
-    else _.take(max)
+      list <- ctx.store.transact(
        ClassifierName.findOrphanTagModels(ctx.args.collective)
      )
      _ <- ctx.logger.info(
        s"Found ${list.size} obsolete model files that are deleted now."
      )
      n <- ctx.store.transact(RClassifierModel.deleteAll(list.map(_.id)))
      _ <- list
        .map(_.fileId.id)
        .traverse(id => ctx.store.bitpeace.delete(id).compile.drain)
      _ <- ctx.logger.debug(s"Deleted $n model files.")
    } yield ()
  private def findActiveSettings[F[_]: Sync](
      ctx: Context[F, Args],
@@ -98,14 +95,13 @@ object LearnClassifierTask {
  ): OptionT[F, OCollective.Classifier] =
    if (cfg.classification.enabled)
      OptionT(ctx.store.transact(RClassifierSetting.findById(ctx.args.collective)))
-        .filter(_.enabled)
+        .filter(_.autoTagEnabled)
        .filter(_.category.nonEmpty)
        .map(OCollective.Classifier.fromRecord)
    else
      OptionT.none
  private def logInactiveWarning[F[_]: Sync](logger: Logger[F]): F[Unit] =
    logger.warn(
-      "Classification is disabled. Check joex config and the collective settings."
+      "Auto-tagging is disabled. Check joex config and the collective settings."
    )
 }
--- a/modules/joex/src/main/scala/docspell/joex/learn/LearnItemEntities.scala
+++ b/modules/joex/src/main/scala/docspell/joex/learn/LearnItemEntities.scala
@@ -0,0 +1,79 @@
 package docspell.joex.learn
 import cats.data.Kleisli
 import cats.effect._
 import cats.implicits._
 import fs2.Stream
 import docspell.analysis.TextAnalyser
 import docspell.analysis.classifier.TextClassifier.Data
 import docspell.common._
 import docspell.joex.scheduler._
 object LearnItemEntities {
  def learnAll[F[_]: Sync: ContextShift, A](
      analyser: TextAnalyser[F],
      collective: Ident,
      maxItems: Int,
      maxTextLen: Int
  ): Task[F, A, Unit] =
    learnCorrOrg(analyser, collective, maxItems, maxTextLen)
      .flatMap(_ => learnCorrPerson[F, A](analyser, collective, maxItems, maxTextLen))
      .flatMap(_ => learnConcPerson(analyser, collective, maxItems, maxTextLen))
      .flatMap(_ => learnConcEquip(analyser, collective, maxItems, maxTextLen))
  def learnCorrOrg[F[_]: Sync: ContextShift, A](
      analyser: TextAnalyser[F],
      collective: Ident,
      maxItems: Int,
      maxTextLen: Int
  ): Task[F, A, Unit] =
    learn(analyser, collective)(
      ClassifierName.correspondentOrg,
      ctx => SelectItems.forCorrOrg(ctx.store, collective, maxItems, maxTextLen)
    )
  def learnCorrPerson[F[_]: Sync: ContextShift, A](
      analyser: TextAnalyser[F],
      collective: Ident,
      maxItems: Int,
      maxTextLen: Int
  ): Task[F, A, Unit] =
    learn(analyser, collective)(
      ClassifierName.correspondentPerson,
      ctx => SelectItems.forCorrPerson(ctx.store, collective, maxItems, maxTextLen)
    )
  def learnConcPerson[F[_]: Sync: ContextShift, A](
      analyser: TextAnalyser[F],
      collective: Ident,
      maxItems: Int,
      maxTextLen: Int
  ): Task[F, A, Unit] =
    learn(analyser, collective)(
      ClassifierName.concernedPerson,
      ctx => SelectItems.forConcPerson(ctx.store, collective, maxItems, maxTextLen)
    )
  def learnConcEquip[F[_]: Sync: ContextShift, A](
      analyser: TextAnalyser[F],
      collective: Ident,
      maxItems: Int,
      maxTextLen: Int
  ): Task[F, A, Unit] =
    learn(analyser, collective)(
      ClassifierName.concernedEquip,
      ctx => SelectItems.forConcEquip(ctx.store, collective, maxItems, maxTextLen)
    )
  private def learn[F[_]: Sync: ContextShift, A](
      analyser: TextAnalyser[F],
      collective: Ident
  )(cname: ClassifierName, data: Context[F, _] => Stream[F, Data]): Task[F, A, Unit] =
    Task { ctx =>
      ctx.logger.info(s"Learn classifier ${cname.name}") *>
        analyser.classifier.trainClassifier(ctx.logger, data(ctx))(
          Kleisli(StoreClassifierModel.handleModel(ctx, collective, cname))
        )
    }
 }
--- a/modules/joex/src/main/scala/docspell/joex/learn/LearnTags.scala
+++ b/modules/joex/src/main/scala/docspell/joex/learn/LearnTags.scala
@@ -0,0 +1,48 @@
 package docspell.joex.learn
 import cats.data.Kleisli
 import cats.effect._
 import cats.implicits._
 import docspell.analysis.TextAnalyser
 import docspell.common._
 import docspell.joex.scheduler._
 import docspell.store.records.RClassifierSetting
 object LearnTags {
  def learnTagCategory[F[_]: Sync: ContextShift, A](
      analyser: TextAnalyser[F],
      collective: Ident,
      maxItems: Int,
      maxTextLen: Int
  )(
      category: String
  ): Task[F, A, Unit] =
    Task { ctx =>
      val data = SelectItems.forCategory(ctx, collective)(maxItems, category, maxTextLen)
      ctx.logger.info(s"Learn classifier for tag category: $category") *>
        analyser.classifier.trainClassifier(ctx.logger, data)(
          Kleisli(
            StoreClassifierModel.handleModel(
              ctx,
              collective,
              ClassifierName.tagCategory(category)
            )
          )
        )
    }
  def learnAllTagCategories[F[_]: Sync: ContextShift, A](analyser: TextAnalyser[F])(
      collective: Ident,
      maxItems: Int,
      maxTextLen: Int
  ): Task[F, A, Unit] =
    Task { ctx =>
      for {
        cats <- ctx.store.transact(RClassifierSetting.getActiveCategories(collective))
        task = learnTagCategory[F, A](analyser, collective, maxItems, maxTextLen) _
        _ <- cats.map(task).traverse(_.run(ctx))
      } yield ()
    }
 }
--- a/modules/joex/src/main/scala/docspell/joex/learn/SelectItems.scala
+++ b/modules/joex/src/main/scala/docspell/joex/learn/SelectItems.scala
@@ -0,0 +1,109 @@
 package docspell.joex.learn
 import fs2.{Pipe, Stream}
 import docspell.analysis.classifier.TextClassifier.Data
 import docspell.common._
 import docspell.joex.scheduler.Context
 import docspell.store.Store
 import docspell.store.qb.Batch
 import docspell.store.queries.{QItem, TextAndTag}
 import doobie._
 object SelectItems {
  val pageSep = LearnClassifierTask.pageSep
  val noClass = LearnClassifierTask.noClass
  def forCategory[F[_]](ctx: Context[F, _], collective: Ident)(
      maxItems: Int,
      category: String,
      maxTextLen: Int
  ): Stream[F, Data] =
    forCategory(ctx.store, collective, maxItems, category, maxTextLen)
  def forCategory[F[_]](
      store: Store[F],
      collective: Ident,
      maxItems: Int,
      category: String,
      maxTextLen: Int
  ): Stream[F, Data] = {
    val connStream =
      allItems(collective, maxItems)
        .evalMap(item =>
          QItem.resolveTextAndTag(collective, item, category, maxTextLen, pageSep)
        )
        .through(mkData)
    store.transact(connStream)
  }
  def forCorrOrg[F[_]](
      store: Store[F],
      collective: Ident,
      maxItems: Int,
      maxTextLen: Int
  ): Stream[F, Data] = {
    val connStream =
      allItems(collective, maxItems)
        .evalMap(item =>
          QItem.resolveTextAndCorrOrg(collective, item, maxTextLen, pageSep)
        )
        .through(mkData)
    store.transact(connStream)
  }
  def forCorrPerson[F[_]](
      store: Store[F],
      collective: Ident,
      maxItems: Int,
      maxTextLen: Int
  ): Stream[F, Data] = {
    val connStream =
      allItems(collective, maxItems)
        .evalMap(item =>
          QItem.resolveTextAndCorrPerson(collective, item, maxTextLen, pageSep)
        )
        .through(mkData)
    store.transact(connStream)
  }
  def forConcPerson[F[_]](
      store: Store[F],
      collective: Ident,
      maxItems: Int,
      maxTextLen: Int
  ): Stream[F, Data] = {
    val connStream =
      allItems(collective, maxItems)
        .evalMap(item =>
          QItem.resolveTextAndConcPerson(collective, item, maxTextLen, pageSep)
        )
        .through(mkData)
    store.transact(connStream)
  }
  def forConcEquip[F[_]](
      store: Store[F],
      collective: Ident,
      maxItems: Int,
      maxTextLen: Int
  ): Stream[F, Data] = {
    val connStream =
      allItems(collective, maxItems)
        .evalMap(item =>
          QItem.resolveTextAndConcEquip(collective, item, maxTextLen, pageSep)
        )
        .through(mkData)
    store.transact(connStream)
  }
  private def allItems(collective: Ident, max: Int): Stream[ConnectionIO, Ident] = {
    val limit = if (max <= 0) Batch.all else Batch.limit(max)
    QItem.findAllNewesFirst(collective, 10, limit)
  }
  private def mkData[F[_]]: Pipe[F, TextAndTag, Data] =
    _.map(tt => Data(tt.tag.map(_.name).getOrElse(noClass), tt.itemId.id, tt.text.trim))
      .filter(_.text.nonEmpty)
 }
--- a/modules/joex/src/main/scala/docspell/joex/learn/StoreClassifierModel.scala
+++ b/modules/joex/src/main/scala/docspell/joex/learn/StoreClassifierModel.scala
@@ -0,0 +1,53 @@
 package docspell.joex.learn
 import cats.effect._
 import cats.implicits._
 import docspell.analysis.classifier.ClassifierModel
 import docspell.common._
 import docspell.joex.scheduler._
 import docspell.store.Store
 import docspell.store.records.RClassifierModel
 import bitpeace.MimetypeHint
 object StoreClassifierModel {
  def handleModel[F[_]: Sync: ContextShift](
      ctx: Context[F, _],
      collective: Ident,
      modelName: ClassifierName
  )(
      trainedModel: ClassifierModel
  ): F[Unit] =
    handleModel(ctx.store, ctx.blocker, ctx.logger)(collective, modelName, trainedModel)
  def handleModel[F[_]: Sync: ContextShift](
      store: Store[F],
      blocker: Blocker,
      logger: Logger[F]
  )(
      collective: Ident,
      modelName: ClassifierName,
      trainedModel: ClassifierModel
  ): F[Unit] =
    for {
      oldFile <- store.transact(
        RClassifierModel.findByName(collective, modelName.name).map(_.map(_.fileId))
      )
      _ <- logger.debug(s"Storing new trained model for: ${modelName.name}")
      fileData = fs2.io.file.readAll(trainedModel.model, blocker, 4096)
      newFile <-
        store.bitpeace.saveNew(fileData, 4096, MimetypeHint.none).compile.lastOrError
      _ <- store.transact(
        RClassifierModel.updateFile(collective, modelName.name, Ident.unsafe(newFile.id))
      )
      _ <- logger.debug(s"New model stored at file ${newFile.id}")
      _ <- oldFile match {
        case Some(fid) =>
          logger.debug(s"Deleting old model file ${fid.id}") *>
            store.bitpeace.delete(fid.id).compile.drain
        case None => ().pure[F]
      }
    } yield ()
 }
--- a/modules/joex/src/main/scala/docspell/joex/process/AttachmentPageCount.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/AttachmentPageCount.scala
@@ -78,7 +78,14 @@ object AttachmentPageCount {
            s"No attachmentmeta record exists for ${ra.id.id}. Creating new."
          ) *> ctx.store.transact(
            RAttachmentMeta.insert(
-              RAttachmentMeta(ra.id, None, Nil, MetaProposalList.empty, md.pageCount.some)
+              RAttachmentMeta(
                ra.id,
                None,
                Nil,
                MetaProposalList.empty,
                md.pageCount.some,
                None
              )
            )
          )
        else 0.pure[F]
--- a/modules/joex/src/main/scala/docspell/joex/process/ConvertPdf.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/ConvertPdf.scala
@@ -108,7 +108,18 @@ object ConvertPdf {
        ctx.logger.info(s"Conversion to pdf+txt successful. Saving file.") *>
          storePDF(ctx, cfg, ra, pdf)
            .flatMap(r =>
-              txt.map(t => (r, item.changeMeta(ra.id, _.setContentIfEmpty(t.some)).some))
+              txt.map(t =>
                (
                  r,
                  item
                    .changeMeta(
                      ra.id,
                      ctx.args.meta.language,
                      _.setContentIfEmpty(t.some)
                    )
                    .some
                )
              )
            )
      case ConversionResult.UnsupportedFormat(mt) =>
--- a/modules/joex/src/main/scala/docspell/joex/process/CreateItem.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/CreateItem.scala
@@ -107,6 +107,8 @@ object CreateItem {
        Vector.empty,
        fm.map(a => a.id -> a.fileId).toMap,
        MetaProposalList.empty,
        Nil,
        MetaProposalList.empty,
        Nil
      )
    }
@@ -166,6 +168,8 @@ object CreateItem {
          Vector.empty,
          origMap,
          MetaProposalList.empty,
          Nil,
          MetaProposalList.empty,
          Nil
        )
      )
--- a/modules/joex/src/main/scala/docspell/joex/process/ExtractArchive.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/ExtractArchive.scala
@@ -42,7 +42,7 @@ object ExtractArchive {
      archive: Option[RAttachmentArchive]
  ): Task[F, ProcessItemArgs, (Option[RAttachmentArchive], ItemData)] =
    singlePass(item, archive).flatMap { t =>
-      if (t._1 == None) Task.pure(t)
+      if (t._1.isEmpty) Task.pure(t)
      else multiPass(t._2, t._1)
    }
--- a/modules/joex/src/main/scala/docspell/joex/process/FindProposal.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/FindProposal.scala
@@ -17,24 +17,92 @@ import docspell.store.records._
  * by looking up values from NER in the users address book.
  */
 object FindProposal {
  type Args = ProcessItemArgs
  def apply[F[_]: Sync](
-      cfg: Config.Processing
+      cfg: Config.TextAnalysis
-  )(data: ItemData): Task[F, ProcessItemArgs, ItemData] =
+  )(data: ItemData): Task[F, Args, ItemData] =
    Task { ctx =>
      val rmas = data.metas.map(rm => rm.copy(nerlabels = removeDuplicates(rm.nerlabels)))
-
+      for {
-      ctx.logger.info("Starting find-proposal") *>
+        _ <- ctx.logger.info("Starting find-proposal")
-        rmas
+        rmv <- rmas
          .traverse(rm =>
            processAttachment(cfg, rm, data.findDates(rm), ctx)
              .map(ml => rm.copy(proposals = ml))
          )
-          .map(rmv => data.copy(metas = rmv))
+        clp <- lookupClassifierProposals(ctx, data.classifyProposals)
      } yield data.copy(metas = rmv, classifyProposals = clp)
    }
  def lookupClassifierProposals[F[_]: Sync](
      ctx: Context[F, Args],
      mpList: MetaProposalList
  ): F[MetaProposalList] = {
    val coll = ctx.args.meta.collective
    def lookup(mp: MetaProposal): F[Option[IdRef]] =
      mp.proposalType match {
        case MetaProposalType.CorrOrg =>
          ctx.store
            .transact(
              ROrganization
                .findLike(coll, mp.values.head.ref.name.toLowerCase)
                .map(_.headOption)
            )
            .flatTap(oref =>
              ctx.logger.debug(s"Found classifier organization for $mp: $oref")
            )
        case MetaProposalType.CorrPerson =>
          ctx.store
            .transact(
              RPerson
                .findLike(coll, mp.values.head.ref.name.toLowerCase, false)
                .map(_.headOption)
            )
            .flatTap(oref =>
              ctx.logger.debug(s"Found classifier corr-person for $mp: $oref")
            )
        case MetaProposalType.ConcPerson =>
          ctx.store
            .transact(
              RPerson
                .findLike(coll, mp.values.head.ref.name.toLowerCase, true)
                .map(_.headOption)
            )
            .flatTap(oref =>
              ctx.logger.debug(s"Found classifier conc-person for $mp: $oref")
            )
        case MetaProposalType.ConcEquip =>
          ctx.store
            .transact(
              REquipment
                .findLike(coll, mp.values.head.ref.name.toLowerCase)
                .map(_.headOption)
            )
            .flatTap(oref =>
              ctx.logger.debug(s"Found classifier conc-equip for $mp: $oref")
            )
        case MetaProposalType.DocDate =>
          (None: Option[IdRef]).pure[F]
        case MetaProposalType.DueDate =>
          (None: Option[IdRef]).pure[F]
      }
    def updateRef(mp: MetaProposal)(idRef: Option[IdRef]): Option[MetaProposal] =
      idRef // this proposal contains a single value only, since coming from classifier
        .map(ref => mp.copy(values = mp.values.map(_.copy(ref = ref))))
    ctx.logger.debug(s"Looking up classifier results: ${mpList.proposals}") *>
      mpList.proposals
        .traverse(mp => lookup(mp).map(updateRef(mp)))
        .map(_.flatten)
        .map(MetaProposalList.apply)
  }
  def processAttachment[F[_]: Sync](
-      cfg: Config.Processing,
+      cfg: Config.TextAnalysis,
      rm: RAttachmentMeta,
      rd: Vector[NerDateLabel],
      ctx: Context[F, ProcessItemArgs]
@@ -46,11 +114,11 @@ object FindProposal {
  }
  def makeDateProposal[F[_]: Sync](
-      cfg: Config.Processing,
+      cfg: Config.TextAnalysis,
      dates: Vector[NerDateLabel]
  ): F[MetaProposalList] =
    Timestamp.current[F].map { now =>
-      val maxFuture = now.plus(Duration.years(cfg.maxDueDateYears.toLong))
+      val maxFuture = now.plus(Duration.years(cfg.nlp.maxDueDateYears.toLong))
      val latestFirst = dates
        .filter(_.date.isBefore(maxFuture.toUtcDate))
        .sortWith((l1, l2) => l1.date.isAfter(l2.date))
--- a/modules/joex/src/main/scala/docspell/joex/process/ItemData.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/ItemData.scala
@@ -15,6 +15,9 @@ import docspell.store.records.{RAttachment, RAttachmentMeta, RItem}
  * containng the source or origin file
  * @param givenMeta meta data to this item that was not "guessed"
  * from an attachment but given and thus is always correct
  * @param classifyProposals these are proposals that were obtained by
  * a trained classifier. There are no ner-tags, it will only provide a
  * single label
  */
 case class ItemData(
    item: RItem,
@@ -23,7 +26,11 @@ case class ItemData(
    dateLabels: Vector[AttachmentDates],
    originFile: Map[Ident, Ident], // maps RAttachment.id -> FileMeta.id
    givenMeta: MetaProposalList,   // given meta data not associated to a specific attachment
-    tags: List[String]             // a list of tags (names or ids) attached to the item if they exist
+    // a list of tags (names or ids) attached to the item if they exist
    tags: List[String],
    // proposals obtained from the classifier
    classifyProposals: MetaProposalList,
    classifyTags: List[String]
 ) {
  def findMeta(attachId: Ident): Option[RAttachmentMeta] =
@@ -32,8 +39,12 @@ case class ItemData(
  def findDates(rm: RAttachmentMeta): Vector[NerDateLabel] =
    dateLabels.find(m => m.rm.id == rm.id).map(_.dates).getOrElse(Vector.empty)
-  def mapMeta(attachId: Ident, f: RAttachmentMeta => RAttachmentMeta): ItemData = {
+  def mapMeta(
-    val item = changeMeta(attachId, f)
+      attachId: Ident,
      lang: Language,
      f: RAttachmentMeta => RAttachmentMeta
  ): ItemData = {
    val item = changeMeta(attachId, lang, f)
    val next = metas.map(a => if (a.id == attachId) item else a)
    copy(metas = next)
  }
@@ -43,13 +54,14 @@ case class ItemData(
  def changeMeta(
      attachId: Ident,
      lang: Language,
      f: RAttachmentMeta => RAttachmentMeta
  ): RAttachmentMeta =
-    f(findOrCreate(attachId))
+    f(findOrCreate(attachId, lang))
-  def findOrCreate(attachId: Ident): RAttachmentMeta =
+  def findOrCreate(attachId: Ident, lang: Language): RAttachmentMeta =
    metas.find(_.id == attachId).getOrElse {
-      RAttachmentMeta.empty(attachId)
+      RAttachmentMeta.empty(attachId, lang)
    }
 }
--- a/modules/joex/src/main/scala/docspell/joex/process/LinkProposal.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/LinkProposal.scala
@@ -24,6 +24,7 @@ object LinkProposal {
          .flatten(data.metas.map(_.proposals))
          .filter(_.proposalType != MetaProposalType.DocDate)
          .sortByWeights
          .fillEmptyFrom(data.classifyProposals)
        ctx.logger.info(s"Starting linking proposals") *>
          MetaProposalType.all
--- a/modules/joex/src/main/scala/docspell/joex/process/ProcessItem.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/ProcessItem.scala
@@ -41,7 +41,7 @@ object ProcessItem {
      regexNer: RegexNerFile[F]
  )(item: ItemData): Task[F, ProcessItemArgs, ItemData] =
    TextAnalysis[F](cfg.textAnalysis, analyser, regexNer)(item)
-      .flatMap(FindProposal[F](cfg.processing))
+      .flatMap(FindProposal[F](cfg.textAnalysis))
      .flatMap(EvalProposals[F])
      .flatMap(SaveProposals[F])
--- a/modules/joex/src/main/scala/docspell/joex/process/ReProcessItem.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/ReProcessItem.scala
@@ -65,6 +65,8 @@ object ReProcessItem {
        Vector.empty,
        asrcMap.view.mapValues(_.fileId).toMap,
        MetaProposalList.empty,
        Nil,
        MetaProposalList.empty,
        Nil
      )).getOrElseF(
        Sync[F].raiseError(new Exception(s"Item not found: ${ctx.args.itemId.id}"))
--- a/modules/joex/src/main/scala/docspell/joex/process/SaveProposals.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/SaveProposals.scala
@@ -4,21 +4,51 @@ import cats.effect.Sync
 import cats.implicits._
 import docspell.common._
-import docspell.joex.scheduler.Task
+import docspell.joex.scheduler.{Context, Task}
 import docspell.store.AddResult
 import docspell.store.records._
 /** Saves the proposals in the database
  */
 object SaveProposals {
  type Args = ProcessItemArgs
-  def apply[F[_]: Sync](data: ItemData): Task[F, ProcessItemArgs, ItemData] =
+  def apply[F[_]: Sync](data: ItemData): Task[F, Args, ItemData] =
    Task { ctx =>
-      ctx.logger.info("Storing proposals") *>
+      for {
-        data.metas
+        _ <- ctx.logger.info("Storing proposals")
        _ <- data.metas
          .traverse(rm =>
-            ctx.logger.debug(s"Storing attachment proposals: ${rm.proposals}") *>
+            ctx.logger.debug(
-              ctx.store.transact(RAttachmentMeta.updateProposals(rm.id, rm.proposals))
+              s"Storing attachment proposals: ${rm.proposals}"
            ) *> ctx.store.transact(RAttachmentMeta.updateProposals(rm.id, rm.proposals))
          )
-          .map(_ => data)
+        _ <-
          if (data.classifyProposals.isEmpty && data.classifyTags.isEmpty) 0.pure[F]
          else saveItemProposal(ctx, data)
      } yield data
    }
  def saveItemProposal[F[_]: Sync](ctx: Context[F, Args], data: ItemData): F[Unit] = {
    def upsert(v: RItemProposal): F[Int] =
      ctx.store.add(RItemProposal.insert(v), RItemProposal.exists(v.itemId)).flatMap {
        case AddResult.Success => 1.pure[F]
        case AddResult.EntityExists(_) =>
          ctx.store.transact(RItemProposal.update(v))
        case AddResult.Failure(ex) =>
          ctx.logger.warn(s"Could not store item proposals: ${ex.getMessage}") *> 0
            .pure[F]
      }
    for {
      _ <- ctx.logger.debug(s"Storing classifier proposals: ${data.classifyProposals}")
      tags <- ctx.store.transact(
        RTag.findAllByNameOrId(data.classifyTags, ctx.args.meta.collective)
      )
      tagRefs = tags.map(t => IdRef(t.tagId, t.name))
      now <- Timestamp.current[F]
      value = RItemProposal(data.item.id, data.classifyProposals, tagRefs.toList, now)
      _ <- upsert(value)
    } yield ()
  }
 }
--- a/modules/joex/src/main/scala/docspell/joex/process/SetGivenData.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/SetGivenData.scala
@@ -45,7 +45,8 @@ object SetGivenData {
    Task { ctx =>
      val itemId     = data.item.id
      val collective = ctx.args.meta.collective
-      val tags       = (ctx.args.meta.tags.getOrElse(Nil) ++ data.tags).distinct
+      val tags =
        (ctx.args.meta.tags.getOrElse(Nil) ++ data.tags ++ data.classifyTags).distinct
      for {
        _ <- ctx.logger.info(s"Set tags from given data: ${tags}")
        e <- ops.linkTags(itemId, tags, collective).attempt
--- a/modules/joex/src/main/scala/docspell/joex/process/TextAnalysis.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/TextAnalysis.scala
@@ -1,24 +1,20 @@
 package docspell.joex.process
-import cats.data.OptionT
+import cats.Traverse
 import cats.effect._
 import cats.implicits._
-import docspell.analysis.TextAnalyser
+import docspell.analysis.classifier.TextClassifier
-import docspell.analysis.nlp.ClassifierModel
+import docspell.analysis.{NlpSettings, TextAnalyser}
-import docspell.analysis.nlp.StanfordNerSettings
+import docspell.common.MetaProposal.Candidate
 import docspell.analysis.nlp.TextClassifier
 import docspell.common._
 import docspell.joex.Config
 import docspell.joex.analysis.RegexNerFile
-import docspell.joex.learn.LearnClassifierTask
+import docspell.joex.learn.{ClassifierName, Classify, LearnClassifierTask}
 import docspell.joex.process.ItemData.AttachmentDates
 import docspell.joex.scheduler.Context
 import docspell.joex.scheduler.Task
-import docspell.store.records.RAttachmentMeta
+import docspell.store.records.{RAttachmentMeta, RClassifierSetting}
 import docspell.store.records.RClassifierSetting
 import bitpeace.RangeDef
 object TextAnalysis {
  type Args = ProcessItemArgs
@@ -41,13 +37,27 @@ object TextAnalysis {
        _ <- t.traverse(m =>
          ctx.store.transact(RAttachmentMeta.updateLabels(m._1.id, m._1.nerlabels))
        )
        v = t.toVector
        autoTagEnabled <- getActiveAutoTag(ctx, cfg)
        tag <-
          if (autoTagEnabled) predictTags(ctx, cfg, item.metas, analyser.classifier)
          else List.empty[String].pure[F]
        classProposals <-
          if (cfg.classification.enabled)
            predictItemEntities(ctx, cfg, item.metas, analyser.classifier)
          else MetaProposalList.empty.pure[F]
        e <- s
        _ <- ctx.logger.info(s"Text-Analysis finished in ${e.formatExact}")
        v = t.toVector
        tag <- predictTag(ctx, cfg, item.metas, analyser.classifier(ctx.blocker)).value
      } yield item
-        .copy(metas = v.map(_._1), dateLabels = v.map(_._2))
+        .copy(
-        .appendTags(tag.toSeq)
+          metas = v.map(_._1),
          dateLabels = v.map(_._2),
          classifyProposals = classProposals,
          classifyTags = tag
        )
    }
  def annotateAttachment[F[_]: Sync](
@@ -55,7 +65,7 @@ object TextAnalysis {
      analyser: TextAnalyser[F],
      nerFile: RegexNerFile[F]
  )(rm: RAttachmentMeta): F[(RAttachmentMeta, AttachmentDates)] = {
-    val settings = StanfordNerSettings(ctx.args.meta.language, false, None)
+    val settings = NlpSettings(ctx.args.meta.language, false, None)
    for {
      customNer <- nerFile.makeFile(ctx.args.meta.collective)
      sett = settings.copy(regexNer = customNer)
@@ -68,44 +78,84 @@ object TextAnalysis {
    } yield (rm.copy(nerlabels = labels.all.toList), AttachmentDates(rm, labels.dates))
  }
-  def predictTag[F[_]: Sync: ContextShift](
+  def predictTags[F[_]: Sync: ContextShift](
      ctx: Context[F, Args],
      cfg: Config.TextAnalysis,
      metas: Vector[RAttachmentMeta],
      classifier: TextClassifier[F]
-  ): OptionT[F, String] =
+  ): F[List[String]] = {
-    for {
+    val text = metas.flatMap(_.content).mkString(LearnClassifierTask.pageSep)
-      model <- findActiveModel(ctx, cfg)
+    val classifyWith: ClassifierName => F[Option[String]] =
-      _     <- OptionT.liftF(ctx.logger.info(s"Guessing tag …"))
+      makeClassify(ctx, cfg, classifier)(text)
      text = metas.flatMap(_.content).mkString(LearnClassifierTask.pageSep)
      modelData =
        ctx.store.bitpeace
          .get(model.id)
          .unNoneTerminate
          .through(ctx.store.bitpeace.fetchData2(RangeDef.all))
      cls <- OptionT(File.withTempDir(cfg.workingDir, "classify").use { dir =>
        val modelFile = dir.resolve("model.ser.gz")
        modelData
          .through(fs2.io.file.writeAll(modelFile, ctx.blocker))
          .compile
          .drain
          .flatMap(_ => classifier.classify(ctx.logger, ClassifierModel(modelFile), text))
      }).filter(_ != LearnClassifierTask.noClass)
      _ <- OptionT.liftF(ctx.logger.debug(s"Guessed tag: ${cls}"))
    } yield cls
-  private def findActiveModel[F[_]: Sync](
+    for {
      names <- ctx.store.transact(
        ClassifierName.findTagClassifiers(ctx.args.meta.collective)
      )
      _    <- ctx.logger.debug(s"Guessing tags for ${names.size} categories")
      tags <- names.traverse(classifyWith)
    } yield tags.flatten
  }
  def predictItemEntities[F[_]: Sync: ContextShift](
      ctx: Context[F, Args],
-      cfg: Config.TextAnalysis
+      cfg: Config.TextAnalysis,
-  ): OptionT[F, Ident] =
+      metas: Vector[RAttachmentMeta],
-    (if (cfg.classification.enabled)
+      classifier: TextClassifier[F]
-       OptionT(ctx.store.transact(RClassifierSetting.findById(ctx.args.meta.collective)))
+  ): F[MetaProposalList] = {
-         .filter(_.enabled)
+    val text = metas.flatMap(_.content).mkString(LearnClassifierTask.pageSep)
-         .mapFilter(_.fileId)
+
-     else
+    def classifyWith(
-       OptionT.none[F, Ident]).orElse(
+        cname: ClassifierName,
-      OptionT.liftF(ctx.logger.info("Classification is disabled.")) *> OptionT
+        mtype: MetaProposalType
-        .none[F, Ident]
+    ): F[Option[MetaProposal]] =
      for {
        label <- makeClassify(ctx, cfg, classifier)(text).apply(cname)
      } yield label.map(str =>
        MetaProposal(mtype, Candidate(IdRef(Ident.unsafe(""), str), Set.empty))
      )
    Traverse[List]
      .sequence(
        List(
          classifyWith(ClassifierName.correspondentOrg, MetaProposalType.CorrOrg),
          classifyWith(ClassifierName.correspondentPerson, MetaProposalType.CorrPerson),
          classifyWith(ClassifierName.concernedPerson, MetaProposalType.ConcPerson),
          classifyWith(ClassifierName.concernedEquip, MetaProposalType.ConcEquip)
        )
      )
      .map(_.flatten)
      .map(MetaProposalList.apply)
  }
  private def makeClassify[F[_]: Sync: ContextShift](
      ctx: Context[F, Args],
      cfg: Config.TextAnalysis,
      classifier: TextClassifier[F]
  )(text: String): ClassifierName => F[Option[String]] =
    Classify[F](
      ctx.blocker,
      ctx.logger,
      cfg.workingDir,
      ctx.store,
      classifier,
      ctx.args.meta.collective,
      text
    )
  private def getActiveAutoTag[F[_]: Sync](
      ctx: Context[F, Args],
      cfg: Config.TextAnalysis
  ): F[Boolean] =
    if (cfg.classification.enabled)
      ctx.store
        .transact(RClassifierSetting.findById(ctx.args.meta.collective))
        .map(_.exists(_.autoTagEnabled))
        .flatTap(enabled =>
          if (enabled) ().pure[F]
          else ctx.logger.info("Classification is disabled. Check config or settings.")
        )
    else
      ctx.logger.info("Classification is disabled.") *> false.pure[F]
 }
--- a/modules/joex/src/main/scala/docspell/joex/process/TextExtraction.scala
+++ b/modules/joex/src/main/scala/docspell/joex/process/TextExtraction.scala
@@ -46,10 +46,14 @@ object TextExtraction {
        )
        _   <- fts.indexData(ctx.logger, (idxItem +: txt.map(_.td)).toSeq: _*)
        dur <- start
-        _   <- ctx.logger.info(s"Text extraction finished in ${dur.formatExact}")
+        extractedTags = txt.flatMap(_.tags).distinct.toList
        _ <- ctx.logger.info(s"Text extraction finished in ${dur.formatExact}.")
        _ <-
          if (extractedTags.isEmpty) ().pure[F]
          else ctx.logger.debug(s"Found tags in file: $extractedTags")
      } yield item
        .copy(metas = txt.map(_.am))
-        .appendTags(txt.flatMap(_.tags).distinct.toList)
+        .appendTags(extractedTags)
    }
  // --  helpers
@@ -78,7 +82,7 @@ object TextExtraction {
        pair._2
      )
-    val rm = item.findOrCreate(ra.id)
+    val rm = item.findOrCreate(ra.id, lang)
    rm.content match {
      case Some(_) =>
        ctx.logger.info("TextExtraction skipped, since text is already available.") *>
@@ -102,6 +106,7 @@ object TextExtraction {
      res  <- extractTextFallback(ctx, cfg, ra, lang)(fids)
      meta = item.changeMeta(
        ra.id,
        lang,
        rm =>
          rm.setContentIfEmpty(
            res.map(_.appendPdfMetaToText.text.trim).filter(_.nonEmpty)
--- a/modules/joexapi/src/main/resources/joex-openapi.yml
+++ b/modules/joexapi/src/main/resources/joex-openapi.yml
@@ -9,7 +9,7 @@ servers:
    description: Current host
 paths:
-  /api/info:
+  /api/info/version:
    get:
      tags: [ Api Info ]
      summary: Get basic information about this software.
--- a/modules/restapi/src/main/resources/docspell-openapi.yml
+++ b/modules/restapi/src/main/resources/docspell-openapi.yml
@@ -4850,14 +4850,11 @@ components:
      description: |
        Settings for learning a document classifier.
      required:
        - enabled
        - schedule
        - itemCount
        - categoryList
        - listType
      properties:
        enabled:
          type: boolean
        category:
          type: string
        itemCount:
          type: integer
          format: int32
@@ -4867,6 +4864,16 @@ components:
        schedule:
          type: string
          format: calevent
        categoryList:
          type: array
          items:
            type: string
        listType:
          type: string
          format: listtype
          enum:
            - blacklist
            - whitelist
    SourceList:
      description: |
--- a/modules/restserver/src/main/scala/docspell/restserver/routes/CollectiveRoutes.scala
+++ b/modules/restserver/src/main/scala/docspell/restserver/routes/CollectiveRoutes.scala
@@ -6,7 +6,7 @@ import cats.implicits._
 import docspell.backend.BackendApp
 import docspell.backend.auth.AuthToken
 import docspell.backend.ops.OCollective
-import docspell.common.MakePreviewArgs
+import docspell.common.{ListType, MakePreviewArgs}
 import docspell.restapi.model._
 import docspell.restserver.conv.Conversions
 import docspell.restserver.http4s._
@@ -44,10 +44,10 @@ object CollectiveRoutes {
            settings.integrationEnabled,
            Some(
              OCollective.Classifier(
                settings.classifier.enabled,
                settings.classifier.schedule,
                settings.classifier.itemCount,
-                settings.classifier.category
+                settings.classifier.categoryList,
                settings.classifier.listType
              )
            )
          )
@@ -65,12 +65,12 @@ object CollectiveRoutes {
              c.language,
              c.integrationEnabled,
              ClassifierSetting(
                c.classifier.map(_.enabled).getOrElse(false),
                c.classifier.flatMap(_.category),
                c.classifier.map(_.itemCount).getOrElse(0),
                c.classifier
                  .map(_.schedule)
-                  .getOrElse(CalEvent.unsafe("*-1/3-01 01:00:00"))
+                  .getOrElse(CalEvent.unsafe("*-1/3-01 01:00:00")),
                c.classifier.map(_.categories).getOrElse(Nil),
                c.classifier.map(_.listType).getOrElse(ListType.whitelist)
              )
            )
          )
--- a/modules/store/src/main/resources/db/migration/h2/V1.17.0__meta_language.sql
+++ b/modules/store/src/main/resources/db/migration/h2/V1.17.0__meta_language.sql
@@ -0,0 +1,35 @@
 ALTER TABLE "attachmentmeta"
 ADD COLUMN "language" varchar(254);
 update "attachmentmeta"
 set "language" = 'deu'
 where "attachid" in (
  select "m"."attachid"
  from "attachmentmeta" m
  inner join "attachment" a on "a"."attachid" = "m"."attachid"
  inner join "item" i on "a"."itemid" = "i"."itemid"
  inner join "collective" c on "c"."cid" = "i"."cid"
  where "c"."doclang" = 'deu'
 );
 update "attachmentmeta"
 set "language" = 'eng'
 where "attachid" in (
  select "m"."attachid"
  from "attachmentmeta" m
  inner join "attachment" a on "a"."attachid" = "m"."attachid"
  inner join "item" i on "a"."itemid" = "i"."itemid"
  inner join "collective" c on "c"."cid" = "i"."cid"
  where "c"."doclang" = 'eng'
 );
 update "attachmentmeta"
 set "language" = 'fra'
 where "attachid" in (
  select "m"."attachid"
  from "attachmentmeta" m
  inner join "attachment" a on "a"."attachid" = "m"."attachid"
  inner join "item" i on "a"."itemid" = "i"."itemid"
  inner join "collective" c on "c"."cid" = "i"."cid"
  where "c"."doclang" = 'fra'
 );
--- a/modules/store/src/main/resources/db/migration/h2/V1.18.0__classifier_model.sql
+++ b/modules/store/src/main/resources/db/migration/h2/V1.18.0__classifier_model.sql
@@ -0,0 +1,44 @@
 CREATE TABLE "classifier_model"(
  "id" varchar(254) not null primary key,
  "cid" varchar(254) not null,
  "name" varchar(254) not null,
  "file_id" varchar(254) not null,
  "created" timestamp not null,
  foreign key ("cid") references "collective"("cid"),
  foreign key ("file_id") references "filemeta"("id"),
  unique ("cid", "name")
 );
 insert into "classifier_model"
 select random_uuid() as "id", "cid", concat('tagcategory-', "category") as "name", "file_id", "created"
 from "classifier_setting"
 where "file_id" is not null;
 alter table "classifier_setting"
 add column "categories" text;
 alter table "classifier_setting"
 add column "category_list_type" varchar(254);
 update "classifier_setting"
 set "category_list_type" = 'whitelist';
 update "classifier_setting"
 set "categories" = concat('["', "category", '"]')
 where category is not null;
 update "classifier_setting"
 set "categories" = '[]'
 where category is null;
 alter table "classifier_setting"
 drop column "category";
 alter table "classifier_setting"
 drop column "file_id";
 ALTER TABLE "classifier_setting"
 ALTER COLUMN "categories" SET NOT NULL;
 ALTER TABLE "classifier_setting"
 ALTER COLUMN "category_list_type" SET NOT NULL;
--- a/modules/store/src/main/resources/db/migration/h2/V1.19.0__add_classify_meta.sql
+++ b/modules/store/src/main/resources/db/migration/h2/V1.19.0__add_classify_meta.sql
@@ -0,0 +1,7 @@
 CREATE TABLE "item_proposal" (
  "itemid" varchar(254) not null primary key,
  "classifier_proposals" text not null,
  "classifier_tags" text not null,
  "created" timestamp not null,
  foreign key ("itemid") references "item"("itemid")
 );
--- a/modules/store/src/main/resources/db/migration/mariadb/V1.17.0__meta_language.sql
+++ b/modules/store/src/main/resources/db/migration/mariadb/V1.17.0__meta_language.sql
@@ -0,0 +1,14 @@
 ALTER TABLE `attachmentmeta`
 ADD COLUMN (`language` varchar(254));
 update `attachmentmeta` `m`
 inner join (
    select `m`.`attachid`, `c`.`doclang`
    from `attachmentmeta` m
    inner join `attachment` a on `a`.`attachid` = `m`.`attachid`
    inner join `item` i on `a`.`itemid` = `i`.`itemid`
    inner join `collective` c on `c`.`cid` = `i`.`cid`
  ) as `c`
 set `m`.`language` = `c`.`doclang`
 where `m`.`attachid` = `c`.`attachid` and `m`.`language` is null;
--- a/modules/store/src/main/resources/db/migration/mariadb/V1.18.0__classifier_model.sql
+++ b/modules/store/src/main/resources/db/migration/mariadb/V1.18.0__classifier_model.sql
@@ -0,0 +1,48 @@
 CREATE TABLE `classifier_model`(
  `id` varchar(254) not null primary key,
  `cid` varchar(254) not null,
  `name` varchar(254) not null,
  `file_id` varchar(254) not null,
  `created` timestamp not null,
  foreign key (`cid`) references `collective`(`cid`),
  foreign key (`file_id`) references `filemeta`(`id`),
  unique (`cid`, `name`)
 );
 insert into `classifier_model`
 select md5(rand()) as id, `cid`,concat('tagcategory-', `category`) as `name`, `file_id`, `created`
 from `classifier_setting`
 where `file_id` is not null;
 alter table `classifier_setting`
 add column (`categories` mediumtext);
 alter table `classifier_setting`
 add column (`category_list_type` varchar(254));
 update `classifier_setting`
 set `category_list_type` = 'whitelist';
 update `classifier_setting`
 set `categories` = concat('["', `category`, '"]')
 where category is not null;
 update `classifier_setting`
 set `categories` = '[]'
 where category is null;
 alter table `classifier_setting`
 drop column `category`;
 -- mariadb requires to drop constraint manually when dropping a column
 alter table `classifier_setting`
 drop constraint `classifier_setting_ibfk_2`;
 alter table `classifier_setting`
 drop column `file_id`;
 ALTER TABLE `classifier_setting`
 MODIFY `categories` mediumtext NOT NULL;
 ALTER TABLE `classifier_setting`
 MODIFY `category_list_type` varchar(254) NOT NULL;
--- a/modules/store/src/main/resources/db/migration/mariadb/V1.19.0__add_classify_meta.sql
+++ b/modules/store/src/main/resources/db/migration/mariadb/V1.19.0__add_classify_meta.sql
@@ -0,0 +1,7 @@
 CREATE TABLE `item_proposal` (
  `itemid` varchar(254) not null primary key,
  `classifier_proposals` mediumtext not null,
  `classifier_tags` mediumtext not null,
  `created` timestamp not null,
  foreign key (`itemid`) references `item`(`itemid`)
 );
--- a/modules/store/src/main/resources/db/migration/postgresql/V1.17.0__meta_language.sql
+++ b/modules/store/src/main/resources/db/migration/postgresql/V1.17.0__meta_language.sql
@@ -0,0 +1,15 @@
 ALTER TABLE "attachmentmeta"
 ADD COLUMN "language" varchar(254);
 with
  "attachlang" as (
    select "m"."attachid", "m"."language", "c"."doclang"
    from "attachmentmeta" m
    inner join "attachment" a on "a"."attachid" = "m"."attachid"
    inner join "item" i on "a"."itemid" = "i"."itemid"
    inner join "collective" c on "c"."cid" = "i"."cid"
  )
 update "attachmentmeta" as "m"
 set "language" = "c"."doclang"
 from "attachlang" c
 where "m"."attachid" = "c"."attachid" and "m"."language" is null;
--- a/modules/store/src/main/resources/db/migration/postgresql/V1.18.0__classifier_model.sql
+++ b/modules/store/src/main/resources/db/migration/postgresql/V1.18.0__classifier_model.sql
@@ -0,0 +1,44 @@
 CREATE TABLE "classifier_model"(
  "id" varchar(254) not null primary key,
  "cid" varchar(254) not null,
  "name" varchar(254) not null,
  "file_id" varchar(254) not null,
  "created" timestamp not null,
  foreign key ("cid") references "collective"("cid"),
  foreign key ("file_id") references "filemeta"("id"),
  unique ("cid", "name")
 );
 insert into "classifier_model"
 select md5(random()::text) as id, "cid",'tagcategory-' || "category" as "name", "file_id", "created"
 from "classifier_setting"
 where "file_id" is not null;
 alter table "classifier_setting"
 add column "categories" text;
 alter table "classifier_setting"
 add column "category_list_type" varchar(254);
 update "classifier_setting"
 set "category_list_type" = 'whitelist';
 update "classifier_setting"
 set "categories" = concat('["', "category", '"]')
 where category is not null;
 update "classifier_setting"
 set "categories" = '[]'
 where category is null;
 alter table "classifier_setting"
 drop column "category";
 alter table "classifier_setting"
 drop column "file_id";
 ALTER TABLE "classifier_setting"
 ALTER COLUMN "categories" SET NOT NULL;
 ALTER TABLE "classifier_setting"
 ALTER COLUMN "category_list_type" SET NOT NULL;
--- a/modules/store/src/main/resources/db/migration/postgresql/V1.19.0__add_classify_meta.sql
+++ b/modules/store/src/main/resources/db/migration/postgresql/V1.19.0__add_classify_meta.sql
@@ -0,0 +1,7 @@
 CREATE TABLE "item_proposal" (
  "itemid" varchar(254) not null primary key,
  "classifier_proposals" text not null,
  "classifier_tags" text not null,
  "created" timestamp not null,
  foreign key ("itemid") references "item"("itemid")
 );
--- a/modules/store/src/main/scala/docspell/store/impl/DoobieMeta.scala
+++ b/modules/store/src/main/scala/docspell/store/impl/DoobieMeta.scala
@@ -86,6 +86,9 @@ trait DoobieMeta extends EmilDoobieMeta {
  implicit val metaItemProposalList: Meta[MetaProposalList] =
    jsonMeta[MetaProposalList]
  implicit val metaIdRef: Meta[List[IdRef]] =
    jsonMeta[List[IdRef]]
  implicit val metaLanguage: Meta[Language] =
    Meta[String].imap(Language.unsafe)(_.iso3)
@@ -97,6 +100,9 @@ trait DoobieMeta extends EmilDoobieMeta {
  implicit val metaCustomFieldType: Meta[CustomFieldType] =
    Meta[String].timap(CustomFieldType.unsafe)(_.name)
  implicit val metaListType: Meta[ListType] =
    Meta[String].timap(ListType.unsafeFromString)(_.name)
 }
 object DoobieMeta extends DoobieMeta {
--- a/modules/store/src/main/scala/docspell/store/qb/DBFunction.scala
+++ b/modules/store/src/main/scala/docspell/store/qb/DBFunction.scala
@@ -1,5 +1,7 @@
 package docspell.store.qb
 import cats.data.NonEmptyList
 sealed trait DBFunction {}
 object DBFunction {
@@ -31,6 +33,8 @@ object DBFunction {
  case class Sum(expr: SelectExpr) extends DBFunction
  case class Concat(exprs: NonEmptyList[SelectExpr]) extends DBFunction
  sealed trait Operator
  object Operator {
    case object Plus  extends Operator
--- a/modules/store/src/main/scala/docspell/store/qb/DSL.scala
+++ b/modules/store/src/main/scala/docspell/store/qb/DSL.scala
@@ -98,6 +98,9 @@ trait DSL extends DoobieMeta {
  def substring(expr: SelectExpr, start: Int, length: Int): DBFunction =
    DBFunction.Substring(expr, start, length)
  def concat(expr: SelectExpr, exprs: SelectExpr*): DBFunction =
    DBFunction.Concat(Nel.of(expr, exprs: _*))
  def lit[A](value: A)(implicit P: Put[A]): SelectExpr.SelectLit[A] =
    SelectExpr.SelectLit(value, None)
--- a/modules/store/src/main/scala/docspell/store/qb/impl/DBFunctionBuilder.scala
+++ b/modules/store/src/main/scala/docspell/store/qb/impl/DBFunctionBuilder.scala
@@ -32,6 +32,10 @@ object DBFunctionBuilder extends CommonBuilder {
      case DBFunction.Substring(expr, start, len) =>
        sql"SUBSTRING(" ++ SelectExprBuilder.build(expr) ++ fr" FROM $start FOR $len)"
      case DBFunction.Concat(exprs) =>
        val inner = exprs.map(SelectExprBuilder.build).toList.reduce(_ ++ comma ++ _)
        sql"CONCAT(" ++ inner ++ sql")"
      case DBFunction.Calc(op, left, right) =>
        SelectExprBuilder.build(left) ++
          buildOperator(op) ++
--- a/modules/store/src/main/scala/docspell/store/queries/QAttachment.scala
+++ b/modules/store/src/main/scala/docspell/store/queries/QAttachment.scala
@@ -21,6 +21,7 @@ object QAttachment {
  private val item = RItem.as("i")
  private val am   = RAttachmentMeta.as("am")
  private val c    = RCollective.as("c")
  private val im   = RItemProposal.as("im")
  def deletePreview[F[_]: Sync](store: Store[F])(attachId: Ident): F[Int] = {
    val findPreview =
@@ -118,17 +119,27 @@ object QAttachment {
    } yield ns.sum
  def getMetaProposals(itemId: Ident, coll: Ident): ConnectionIO[MetaProposalList] = {
-    val q = Select(
+    val qa = Select(
-      am.proposals.s,
+      select(am.proposals),
      from(am)
        .innerJoin(a, a.id === am.id)
        .innerJoin(item, a.itemId === item.id),
      a.itemId === itemId && item.cid === coll
    ).build
    val qi = Select(
      select(im.classifyProposals),
      from(im)
        .innerJoin(item, item.id === im.itemId),
      item.cid === coll && im.itemId === itemId
    ).build
    for {
-      ml <- q.query[MetaProposalList].to[Vector]
+      mla <- qa.query[MetaProposalList].to[Vector]
-    } yield MetaProposalList.flatten(ml)
+      mli <- qi.query[MetaProposalList].to[Vector]
    } yield MetaProposalList
      .flatten(mla)
      .insertSecond(MetaProposalList.flatten(mli))
  }
  def getAttachmentMeta(
@@ -160,7 +171,15 @@ object QAttachment {
      chunkSize: Int
  ): Stream[ConnectionIO, ContentAndName] =
    Select(
-      select(a.id, a.itemId, item.cid, item.folder, c.language, a.name, am.content),
+      select(
        a.id.s,
        a.itemId.s,
        item.cid.s,
        item.folder.s,
        coalesce(am.language.s, c.language.s).s,
        a.name.s,
        am.content.s
      ),
      from(a)
        .innerJoin(am, am.id === a.id)
        .innerJoin(item, item.id === a.itemId)
--- a/modules/store/src/main/scala/docspell/store/queries/QCollective.scala
+++ b/modules/store/src/main/scala/docspell/store/queries/QCollective.scala
@@ -1,10 +1,8 @@
 package docspell.store.queries
 import cats.data.OptionT
 import fs2.Stream
-import docspell.common.ContactKind
+import docspell.common._
 import docspell.common.{Direction, Ident}
 import docspell.store.qb.DSL._
 import docspell.store.qb._
 import docspell.store.records._
@@ -17,6 +15,7 @@ object QCollective {
  private val t  = RTag.as("t")
  private val ro = ROrganization.as("o")
  private val rp = RPerson.as("p")
  private val re = REquipment.as("e")
  private val rc = RContact.as("c")
  private val i  = RItem.as("i")
@@ -25,13 +24,37 @@ object QCollective {
    val empty = Names(Vector.empty, Vector.empty, Vector.empty)
  }
-  def allNames(collective: Ident): ConnectionIO[Names] =
+  def allNames(collective: Ident, maxEntries: Int): ConnectionIO[Names] = {
-    (for {
+    val created = Column[Timestamp]("created", TableDef(""))
-      orgs <- OptionT.liftF(ROrganization.findAllRef(collective, None, _.name))
+    union(
-      pers <- OptionT.liftF(RPerson.findAllRef(collective, None, _.name))
+      Select(
-      equp <- OptionT.liftF(REquipment.findAll(collective, None, _.name))
+        select(ro.name.s, lit(1).as("kind"), ro.created.as(created)),
-    } yield Names(orgs.map(_.name), pers.map(_.name), equp.map(_.name)))
+        from(ro),
-      .getOrElse(Names.empty)
+        ro.cid === collective
      ),
      Select(
        select(rp.name.s, lit(2).as("kind"), rp.created.as(created)),
        from(rp),
        rp.cid === collective
      ),
      Select(
        select(re.name.s, lit(3).as("kind"), re.created.as(created)),
        from(re),
        re.cid === collective
      )
    ).orderBy(created.desc)
      .limit(Batch.limit(maxEntries))
      .build
      .query[(String, Int)]
      .streamWithChunkSize(maxEntries)
      .fold(Names.empty) { case (names, (name, kind)) =>
        if (kind == 1) names.copy(org = names.org :+ name)
        else if (kind == 2) names.copy(pers = names.pers :+ name)
        else names.copy(equip = names.equip :+ name)
      }
      .compile
      .lastOrError
  }
  case class InsightData(
      incoming: Int,
--- a/modules/store/src/main/scala/docspell/store/queries/QItem.scala
+++ b/modules/store/src/main/scala/docspell/store/queries/QItem.scala
@@ -441,8 +441,9 @@ object QItem {
      tn <- store.transact(RTagItem.deleteItemTags(itemId))
      mn <- store.transact(RSentMail.deleteByItem(itemId))
      cf <- store.transact(RCustomFieldValue.deleteByItem(itemId))
      im <- store.transact(RItemProposal.deleteByItem(itemId))
      n  <- store.transact(RItem.deleteByIdAndCollective(itemId, collective))
-    } yield tn + rn + n + mn + cf
+    } yield tn + rn + n + mn + cf + im
  private def findByFileIdsQuery(
      fileMetaIds: Nel[Ident],
@@ -543,11 +544,13 @@ object QItem {
  def findAllNewesFirst(
      collective: Ident,
-      chunkSize: Int
+      chunkSize: Int,
      limit: Batch
  ): Stream[ConnectionIO, Ident] = {
    val i = RItem.as("i")
    Select(i.id.s, from(i), i.cid === collective && i.state === ItemState.confirmed)
      .orderBy(i.created.desc)
      .limit(limit)
      .build
      .query[Ident]
      .streamWithChunkSize(chunkSize)
@@ -557,6 +560,7 @@ object QItem {
      collective: Ident,
      itemId: Ident,
      tagCategory: String,
      maxLen: Int,
      pageSep: String
  ): ConnectionIO[TextAndTag] = {
    val tags     = TableDef("tags").as("tt")
@@ -564,7 +568,7 @@ object QItem {
    val tagsTid  = Column[Ident]("tid", tags)
    val tagsName = Column[String]("tname", tags)
-    val q =
+    readTextAndTag(collective, itemId, pageSep) {
      withCte(
        tags -> Select(
          select(ti.itemId.as(tagsItem), tag.tid.as(tagsTid), tag.name.as(tagsName)),
@@ -574,25 +578,98 @@ object QItem {
        )
      )(
        Select(
-          select(m.content, tagsTid, tagsName),
+          select(substring(m.content.s, 0, maxLen).s, tagsTid.s, tagsName.s),
          from(i)
            .innerJoin(a, a.itemId === i.id)
            .innerJoin(m, a.id === m.id)
            .leftJoin(tags, tagsItem === i.id),
          i.id === itemId && i.cid === collective && m.content.isNotNull && m.content <> ""
        )
-      ).build
+      )
    }
  }
  def resolveTextAndCorrOrg(
      collective: Ident,
      itemId: Ident,
      maxLen: Int,
      pageSep: String
  ): ConnectionIO[TextAndTag] =
    readTextAndTag(collective, itemId, pageSep) {
      Select(
        select(substring(m.content.s, 0, maxLen).s, org.oid.s, org.name.s),
        from(i)
          .innerJoin(a, a.itemId === i.id)
          .innerJoin(m, m.id === a.id)
          .leftJoin(org, org.oid === i.corrOrg),
        i.id === itemId && m.content.isNotNull && m.content <> ""
      )
    }
  def resolveTextAndCorrPerson(
      collective: Ident,
      itemId: Ident,
      maxLen: Int,
      pageSep: String
  ): ConnectionIO[TextAndTag] =
    readTextAndTag(collective, itemId, pageSep) {
      Select(
        select(substring(m.content.s, 0, maxLen).s, pers0.pid.s, pers0.name.s),
        from(i)
          .innerJoin(a, a.itemId === i.id)
          .innerJoin(m, m.id === a.id)
          .leftJoin(pers0, pers0.pid === i.corrPerson),
        i.id === itemId && m.content.isNotNull && m.content <> ""
      )
    }
  def resolveTextAndConcPerson(
      collective: Ident,
      itemId: Ident,
      maxLen: Int,
      pageSep: String
  ): ConnectionIO[TextAndTag] =
    readTextAndTag(collective, itemId, pageSep) {
      Select(
        select(substring(m.content.s, 0, maxLen).s, pers0.pid.s, pers0.name.s),
        from(i)
          .innerJoin(a, a.itemId === i.id)
          .innerJoin(m, m.id === a.id)
          .leftJoin(pers0, pers0.pid === i.concPerson),
        i.id === itemId && m.content.isNotNull && m.content <> ""
      )
    }
  def resolveTextAndConcEquip(
      collective: Ident,
      itemId: Ident,
      maxLen: Int,
      pageSep: String
  ): ConnectionIO[TextAndTag] =
    readTextAndTag(collective, itemId, pageSep) {
      Select(
        select(substring(m.content.s, 0, maxLen).s, equip.eid.s, equip.name.s),
        from(i)
          .innerJoin(a, a.itemId === i.id)
          .innerJoin(m, m.id === a.id)
          .leftJoin(equip, equip.eid === i.concEquipment),
        i.id === itemId && m.content.isNotNull && m.content <> ""
      )
    }
  private def readTextAndTag(collective: Ident, itemId: Ident, pageSep: String)(
      q: Select
  ): ConnectionIO[TextAndTag] =
    for {
      _ <- logger.ftrace[ConnectionIO](
-        s"query: $q  (${itemId.id}, ${collective.id}, ${tagCategory})"
+        s"query: $q  (${itemId.id}, ${collective.id})"
      )
-      texts <- q.query[(String, Option[TextAndTag.TagName])].to[List]
+      texts <- q.build.query[(String, Option[TextAndTag.TagName])].to[List]
      _ <- logger.ftrace[ConnectionIO](
        s"Got ${texts.size} text and tag entries for item ${itemId.id}"
      )
      tag = texts.headOption.flatMap(_._2)
      txt = texts.map(_._1).mkString(pageSep)
    } yield TextAndTag(itemId, txt, tag)
-  }
+
 }
--- a/modules/store/src/main/scala/docspell/store/records/RAttachmentMeta.scala
+++ b/modules/store/src/main/scala/docspell/store/records/RAttachmentMeta.scala
@@ -15,7 +15,8 @@ case class RAttachmentMeta(
    content: Option[String],
    nerlabels: List[NerLabel],
    proposals: MetaProposalList,
-    pages: Option[Int]
+    pages: Option[Int],
    language: Option[Language]
 ) {
  def setContentIfEmpty(txt: Option[String]): RAttachmentMeta =
@@ -27,8 +28,8 @@ case class RAttachmentMeta(
 }
 object RAttachmentMeta {
-  def empty(attachId: Ident) =
+  def empty(attachId: Ident, lang: Language) =
-    RAttachmentMeta(attachId, None, Nil, MetaProposalList.empty, None)
+    RAttachmentMeta(attachId, None, Nil, MetaProposalList.empty, None, Some(lang))
  final case class Table(alias: Option[String]) extends TableDef {
    val tableName = "attachmentmeta"
@@ -38,7 +39,16 @@ object RAttachmentMeta {
    val nerlabels = Column[List[NerLabel]]("nerlabels", this)
    val proposals = Column[MetaProposalList]("itemproposals", this)
    val pages     = Column[Int]("page_count", this)
-    val all       = NonEmptyList.of[Column[_]](id, content, nerlabels, proposals, pages)
+    val language  = Column[Language]("language", this)
    val all =
      NonEmptyList.of[Column[_]](
        id,
        content,
        nerlabels,
        proposals,
        pages,
        language
      )
  }
  val T = Table(None)
@@ -49,7 +59,7 @@ object RAttachmentMeta {
    DML.insert(
      T,
      T.all,
-      fr"${v.id},${v.content},${v.nerlabels},${v.proposals},${v.pages}"
+      fr"${v.id},${v.content},${v.nerlabels},${v.proposals},${v.pages},${v.language}"
    )
  def exists(attachId: Ident): ConnectionIO[Boolean] =
@@ -90,13 +100,14 @@ object RAttachmentMeta {
      )
    )
-  def updateProposals(mid: Ident, plist: MetaProposalList): ConnectionIO[Int] =
+  def updateProposals(
      mid: Ident,
      plist: MetaProposalList
  ): ConnectionIO[Int] =
    DML.update(
      T,
      T.id === mid,
-      DML.set(
+      DML.set(T.proposals.setTo(plist))
        T.proposals.setTo(plist)
      )
    )
  def updatePageCount(mid: Ident, pageCount: Option[Int]): ConnectionIO[Int] =
--- a/modules/store/src/main/scala/docspell/store/records/RClassifierModel.scala
+++ b/modules/store/src/main/scala/docspell/store/records/RClassifierModel.scala
@@ -0,0 +1,102 @@
 package docspell.store.records
 import cats.data.NonEmptyList
 import cats.effect._
 import cats.implicits._
 import docspell.common._
 import docspell.store.qb.DSL._
 import docspell.store.qb._
 import doobie._
 import doobie.implicits._
 final case class RClassifierModel(
    id: Ident,
    cid: Ident,
    name: String,
    fileId: Ident,
    created: Timestamp
 ) {}
 object RClassifierModel {
  def createNew[F[_]: Sync](
      cid: Ident,
      name: String,
      fileId: Ident
  ): F[RClassifierModel] =
    for {
      id  <- Ident.randomId[F]
      now <- Timestamp.current[F]
    } yield RClassifierModel(id, cid, name, fileId, now)
  final case class Table(alias: Option[String]) extends TableDef {
    val tableName = "classifier_model"
    val id      = Column[Ident]("id", this)
    val cid     = Column[Ident]("cid", this)
    val name    = Column[String]("name", this)
    val fileId  = Column[Ident]("file_id", this)
    val created = Column[Timestamp]("created", this)
    val all = NonEmptyList.of[Column[_]](id, cid, name, fileId, created)
  }
  def as(alias: String): Table =
    Table(Some(alias))
  val T = Table(None)
  def insert(v: RClassifierModel): ConnectionIO[Int] =
    DML.insert(
      T,
      T.all,
      fr"${v.id},${v.cid},${v.name},${v.fileId},${v.created}"
    )
  def updateFile(coll: Ident, name: String, fid: Ident): ConnectionIO[Int] =
    for {
      now <- Timestamp.current[ConnectionIO]
      n <- DML.update(
        T,
        T.cid === coll && T.name === name,
        DML.set(T.fileId.setTo(fid), T.created.setTo(now))
      )
      k <-
        if (n == 0) createNew[ConnectionIO](coll, name, fid).flatMap(insert)
        else 0.pure[ConnectionIO]
    } yield n + k
  def deleteById(id: Ident): ConnectionIO[Int] =
    DML.delete(T, T.id === id)
  def deleteAll(ids: List[Ident]): ConnectionIO[Int] =
    NonEmptyList.fromList(ids) match {
      case Some(nel) =>
        DML.delete(T, T.id.in(nel))
      case None =>
        0.pure[ConnectionIO]
    }
  def findByName(cid: Ident, name: String): ConnectionIO[Option[RClassifierModel]] =
    Select(select(T.all), from(T), T.cid === cid && T.name === name).build
      .query[RClassifierModel]
      .option
  def findAllByName(
      cid: Ident,
      names: NonEmptyList[String]
  ): ConnectionIO[List[RClassifierModel]] =
    Select(select(T.all), from(T), T.cid === cid && T.name.in(names)).build
      .query[RClassifierModel]
      .to[List]
  def findAllByQuery(
      cid: Ident,
      nameQuery: String
  ): ConnectionIO[List[RClassifierModel]] =
    Select(select(T.all), from(T), T.cid === cid && T.name.like(nameQuery)).build
      .query[RClassifierModel]
      .to[List]
 }
--- a/modules/store/src/main/scala/docspell/store/records/RClassifierSetting.scala
+++ b/modules/store/src/main/scala/docspell/store/records/RClassifierSetting.scala
@@ -1,6 +1,6 @@
 package docspell.store.records
-import cats.data.NonEmptyList
+import cats.data.{NonEmptyList, OptionT}
 import cats.implicits._
 import docspell.common._
@@ -13,27 +13,38 @@ import doobie.implicits._
 case class RClassifierSetting(
    cid: Ident,
    enabled: Boolean,
    schedule: CalEvent,
    category: String,
    itemCount: Int,
-    fileId: Option[Ident],
+    created: Timestamp,
-    created: Timestamp
+    categoryList: List[String],
-) {}
+    listType: ListType
 ) {
  def autoTagEnabled: Boolean =
    listType match {
      case ListType.Blacklist =>
        true
      case ListType.Whitelist =>
        categoryList.nonEmpty
    }
 }
 object RClassifierSetting {
  // the categoryList is stored as a json array
  implicit val stringListMeta: Meta[List[String]] =
    jsonMeta[List[String]]
  final case class Table(alias: Option[String]) extends TableDef {
    val tableName = "classifier_setting"
    val cid        = Column[Ident]("cid", this)
    val enabled   = Column[Boolean]("enabled", this)
    val schedule   = Column[CalEvent]("schedule", this)
    val category  = Column[String]("category", this)
    val itemCount  = Column[Int]("item_count", this)
    val fileId    = Column[Ident]("file_id", this)
    val created    = Column[Timestamp]("created", this)
    val categories = Column[List[String]]("categories", this)
    val listType   = Column[ListType]("category_list_type", this)
    val all = NonEmptyList
-      .of[Column[_]](cid, enabled, schedule, category, itemCount, fileId, created)
+      .of[Column[_]](cid, schedule, itemCount, created, categories, listType)
  }
  val T = Table(None)
@@ -44,35 +55,19 @@ object RClassifierSetting {
    DML.insert(
      T,
      T.all,
-      fr"${v.cid},${v.enabled},${v.schedule},${v.category},${v.itemCount},${v.fileId},${v.created}"
+      fr"${v.cid},${v.schedule},${v.itemCount},${v.created},${v.categoryList},${v.listType}"
    )
-  def updateAll(v: RClassifierSetting): ConnectionIO[Int] =
+  def update(v: RClassifierSetting): ConnectionIO[Int] =
    DML.update(
      T,
      T.cid === v.cid,
      DML.set(
        T.enabled.setTo(v.enabled),
        T.schedule.setTo(v.schedule),
        T.category.setTo(v.category),
        T.itemCount.setTo(v.itemCount),
        T.fileId.setTo(v.fileId)
      )
    )
  def updateFile(coll: Ident, fid: Ident): ConnectionIO[Int] =
    DML.update(T, T.cid === coll, DML.set(T.fileId.setTo(fid)))
  def updateSettings(v: RClassifierSetting): ConnectionIO[Int] =
    for {
      n1 <- DML.update(
        T,
        T.cid === v.cid,
        DML.set(
          T.enabled.setTo(v.enabled),
          T.schedule.setTo(v.schedule),
          T.itemCount.setTo(v.itemCount),
-          T.category.setTo(v.category)
+          T.categories.setTo(v.categoryList),
          T.listType.setTo(v.listType)
        )
      )
      n2 <- if (n1 <= 0) insert(v) else 0.pure[ConnectionIO]
@@ -86,27 +81,62 @@ object RClassifierSetting {
  def delete(coll: Ident): ConnectionIO[Int] =
    DML.delete(T, T.cid === coll)
  /** Finds tag categories that exist and match the classifier setting.
    * If the setting contains a black list, they are removed from the
    * existing categories. If it is a whitelist, the intersection is
    * returned.
    */
  def getActiveCategories(coll: Ident): ConnectionIO[List[String]] =
    (for {
      sett <- OptionT(findById(coll))
      cats <- OptionT.liftF(RTag.listCategories(coll))
      res = sett.listType match {
        case ListType.Blacklist =>
          cats.diff(sett.categoryList)
        case ListType.Whitelist =>
          sett.categoryList.intersect(cats)
      }
    } yield res).getOrElse(Nil)
  /** Checks the json array of tag categories and removes those that are not present anymore. */
  def fixCategoryList(coll: Ident): ConnectionIO[Int] =
    (for {
      sett <- OptionT(findById(coll))
      cats <- OptionT.liftF(RTag.listCategories(coll))
      fixed = sett.categoryList.intersect(cats)
      n <- OptionT.liftF(
        if (fixed == sett.categoryList) 0.pure[ConnectionIO]
        else DML.update(T, T.cid === coll, DML.set(T.categories.setTo(fixed)))
      )
    } yield n).getOrElse(0)
  case class Classifier(
      enabled: Boolean,
      schedule: CalEvent,
      itemCount: Int,
-      category: Option[String]
+      categories: List[String],
      listType: ListType
  ) {
    def enabled: Boolean =
      listType match {
        case ListType.Blacklist =>
          true
        case ListType.Whitelist =>
          categories.nonEmpty
      }
    def toRecord(coll: Ident, created: Timestamp): RClassifierSetting =
      RClassifierSetting(
        coll,
        enabled,
        schedule,
        category.getOrElse(""),
        itemCount,
-        None,
+        created,
-        created
+        categories,
        listType
      )
  }
  object Classifier {
    def fromRecord(r: RClassifierSetting): Classifier =
-      Classifier(r.enabled, r.schedule, r.itemCount, r.category.some)
+      Classifier(r.schedule, r.itemCount, r.categoryList, r.listType)
  }
 }
--- a/modules/store/src/main/scala/docspell/store/records/RCollective.scala
+++ b/modules/store/src/main/scala/docspell/store/records/RCollective.scala
@@ -1,6 +1,6 @@
 package docspell.store.records
-import cats.data.NonEmptyList
+import cats.data.{NonEmptyList, OptionT}
 import fs2.Stream
 import docspell.common._
@@ -73,13 +73,24 @@ object RCollective {
          .map(now => settings.classifier.map(_.toRecord(cid, now)))
      n2 <- cls match {
        case Some(cr) =>
-          RClassifierSetting.updateSettings(cr)
+          RClassifierSetting.update(cr)
        case None =>
          RClassifierSetting.delete(cid)
      }
    } yield n1 + n2
-  def getSettings(coll: Ident): ConnectionIO[Option[Settings]] = {
+  // this hides categories that have been deleted in the meantime
  // they are finally removed from the json array once the learn classifier task is run
  def getSettings(coll: Ident): ConnectionIO[Option[Settings]] =
    (for {
      sett <- OptionT(getRawSettings(coll))
      prev <- OptionT.fromOption[ConnectionIO](sett.classifier)
      cats <- OptionT.liftF(RTag.listCategories(coll))
      next = prev.copy(categories = prev.categories.intersect(cats))
    } yield sett.copy(classifier = Some(next))).value
  private def getRawSettings(coll: Ident): ConnectionIO[Option[Settings]] = {
    import RClassifierSetting.stringListMeta
    val c  = RCollective.as("c")
    val cs = RClassifierSetting.as("cs")
@@ -87,10 +98,10 @@ object RCollective {
      select(
        c.language.s,
        c.integration.s,
        cs.enabled.s,
        cs.schedule.s,
        cs.itemCount.s,
-        cs.category.s
+        cs.categories.s,
        cs.listType.s
      ),
      from(c).leftJoin(cs, cs.cid === c.id),
      c.id === coll
--- a/modules/store/src/main/scala/docspell/store/records/RItemProposal.scala
+++ b/modules/store/src/main/scala/docspell/store/records/RItemProposal.scala
@@ -0,0 +1,60 @@
 package docspell.store.records
 import cats.data.NonEmptyList
 import docspell.common._
 import docspell.store.qb.DSL._
 import docspell.store.qb._
 import doobie._
 import doobie.implicits._
 case class RItemProposal(
    itemId: Ident,
    classifyProposals: MetaProposalList,
    classifyTags: List[IdRef],
    created: Timestamp
 )
 object RItemProposal {
  final case class Table(alias: Option[String]) extends TableDef {
    val tableName = "item_proposal"
    val itemId            = Column[Ident]("itemid", this)
    val classifyProposals = Column[MetaProposalList]("classifier_proposals", this)
    val classifyTags      = Column[List[IdRef]]("classifier_tags", this)
    val created           = Column[Timestamp]("created", this)
    val all               = NonEmptyList.of[Column[_]](itemId, classifyProposals, classifyTags, created)
  }
  val T = Table(None)
  def as(alias: String): Table =
    Table(Some(alias))
  def insert(v: RItemProposal): ConnectionIO[Int] =
    DML.insert(
      T,
      T.all,
      fr"${v.itemId},${v.classifyProposals},${v.classifyTags},${v.created}"
    )
  def update(v: RItemProposal): ConnectionIO[Int] =
    DML.update(
      T,
      T.itemId === v.itemId,
      DML.set(
        T.classifyProposals.setTo(v.classifyProposals),
        T.classifyTags.setTo(v.classifyTags)
      )
    )
  def deleteByItem(itemId: Ident): ConnectionIO[Int] =
    DML.delete(T, T.itemId === itemId)
  def exists(itemId: Ident): ConnectionIO[Boolean] =
    Select(select(countAll), from(T), T.itemId === itemId).build
      .query[Int]
      .unique
      .map(_ > 0)
 }
--- a/modules/store/src/main/scala/docspell/store/records/RTag.scala
+++ b/modules/store/src/main/scala/docspell/store/records/RTag.scala
@@ -148,6 +148,13 @@ object RTag {
    ).orderBy(T.name.asc).build.query[RTag].to[List]
  }
  def listCategories(coll: Ident): ConnectionIO[List[String]] =
    Select(
      T.category.s,
      from(T),
      T.cid === coll && T.category.isNotNull
    ).distinct.build.query[String].to[List]
  def delete(tagId: Ident, coll: Ident): ConnectionIO[Int] =
    DML.delete(T, T.tid === tagId && T.cid === coll)
 }
--- a/modules/webapp/src/main/elm/Comp/ClassifierSettingsForm.elm
+++ b/modules/webapp/src/main/elm/Comp/ClassifierSettingsForm.elm
@@ -11,35 +11,38 @@ import Api
 import Api.Model.ClassifierSetting exposing (ClassifierSetting)
 import Api.Model.TagList exposing (TagList)
 import Comp.CalEventInput
 import Comp.Dropdown
 import Comp.FixedDropdown
 import Comp.IntField
 import Data.CalEvent exposing (CalEvent)
 import Data.Flags exposing (Flags)
 import Data.ListType exposing (ListType)
 import Data.UiSettings exposing (UiSettings)
 import Data.Validated exposing (Validated(..))
 import Html exposing (..)
 import Html.Attributes exposing (..)
 import Html.Events exposing (onCheck)
 import Http
 import Markdown
 import Util.Tag
 type alias Model =
-    { enabled : Bool
+    { scheduleModel : Comp.CalEventInput.Model
    , categoryModel : Comp.FixedDropdown.Model String
    , category : Maybe String
    , scheduleModel : Comp.CalEventInput.Model
    , schedule : Validated CalEvent
    , itemCountModel : Comp.IntField.Model
    , itemCount : Maybe Int
    , categoryListModel : Comp.Dropdown.Model String
    , categoryListType : ListType
    , categoryListTypeModel : Comp.FixedDropdown.Model ListType
    }
 type Msg
-    = GetTagsResp (Result Http.Error TagList)
+    = ScheduleMsg Comp.CalEventInput.Msg
    | ScheduleMsg Comp.CalEventInput.Msg
    | ToggleEnabled
    | CategoryMsg (Comp.FixedDropdown.Msg String)
    | ItemCountMsg Comp.IntField.Msg
    | GetTagsResp (Result Http.Error TagList)
    | CategoryListMsg (Comp.Dropdown.Msg String)
    | CategoryListTypeMsg (Comp.FixedDropdown.Msg ListType)
 init : Flags -> ClassifierSetting -> ( Model, Cmd Msg )
@@ -52,13 +55,36 @@ init flags sett =
        ( cem, cec ) =
            Comp.CalEventInput.init flags newSchedule
    in
-    ( { enabled = sett.enabled
+    ( { scheduleModel = cem
      , categoryModel = Comp.FixedDropdown.initString []
      , category = sett.category
      , scheduleModel = cem
      , schedule = Data.Validated.Unknown newSchedule
      , itemCountModel = Comp.IntField.init (Just 0) Nothing True "Item Count"
      , itemCount = Just sett.itemCount
      , categoryListModel =
            let
                mkOption s =
                    { value = s, text = s, additional = "" }
                minit =
                    Comp.Dropdown.makeModel
                        { multiple = True
                        , searchable = \n -> n > 0
                        , makeOption = mkOption
                        , labelColor = \_ -> \_ -> "grey "
                        , placeholder = "Choose categories …"
                        }
                lm =
                    Comp.Dropdown.SetSelection sett.categoryList
                ( m_, _ ) =
                    Comp.Dropdown.update lm minit
            in
            m_
      , categoryListType =
            Data.ListType.fromString sett.listType
                |> Maybe.withDefault Data.ListType.Whitelist
      , categoryListTypeModel =
            Comp.FixedDropdown.initMap Data.ListType.label Data.ListType.all
      }
    , Cmd.batch
        [ Api.getTags flags "" GetTagsResp
@@ -71,11 +97,11 @@ getSettings : Model -> Validated ClassifierSetting
 getSettings model =
    Data.Validated.map
        (\sch ->
-            { enabled = model.enabled
+            { schedule =
            , category = model.category
            , schedule =
                Data.CalEvent.makeEvent sch
            , itemCount = Maybe.withDefault 0 model.itemCount
            , listType = Data.ListType.toString model.categoryListType
            , categoryList = Comp.Dropdown.getSelected model.categoryListModel
            }
        )
        model.schedule
@@ -89,18 +115,11 @@ update flags msg model =
                categories =
                    Util.Tag.getCategories tl.items
                        |> List.sort
            in
            ( { model
                | categoryModel = Comp.FixedDropdown.initString categories
                , category =
                    if model.category == Nothing then
                        List.head categories
-                    else
+                lm =
-                        model.category
+                    Comp.Dropdown.SetOptions categories
-              }
+            in
-            , Cmd.none
+            update flags (CategoryListMsg lm) model
            )
        GetTagsResp (Err _) ->
            ( model, Cmd.none )
@@ -121,28 +140,6 @@ update flags msg model =
            , Cmd.map ScheduleMsg cc
            )
        ToggleEnabled ->
            ( { model | enabled = not model.enabled }
            , Cmd.none
            )
        CategoryMsg lmsg ->
            let
                ( mm, ma ) =
                    Comp.FixedDropdown.update lmsg model.categoryModel
            in
            ( { model
                | categoryModel = mm
                , category =
                    if ma == Nothing then
                        model.category
                    else
                        ma
              }
            , Cmd.none
            )
        ItemCountMsg lmsg ->
            let
                ( im, iv ) =
@@ -155,39 +152,68 @@ update flags msg model =
            , Cmd.none
            )
        CategoryListMsg lm ->
            let
                ( m_, cmd_ ) =
                    Comp.Dropdown.update lm model.categoryListModel
            in
            ( { model | categoryListModel = m_ }
            , Cmd.map CategoryListMsg cmd_
            )
-view : Model -> Html Msg
+        CategoryListTypeMsg lm ->
-view model =
+            let
                ( m_, sel ) =
                    Comp.FixedDropdown.update lm model.categoryListTypeModel
                newListType =
                    Maybe.withDefault model.categoryListType sel
            in
            ( { model
                | categoryListTypeModel = m_
                , categoryListType = newListType
              }
            , Cmd.none
            )
 view : UiSettings -> Model -> Html Msg
 view settings model =
    let
        catListTypeItem =
            Comp.FixedDropdown.Item
                model.categoryListType
                (Data.ListType.label model.categoryListType)
    in
    div []
-        [ div
+        [ Markdown.toHtml [ class "ui basic segment" ]
-            [ class "field"
+            """
-            ]
+
-            [ div [ class "ui checkbox" ]
+Auto-tagging works by learning from existing documents. The more
-                [ input
+documents you have correctly tagged, the better. Learning is done
-                    [ type_ "checkbox"
+periodically based on a schedule. You can specify tag-groups that
-                    , onCheck (\_ -> ToggleEnabled)
+should either be used (whitelist) or not used (blacklist) for
-                    , checked model.enabled
+learning.
-                    ]
+
-                    []
+Use an empty whitelist to disable auto tagging.
-                , label [] [ text "Enable classification" ]
+
-                , span [ class "small-info" ]
+            """
-                    [ text "Disable document classification if not needed."
+        , div [ class "field" ]
-                    ]
+            [ label [] [ text "Is the following a blacklist or whitelist?" ]
-                ]
+            , Html.map CategoryListTypeMsg
-            ]
+                (Comp.FixedDropdown.view (Just catListTypeItem) model.categoryListTypeModel)
        , div [ class "ui basic segment" ]
            [ text "Document classification tries to predict a tag for new incoming documents. This "
            , text "works by learning from existing documents in order to find common patterns within "
            , text "the text. The more documents you have correctly tagged, the better. Learning is done "
            , text "periodically based on a schedule and you need to specify a tag-group that should "
            , text "be used for learning."
            ]
        , div [ class "field" ]
-            [ label [] [ text "Category" ]
+            [ label []
-            , Html.map CategoryMsg
+                [ case model.categoryListType of
-                (Comp.FixedDropdown.viewString model.category
+                    Data.ListType.Whitelist ->
-                    model.categoryModel
+                        text "Include tag categories for learning"
-                )
+
                    Data.ListType.Blacklist ->
                        text "Exclude tag categories from learning"
                ]
            , Html.map CategoryListMsg
                (Comp.Dropdown.view settings model.categoryListModel)
            ]
        , Html.map ItemCountMsg
            (Comp.IntField.viewWithInfo
--- a/modules/webapp/src/main/elm/Comp/CollectiveSettingsForm.elm
+++ b/modules/webapp/src/main/elm/Comp/CollectiveSettingsForm.elm
@@ -280,7 +280,7 @@ view flags settings model =
                , ( "invisible hidden", not flags.config.showClassificationSettings )
                ]
            ]
-            [ text "Document Classifier"
+            [ text "Auto-Tagging"
            ]
        , div
            [ classList
@@ -289,13 +289,10 @@ view flags settings model =
                ]
            ]
            [ Html.map ClassifierSettingMsg
-                (Comp.ClassifierSettingsForm.view model.classifierModel)
+                (Comp.ClassifierSettingsForm.view settings model.classifierModel)
            , div [ class "ui vertical segment" ]
                [ button
-                    [ classList
+                    [ class "ui small secondary basic button"
                        [ ( "ui small secondary basic button", True )
                        , ( "disabled", not model.classifierModel.enabled )
                        ]
                    , title "Starts a task to train a classifier"
                    , onClick StartClassifierTask
                    ]
--- a/modules/webapp/src/main/elm/Comp/ItemDetail/View.elm
+++ b/modules/webapp/src/main/elm/Comp/ItemDetail/View.elm
@@ -958,7 +958,6 @@ renderSuggestions model mkName idnames tagger =
                ]
            , div [ class "menu" ] <|
                (idnames
                    |> List.take 5
                    |> List.map (\p -> a [ class "item", href "#", onClick (tagger p) ] [ text (mkName p) ])
                )
            ]
@@ -969,7 +968,7 @@ renderOrgSuggestions : Model -> Html Msg
 renderOrgSuggestions model =
    renderSuggestions model
        .name
-        (List.take 5 model.itemProposals.corrOrg)
+        (List.take 6 model.itemProposals.corrOrg)
        SetCorrOrgSuggestion
@@ -977,7 +976,7 @@ renderCorrPersonSuggestions : Model -> Html Msg
 renderCorrPersonSuggestions model =
    renderSuggestions model
        .name
-        (List.take 5 model.itemProposals.corrPerson)
+        (List.take 6 model.itemProposals.corrPerson)
        SetCorrPersonSuggestion
@@ -985,7 +984,7 @@ renderConcPersonSuggestions : Model -> Html Msg
 renderConcPersonSuggestions model =
    renderSuggestions model
        .name
-        (List.take 5 model.itemProposals.concPerson)
+        (List.take 6 model.itemProposals.concPerson)
        SetConcPersonSuggestion
@@ -993,7 +992,7 @@ renderConcEquipSuggestions : Model -> Html Msg
 renderConcEquipSuggestions model =
    renderSuggestions model
        .name
-        (List.take 5 model.itemProposals.concEquipment)
+        (List.take 6 model.itemProposals.concEquipment)
        SetConcEquipSuggestion
@@ -1001,7 +1000,7 @@ renderItemDateSuggestions : Model -> Html Msg
 renderItemDateSuggestions model =
    renderSuggestions model
        Util.Time.formatDate
-        (List.take 5 model.itemProposals.itemDate)
+        (List.take 6 model.itemProposals.itemDate)
        SetItemDateSuggestion
@@ -1009,7 +1008,7 @@ renderDueDateSuggestions : Model -> Html Msg
 renderDueDateSuggestions model =
    renderSuggestions model
        Util.Time.formatDate
-        (List.take 5 model.itemProposals.dueDate)
+        (List.take 6 model.itemProposals.dueDate)
        SetDueDateSuggestion
--- a/modules/webapp/src/main/elm/Data/Language.elm
+++ b/modules/webapp/src/main/elm/Data/Language.elm
@@ -11,6 +11,17 @@ type Language
    = German
    | English
    | French
    | Italian
    | Spanish
    | Portuguese
    | Czech
    | Danish
    | Finnish
    | Norwegian
    | Swedish
    | Russian
    | Romanian
    | Dutch
 fromString : String -> Maybe Language
@@ -24,6 +35,39 @@ fromString str =
    else if str == "fra" || str == "fr" || str == "french" then
        Just French
    else if str == "ita" || str == "it" || str == "italian" then
        Just Italian
    else if str == "spa" || str == "es" || str == "spanish" then
        Just Spanish
    else if str == "por" || str == "pt" || str == "portuguese" then
        Just Portuguese
    else if str == "ces" || str == "cs" || str == "czech" then
        Just Czech
    else if str == "dan" || str == "da" || str == "danish" then
        Just Danish
    else if str == "nld" || str == "nd" || str == "dutch" then
        Just Dutch
    else if str == "fin" || str == "fi" || str == "finnish" then
        Just Finnish
    else if str == "nor" || str == "no" || str == "norwegian" then
        Just Norwegian
    else if str == "swe" || str == "sv" || str == "swedish" then
        Just Swedish
    else if str == "rus" || str == "ru" || str == "russian" then
        Just Russian
    else if str == "ron" || str == "ro" || str == "romanian" then
        Just Romanian
    else
        Nothing
@@ -40,6 +84,39 @@ toIso3 lang =
        French ->
            "fra"
        Italian ->
            "ita"
        Spanish ->
            "spa"
        Portuguese ->
            "por"
        Czech ->
            "ces"
        Danish ->
            "dan"
        Finnish ->
            "fin"
        Norwegian ->
            "nor"
        Swedish ->
            "swe"
        Russian ->
            "rus"
        Romanian ->
            "ron"
        Dutch ->
            "nld"
 toName : Language -> String
 toName lang =
@@ -53,7 +130,54 @@ toName lang =
        French ->
            "French"
        Italian ->
            "Italian"
        Spanish ->
            "Spanish"
        Portuguese ->
            "Portuguese"
        Czech ->
            "Czech"
        Danish ->
            "Danish"
        Finnish ->
            "Finnish"
        Norwegian ->
            "Norwegian"
        Swedish ->
            "Swedish"
        Russian ->
            "Russian"
        Romanian ->
            "Romanian"
        Dutch ->
            "Dutch"
 all : List Language
 all =
-    [ German, English, French ]
+    [ German
    , English
    , French
    , Italian
    , Spanish
    , Portuguese
    , Czech
    , Dutch
    , Danish
    , Finnish
    , Norwegian
    , Swedish
    , Russian
    , Romanian
    ]
--- a/modules/webapp/src/main/elm/Data/ListType.elm
+++ b/modules/webapp/src/main/elm/Data/ListType.elm
@@ -0,0 +1,50 @@
 module Data.ListType exposing
    ( ListType(..)
    , all
    , fromString
    , label
    , toString
    )
 type ListType
    = Blacklist
    | Whitelist
 all : List ListType
 all =
    [ Blacklist, Whitelist ]
 toString : ListType -> String
 toString lt =
    case lt of
        Blacklist ->
            "blacklist"
        Whitelist ->
            "whitelist"
 label : ListType -> String
 label lt =
    case lt of
        Blacklist ->
            "Blacklist"
        Whitelist ->
            "Whitelist"
 fromString : String -> Maybe ListType
 fromString str =
    case String.toLower str of
        "blacklist" ->
            Just Blacklist
        "whitelist" ->
            Just Whitelist
        _ ->
            Nothing
--- a/nix/module-joex.nix
+++ b/nix/module-joex.nix
@@ -98,10 +98,14 @@ let
    };
    text-analysis = {
      max-length = 10000;
      nlp = {
        mode = "full";
        clear-interval = "15 minutes";
        regex-ner = {
-        enabled = true;
+          max-entries = 1000;
          file-cache-time = "1 minute";
        };
      };
      classification = {
        enabled = true;
        item-count = 0;
@@ -118,7 +122,6 @@ let
        ];
      };
      working-dir = "/tmp/docspell-analysis";
      clear-stanford-nlp-interval = "15 minutes";
    };
    processing = {
      max-due-date-years = 10;
@@ -772,9 +775,50 @@ in {
                files.
              '';
            };
-            clear-stanford-nlp-interval = mkOption {
+
            nlp = mkOption {
              type = types.submodule({
                options = {
                  mode = mkOption {
                    type = types.str;
-              default = defaults.text-analysis.clear-stanford-nlp-interval;
+                    default = defaults.text-analysis.nlp.mode;
                    description = ''
                      The mode for configuring NLP models:
                      1. full – builds the complete pipeline
                      2. basic - builds only the ner annotator
                      3. regexonly - matches each entry in your address book via regexps
                      4. disabled - doesn't use any stanford-nlp feature
                      The full and basic variants rely on pre-build language models
                      that are available for only 3 lanugages at the moment: German,
                      English and French.
                      Memory usage varies greatly among the languages. German has
                      quite large models, that require about 1G heap. So joex should
                      run with -Xmx1400M at least when using mode=full.
                      The basic variant does a quite good job for German and
                      English. It might be worse for French, always depending on the
                      type of text that is analysed. Joex should run with about 600M
                      heap, here again lanugage German uses the most.
                      The regexonly variant doesn't depend on a language. It roughly
                      works by converting all entries in your addressbook into
                      regexps and matches each one against the text. This can get
                      memory intensive, too, when the addressbook grows large. This
                      is included in the full and basic by default, but can be used
                      independently by setting mode=regexner.
                      When mode=disabled, then the whole nlp pipeline is disabled,
                      and you won't get any suggestions. Only what the classifier
                      returns (if enabled).
                    '';
                  };
                  clear-interval = mkOption {
                    type = types.str;
                    default = defaults.text-analysis.nlp.clear-interval;
                    description = ''
                      Idle time after which the NLP caches are cleared to free
                      memory. If <= 0 clearing the cache is disabled.
@@ -785,19 +829,22 @@ in {
                    type = types.submodule({
                      options = {
                        enabled = mkOption {
-                    type = types.bool;
+                          type = types.int;
-                    default = defaults.text-analysis.regex-ner.enabled;
+                          default = defaults.text-analysis.regex-ner.max-entries;
                          description = ''
-                      Whether to enable custom NER annotation. This uses the address
+                            Whether to enable custom NER annotation. This uses the
-                      book of a collective as input for NER tagging (to automatically
+                            address book of a collective as input for NER tagging (to
-                      find correspondent and concerned entities). If the address book
+                            automatically find correspondent and concerned entities). If
-                      is large, this can be quite memory intensive and also makes text
+                            the address book is large, this can be quite memory
-                      analysis slower. But it greatly improves accuracy. If this is
+                            intensive and also makes text analysis much slower. But it
-                      false, NER tagging uses only statistical models (that also work
+                            improves accuracy and can be used independent of the
-                      quite well).
+                            lanugage. If this is set to 0, it is effectively disabled
                            and NER tagging uses only statistical models (that also work
                            quite well, but are restricted to the languages mentioned
                            above).
-                      This setting might be moved to the collective settings in the
+                            Note, this is only relevant if nlp-config.mode is not
-                      future.
+                            "disabled".
                          '';
                        };
                        file-cache-time = mkOption {
@@ -811,9 +858,14 @@ in {
                        };
                      };
                    });
-              default = defaults.text-analysis.regex-ner;
+                    default = defaults.text-analysis.nlp.regex-ner;
                    description = "";
                  };
                };
              });
              default = defaults.text-analysis.nlp;
              description = "Configure NLP";
            };
            classification = mkOption {
              type = types.submodule({
--- a/website/site/content/docs/configure/_index.md
+++ b/website/site/content/docs/configure/_index.md
@@ -20,6 +20,9 @@ The configuration of both components uses separate namespaces. The
 configuration for the REST server is below `docspell.server`, while
 the one for joex is below `docspell.joex`.
 You can therefore use two separate config files or one single file
 containing both namespaces.
 ## JDBC
 This configures the connection to the database. This has to be
@@ -281,6 +284,70 @@ just some minutes, the web application obtains new ones
 periodically. So a short time is recommended.
 ## File Processing
 Files are being processed by the joex component. So all the respective
 configuration is in this config only.
 File processing involves several stages, detailed information can be
 found [here](@/docs/joex/file-processing.md#text-analysis) and in the
 corresponding sections in [joex default config](#joex).
 Configuration allows to define the external tools and set some
 limitations to control memory usage. The sections are:
 - `docspell.joex.extraction`
 - `docspell.joex.text-analysis`
 - `docspell.joex.convert`
 Options to external commands can use variables that are replaced by
 values at runtime. Variables are enclosed in double braces `{{…}}`.
 Please see the default configuration for what variables exist per
 command.
 ### Classification
 In `text-analysis.classification` you can define how many documents at
 most should be used for learning. The default settings should work
 well for most cases. However, it always depends on the amount of data
 and the machine that runs joex. For example, by default the documents
 to learn from are limited to 600 (`classification.item-count`) and
 every text is cut after 8000 characters (`text-analysis.max-length`).
 This is fine if *most* of your documents are small and only a few are
 near 8000 characters). But if *all* your documents are very large, you
 probably need to either assign more heap memory or go down with the
 limits.
 Classification can be disabled, too, for when it's not needed.
 ### NLP
 This setting defines which NLP mode to use. It defaults to `full`,
 which requires more memory for certain languages (with the advantage
 of better results). Other values are `basic`, `regexonly` and
 `disabled`. The modes `full` and `basic` use pre-defined lanugage
 models for procesing documents of languaes German, English and French.
 These require some amount of memory (see below).
 The mode `basic` is like the "light" variant to `full`. It doesn't use
 all NLP features, which makes memory consumption much lower, but comes
 with the compromise of less accurate results.
 The mode `regexonly` doesn't use pre-defined lanuage models, even if
 available. It checks your address book against a document to find
 metadata. That means, it is language independent. Also, when using
 `full` or `basic` with lanugages where no pre-defined models exist, it
 will degrade to `regexonly` for these.
 The mode `disabled` skips NLP processing completely. This has least
 impact in memory consumption, obviously, but then only the classifier
 is used to find metadata.
 You might want to try different modes and see what combination suits
 best your usage pattern and machine running joex. If a powerful
 machine is used, simply leave the defaults. When running on an older
 raspberry pi, for example, you might need to adjust things.
 # File Format
 The format of the configuration files can be
--- a/Show More
+++ b/Show More
`@@ -1,4 +1,4 @@`
	`package docspell.analysis.nlp`	`package docspell.analysis.classifier`

	`import java.nio.file.Path`	`import java.nio.file.Path`