Merge pull request #581 from eikek/text-analysis-improvements

Text analysis improvements
This commit is contained in:
mergify[bot] 2021-01-21 22:01:50 +00:00 committed by GitHub
commit df5f9e8c51
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
104 changed files with 3385 additions and 714 deletions

View File

@ -24,4 +24,4 @@ before_script:
- export TZ=Europe/Berlin - export TZ=Europe/Berlin
script: script:
- sbt ++$TRAVIS_SCALA_VERSION ";project root ;scalafmtCheckAll ;make ;test" - sbt -J-XX:+UseG1GC ++$TRAVIS_SCALA_VERSION ";project root ;scalafmtCheckAll ;make ;test"

View File

@ -17,6 +17,9 @@ If you don't like to sign up to github/matrix or like to reach me
personally, you can make a mail to `info [at] docspell.org` or on personally, you can make a mail to `info [at] docspell.org` or on
matrix, via `@eikek:matrix.org`. matrix, via `@eikek:matrix.org`.
If you find a feature request already filed, you can vote on it. I
tend to prefer most voted requests to those without much attention.
## Documentation ## Documentation

View File

@ -9,25 +9,28 @@
# Docspell # Docspell
Docspell is a personal document organizer. You'll need a scanner to Docspell is a personal document organizer. You'll need a scanner to
convert your papers into files. Docspell can then assist in convert your papers into files. Docspell can then assist in organizing
organizing the resulting mess :wink:. the resulting mess :wink:. It is targeted for home use, i.e. families
and households and also for (smaller) groups/companies.
You can associate tags, set correspondends, what a document is You can associate tags, set correspondends and lots of other
concerned with, a name, a date and much more. If your documents are predefined and custom metadata. If your documents are associated with
associated with such meta data, you should be able to quickly find such meta data, you can quickly find them later using the search
them later using the search feature. But adding this manually to each feature. But adding this manually is a tedious task. Docspell can help
document is a tedious task. Docspell can help you by suggesting by suggesting correspondents, guessing tags or finding dates using
correspondents, guessing tags or finding dates using machine learning machine learning. It can learn metadata from existing documents and
techniques. This makes adding metadata to your documents a lot easier. find things using NLP. This makes adding metadata to your documents a
lot easier. For machine learning, it relies on the free (GPL)
[Stanford Core NLP library](https://github.com/stanfordnlp/CoreNLP).
Docspell also runs OCR (if needed) on your documents, can provide Docspell also runs OCR (if needed) on your documents, can provide
fulltext search and has great e-mail integration. Everything is fulltext search and has great e-mail integration. Everything is
accessible via a REST/HTTP api. A mobile friendly SPA web application accessible via a REST/HTTP api. A mobile friendly SPA web application
is provided as the user interface and an [Android is the default user interface. An [Android
app](https://github.com/docspell/android-client) for conveniently app](https://github.com/docspell/android-client) exists for
uploading files from your phone/tablet. The [feature conveniently uploading files from your phone/tablet. The [feature
overview](https://docspell.org/#feature-selection) has a more complete overview](https://docspell.org/#feature-selection) lists some more
list. points.
## Impressions ## Impressions

View File

@ -131,7 +131,8 @@ val openapiScalaSettings = Seq(
case "ident" => case "ident" =>
field => field.copy(typeDef = TypeDef("Ident", Imports("docspell.common.Ident"))) field => field.copy(typeDef = TypeDef("Ident", Imports("docspell.common.Ident")))
case "accountid" => case "accountid" =>
field => field.copy(typeDef = TypeDef("AccountId", Imports("docspell.common.AccountId"))) field =>
field.copy(typeDef = TypeDef("AccountId", Imports("docspell.common.AccountId")))
case "collectivestate" => case "collectivestate" =>
field => field =>
field.copy(typeDef = field.copy(typeDef =
@ -190,6 +191,9 @@ val openapiScalaSettings = Seq(
field.copy(typeDef = field.copy(typeDef =
TypeDef("CustomFieldType", Imports("docspell.common.CustomFieldType")) TypeDef("CustomFieldType", Imports("docspell.common.CustomFieldType"))
) )
case "listtype" =>
field =>
field.copy(typeDef = TypeDef("ListType", Imports("docspell.common.ListType")))
})) }))
) )

View File

@ -15,6 +15,17 @@ RUN apk add --no-cache openjdk11-jre \
tesseract-ocr \ tesseract-ocr \
tesseract-ocr-data-deu \ tesseract-ocr-data-deu \
tesseract-ocr-data-fra \ tesseract-ocr-data-fra \
tesseract-ocr-data-ita \
tesseract-ocr-data-spa \
tesseract-ocr-data-por \
tesseract-ocr-data-ces \
tesseract-ocr-data-nld \
tesseract-ocr-data-dan \
tesseract-ocr-data-fin \
tesseract-ocr-data-nor \
tesseract-ocr-data-swe \
tesseract-ocr-data-rus \
tesseract-ocr-data-ron \
unpaper \ unpaper \
wkhtmltopdf \ wkhtmltopdf \
libreoffice \ libreoffice \

View File

@ -0,0 +1,7 @@
package docspell.analysis
import java.nio.file.Path
import docspell.common._
case class NlpSettings(lang: Language, highRecall: Boolean, regexNer: Option[Path])

View File

@ -1,29 +1,30 @@
package docspell.analysis package docspell.analysis
import cats.Applicative
import cats.effect._ import cats.effect._
import cats.implicits._ import cats.implicits._
import docspell.analysis.classifier.{StanfordTextClassifier, TextClassifier}
import docspell.analysis.contact.Contact import docspell.analysis.contact.Contact
import docspell.analysis.date.DateFind import docspell.analysis.date.DateFind
import docspell.analysis.nlp.PipelineCache import docspell.analysis.nlp._
import docspell.analysis.nlp.StanfordNerClassifier
import docspell.analysis.nlp.StanfordNerSettings
import docspell.analysis.nlp.StanfordTextClassifier
import docspell.analysis.nlp.TextClassifier
import docspell.common._ import docspell.common._
import org.log4s.getLogger
trait TextAnalyser[F[_]] { trait TextAnalyser[F[_]] {
def annotate( def annotate(
logger: Logger[F], logger: Logger[F],
settings: StanfordNerSettings, settings: NlpSettings,
cacheKey: Ident, cacheKey: Ident,
text: String text: String
): F[TextAnalyser.Result] ): F[TextAnalyser.Result]
def classifier(blocker: Blocker)(implicit CS: ContextShift[F]): TextClassifier[F] def classifier: TextClassifier[F]
} }
object TextAnalyser { object TextAnalyser {
private[this] val logger = getLogger
case class Result(labels: Vector[NerLabel], dates: Vector[NerDateLabel]) { case class Result(labels: Vector[NerLabel], dates: Vector[NerDateLabel]) {
@ -31,31 +32,30 @@ object TextAnalyser {
labels ++ dates.map(dl => dl.label.copy(label = dl.date.toString)) labels ++ dates.map(dl => dl.label.copy(label = dl.date.toString))
} }
def create[F[_]: Concurrent: Timer]( def create[F[_]: Concurrent: Timer: ContextShift](
cfg: TextAnalysisConfig cfg: TextAnalysisConfig,
blocker: Blocker
): Resource[F, TextAnalyser[F]] = ): Resource[F, TextAnalyser[F]] =
Resource Resource
.liftF(PipelineCache[F](cfg.clearStanfordPipelineInterval)) .liftF(Nlp(cfg.nlpConfig))
.map(cache => .map(stanfordNer =>
new TextAnalyser[F] { new TextAnalyser[F] {
def annotate( def annotate(
logger: Logger[F], logger: Logger[F],
settings: StanfordNerSettings, settings: NlpSettings,
cacheKey: Ident, cacheKey: Ident,
text: String text: String
): F[TextAnalyser.Result] = ): F[TextAnalyser.Result] =
for { for {
input <- textLimit(logger, text) input <- textLimit(logger, text)
tags0 <- stanfordNer(cacheKey, settings, input) tags0 <- stanfordNer(Nlp.Input(cacheKey, settings, logger, input))
tags1 <- contactNer(input) tags1 <- contactNer(input)
dates <- dateNer(settings.lang, input) dates <- dateNer(settings.lang, input)
list = tags0 ++ tags1 list = tags0 ++ tags1
spans = NerLabelSpan.build(list) spans = NerLabelSpan.build(list)
} yield Result(spans ++ list, dates) } yield Result(spans ++ list, dates)
def classifier(blocker: Blocker)(implicit def classifier: TextClassifier[F] =
CS: ContextShift[F]
): TextClassifier[F] =
new StanfordTextClassifier[F](cfg.classifier, blocker) new StanfordTextClassifier[F](cfg.classifier, blocker)
private def textLimit(logger: Logger[F], text: String): F[String] = private def textLimit(logger: Logger[F], text: String): F[String] =
@ -66,10 +66,6 @@ object TextAnalyser {
s" Analysing only first ${cfg.maxLength} characters." s" Analysing only first ${cfg.maxLength} characters."
) *> text.take(cfg.maxLength).pure[F] ) *> text.take(cfg.maxLength).pure[F]
private def stanfordNer(key: Ident, settings: StanfordNerSettings, text: String)
: F[Vector[NerLabel]] =
StanfordNerClassifier.nerAnnotate[F](key.id, cache)(settings, text)
private def contactNer(text: String): F[Vector[NerLabel]] = private def contactNer(text: String): F[Vector[NerLabel]] =
Sync[F].delay { Sync[F].delay {
Contact.annotate(text) Contact.annotate(text)
@ -82,4 +78,36 @@ object TextAnalyser {
} }
) )
/** Provides the nlp pipeline based on the configuration. */
private object Nlp {
def apply[F[_]: Concurrent: Timer: BracketThrow](
cfg: TextAnalysisConfig.NlpConfig
): F[Input[F] => F[Vector[NerLabel]]] =
cfg.mode match {
case NlpMode.Disabled =>
Logger.log4s(logger).info("NLP is disabled as defined in config.") *>
Applicative[F].pure(_ => Vector.empty[NerLabel].pure[F])
case _ =>
PipelineCache(cfg.clearInterval)(
Annotator[F](cfg.mode),
Annotator.clearCaches[F]
)
.map(annotate[F])
}
final case class Input[F[_]](
key: Ident,
settings: NlpSettings,
logger: Logger[F],
text: String
)
def annotate[F[_]: BracketThrow](
cache: PipelineCache[F]
)(input: Input[F]): F[Vector[NerLabel]] =
cache
.obtain(input.key.id, input.settings)
.use(ann => ann.nerAnnotate(input.logger)(input.text))
}
} }

View File

@ -1,10 +1,16 @@
package docspell.analysis package docspell.analysis
import docspell.analysis.nlp.TextClassifierConfig import docspell.analysis.TextAnalysisConfig.NlpConfig
import docspell.analysis.classifier.TextClassifierConfig
import docspell.common._ import docspell.common._
case class TextAnalysisConfig( case class TextAnalysisConfig(
maxLength: Int, maxLength: Int,
clearStanfordPipelineInterval: Duration, nlpConfig: NlpConfig,
classifier: TextClassifierConfig classifier: TextClassifierConfig
) )
object TextAnalysisConfig {
case class NlpConfig(clearInterval: Duration, mode: NlpMode)
}

View File

@ -1,4 +1,4 @@
package docspell.analysis.nlp package docspell.analysis.classifier
import java.nio.file.Path import java.nio.file.Path

View File

@ -1,4 +1,4 @@
package docspell.analysis.nlp package docspell.analysis.classifier
import java.nio.file.Path import java.nio.file.Path
@ -7,8 +7,11 @@ import cats.effect.concurrent.Ref
import cats.implicits._ import cats.implicits._
import fs2.Stream import fs2.Stream
import docspell.analysis.nlp.TextClassifier._ import docspell.analysis.classifier
import docspell.analysis.classifier.TextClassifier._
import docspell.analysis.nlp.Properties
import docspell.common._ import docspell.common._
import docspell.common.syntax.FileSyntax._
import edu.stanford.nlp.classify.ColumnDataClassifier import edu.stanford.nlp.classify.ColumnDataClassifier
@ -26,7 +29,7 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
.use { dir => .use { dir =>
for { for {
rawData <- writeDataFile(blocker, dir, data) rawData <- writeDataFile(blocker, dir, data)
_ <- logger.info(s"Learning from ${rawData.count} items.") _ <- logger.debug(s"Learning from ${rawData.count} items.")
trainData <- splitData(logger, rawData) trainData <- splitData(logger, rawData)
scores <- cfg.classifierConfigs.traverse(m => train(logger, trainData, m)) scores <- cfg.classifierConfigs.traverse(m => train(logger, trainData, m))
sorted = scores.sortBy(-_.score) sorted = scores.sortBy(-_.score)
@ -43,7 +46,7 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
case Some(text) => case Some(text) =>
Sync[F].delay { Sync[F].delay {
val cls = ColumnDataClassifier.getClassifier( val cls = ColumnDataClassifier.getClassifier(
model.model.normalize().toAbsolutePath().toString() model.model.normalize().toAbsolutePath.toString
) )
val cat = cls.classOf(cls.makeDatumFromLine("\t\t" + normalisedText(text))) val cat = cls.classOf(cls.makeDatumFromLine("\t\t" + normalisedText(text)))
Option(cat) Option(cat)
@ -65,7 +68,7 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
val cdc = new ColumnDataClassifier(Properties.fromMap(amendProps(in, props))) val cdc = new ColumnDataClassifier(Properties.fromMap(amendProps(in, props)))
cdc.trainClassifier(in.train.toString()) cdc.trainClassifier(in.train.toString())
val score = cdc.testClassifier(in.test.toString()) val score = cdc.testClassifier(in.test.toString())
TrainResult(score.first(), ClassifierModel(in.modelFile)) TrainResult(score.first(), classifier.ClassifierModel(in.modelFile))
} }
_ <- logger.debug(s"Trained with result $res") _ <- logger.debug(s"Trained with result $res")
} yield res } yield res
@ -136,9 +139,9 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
props: Map[String, String] props: Map[String, String]
): Map[String, String] = ): Map[String, String] =
prepend("2.", props) ++ Map( prepend("2.", props) ++ Map(
"trainFile" -> trainData.train.normalize().toAbsolutePath().toString(), "trainFile" -> trainData.train.absolutePathAsString,
"testFile" -> trainData.test.normalize().toAbsolutePath().toString(), "testFile" -> trainData.test.absolutePathAsString,
"serializeTo" -> trainData.modelFile.normalize().toAbsolutePath().toString() "serializeTo" -> trainData.modelFile.absolutePathAsString
).toList ).toList
case class RawData(count: Long, file: Path) case class RawData(count: Long, file: Path)

View File

@ -1,9 +1,9 @@
package docspell.analysis.nlp package docspell.analysis.classifier
import cats.data.Kleisli import cats.data.Kleisli
import fs2.Stream import fs2.Stream
import docspell.analysis.nlp.TextClassifier.Data import docspell.analysis.classifier.TextClassifier.Data
import docspell.common._ import docspell.common._
trait TextClassifier[F[_]] { trait TextClassifier[F[_]] {

View File

@ -1,4 +1,4 @@
package docspell.analysis.nlp package docspell.analysis.classifier
import java.nio.file.Path import java.nio.file.Path

View File

@ -41,23 +41,41 @@ object DateFind {
} }
object SimpleDate { object SimpleDate {
val p0 = (readYear >> readMonth >> readDay).map { case ((y, m), d) => def pattern0(lang: Language) = (readYear >> readMonth(lang) >> readDay).map {
case ((y, m), d) =>
List(SimpleDate(y, m, d)) List(SimpleDate(y, m, d))
} }
val p1 = (readDay >> readMonth >> readYear).map { case ((d, m), y) => def pattern1(lang: Language) = (readDay >> readMonth(lang) >> readYear).map {
case ((d, m), y) =>
List(SimpleDate(y, m, d)) List(SimpleDate(y, m, d))
} }
val p2 = (readMonth >> readDay >> readYear).map { case ((m, d), y) => def pattern2(lang: Language) = (readMonth(lang) >> readDay >> readYear).map {
case ((m, d), y) =>
List(SimpleDate(y, m, d)) List(SimpleDate(y, m, d))
} }
// ymd , ydm, dmy , dym, myd, mdy // ymd , ydm, dmy , dym, myd, mdy
def fromParts(parts: List[Word], lang: Language): List[SimpleDate] = { def fromParts(parts: List[Word], lang: Language): List[SimpleDate] = {
val ymd = pattern0(lang)
val dmy = pattern1(lang)
val mdy = pattern2(lang)
// most is from wikipedia
val p = lang match { val p = lang match {
case Language.English => case Language.English =>
p2.alt(p1).map(t => t._1 ++ t._2).or(p2).or(p0).or(p1) mdy.alt(dmy).map(t => t._1 ++ t._2).or(mdy).or(ymd).or(dmy)
case Language.German => p1.or(p0).or(p2) case Language.German => dmy.or(ymd).or(mdy)
case Language.French => p1.or(p0).or(p2) case Language.French => dmy.or(ymd).or(mdy)
case Language.Italian => dmy.or(ymd).or(mdy)
case Language.Spanish => dmy.or(ymd).or(mdy)
case Language.Czech => dmy.or(ymd).or(mdy)
case Language.Danish => dmy.or(ymd).or(mdy)
case Language.Finnish => dmy.or(ymd).or(mdy)
case Language.Norwegian => dmy.or(ymd).or(mdy)
case Language.Portuguese => dmy.or(ymd).or(mdy)
case Language.Romanian => dmy.or(ymd).or(mdy)
case Language.Russian => dmy.or(ymd).or(mdy)
case Language.Swedish => ymd.or(dmy).or(mdy)
case Language.Dutch => dmy.or(ymd).or(mdy)
} }
p.read(parts) match { p.read(parts) match {
case Result.Success(sds, _) => case Result.Success(sds, _) =>
@ -76,9 +94,11 @@ object DateFind {
} }
) )
def readMonth: Reader[Int] = def readMonth(lang: Language): Reader[Int] =
Reader.readFirst(w => Reader.readFirst(w =>
Some(months.indexWhere(_.contains(w.value))).filter(_ >= 0).map(_ + 1) Some(MonthName.getAll(lang).indexWhere(_.contains(w.value)))
.filter(_ >= 0)
.map(_ + 1)
) )
def readDay: Reader[Int] = def readDay: Reader[Int] =
@ -150,20 +170,5 @@ object DateFind {
Failure Failure
} }
} }
private val months = List(
List("jan", "january", "januar", "01"),
List("feb", "february", "februar", "02"),
List("mar", "march", "märz", "marz", "03"),
List("apr", "april", "04"),
List("may", "mai", "05"),
List("jun", "june", "juni", "06"),
List("jul", "july", "juli", "07"),
List("aug", "august", "08"),
List("sep", "september", "09"),
List("oct", "october", "oktober", "10"),
List("nov", "november", "11"),
List("dec", "december", "dezember", "12")
)
} }
} }

View File

@ -0,0 +1,270 @@
package docspell.analysis.date
import docspell.common.Language
object MonthName {
def getAll(lang: Language): List[List[String]] =
merge(numbers, forLang(lang))
private def merge(n0: List[List[String]], ns: List[List[String]]*): List[List[String]] =
ns.foldLeft(n0) { (res, el) =>
res.zip(el).map({ case (a, b) => a ++ b })
}
private def forLang(lang: Language): List[List[String]] =
lang match {
case Language.English =>
english
case Language.German =>
german
case Language.French =>
french
case Language.Italian =>
italian
case Language.Spanish =>
spanish
case Language.Swedish =>
swedish
case Language.Norwegian =>
norwegian
case Language.Dutch =>
dutch
case Language.Czech =>
czech
case Language.Danish =>
danish
case Language.Portuguese =>
portuguese
case Language.Romanian =>
romanian
case Language.Finnish =>
finnish
case Language.Russian =>
russian
}
private val numbers = List(
List("01"),
List("02"),
List("03"),
List("04"),
List("05"),
List("06"),
List("07"),
List("08"),
List("09"),
List("10"),
List("11"),
List("12")
)
private val english = List(
List("jan", "january"),
List("feb", "february"),
List("mar", "march"),
List("apr", "april"),
List("may"),
List("jun", "june"),
List("jul", "july"),
List("aug", "august"),
List("sept", "september"),
List("oct", "october"),
List("nov", "november"),
List("dec", "december")
)
private val german = List(
List("jan", "januar"),
List("feb", "februar"),
List("märz"),
List("apr", "april"),
List("mai"),
List("juni"),
List("juli"),
List("aug", "august"),
List("sept", "september"),
List("okt", "oktober"),
List("nov", "november"),
List("dez", "dezember")
)
private val french = List(
List("janv", "janvier"),
List("févr", "fevr", "février", "fevrier"),
List("mars"),
List("avril"),
List("mai"),
List("juin"),
List("juil", "juillet"),
List("aout", "août"),
List("sept", "septembre"),
List("oct", "octobre"),
List("nov", "novembre"),
List("dec", "déc", "décembre", "decembre")
)
private val italian = List(
List("genn", "gennaio"),
List("febbr", "febbraio"),
List("mar", "marzo"),
List("apr", "aprile"),
List("magg", "maggio"),
List("giugno"),
List("luglio"),
List("ag", "agosto"),
List("sett", "settembre"),
List("ott", "ottobre"),
List("nov", "novembre"),
List("dic", "dicembre")
)
private val spanish = List(
List("ene", "enero"),
List("feb", "febrero"),
List("mar", "marzo"),
List("abr", "abril"),
List("may", "mayo"),
List("jun"),
List("jul"),
List("ago", "agosto"),
List("sep", "septiembre"),
List("oct", "octubre"),
List("nov", "noviembre"),
List("dic", "diciembre")
)
private val swedish = List(
List("jan", "januari"),
List("febr", "februari"),
List("mars"),
List("april"),
List("maj"),
List("juni"),
List("juli"),
List("aug", "augusti"),
List("sept", "september"),
List("okt", "oktober"),
List("nov", "november"),
List("dec", "december")
)
private val norwegian = List(
List("jan", "januar"),
List("febr", "februar"),
List("mars"),
List("april"),
List("mai"),
List("juni"),
List("juli"),
List("aug", "august"),
List("sept", "september"),
List("okt", "oktober"),
List("nov", "november"),
List("des", "desember")
)
private val czech = List(
List("led", "leden"),
List("un", "ún", "únor", "unor"),
List("brez", "březen", "brezen"),
List("dub", "duben"),
List("kvet", "květen"),
List("cerv", "červen"),
List("cerven", "červenec"),
List("srp", "srpen"),
List("zari", "září"),
List("ríj", "rij", "říjen"),
List("list", "listopad"),
List("pros", "prosinec")
)
private val romanian = List(
List("ian", "ianuarie"),
List("feb", "februarie"),
List("mar", "martie"),
List("apr", "aprilie"),
List("mai"),
List("iunie"),
List("iulie"),
List("aug", "august"),
List("sept", "septembrie"),
List("oct", "octombrie"),
List("noem", "nov", "noiembrie"),
List("dec", "decembrie")
)
private val danish = List(
List("jan", "januar"),
List("febr", "februar"),
List("marts"),
List("april"),
List("maj"),
List("juni"),
List("juli"),
List("aug", "august"),
List("sept", "september"),
List("okt", "oktober"),
List("nov", "november"),
List("dec", "december")
)
private val portuguese = List(
List("jan", "janeiro"),
List("fev", "fevereiro"),
List("março", "marco"),
List("abril"),
List("maio"),
List("junho"),
List("julho"),
List("agosto"),
List("set", "setembro"),
List("out", "outubro"),
List("nov", "novembro"),
List("dez", "dezembro")
)
private val finnish = List(
List("tammikuu"),
List("helmikuu"),
List("maaliskuu"),
List("huhtikuu"),
List("toukokuu"),
List("kesäkuu"),
List("heinäkuu"),
List("elokuu"),
List("syyskuu"),
List("lokakuu"),
List("marraskuu"),
List("joulukuu")
)
private val russian = List(
List("январь"),
List("февраль"),
List("март"),
List("апрель"),
List("май"),
List("июнь"),
List("июль"),
List("август"),
List("сентябрь"),
List("октябрь"),
List("ноябрь"),
List("декабрь")
)
private val dutch = List(
List("jan", "januari"),
List("feb", "februari"),
List("maart"),
List("apr", "april"),
List("mei"),
List("juni"),
List("juli"),
List("aug", "augustus"),
List("sept", "september"),
List("okt", "oct", "oktober"),
List("nov", "november"),
List("dec", "december")
)
}

View File

@ -0,0 +1,98 @@
package docspell.analysis.nlp
import cats.effect.Sync
import cats.implicits._
import cats.{Applicative, FlatMap}
import docspell.analysis.NlpSettings
import docspell.common._
import edu.stanford.nlp.pipeline.StanfordCoreNLP
/** Analyses a text to mark certain parts with a `NerLabel`. */
trait Annotator[F[_]] { self =>
def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]]
def ++(next: Annotator[F])(implicit F: FlatMap[F]): Annotator[F] =
new Annotator[F] {
def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
for {
n0 <- self.nerAnnotate(logger)(text)
n1 <- next.nerAnnotate(logger)(text)
} yield (n0 ++ n1).distinct
}
}
object Annotator {
/** Creates an annotator according to the given `mode` and `settings`.
*
* There are the following ways:
*
* - disabled: it returns a no-op annotator that always gives an empty list
* - full: the complete stanford pipeline is used
* - basic: only the ner classifier is used
*
* Additionally, if there is a regexNer-file specified, the regexner annotator is
* also run. In case the full pipeline is used, this is already included.
*/
def apply[F[_]: Sync](mode: NlpMode)(settings: NlpSettings): Annotator[F] =
mode match {
case NlpMode.Disabled =>
Annotator.none[F]
case NlpMode.Full =>
StanfordNerSettings.fromNlpSettings(settings) match {
case Some(ss) =>
Annotator.pipeline(StanfordNerAnnotator.makePipeline(ss))
case None =>
Annotator.none[F]
}
case NlpMode.Basic =>
StanfordNerSettings.fromNlpSettings(settings) match {
case Some(StanfordNerSettings.Full(lang, _, Some(file))) =>
Annotator.basic(BasicCRFAnnotator.Cache.getAnnotator(lang)) ++
Annotator.pipeline(StanfordNerAnnotator.regexNerPipeline(file))
case Some(StanfordNerSettings.Full(lang, _, None)) =>
Annotator.basic(BasicCRFAnnotator.Cache.getAnnotator(lang))
case Some(StanfordNerSettings.RegexOnly(file)) =>
Annotator.pipeline(StanfordNerAnnotator.regexNerPipeline(file))
case None =>
Annotator.none[F]
}
case NlpMode.RegexOnly =>
settings.regexNer match {
case Some(file) =>
Annotator.pipeline(StanfordNerAnnotator.regexNerPipeline(file))
case None =>
Annotator.none[F]
}
}
def none[F[_]: Applicative]: Annotator[F] =
new Annotator[F] {
def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
logger.debug("Running empty annotator. NLP not supported.") *>
Vector.empty[NerLabel].pure[F]
}
def basic[F[_]: Sync](ann: BasicCRFAnnotator.Annotator): Annotator[F] =
new Annotator[F] {
def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
Sync[F].delay(
BasicCRFAnnotator.nerAnnotate(ann)(text)
)
}
def pipeline[F[_]: Sync](cp: StanfordCoreNLP): Annotator[F] =
new Annotator[F] {
def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
Sync[F].delay(StanfordNerAnnotator.nerAnnotate(cp, text))
}
def clearCaches[F[_]: Sync]: F[Unit] =
Sync[F].delay {
StanfordCoreNLP.clearAnnotatorPool()
BasicCRFAnnotator.Cache.clearCache()
}
}

View File

@ -0,0 +1,94 @@
package docspell.analysis.nlp
import java.net.URL
import java.util.concurrent.atomic.AtomicReference
import java.util.zip.GZIPInputStream
import scala.jdk.CollectionConverters._
import scala.util.Using
import docspell.common.Language.NLPLanguage
import docspell.common._
import edu.stanford.nlp.ie.AbstractSequenceClassifier
import edu.stanford.nlp.ie.crf.CRFClassifier
import edu.stanford.nlp.ling.{CoreAnnotations, CoreLabel}
import org.log4s.getLogger
/** This is only using the CRFClassifier without building an analysis
* pipeline. The ner-classifier cannot use results from POS-tagging
* etc. and is therefore not as good as the [[StanfordNerAnnotator]].
* But it uses less memory, while still being not bad.
*/
object BasicCRFAnnotator {
private[this] val logger = getLogger
// assert correct resource names
List(Language.French, Language.German, Language.English).foreach(classifierResource)
type Annotator = AbstractSequenceClassifier[CoreLabel]
def nerAnnotate(nerClassifier: Annotator)(text: String): Vector[NerLabel] =
nerClassifier
.classify(text)
.asScala
.flatMap(a => a.asScala)
.collect(Function.unlift { label =>
val tag = label.get(classOf[CoreAnnotations.AnswerAnnotation])
NerTag
.fromString(Option(tag).getOrElse(""))
.toOption
.map(t => NerLabel(label.word(), t, label.beginPosition(), label.endPosition()))
})
.toVector
def makeAnnotator(lang: NLPLanguage): Annotator = {
logger.info(s"Creating ${lang.name} Stanford NLP NER-only classifier...")
val ner = classifierResource(lang)
Using(new GZIPInputStream(ner.openStream())) { in =>
CRFClassifier.getClassifier(in).asInstanceOf[Annotator]
}.fold(throw _, identity)
}
private def classifierResource(lang: NLPLanguage): URL = {
def check(name: String): URL =
Option(getClass.getResource(name)) match {
case None =>
sys.error(s"NER model resource '$name' not found for language ${lang.name}")
case Some(url) => url
}
check(lang match {
case Language.French =>
"/edu/stanford/nlp/models/ner/french-wikiner-4class.crf.ser.gz"
case Language.German =>
"/edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz"
case Language.English =>
"/edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz"
})
}
final class Cache {
private[this] lazy val germanNerClassifier = makeAnnotator(Language.German)
private[this] lazy val englishNerClassifier = makeAnnotator(Language.English)
private[this] lazy val frenchNerClassifier = makeAnnotator(Language.French)
def forLang(language: NLPLanguage): Annotator =
language match {
case Language.French => frenchNerClassifier
case Language.German => germanNerClassifier
case Language.English => englishNerClassifier
}
}
object Cache {
private[this] val cacheRef = new AtomicReference[Cache](new Cache)
def getAnnotator(language: NLPLanguage): Annotator =
cacheRef.get().forLang(language)
def clearCache(): Unit =
cacheRef.set(new Cache)
}
}

View File

@ -7,9 +7,9 @@ import cats.effect._
import cats.effect.concurrent.Ref import cats.effect.concurrent.Ref
import cats.implicits._ import cats.implicits._
import docspell.analysis.NlpSettings
import docspell.common._ import docspell.common._
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import org.log4s.getLogger import org.log4s.getLogger
/** Creating the StanfordCoreNLP pipeline is quite expensive as it /** Creating the StanfordCoreNLP pipeline is quite expensive as it
@ -21,46 +21,45 @@ import org.log4s.getLogger
*/ */
trait PipelineCache[F[_]] { trait PipelineCache[F[_]] {
def obtain(key: String, settings: StanfordNerSettings): Resource[F, StanfordCoreNLP] def obtain(key: String, settings: NlpSettings): Resource[F, Annotator[F]]
} }
object PipelineCache { object PipelineCache {
private[this] val logger = getLogger private[this] val logger = getLogger
def none[F[_]: Applicative]: PipelineCache[F] = def apply[F[_]: Concurrent: Timer](clearInterval: Duration)(
new PipelineCache[F] { creator: NlpSettings => Annotator[F],
def obtain( release: F[Unit]
ignored: String, ): F[PipelineCache[F]] =
settings: StanfordNerSettings
): Resource[F, StanfordCoreNLP] =
Resource.liftF(makeClassifier(settings).pure[F])
}
def apply[F[_]: Concurrent: Timer](clearInterval: Duration): F[PipelineCache[F]] =
for { for {
data <- Ref.of(Map.empty[String, Entry]) data <- Ref.of(Map.empty[String, Entry[Annotator[F]]])
cacheClear <- CacheClearing.create(data, clearInterval) cacheClear <- CacheClearing.create(data, clearInterval, release)
} yield new Impl[F](data, cacheClear) _ <- Logger.log4s(logger).info("Creating nlp pipeline cache")
} yield new Impl[F](data, creator, cacheClear)
final private class Impl[F[_]: Sync]( final private class Impl[F[_]: Sync](
data: Ref[F, Map[String, Entry]], data: Ref[F, Map[String, Entry[Annotator[F]]]],
creator: NlpSettings => Annotator[F],
cacheClear: CacheClearing[F] cacheClear: CacheClearing[F]
) extends PipelineCache[F] { ) extends PipelineCache[F] {
def obtain(key: String, settings: StanfordNerSettings): Resource[F, StanfordCoreNLP] = def obtain(key: String, settings: NlpSettings): Resource[F, Annotator[F]] =
for { for {
_ <- cacheClear.withCache _ <- cacheClear.withCache
id <- Resource.liftF(makeSettingsId(settings)) id <- Resource.liftF(makeSettingsId(settings))
nlp <- Resource.liftF(data.modify(cache => getOrCreate(key, id, cache, settings))) nlp <- Resource.liftF(
data.modify(cache => getOrCreate(key, id, cache, settings, creator))
)
} yield nlp } yield nlp
private def getOrCreate( private def getOrCreate(
key: String, key: String,
id: String, id: String,
cache: Map[String, Entry], cache: Map[String, Entry[Annotator[F]]],
settings: StanfordNerSettings settings: NlpSettings,
): (Map[String, Entry], StanfordCoreNLP) = creator: NlpSettings => Annotator[F]
): (Map[String, Entry[Annotator[F]]], Annotator[F]) =
cache.get(key) match { cache.get(key) match {
case Some(entry) => case Some(entry) =>
if (entry.id == id) (cache, entry.value) if (entry.id == id) (cache, entry.value)
@ -68,18 +67,18 @@ object PipelineCache {
logger.info( logger.info(
s"StanfordNLP settings changed for key $key. Creating new classifier" s"StanfordNLP settings changed for key $key. Creating new classifier"
) )
val nlp = makeClassifier(settings) val nlp = creator(settings)
val e = Entry(id, nlp) val e = Entry(id, nlp)
(cache.updated(key, e), nlp) (cache.updated(key, e), nlp)
} }
case None => case None =>
val nlp = makeClassifier(settings) val nlp = creator(settings)
val e = Entry(id, nlp) val e = Entry(id, nlp)
(cache.updated(key, e), nlp) (cache.updated(key, e), nlp)
} }
private def makeSettingsId(settings: StanfordNerSettings): F[String] = { private def makeSettingsId(settings: NlpSettings): F[String] = {
val base = settings.copy(regexNer = None).toString val base = settings.copy(regexNer = None).toString
val size: F[Long] = val size: F[Long] =
settings.regexNer match { settings.regexNer match {
@ -104,9 +103,10 @@ object PipelineCache {
Resource.pure[F, Unit](()) Resource.pure[F, Unit](())
} }
def create[F[_]: Concurrent: Timer]( def create[F[_]: Concurrent: Timer, A](
data: Ref[F, Map[String, Entry]], data: Ref[F, Map[String, Entry[A]]],
interval: Duration interval: Duration,
release: F[Unit]
): F[CacheClearing[F]] = ): F[CacheClearing[F]] =
for { for {
counter <- Ref.of(0L) counter <- Ref.of(0L)
@ -121,16 +121,23 @@ object PipelineCache {
log log
.info(s"Clearing StanfordNLP cache after $interval idle time") .info(s"Clearing StanfordNLP cache after $interval idle time")
.map(_ => .map(_ =>
new CacheClearingImpl[F](data, counter, cleaning, interval.toScala) new CacheClearingImpl[F, A](
data,
counter,
cleaning,
interval.toScala,
release
)
) )
} yield result } yield result
} }
final private class CacheClearingImpl[F[_]]( final private class CacheClearingImpl[F[_], A](
data: Ref[F, Map[String, Entry]], data: Ref[F, Map[String, Entry[A]]],
counter: Ref[F, Long], counter: Ref[F, Long],
cleaningFiber: Ref[F, Option[Fiber[F, Unit]]], cleaningFiber: Ref[F, Option[Fiber[F, Unit]]],
clearInterval: FiniteDuration clearInterval: FiniteDuration,
release: F[Unit]
)(implicit T: Timer[F], F: Concurrent[F]) )(implicit T: Timer[F], F: Concurrent[F])
extends CacheClearing[F] { extends CacheClearing[F] {
private[this] val log = Logger.log4s[F](logger) private[this] val log = Logger.log4s[F](logger)
@ -158,17 +165,10 @@ object PipelineCache {
def clearAll: F[Unit] = def clearAll: F[Unit] =
log.info("Clearing stanford nlp cache now!") *> log.info("Clearing stanford nlp cache now!") *>
data.set(Map.empty) *> Sync[F].delay { data.set(Map.empty) *> release *> Sync[F].delay {
// turns out that everything is cached in a static map
StanfordCoreNLP.clearAnnotatorPool()
System.gc(); System.gc();
} }
} }
private def makeClassifier(settings: StanfordNerSettings): StanfordCoreNLP = { private case class Entry[A](id: String, value: A)
logger.info(s"Creating ${settings.lang.name} Stanford NLP NER classifier...")
new StanfordCoreNLP(Properties.forSettings(settings))
}
private case class Entry(id: String, value: StanfordCoreNLP)
} }

View File

@ -1,9 +1,11 @@
package docspell.analysis.nlp package docspell.analysis.nlp
import java.nio.file.Path
import java.util.{Properties => JProps} import java.util.{Properties => JProps}
import docspell.analysis.nlp.Properties.Implicits._ import docspell.analysis.nlp.Properties.Implicits._
import docspell.common._ import docspell.common._
import docspell.common.syntax.FileSyntax._
object Properties { object Properties {
@ -17,17 +19,20 @@ object Properties {
p p
} }
def forSettings(settings: StanfordNerSettings): JProps = { def forSettings(settings: StanfordNerSettings): JProps =
val regexNerFile = settings.regexNer settings match {
.map(p => p.normalize().toAbsolutePath().toString()) case StanfordNerSettings.Full(lang, highRecall, regexNer) =>
settings.lang match { val regexNerFile = regexNer.map(p => p.absolutePathAsString)
lang match {
case Language.German => case Language.German =>
Properties.nerGerman(regexNerFile, settings.highRecall) Properties.nerGerman(regexNerFile, highRecall)
case Language.English => case Language.English =>
Properties.nerEnglish(regexNerFile) Properties.nerEnglish(regexNerFile)
case Language.French => case Language.French =>
Properties.nerFrench(regexNerFile, settings.highRecall) Properties.nerFrench(regexNerFile, highRecall)
} }
case StanfordNerSettings.RegexOnly(path) =>
Properties.regexNerOnly(path)
} }
def nerGerman(regexNerMappingFile: Option[String], highRecall: Boolean): JProps = def nerGerman(regexNerMappingFile: Option[String], highRecall: Boolean): JProps =
@ -76,6 +81,11 @@ object Properties {
"ner.model" -> "edu/stanford/nlp/models/ner/french-wikiner-4class.crf.ser.gz,edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz" "ner.model" -> "edu/stanford/nlp/models/ner/french-wikiner-4class.crf.ser.gz,edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz"
).withRegexNer(regexNerMappingFile).withHighRecall(highRecall) ).withRegexNer(regexNerMappingFile).withHighRecall(highRecall)
def regexNerOnly(regexNerMappingFile: Path): JProps =
Properties(
"annotators" -> "tokenize,ssplit"
).withRegexNer(Some(regexNerMappingFile.absolutePathAsString))
object Implicits { object Implicits {
implicit final class JPropsOps(val p: JProps) extends AnyVal { implicit final class JPropsOps(val p: JProps) extends AnyVal {

View File

@ -0,0 +1,52 @@
package docspell.analysis.nlp
import java.nio.file.Path
import scala.jdk.CollectionConverters._
import cats.effect._
import docspell.common._
import edu.stanford.nlp.pipeline.{CoreDocument, StanfordCoreNLP}
import org.log4s.getLogger
object StanfordNerAnnotator {
private[this] val logger = getLogger
/** Runs named entity recognition on the given `text`.
*
* This uses the classifier pipeline from stanford-nlp, see
* https://nlp.stanford.edu/software/CRF-NER.html. Creating these
* classifiers is quite expensive, it involves loading large model
* files. The classifiers are thread-safe and so they are cached.
* The `cacheKey` defines the "slot" where classifiers are stored
* and retrieved. If for a given `cacheKey` the `settings` change,
* a new classifier must be created. It will then replace the
* previous one.
*/
def nerAnnotate(nerClassifier: StanfordCoreNLP, text: String): Vector[NerLabel] = {
val doc = new CoreDocument(text)
nerClassifier.annotate(doc)
doc.tokens().asScala.collect(Function.unlift(LabelConverter.toNerLabel)).toVector
}
def makePipeline(settings: StanfordNerSettings): StanfordCoreNLP =
settings match {
case s: StanfordNerSettings.Full =>
logger.info(s"Creating ${s.lang.name} Stanford NLP NER classifier...")
new StanfordCoreNLP(Properties.forSettings(settings))
case StanfordNerSettings.RegexOnly(path) =>
logger.info(s"Creating regexNer-only Stanford NLP NER classifier...")
regexNerPipeline(path)
}
def regexNerPipeline(regexNerFile: Path): StanfordCoreNLP =
new StanfordCoreNLP(Properties.regexNerOnly(regexNerFile))
def clearPipelineCaches[F[_]: Sync]: F[Unit] =
Sync[F].delay {
// turns out that everything is cached in a static map
StanfordCoreNLP.clearAnnotatorPool()
}
}

View File

@ -1,39 +0,0 @@
package docspell.analysis.nlp
import scala.jdk.CollectionConverters._
import cats.Applicative
import cats.effect._
import docspell.common._
import edu.stanford.nlp.pipeline.{CoreDocument, StanfordCoreNLP}
object StanfordNerClassifier {
/** Runs named entity recognition on the given `text`.
*
* This uses the classifier pipeline from stanford-nlp, see
* https://nlp.stanford.edu/software/CRF-NER.html. Creating these
* classifiers is quite expensive, it involves loading large model
* files. The classifiers are thread-safe and so they are cached.
* The `cacheKey` defines the "slot" where classifiers are stored
* and retrieved. If for a given `cacheKey` the `settings` change,
* a new classifier must be created. It will then replace the
* previous one.
*/
def nerAnnotate[F[_]: BracketThrow](
cacheKey: String,
cache: PipelineCache[F]
)(settings: StanfordNerSettings, text: String): F[Vector[NerLabel]] =
cache
.obtain(cacheKey, settings)
.use(crf => Applicative[F].pure(runClassifier(crf, text)))
def runClassifier(nerClassifier: StanfordCoreNLP, text: String): Vector[NerLabel] = {
val doc = new CoreDocument(text)
nerClassifier.annotate(doc)
doc.tokens().asScala.collect(Function.unlift(LabelConverter.toNerLabel)).toVector
}
}

View File

@ -2,7 +2,12 @@ package docspell.analysis.nlp
import java.nio.file.Path import java.nio.file.Path
import docspell.common._ import docspell.analysis.NlpSettings
import docspell.common.Language.NLPLanguage
sealed trait StanfordNerSettings
object StanfordNerSettings {
/** Settings for configuring the stanford NER pipeline. /** Settings for configuring the stanford NER pipeline.
* *
@ -19,8 +24,19 @@ import docspell.common._
* as a last step to tag untagged tokens using the provided list of * as a last step to tag untagged tokens using the provided list of
* regexps. * regexps.
*/ */
case class StanfordNerSettings( case class Full(
lang: Language, lang: NLPLanguage,
highRecall: Boolean, highRecall: Boolean,
regexNer: Option[Path] regexNer: Option[Path]
) ) extends StanfordNerSettings
/** Not all languages are supported with predefined statistical models. This allows to provide regexps only.
*/
case class RegexOnly(regexNerFile: Path) extends StanfordNerSettings
def fromNlpSettings(ns: NlpSettings): Option[StanfordNerSettings] =
NLPLanguage.all
.find(nl => nl == ns.lang)
.map(nl => Full(nl, ns.highRecall, ns.regexNer))
.orElse(ns.regexNer.map(nrf => RegexOnly(nrf)))
}

View File

@ -0,0 +1,12 @@
package docspell.analysis
object Env {
def isCI = bool("CI")
def bool(key: String): Boolean =
string(key).contains("true")
def string(key: String): Option[String] =
Option(System.getenv(key)).filter(_.nonEmpty)
}

View File

@ -1,4 +1,4 @@
package docspell.analysis.nlp package docspell.analysis.classifier
import minitest._ import minitest._
import cats.effect._ import cats.effect._

View File

@ -1,19 +1,22 @@
package docspell.analysis.nlp package docspell.analysis.nlp
import docspell.analysis.Env
import docspell.common.Language.NLPLanguage
import minitest.SimpleTestSuite import minitest.SimpleTestSuite
import docspell.files.TestFiles import docspell.files.TestFiles
import docspell.common._ import docspell.common._
import edu.stanford.nlp.pipeline.StanfordCoreNLP
object TextAnalyserSuite extends SimpleTestSuite { object BaseCRFAnnotatorSuite extends SimpleTestSuite {
lazy val germanClassifier =
new StanfordCoreNLP(Properties.nerGerman(None, false)) def annotate(language: NLPLanguage): String => Vector[NerLabel] =
lazy val englishClassifier = BasicCRFAnnotator.nerAnnotate(BasicCRFAnnotator.Cache.getAnnotator(language))
new StanfordCoreNLP(Properties.nerEnglish(None))
test("find english ner labels") { test("find english ner labels") {
val labels = if (Env.isCI) {
StanfordNerClassifier.runClassifier(englishClassifier, TestFiles.letterENText) ignore("Test ignored on travis.")
}
val labels = annotate(Language.English)(TestFiles.letterENText)
val expect = Vector( val expect = Vector(
NerLabel("Derek", NerTag.Person, 0, 5), NerLabel("Derek", NerTag.Person, 0, 5),
NerLabel("Jeter", NerTag.Person, 6, 11), NerLabel("Jeter", NerTag.Person, 6, 11),
@ -45,11 +48,15 @@ object TextAnalyserSuite extends SimpleTestSuite {
NerLabel("Jeter", NerTag.Person, 1123, 1128) NerLabel("Jeter", NerTag.Person, 1123, 1128)
) )
assertEquals(labels, expect) assertEquals(labels, expect)
BasicCRFAnnotator.Cache.clearCache()
} }
test("find german ner labels") { test("find german ner labels") {
val labels = if (Env.isCI) {
StanfordNerClassifier.runClassifier(germanClassifier, TestFiles.letterDEText) ignore("Test ignored on travis.")
}
val labels = annotate(Language.German)(TestFiles.letterDEText)
val expect = Vector( val expect = Vector(
NerLabel("Max", NerTag.Person, 0, 3), NerLabel("Max", NerTag.Person, 0, 3),
NerLabel("Mustermann", NerTag.Person, 4, 14), NerLabel("Mustermann", NerTag.Person, 4, 14),
@ -65,5 +72,6 @@ object TextAnalyserSuite extends SimpleTestSuite {
NerLabel("Mustermann", NerTag.Person, 509, 519) NerLabel("Mustermann", NerTag.Person, 509, 519)
) )
assertEquals(labels, expect) assertEquals(labels, expect)
BasicCRFAnnotator.Cache.clearCache()
} }
} }

View File

@ -0,0 +1,120 @@
package docspell.analysis.nlp
import java.nio.file.Paths
import cats.effect.IO
import docspell.analysis.Env
import minitest.SimpleTestSuite
import docspell.files.TestFiles
import docspell.common._
import docspell.common.syntax.FileSyntax._
import edu.stanford.nlp.pipeline.StanfordCoreNLP
object StanfordNerAnnotatorSuite extends SimpleTestSuite {
lazy val germanClassifier =
new StanfordCoreNLP(Properties.nerGerman(None, false))
lazy val englishClassifier =
new StanfordCoreNLP(Properties.nerEnglish(None))
test("find english ner labels") {
if (Env.isCI) {
ignore("Test ignored on travis.")
}
val labels =
StanfordNerAnnotator.nerAnnotate(englishClassifier, TestFiles.letterENText)
val expect = Vector(
NerLabel("Derek", NerTag.Person, 0, 5),
NerLabel("Jeter", NerTag.Person, 6, 11),
NerLabel("Elm", NerTag.Misc, 17, 20),
NerLabel("Ave.", NerTag.Misc, 21, 25),
NerLabel("Treesville", NerTag.Misc, 27, 37),
NerLabel("Derek", NerTag.Person, 68, 73),
NerLabel("Jeter", NerTag.Person, 74, 79),
NerLabel("Elm", NerTag.Misc, 85, 88),
NerLabel("Ave.", NerTag.Misc, 89, 93),
NerLabel("Treesville", NerTag.Person, 95, 105),
NerLabel("Leaf", NerTag.Organization, 144, 148),
NerLabel("Chief", NerTag.Organization, 150, 155),
NerLabel("of", NerTag.Organization, 156, 158),
NerLabel("Syrup", NerTag.Organization, 159, 164),
NerLabel("Production", NerTag.Organization, 165, 175),
NerLabel("Old", NerTag.Organization, 176, 179),
NerLabel("Sticky", NerTag.Organization, 180, 186),
NerLabel("Pancake", NerTag.Organization, 187, 194),
NerLabel("Company", NerTag.Organization, 195, 202),
NerLabel("Maple", NerTag.Organization, 207, 212),
NerLabel("Lane", NerTag.Organization, 213, 217),
NerLabel("Forest", NerTag.Organization, 219, 225),
NerLabel("Hemptown", NerTag.Location, 239, 247),
NerLabel("Leaf", NerTag.Person, 276, 280),
NerLabel("Little", NerTag.Misc, 347, 353),
NerLabel("League", NerTag.Misc, 354, 360),
NerLabel("Derek", NerTag.Person, 1117, 1122),
NerLabel("Jeter", NerTag.Person, 1123, 1128)
)
assertEquals(labels, expect)
StanfordCoreNLP.clearAnnotatorPool()
}
test("find german ner labels") {
if (Env.isCI) {
ignore("Test ignored on travis.")
}
val labels =
StanfordNerAnnotator.nerAnnotate(germanClassifier, TestFiles.letterDEText)
val expect = Vector(
NerLabel("Max", NerTag.Person, 0, 3),
NerLabel("Mustermann", NerTag.Person, 4, 14),
NerLabel("Lilienweg", NerTag.Person, 16, 25),
NerLabel("Max", NerTag.Person, 77, 80),
NerLabel("Mustermann", NerTag.Person, 81, 91),
NerLabel("Lilienweg", NerTag.Location, 93, 102),
NerLabel("EasyCare", NerTag.Organization, 124, 132),
NerLabel("AG", NerTag.Organization, 133, 135),
NerLabel("Ackerweg", NerTag.Location, 158, 166),
NerLabel("Nebendorf", NerTag.Location, 184, 193),
NerLabel("Max", NerTag.Person, 505, 508),
NerLabel("Mustermann", NerTag.Person, 509, 519)
)
assertEquals(labels, expect)
StanfordCoreNLP.clearAnnotatorPool()
}
test("regexner-only annotator") {
if (Env.isCI) {
ignore("Test ignored on travis.")
}
val regexNerContent =
s"""(?i)volantino ag${"\t"}ORGANIZATION${"\t"}LOCATION,PERSON,MISC${"\t"}3
|(?i)volantino${"\t"}ORGANIZATION${"\t"}LOCATION,PERSON,MISC${"\t"}3
|(?i)ag${"\t"}ORGANIZATION${"\t"}LOCATION,PERSON,MISC${"\t"}3
|(?i)andrea rossi${"\t"}PERSON${"\t"}LOCATION,MISC${"\t"}2
|(?i)andrea${"\t"}PERSON${"\t"}LOCATION,MISC${"\t"}2
|(?i)rossi${"\t"}PERSON${"\t"}LOCATION,MISC${"\t"}2
|""".stripMargin
File
.withTempDir[IO](Paths.get("target"), "test-regex-ner")
.use { dir =>
for {
out <- File.writeString[IO](dir / "regex.txt", regexNerContent)
ann = StanfordNerAnnotator.makePipeline(StanfordNerSettings.RegexOnly(out))
labels = StanfordNerAnnotator.nerAnnotate(ann, "Hello Andrea Rossi, can you.")
_ <- IO(
assertEquals(
labels,
Vector(
NerLabel("Andrea", NerTag.Person, 6, 12),
NerLabel("Rossi", NerTag.Person, 13, 18)
)
)
)
} yield ()
}
.unsafeRunSync()
StanfordCoreNLP.clearAnnotatorPool()
}
}

View File

@ -591,7 +591,7 @@ object OItem {
for { for {
itemIds <- store.transact(RItem.filterItems(items, collective)) itemIds <- store.transact(RItem.filterItems(items, collective))
results <- itemIds.traverse(item => deleteItem(item, collective)) results <- itemIds.traverse(item => deleteItem(item, collective))
n = results.fold(0)(_ + _) n = results.sum
} yield n } yield n
def getProposals(item: Ident, collective: Ident): F[MetaProposalList] = def getProposals(item: Ident, collective: Ident): F[MetaProposalList] =

View File

@ -1,5 +1,7 @@
package docspell.common package docspell.common
import cats.data.NonEmptyList
import io.circe.{Decoder, Encoder} import io.circe.{Decoder, Encoder}
sealed trait Language { self: Product => sealed trait Language { self: Product =>
@ -11,28 +13,107 @@ sealed trait Language { self: Product =>
def iso3: String def iso3: String
val allowsNLP: Boolean = false
private[common] def allNames = private[common] def allNames =
Set(name, iso3, iso2) Set(name, iso3, iso2)
} }
object Language { object Language {
sealed trait NLPLanguage extends Language with Product {
override val allowsNLP = true
}
object NLPLanguage {
val all: NonEmptyList[NLPLanguage] = NonEmptyList.of(German, English, French)
}
case object German extends Language { case object German extends NLPLanguage {
val iso2 = "de" val iso2 = "de"
val iso3 = "deu" val iso3 = "deu"
} }
case object English extends Language { case object English extends NLPLanguage {
val iso2 = "en" val iso2 = "en"
val iso3 = "eng" val iso3 = "eng"
} }
case object French extends Language { case object French extends NLPLanguage {
val iso2 = "fr" val iso2 = "fr"
val iso3 = "fra" val iso3 = "fra"
} }
val all: List[Language] = List(German, English, French) case object Italian extends Language {
val iso2 = "it"
val iso3 = "ita"
}
case object Spanish extends Language {
val iso2 = "es"
val iso3 = "spa"
}
case object Portuguese extends Language {
val iso2 = "pt"
val iso3 = "por"
}
case object Czech extends Language {
val iso2 = "cs"
val iso3 = "ces"
}
case object Danish extends Language {
val iso2 = "da"
val iso3 = "dan"
}
case object Finnish extends Language {
val iso2 = "fi"
val iso3 = "fin"
}
case object Norwegian extends Language {
val iso2 = "no"
val iso3 = "nor"
}
case object Swedish extends Language {
val iso2 = "sv"
val iso3 = "swe"
}
case object Russian extends Language {
val iso2 = "ru"
val iso3 = "rus"
}
case object Romanian extends Language {
val iso2 = "ro"
val iso3 = "ron"
}
case object Dutch extends Language {
val iso2 = "nl"
val iso3 = "nld"
}
val all: List[Language] =
List(
German,
English,
French,
Italian,
Spanish,
Dutch,
Portuguese,
Czech,
Danish,
Finnish,
Norwegian,
Swedish,
Russian,
Romanian
)
def fromString(str: String): Either[String, Language] = { def fromString(str: String): Either[String, Language] = {
val lang = str.toLowerCase val lang = str.toLowerCase

View File

@ -0,0 +1,33 @@
package docspell.common
import cats.data.NonEmptyList
import io.circe.{Decoder, Encoder}
sealed trait ListType { self: Product =>
def name: String =
productPrefix.toLowerCase
}
object ListType {
case object Whitelist extends ListType
val whitelist: ListType = Whitelist
case object Blacklist extends ListType
val blacklist: ListType = Blacklist
val all: NonEmptyList[ListType] = NonEmptyList.of(Whitelist, Blacklist)
def fromString(name: String): Either[String, ListType] =
all.find(_.name.equalsIgnoreCase(name)).toRight(s"Unknown list type: $name")
def unsafeFromString(name: String): ListType =
fromString(name).fold(sys.error, identity)
implicit val jsonEncoder: Encoder[ListType] =
Encoder.encodeString.contramap(_.name)
implicit val jsonDecoder: Decoder[ListType] =
Decoder.decodeString.emap(fromString)
}

View File

@ -87,7 +87,7 @@ object MetaProposal {
} }
} }
/** Merges candidates with same `IdRef' values and concatenates their /** Merges candidates with same `IdRef` values and concatenates their
* respective labels. The candidate order is preserved. * respective labels. The candidate order is preserved.
*/ */
def flatten(s: NonEmptyList[Candidate]): NonEmptyList[Candidate] = { def flatten(s: NonEmptyList[Candidate]): NonEmptyList[Candidate] = {

View File

@ -45,6 +45,19 @@ case class MetaProposalList private (proposals: List[MetaProposal]) {
def sortByWeights: MetaProposalList = def sortByWeights: MetaProposalList =
change(_.sortByWeight) change(_.sortByWeight)
def insertSecond(ml: MetaProposalList): MetaProposalList =
MetaProposalList.flatten0(
Seq(this, ml),
(map, next) =>
map.get(next.proposalType) match {
case Some(MetaProposal(mt, values)) =>
val cand = NonEmptyList(values.head, next.values.toList ++ values.tail)
map.updated(next.proposalType, MetaProposal(mt, MetaProposal.flatten(cand)))
case None =>
map.updated(next.proposalType, next)
}
)
} }
object MetaProposalList { object MetaProposalList {
@ -74,20 +87,25 @@ object MetaProposalList {
* is preserved and candidates of proposals are appended as given * is preserved and candidates of proposals are appended as given
* by the order of the given `seq'. * by the order of the given `seq'.
*/ */
def flatten(ml: Seq[MetaProposalList]): MetaProposalList = { def flatten(ml: Seq[MetaProposalList]): MetaProposalList =
val init: Map[MetaProposalType, MetaProposal] = Map.empty flatten0(
ml,
def updateMap( (map, mp) =>
map: Map[MetaProposalType, MetaProposal],
mp: MetaProposal
): Map[MetaProposalType, MetaProposal] =
map.get(mp.proposalType) match { map.get(mp.proposalType) match {
case Some(mp0) => map.updated(mp.proposalType, mp0.addIdRef(mp.values.toList)) case Some(mp0) => map.updated(mp.proposalType, mp0.addIdRef(mp.values.toList))
case None => map.updated(mp.proposalType, mp) case None => map.updated(mp.proposalType, mp)
} }
)
val merged = ml.foldLeft(init)((map, el) => el.proposals.foldLeft(map)(updateMap)) private def flatten0(
ml: Seq[MetaProposalList],
merge: (
Map[MetaProposalType, MetaProposal],
MetaProposal
) => Map[MetaProposalType, MetaProposal]
): MetaProposalList = {
val init = Map.empty[MetaProposalType, MetaProposal]
val merged = ml.foldLeft(init)((map, el) => el.proposals.foldLeft(map)(merge))
fromMap(merged) fromMap(merged)
} }

View File

@ -0,0 +1,25 @@
package docspell.common
sealed trait NlpMode { self: Product =>
def name: String =
self.productPrefix
}
object NlpMode {
case object Full extends NlpMode
case object Basic extends NlpMode
case object RegexOnly extends NlpMode
case object Disabled extends NlpMode
def fromString(name: String): Either[String, NlpMode] =
name.toLowerCase match {
case "full" => Right(Full)
case "basic" => Right(Basic)
case "regexonly" => Right(RegexOnly)
case "disabled" => Right(Disabled)
case _ => Left(s"Unknown nlp-mode: $name")
}
def unsafeFromString(name: String): NlpMode =
fromString(name).fold(sys.error, identity)
}

View File

@ -44,6 +44,9 @@ object Implicits {
implicit val priorityReader: ConfigReader[Priority] = implicit val priorityReader: ConfigReader[Priority] =
ConfigReader[String].emap(reason(Priority.fromString)) ConfigReader[String].emap(reason(Priority.fromString))
implicit val nlpModeReader: ConfigReader[NlpMode] =
ConfigReader[String].emap(reason(NlpMode.fromString))
def reason[A: ClassTag]( def reason[A: ClassTag](
f: String => Either[String, A] f: String => Either[String, A]
): String => Either[FailureReason, A] = ): String => Either[FailureReason, A] =

View File

@ -0,0 +1,20 @@
package docspell.common.syntax
import java.nio.file.Path
trait FileSyntax {
implicit final class PathOps(p: Path) {
def absolutePath: Path =
p.normalize().toAbsolutePath
def absolutePathAsString: String =
absolutePath.toString
def /(next: String): Path =
p.resolve(next)
}
}
object FileSyntax extends FileSyntax

View File

@ -2,6 +2,11 @@ package docspell.common
package object syntax { package object syntax {
object all extends EitherSyntax with StreamSyntax with StringSyntax with LoggerSyntax object all
extends EitherSyntax
with StreamSyntax
with StringSyntax
with LoggerSyntax
with FileSyntax
} }

View File

@ -68,4 +68,35 @@ object MetaProposalListTest extends SimpleTestSuite {
assertEquals(candidates.head, cand1) assertEquals(candidates.head, cand1)
assertEquals(candidates.tail.head, cand2) assertEquals(candidates.tail.head, cand2)
} }
test("insert second") {
val cand1 = Candidate(IdRef(Ident.unsafe("123"), "name"), Set.empty)
val cand2 = Candidate(IdRef(Ident.unsafe("456"), "name"), Set.empty)
val cand3 = Candidate(IdRef(Ident.unsafe("789"), "name"), Set.empty)
val cand4 = Candidate(IdRef(Ident.unsafe("abc"), "name"), Set.empty)
val cand5 = Candidate(IdRef(Ident.unsafe("def"), "name"), Set.empty)
val mpl1 = MetaProposalList
.of(
MetaProposal(MetaProposalType.CorrOrg, NonEmptyList.of(cand1, cand2)),
MetaProposal(MetaProposalType.ConcPerson, NonEmptyList.of(cand3))
)
val mpl2 = MetaProposalList
.of(
MetaProposal(MetaProposalType.CorrOrg, NonEmptyList.of(cand4)),
MetaProposal(MetaProposalType.ConcPerson, NonEmptyList.of(cand5))
)
val result = mpl1.insertSecond(mpl2)
assertEquals(
result,
MetaProposalList(
List(
MetaProposal(MetaProposalType.CorrOrg, NonEmptyList.of(cand1, cand4, cand2)),
MetaProposal(MetaProposalType.ConcPerson, NonEmptyList.of(cand3, cand5))
)
)
)
}
} }

View File

@ -0,0 +1,13 @@
Pontremoli, 9 aprile 2013
Spettabile Villa Albicocca
Via Francigena, 9
55100 Pontetetto (LU)
Oggetto: Prenotazione
Gentile Direttore,
Vorrei prenotare una camera matrimoniale …….
In attesa di una Sua pronta risposta, La saluto cordialmente

View File

@ -1,5 +1,8 @@
package docspell.ftsclient package docspell.ftsclient
import cats.Functor
import cats.implicits._
import docspell.common._ import docspell.common._
final case class FtsMigration[F[_]]( final case class FtsMigration[F[_]](
@ -7,7 +10,13 @@ final case class FtsMigration[F[_]](
engine: Ident, engine: Ident,
description: String, description: String,
task: F[FtsMigration.Result] task: F[FtsMigration.Result]
) ) {
def changeResult(f: FtsMigration.Result => FtsMigration.Result)(implicit
F: Functor[F]
): FtsMigration[F] =
copy(task = task.map(f))
}
object FtsMigration { object FtsMigration {

View File

@ -21,22 +21,19 @@ object Field {
val discriminator = Field("discriminator") val discriminator = Field("discriminator")
val attachmentName = Field("attachmentName") val attachmentName = Field("attachmentName")
val content = Field("content") val content = Field("content")
val content_de = Field("content_de") val content_de = contentField(Language.German)
val content_en = Field("content_en") val content_en = contentField(Language.English)
val content_fr = Field("content_fr") val content_fr = contentField(Language.French)
val itemName = Field("itemName") val itemName = Field("itemName")
val itemNotes = Field("itemNotes") val itemNotes = Field("itemNotes")
val folderId = Field("folder") val folderId = Field("folder")
val contentLangFields = Language.all
.map(contentField)
def contentField(lang: Language): Field = def contentField(lang: Language): Field =
lang match { if (lang == Language.Czech) Field(s"content_cz")
case Language.German => else Field(s"content_${lang.iso2}")
Field.content_de
case Language.English =>
Field.content_en
case Language.French =>
Field.content_fr
}
implicit val jsonEncoder: Encoder[Field] = implicit val jsonEncoder: Encoder[Field] =
Encoder.encodeString.contramap(_.name) Encoder.encodeString.contramap(_.name)

View File

@ -37,13 +37,10 @@ object SolrQuery {
cfg, cfg,
List( List(
Field.content, Field.content,
Field.content_de,
Field.content_en,
Field.content_fr,
Field.itemName, Field.itemName,
Field.itemNotes, Field.itemNotes,
Field.attachmentName Field.attachmentName
), ) ++ Field.contentLangFields,
List( List(
Field.id, Field.id,
Field.itemId, Field.itemId,

View File

@ -56,21 +56,51 @@ object SolrSetup {
5, 5,
solrEngine, solrEngine,
"Add content_fr field", "Add content_fr field",
addContentFrField.map(_ => FtsMigration.Result.workDone) addContentField(Language.French).map(_ => FtsMigration.Result.workDone)
), ),
FtsMigration[F]( FtsMigration[F](
6, 6,
solrEngine, solrEngine,
"Index all from database", "Index all from database",
FtsMigration.Result.indexAll.pure[F] FtsMigration.Result.indexAll.pure[F]
),
FtsMigration[F](
7,
solrEngine,
"Add content_it field",
addContentField(Language.Italian).map(_ => FtsMigration.Result.reIndexAll)
),
FtsMigration[F](
8,
solrEngine,
"Add content_es field",
addContentField(Language.Spanish).map(_ => FtsMigration.Result.reIndexAll)
),
FtsMigration[F](
9,
solrEngine,
"Add more content fields",
addMoreContentFields.map(_ => FtsMigration.Result.reIndexAll)
) )
) )
def addFolderField: F[Unit] = def addFolderField: F[Unit] =
addStringField(Field.folderId) addStringField(Field.folderId)
def addContentFrField: F[Unit] = def addMoreContentFields: F[Unit] = {
addTextField(Some(Language.French))(Field.content_fr) val remain = List[Language](
Language.Norwegian,
Language.Romanian,
Language.Swedish,
Language.Finnish,
Language.Danish,
Language.Czech,
Language.Dutch,
Language.Portuguese,
Language.Russian
)
remain.traverse(addContentField).map(_ => ())
}
def setupCoreSchema: F[Unit] = { def setupCoreSchema: F[Unit] = {
val cmds0 = val cmds0 =
@ -90,13 +120,15 @@ object SolrSetup {
) )
.traverse(addTextField(None)) .traverse(addTextField(None))
val cntLang = Language.all.traverse { val cntLang = List(Language.German, Language.English, Language.French).traverse {
case l @ Language.German => case l @ Language.German =>
addTextField(l.some)(Field.content_de) addTextField(l.some)(Field.content_de)
case l @ Language.English => case l @ Language.English =>
addTextField(l.some)(Field.content_en) addTextField(l.some)(Field.content_en)
case l @ Language.French => case l @ Language.French =>
addTextField(l.some)(Field.content_fr) addTextField(l.some)(Field.content_fr)
case _ =>
().pure[F]
} }
cmds0 *> cmds1 *> cntLang *> ().pure[F] cmds0 *> cmds1 *> cntLang *> ().pure[F]
@ -111,20 +143,17 @@ object SolrSetup {
run(DeleteField.command(DeleteField(field))).attempt *> run(DeleteField.command(DeleteField(field))).attempt *>
run(AddField.command(AddField.string(field))) run(AddField.command(AddField.string(field)))
private def addContentField(lang: Language): F[Unit] =
addTextField(Some(lang))(Field.contentField(lang))
private def addTextField(lang: Option[Language])(field: Field): F[Unit] = private def addTextField(lang: Option[Language])(field: Field): F[Unit] =
lang match { lang match {
case None => case None =>
run(DeleteField.command(DeleteField(field))).attempt *> run(DeleteField.command(DeleteField(field))).attempt *>
run(AddField.command(AddField.text(field))) run(AddField.command(AddField.textGeneral(field)))
case Some(Language.German) => case Some(lang) =>
run(DeleteField.command(DeleteField(field))).attempt *> run(DeleteField.command(DeleteField(field))).attempt *>
run(AddField.command(AddField.textDE(field))) run(AddField.command(AddField.textLang(field, lang)))
case Some(Language.English) =>
run(DeleteField.command(DeleteField(field))).attempt *>
run(AddField.command(AddField.textEN(field)))
case Some(Language.French) =>
run(DeleteField.command(DeleteField(field))).attempt *>
run(AddField.command(AddField.textFR(field)))
} }
} }
} }
@ -150,17 +179,12 @@ object SolrSetup {
def string(field: Field): AddField = def string(field: Field): AddField =
AddField(field, "string", true, true, false) AddField(field, "string", true, true, false)
def text(field: Field): AddField = def textGeneral(field: Field): AddField =
AddField(field, "text_general", true, true, false) AddField(field, "text_general", true, true, false)
def textDE(field: Field): AddField = def textLang(field: Field, lang: Language): AddField =
AddField(field, "text_de", true, true, false) if (lang == Language.Czech) AddField(field, s"text_cz", true, true, false)
else AddField(field, s"text_${lang.iso2}", true, true, false)
def textEN(field: Field): AddField =
AddField(field, "text_en", true, true, false)
def textFR(field: Field): AddField =
AddField(field, "text_fr", true, true, false)
} }
case class DeleteField(name: Field) case class DeleteField(name: Field)

View File

@ -269,62 +269,101 @@ docspell.joex {
# All text to analyse must fit into RAM. A large document may take # All text to analyse must fit into RAM. A large document may take
# too much heap. Also, most important information is at the # too much heap. Also, most important information is at the
# beginning of a document, so in most cases the first two pages # beginning of a document, so in most cases the first two pages
# should suffice. Default is 10000, which are about 2-3 pages # should suffice. Default is 8000, which are about 2-3 pages (just
# (just a rough guess, of course). # a rough guess, of course).
max-length = 10000 max-length = 8000
# A working directory for the analyser to store temporary/working # A working directory for the analyser to store temporary/working
# files. # files.
working-dir = ${java.io.tmpdir}"/docspell-analysis" working-dir = ${java.io.tmpdir}"/docspell-analysis"
nlp {
# The mode for configuring NLP models:
#
# 1. full builds the complete pipeline
# 2. basic - builds only the ner annotator
# 3. regexonly - matches each entry in your address book via regexps
# 4. disabled - doesn't use any stanford-nlp feature
#
# The full and basic variants rely on pre-build language models
# that are available for only a few languages. Memory usage
# varies among the languages. So joex should run with -Xmx1400M
# at least when using mode=full.
#
# The basic variant does a quite good job for German and
# English. It might be worse for French, always depending on the
# type of text that is analysed. Joex should run with about 500M
# heap, here again lanugage German uses the most.
#
# The regexonly variant doesn't depend on a language. It roughly
# works by converting all entries in your addressbook into
# regexps and matches each one against the text. This can get
# memory intensive, too, when the addressbook grows large. This
# is included in the full and basic by default, but can be used
# independently by setting mode=regexner.
#
# When mode=disabled, then the whole nlp pipeline is disabled,
# and you won't get any suggestions. Only what the classifier
# returns (if enabled).
mode = full
# The StanfordCoreNLP library caches language models which # The StanfordCoreNLP library caches language models which
# requires quite some amount of memory. Setting this interval to a # requires quite some amount of memory. Setting this interval to a
# positive duration, the cache is cleared after this amount of # positive duration, the cache is cleared after this amount of
# idle time. Set it to 0 to disable it if you have enough memory, # idle time. Set it to 0 to disable it if you have enough memory,
# processing will be faster. # processing will be faster.
clear-stanford-nlp-interval = "15 minutes" #
# This has only any effect, if mode != disabled.
clear-interval = "15 minutes"
# Restricts proposals for due dates. Only dates earlier than this
# number of years in the future are considered.
max-due-date-years = 10
regex-ner { regex-ner {
# Whether to enable custom NER annotation. This uses the address # Whether to enable custom NER annotation. This uses the
# book of a collective as input for NER tagging (to automatically # address book of a collective as input for NER tagging (to
# find correspondent and concerned entities). If the address book # automatically find correspondent and concerned entities). If
# is large, this can be quite memory intensive and also makes text # the address book is large, this can be quite memory
# analysis slower. But it greatly improves accuracy. If this is # intensive and also makes text analysis much slower. But it
# false, NER tagging uses only statistical models (that also work # improves accuracy and can be used independent of the
# quite well). # lanugage. If this is set to 0, it is effectively disabled
# and NER tagging uses only statistical models (that also work
# quite well, but are restricted to the languages mentioned
# above).
# #
# This setting might be moved to the collective settings in the # Note, this is only relevant if nlp-config.mode is not
# future. # "disabled".
enabled = true max-entries = 1000
# The NER annotation uses a file of patterns that is derived from # The NER annotation uses a file of patterns that is derived
# a collective's address book. This is is the time how long this # from a collective's address book. This is is the time how
# file will be kept until a check for a state change is done. # long this data will be kept until a check for a state change
# is done.
file-cache-time = "1 minute" file-cache-time = "1 minute"
} }
}
# Settings for doing document classification. # Settings for doing document classification.
# #
# This works by learning from existing documents. A collective can # This works by learning from existing documents. This requires a
# specify a tag category and the system will try to predict a tag # satstical model that is computed from all existing documents.
# from this category for new incoming documents. # This process is run periodically as configured by the
# # collective. It may require more memory, depending on the amount
# This requires a satstical model that is computed from all # of data.
# existing documents. This process is run periodically as
# configured by the collective. It may require a lot of memory,
# depending on the amount of data.
# #
# It utilises this NLP library: https://nlp.stanford.edu/. # It utilises this NLP library: https://nlp.stanford.edu/.
classification { classification {
# Whether to enable classification globally. Each collective can # Whether to enable classification globally. Each collective can
# decide to disable it. If it is disabled here, no collective # enable/disable auto-tagging. The classifier is also used for
# can use classification. # finding correspondents and concerned entities, if enabled
# here.
enabled = true enabled = true
# If concerned with memory consumption, this restricts the # If concerned with memory consumption, this restricts the
# number of items to consider. More are better for training. A # number of items to consider. More are better for training. A
# negative value or zero means no train on all items. # negative value or zero means to train on all items.
item-count = 0 item-count = 600
# These settings are used to configure the classifier. If # These settings are used to configure the classifier. If
# multiple are given, they are all tried and the "best" is # multiple are given, they are all tried and the "best" is
@ -477,13 +516,6 @@ docspell.joex {
} }
} }
# General config for processing documents
processing {
# Restricts proposals for due dates. Only dates earlier than this
# number of years in the future are considered.
max-due-date-years = 10
}
# The same section is also present in the rest-server config. It is # The same section is also present in the rest-server config. It is
# used when submitting files into the job queue for processing. # used when submitting files into the job queue for processing.
# #

View File

@ -5,7 +5,7 @@ import java.nio.file.Path
import cats.data.NonEmptyList import cats.data.NonEmptyList
import docspell.analysis.TextAnalysisConfig import docspell.analysis.TextAnalysisConfig
import docspell.analysis.nlp.TextClassifierConfig import docspell.analysis.classifier.TextClassifierConfig
import docspell.backend.Config.Files import docspell.backend.Config.Files
import docspell.common._ import docspell.common._
import docspell.convert.ConvertConfig import docspell.convert.ConvertConfig
@ -31,8 +31,7 @@ case class Config(
sendMail: MailSendConfig, sendMail: MailSendConfig,
files: Files, files: Files,
mailDebug: Boolean, mailDebug: Boolean,
fullTextSearch: Config.FullTextSearch, fullTextSearch: Config.FullTextSearch
processing: Config.Processing
) )
object Config { object Config {
@ -55,20 +54,17 @@ object Config {
final case class Migration(indexAllChunk: Int) final case class Migration(indexAllChunk: Int)
} }
case class Processing(maxDueDateYears: Int)
case class TextAnalysis( case class TextAnalysis(
maxLength: Int, maxLength: Int,
workingDir: Path, workingDir: Path,
clearStanfordNlpInterval: Duration, nlp: NlpConfig,
regexNer: RegexNer,
classification: Classification classification: Classification
) { ) {
def textAnalysisConfig: TextAnalysisConfig = def textAnalysisConfig: TextAnalysisConfig =
TextAnalysisConfig( TextAnalysisConfig(
maxLength, maxLength,
clearStanfordNlpInterval, TextAnalysisConfig.NlpConfig(nlp.clearInterval, nlp.mode),
TextClassifierConfig( TextClassifierConfig(
workingDir, workingDir,
NonEmptyList NonEmptyList
@ -78,14 +74,30 @@ object Config {
) )
def regexNerFileConfig: RegexNerFile.Config = def regexNerFileConfig: RegexNerFile.Config =
RegexNerFile.Config(regexNer.enabled, workingDir, regexNer.fileCacheTime) RegexNerFile.Config(
nlp.regexNer.maxEntries,
workingDir,
nlp.regexNer.fileCacheTime
)
} }
case class RegexNer(enabled: Boolean, fileCacheTime: Duration) case class NlpConfig(
mode: NlpMode,
clearInterval: Duration,
maxDueDateYears: Int,
regexNer: RegexNer
)
case class RegexNer(maxEntries: Int, fileCacheTime: Duration)
case class Classification( case class Classification(
enabled: Boolean, enabled: Boolean,
itemCount: Int, itemCount: Int,
classifiers: List[Map[String, String]] classifiers: List[Map[String, String]]
) ) {
def itemCountOrWhenLower(other: Int): Int =
if (itemCount <= 0 || (itemCount > other && other > 0)) other
else itemCount
}
} }

View File

@ -97,7 +97,7 @@ object JoexAppImpl {
upload <- OUpload(store, queue, cfg.files, joex) upload <- OUpload(store, queue, cfg.files, joex)
fts <- createFtsClient(cfg)(httpClient) fts <- createFtsClient(cfg)(httpClient)
itemOps <- OItem(store, fts, queue, joex) itemOps <- OItem(store, fts, queue, joex)
analyser <- TextAnalyser.create[F](cfg.textAnalysis.textAnalysisConfig) analyser <- TextAnalyser.create[F](cfg.textAnalysis.textAnalysisConfig, blocker)
regexNer <- RegexNerFile(cfg.textAnalysis.regexNerFileConfig, blocker, store) regexNer <- RegexNerFile(cfg.textAnalysis.regexNerFileConfig, blocker, store)
javaEmil = javaEmil =
JavaMailEmil(blocker, Settings.defaultSettings.copy(debug = cfg.mailDebug)) JavaMailEmil(blocker, Settings.defaultSettings.copy(debug = cfg.mailDebug))
@ -169,7 +169,7 @@ object JoexAppImpl {
.withTask( .withTask(
JobTask.json( JobTask.json(
LearnClassifierArgs.taskName, LearnClassifierArgs.taskName,
LearnClassifierTask[F](cfg.textAnalysis, blocker, analyser), LearnClassifierTask[F](cfg.textAnalysis, analyser),
LearnClassifierTask.onCancel[F] LearnClassifierTask.onCancel[F]
) )
) )

View File

@ -29,7 +29,7 @@ trait RegexNerFile[F[_]] {
object RegexNerFile { object RegexNerFile {
private[this] val logger = getLogger private[this] val logger = getLogger
case class Config(enabled: Boolean, directory: Path, minTime: Duration) case class Config(maxEntries: Int, directory: Path, minTime: Duration)
def apply[F[_]: Concurrent: ContextShift]( def apply[F[_]: Concurrent: ContextShift](
cfg: Config, cfg: Config,
@ -49,7 +49,7 @@ object RegexNerFile {
) extends RegexNerFile[F] { ) extends RegexNerFile[F] {
def makeFile(collective: Ident): F[Option[Path]] = def makeFile(collective: Ident): F[Option[Path]] =
if (cfg.enabled) doMakeFile(collective) if (cfg.maxEntries > 0) doMakeFile(collective)
else (None: Option[Path]).pure[F] else (None: Option[Path]).pure[F]
def doMakeFile(collective: Ident): F[Option[Path]] = def doMakeFile(collective: Ident): F[Option[Path]] =
@ -127,7 +127,7 @@ object RegexNerFile {
for { for {
_ <- logger.finfo(s"Generating custom NER file for collective '${collective.id}'") _ <- logger.finfo(s"Generating custom NER file for collective '${collective.id}'")
names <- store.transact(QCollective.allNames(collective)) names <- store.transact(QCollective.allNames(collective, cfg.maxEntries))
nerFile = NerFile(collective, lastUpdate, now) nerFile = NerFile(collective, lastUpdate, now)
_ <- update(nerFile, NerFile.mkNerConfig(names)) _ <- update(nerFile, NerFile.mkNerConfig(names))
} yield nerFile } yield nerFile

View File

@ -14,16 +14,26 @@ object FtsWork {
def apply[F[_]](f: FtsContext[F] => F[Unit]): FtsWork[F] = def apply[F[_]](f: FtsContext[F] => F[Unit]): FtsWork[F] =
Kleisli(f) Kleisli(f)
def allInitializeTasks[F[_]: Monad]: FtsWork[F] = /** Runs all migration tasks unconditionally and inserts all data as last step. */
FtsWork[F](_ => ().pure[F]).tap[FtsContext[F]].flatMap { ctx => def reInitializeTasks[F[_]: Monad]: FtsWork[F] =
NonEmptyList.fromList(ctx.fts.initialize.map(fm => from[F](fm.task))) match { FtsWork { ctx =>
val migrations =
ctx.fts.initialize.map(fm => fm.changeResult(_ => FtsMigration.Result.workDone))
NonEmptyList.fromList(migrations) match {
case Some(nel) => case Some(nel) =>
nel.reduce(semigroup[F]) nel
.map(fm => from[F](fm.task))
.append(insertAll[F](None))
.reduce(semigroup[F])
.run(ctx)
case None => case None =>
FtsWork[F](_ => ().pure[F]) ().pure[F]
} }
} }
/**
*/
def from[F[_]: FlatMap: Applicative](t: F[FtsMigration.Result]): FtsWork[F] = def from[F[_]: FlatMap: Applicative](t: F[FtsMigration.Result]): FtsWork[F] =
Kleisli.liftF(t).flatMap(transformResult[F]) Kleisli.liftF(t).flatMap(transformResult[F])

View File

@ -11,6 +11,11 @@ import docspell.joex.Config
import docspell.store.records.RFtsMigration import docspell.store.records.RFtsMigration
import docspell.store.{AddResult, Store} import docspell.store.{AddResult, Store}
/** Migrating the index from the previous version to this version.
*
* The sql database stores the outcome of a migration task. If this
* task has already been applied, it is skipped.
*/
case class Migration[F[_]]( case class Migration[F[_]](
version: Int, version: Int,
engine: Ident, engine: Ident,

View File

@ -46,6 +46,6 @@ object ReIndexTask {
FtsWork.log[F](_.info("Clearing data failed. Continue re-indexing.")) FtsWork.log[F](_.info("Clearing data failed. Continue re-indexing."))
) ++ ) ++
FtsWork.log[F](_.info("Running index initialize")) ++ FtsWork.log[F](_.info("Running index initialize")) ++
FtsWork.allInitializeTasks[F] FtsWork.reInitializeTasks[F]
}) })
} }

View File

@ -4,6 +4,9 @@ import cats.data.Kleisli
package object fts { package object fts {
/** Some work that must be done to advance the schema of the fulltext
* index.
*/
type FtsWork[F[_]] = Kleisli[F, FtsContext[F], Unit] type FtsWork[F[_]] = Kleisli[F, FtsContext[F], Unit]
} }

View File

@ -0,0 +1,66 @@
package docspell.joex.learn
import cats.data.NonEmptyList
import cats.implicits._
import docspell.common.Ident
import docspell.store.records.{RClassifierModel, RClassifierSetting}
import doobie._
final class ClassifierName(val name: String) extends AnyVal
object ClassifierName {
def apply(name: String): ClassifierName =
new ClassifierName(name)
private val categoryPrefix = "tagcategory-"
def tagCategory(cat: String): ClassifierName =
apply(s"${categoryPrefix}${cat}")
val concernedPerson: ClassifierName =
apply("concernedperson")
val concernedEquip: ClassifierName =
apply("concernedequip")
val correspondentOrg: ClassifierName =
apply("correspondentorg")
val correspondentPerson: ClassifierName =
apply("correspondentperson")
def findTagClassifiers[F[_]](coll: Ident): ConnectionIO[List[ClassifierName]] =
for {
categories <- RClassifierSetting.getActiveCategories(coll)
} yield categories.map(tagCategory)
def findTagModels[F[_]](coll: Ident): ConnectionIO[List[RClassifierModel]] =
for {
categories <- RClassifierSetting.getActiveCategories(coll)
models <- NonEmptyList.fromList(categories) match {
case Some(nel) =>
RClassifierModel.findAllByName(coll, nel.map(tagCategory).map(_.name))
case None =>
List.empty[RClassifierModel].pure[ConnectionIO]
}
} yield models
def findOrphanTagModels[F[_]](coll: Ident): ConnectionIO[List[RClassifierModel]] =
for {
cats <- RClassifierSetting.getActiveCategories(coll)
allModels = RClassifierModel.findAllByQuery(coll, s"${categoryPrefix}%")
result <- NonEmptyList.fromList(cats) match {
case Some(nel) =>
allModels.flatMap(all =>
RClassifierModel
.findAllByName(coll, nel.map(tagCategory).map(_.name))
.map(active => all.diff(active))
)
case None =>
allModels
}
} yield result
}

View File

@ -0,0 +1,48 @@
package docspell.joex.learn
import java.nio.file.Path
import cats.data.OptionT
import cats.effect._
import cats.implicits._
import docspell.analysis.classifier.{ClassifierModel, TextClassifier}
import docspell.common._
import docspell.store.Store
import docspell.store.records.RClassifierModel
import bitpeace.RangeDef
object Classify {
def apply[F[_]: Sync: ContextShift](
blocker: Blocker,
logger: Logger[F],
workingDir: Path,
store: Store[F],
classifier: TextClassifier[F],
coll: Ident,
text: String
)(cname: ClassifierName): F[Option[String]] =
(for {
_ <- OptionT.liftF(logger.info(s"Guessing label for ${cname.name}"))
model <- OptionT(store.transact(RClassifierModel.findByName(coll, cname.name)))
.flatTapNone(logger.debug("No classifier model found."))
modelData =
store.bitpeace
.get(model.fileId.id)
.unNoneTerminate
.through(store.bitpeace.fetchData2(RangeDef.all))
cls <- OptionT(File.withTempDir(workingDir, "classify").use { dir =>
val modelFile = dir.resolve("model.ser.gz")
modelData
.through(fs2.io.file.writeAll(modelFile, blocker))
.compile
.drain
.flatMap(_ => classifier.classify(logger, ClassifierModel(modelFile), text))
}).filter(_ != LearnClassifierTask.noClass)
.flatTapNone(logger.debug("Guessed: <none>"))
_ <- OptionT.liftF(logger.debug(s"Guessed: ${cls}"))
} yield cls).value
}

View File

@ -1,26 +1,19 @@
package docspell.joex.learn package docspell.joex.learn
import cats.data.Kleisli
import cats.data.OptionT import cats.data.OptionT
import cats.effect._ import cats.effect._
import cats.implicits._ import cats.implicits._
import fs2.{Pipe, Stream}
import docspell.analysis.TextAnalyser import docspell.analysis.TextAnalyser
import docspell.analysis.nlp.ClassifierModel
import docspell.analysis.nlp.TextClassifier.Data
import docspell.backend.ops.OCollective import docspell.backend.ops.OCollective
import docspell.common._ import docspell.common._
import docspell.joex.Config import docspell.joex.Config
import docspell.joex.scheduler._ import docspell.joex.scheduler._
import docspell.store.queries.QItem import docspell.store.records.{RClassifierModel, RClassifierSetting}
import docspell.store.records.RClassifierSetting
import bitpeace.MimetypeHint
object LearnClassifierTask { object LearnClassifierTask {
val noClass = "__NONE__"
val pageSep = " --n-- " val pageSep = " --n-- "
val noClass = "__NONE__"
type Args = LearnClassifierArgs type Args = LearnClassifierArgs
@ -29,68 +22,72 @@ object LearnClassifierTask {
def apply[F[_]: Sync: ContextShift]( def apply[F[_]: Sync: ContextShift](
cfg: Config.TextAnalysis, cfg: Config.TextAnalysis,
blocker: Blocker, analyser: TextAnalyser[F]
): Task[F, Args, Unit] =
learnTags(cfg, analyser)
.flatMap(_ => learnItemEntities(cfg, analyser))
.flatMap(_ => Task(_ => Sync[F].delay(System.gc())))
private def learnItemEntities[F[_]: Sync: ContextShift](
cfg: Config.TextAnalysis,
analyser: TextAnalyser[F] analyser: TextAnalyser[F]
): Task[F, Args, Unit] = ): Task[F, Args, Unit] =
Task { ctx => Task { ctx =>
(for { if (cfg.classification.enabled)
LearnItemEntities
.learnAll(
analyser,
ctx.args.collective,
cfg.classification.itemCount,
cfg.maxLength
)
.run(ctx)
else ().pure[F]
}
private def learnTags[F[_]: Sync: ContextShift](
cfg: Config.TextAnalysis,
analyser: TextAnalyser[F]
): Task[F, Args, Unit] =
Task { ctx =>
val learnTags =
for {
sett <- findActiveSettings[F](ctx, cfg) sett <- findActiveSettings[F](ctx, cfg)
data = selectItems( maxItems = cfg.classification.itemCountOrWhenLower(sett.itemCount)
ctx,
math.min(cfg.classification.itemCount, sett.itemCount).toLong,
sett.category.getOrElse("")
)
_ <- OptionT.liftF( _ <- OptionT.liftF(
analyser LearnTags
.classifier(blocker) .learnAllTagCategories(analyser)(
.trainClassifier[Unit](ctx.logger, data)(Kleisli(handleModel(ctx, blocker))) ctx.args.collective,
maxItems,
cfg.maxLength
) )
} yield ()) .run(ctx)
.getOrElseF(logInactiveWarning(ctx.logger))
}
private def handleModel[F[_]: Sync: ContextShift](
ctx: Context[F, Args],
blocker: Blocker
)(trainedModel: ClassifierModel): F[Unit] =
for {
oldFile <- ctx.store.transact(
RClassifierSetting.findById(ctx.args.collective).map(_.flatMap(_.fileId))
) )
_ <- ctx.logger.info("Storing new trained model")
fileData = fs2.io.file.readAll(trainedModel.model, blocker, 4096)
newFile <-
ctx.store.bitpeace.saveNew(fileData, 4096, MimetypeHint.none).compile.lastOrError
_ <- ctx.store.transact(
RClassifierSetting.updateFile(ctx.args.collective, Ident.unsafe(newFile.id))
)
_ <- ctx.logger.debug(s"New model stored at file ${newFile.id}")
_ <- oldFile match {
case Some(fid) =>
ctx.logger.debug(s"Deleting old model file ${fid.id}") *>
ctx.store.bitpeace.delete(fid.id).compile.drain
case None => ().pure[F]
}
} yield () } yield ()
// learn classifier models from active tag categories
private def selectItems[F[_]]( learnTags.getOrElseF(logInactiveWarning(ctx.logger)) *>
ctx: Context[F, Args], // delete classifier model files for categories that have been removed
max: Long, clearObsoleteTagModels(ctx) *>
category: String // when tags are deleted, categories may get removed. fix the json array
): Stream[F, Data] = { ctx.store
val connStream = .transact(RClassifierSetting.fixCategoryList(ctx.args.collective))
for { .map(_ => ())
item <- QItem.findAllNewesFirst(ctx.args.collective, 10).through(restrictTo(max))
tt <- Stream.eval(
QItem.resolveTextAndTag(ctx.args.collective, item, category, pageSep)
)
} yield Data(tt.tag.map(_.name).getOrElse(noClass), item.id, tt.text.trim)
ctx.store.transact(connStream.filter(_.text.nonEmpty))
} }
private def restrictTo[F[_], A](max: Long): Pipe[F, A, A] = private def clearObsoleteTagModels[F[_]: Sync](ctx: Context[F, Args]): F[Unit] =
if (max <= 0) identity for {
else _.take(max) list <- ctx.store.transact(
ClassifierName.findOrphanTagModels(ctx.args.collective)
)
_ <- ctx.logger.info(
s"Found ${list.size} obsolete model files that are deleted now."
)
n <- ctx.store.transact(RClassifierModel.deleteAll(list.map(_.id)))
_ <- list
.map(_.fileId.id)
.traverse(id => ctx.store.bitpeace.delete(id).compile.drain)
_ <- ctx.logger.debug(s"Deleted $n model files.")
} yield ()
private def findActiveSettings[F[_]: Sync]( private def findActiveSettings[F[_]: Sync](
ctx: Context[F, Args], ctx: Context[F, Args],
@ -98,14 +95,13 @@ object LearnClassifierTask {
): OptionT[F, OCollective.Classifier] = ): OptionT[F, OCollective.Classifier] =
if (cfg.classification.enabled) if (cfg.classification.enabled)
OptionT(ctx.store.transact(RClassifierSetting.findById(ctx.args.collective))) OptionT(ctx.store.transact(RClassifierSetting.findById(ctx.args.collective)))
.filter(_.enabled) .filter(_.autoTagEnabled)
.filter(_.category.nonEmpty)
.map(OCollective.Classifier.fromRecord) .map(OCollective.Classifier.fromRecord)
else else
OptionT.none OptionT.none
private def logInactiveWarning[F[_]: Sync](logger: Logger[F]): F[Unit] = private def logInactiveWarning[F[_]: Sync](logger: Logger[F]): F[Unit] =
logger.warn( logger.warn(
"Classification is disabled. Check joex config and the collective settings." "Auto-tagging is disabled. Check joex config and the collective settings."
) )
} }

View File

@ -0,0 +1,79 @@
package docspell.joex.learn
import cats.data.Kleisli
import cats.effect._
import cats.implicits._
import fs2.Stream
import docspell.analysis.TextAnalyser
import docspell.analysis.classifier.TextClassifier.Data
import docspell.common._
import docspell.joex.scheduler._
object LearnItemEntities {
def learnAll[F[_]: Sync: ContextShift, A](
analyser: TextAnalyser[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Task[F, A, Unit] =
learnCorrOrg(analyser, collective, maxItems, maxTextLen)
.flatMap(_ => learnCorrPerson[F, A](analyser, collective, maxItems, maxTextLen))
.flatMap(_ => learnConcPerson(analyser, collective, maxItems, maxTextLen))
.flatMap(_ => learnConcEquip(analyser, collective, maxItems, maxTextLen))
def learnCorrOrg[F[_]: Sync: ContextShift, A](
analyser: TextAnalyser[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Task[F, A, Unit] =
learn(analyser, collective)(
ClassifierName.correspondentOrg,
ctx => SelectItems.forCorrOrg(ctx.store, collective, maxItems, maxTextLen)
)
def learnCorrPerson[F[_]: Sync: ContextShift, A](
analyser: TextAnalyser[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Task[F, A, Unit] =
learn(analyser, collective)(
ClassifierName.correspondentPerson,
ctx => SelectItems.forCorrPerson(ctx.store, collective, maxItems, maxTextLen)
)
def learnConcPerson[F[_]: Sync: ContextShift, A](
analyser: TextAnalyser[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Task[F, A, Unit] =
learn(analyser, collective)(
ClassifierName.concernedPerson,
ctx => SelectItems.forConcPerson(ctx.store, collective, maxItems, maxTextLen)
)
def learnConcEquip[F[_]: Sync: ContextShift, A](
analyser: TextAnalyser[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Task[F, A, Unit] =
learn(analyser, collective)(
ClassifierName.concernedEquip,
ctx => SelectItems.forConcEquip(ctx.store, collective, maxItems, maxTextLen)
)
private def learn[F[_]: Sync: ContextShift, A](
analyser: TextAnalyser[F],
collective: Ident
)(cname: ClassifierName, data: Context[F, _] => Stream[F, Data]): Task[F, A, Unit] =
Task { ctx =>
ctx.logger.info(s"Learn classifier ${cname.name}") *>
analyser.classifier.trainClassifier(ctx.logger, data(ctx))(
Kleisli(StoreClassifierModel.handleModel(ctx, collective, cname))
)
}
}

View File

@ -0,0 +1,48 @@
package docspell.joex.learn
import cats.data.Kleisli
import cats.effect._
import cats.implicits._
import docspell.analysis.TextAnalyser
import docspell.common._
import docspell.joex.scheduler._
import docspell.store.records.RClassifierSetting
object LearnTags {
def learnTagCategory[F[_]: Sync: ContextShift, A](
analyser: TextAnalyser[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
)(
category: String
): Task[F, A, Unit] =
Task { ctx =>
val data = SelectItems.forCategory(ctx, collective)(maxItems, category, maxTextLen)
ctx.logger.info(s"Learn classifier for tag category: $category") *>
analyser.classifier.trainClassifier(ctx.logger, data)(
Kleisli(
StoreClassifierModel.handleModel(
ctx,
collective,
ClassifierName.tagCategory(category)
)
)
)
}
def learnAllTagCategories[F[_]: Sync: ContextShift, A](analyser: TextAnalyser[F])(
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Task[F, A, Unit] =
Task { ctx =>
for {
cats <- ctx.store.transact(RClassifierSetting.getActiveCategories(collective))
task = learnTagCategory[F, A](analyser, collective, maxItems, maxTextLen) _
_ <- cats.map(task).traverse(_.run(ctx))
} yield ()
}
}

View File

@ -0,0 +1,109 @@
package docspell.joex.learn
import fs2.{Pipe, Stream}
import docspell.analysis.classifier.TextClassifier.Data
import docspell.common._
import docspell.joex.scheduler.Context
import docspell.store.Store
import docspell.store.qb.Batch
import docspell.store.queries.{QItem, TextAndTag}
import doobie._
object SelectItems {
val pageSep = LearnClassifierTask.pageSep
val noClass = LearnClassifierTask.noClass
def forCategory[F[_]](ctx: Context[F, _], collective: Ident)(
maxItems: Int,
category: String,
maxTextLen: Int
): Stream[F, Data] =
forCategory(ctx.store, collective, maxItems, category, maxTextLen)
def forCategory[F[_]](
store: Store[F],
collective: Ident,
maxItems: Int,
category: String,
maxTextLen: Int
): Stream[F, Data] = {
val connStream =
allItems(collective, maxItems)
.evalMap(item =>
QItem.resolveTextAndTag(collective, item, category, maxTextLen, pageSep)
)
.through(mkData)
store.transact(connStream)
}
def forCorrOrg[F[_]](
store: Store[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Stream[F, Data] = {
val connStream =
allItems(collective, maxItems)
.evalMap(item =>
QItem.resolveTextAndCorrOrg(collective, item, maxTextLen, pageSep)
)
.through(mkData)
store.transact(connStream)
}
def forCorrPerson[F[_]](
store: Store[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Stream[F, Data] = {
val connStream =
allItems(collective, maxItems)
.evalMap(item =>
QItem.resolveTextAndCorrPerson(collective, item, maxTextLen, pageSep)
)
.through(mkData)
store.transact(connStream)
}
def forConcPerson[F[_]](
store: Store[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Stream[F, Data] = {
val connStream =
allItems(collective, maxItems)
.evalMap(item =>
QItem.resolveTextAndConcPerson(collective, item, maxTextLen, pageSep)
)
.through(mkData)
store.transact(connStream)
}
def forConcEquip[F[_]](
store: Store[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Stream[F, Data] = {
val connStream =
allItems(collective, maxItems)
.evalMap(item =>
QItem.resolveTextAndConcEquip(collective, item, maxTextLen, pageSep)
)
.through(mkData)
store.transact(connStream)
}
private def allItems(collective: Ident, max: Int): Stream[ConnectionIO, Ident] = {
val limit = if (max <= 0) Batch.all else Batch.limit(max)
QItem.findAllNewesFirst(collective, 10, limit)
}
private def mkData[F[_]]: Pipe[F, TextAndTag, Data] =
_.map(tt => Data(tt.tag.map(_.name).getOrElse(noClass), tt.itemId.id, tt.text.trim))
.filter(_.text.nonEmpty)
}

View File

@ -0,0 +1,53 @@
package docspell.joex.learn
import cats.effect._
import cats.implicits._
import docspell.analysis.classifier.ClassifierModel
import docspell.common._
import docspell.joex.scheduler._
import docspell.store.Store
import docspell.store.records.RClassifierModel
import bitpeace.MimetypeHint
object StoreClassifierModel {
def handleModel[F[_]: Sync: ContextShift](
ctx: Context[F, _],
collective: Ident,
modelName: ClassifierName
)(
trainedModel: ClassifierModel
): F[Unit] =
handleModel(ctx.store, ctx.blocker, ctx.logger)(collective, modelName, trainedModel)
def handleModel[F[_]: Sync: ContextShift](
store: Store[F],
blocker: Blocker,
logger: Logger[F]
)(
collective: Ident,
modelName: ClassifierName,
trainedModel: ClassifierModel
): F[Unit] =
for {
oldFile <- store.transact(
RClassifierModel.findByName(collective, modelName.name).map(_.map(_.fileId))
)
_ <- logger.debug(s"Storing new trained model for: ${modelName.name}")
fileData = fs2.io.file.readAll(trainedModel.model, blocker, 4096)
newFile <-
store.bitpeace.saveNew(fileData, 4096, MimetypeHint.none).compile.lastOrError
_ <- store.transact(
RClassifierModel.updateFile(collective, modelName.name, Ident.unsafe(newFile.id))
)
_ <- logger.debug(s"New model stored at file ${newFile.id}")
_ <- oldFile match {
case Some(fid) =>
logger.debug(s"Deleting old model file ${fid.id}") *>
store.bitpeace.delete(fid.id).compile.drain
case None => ().pure[F]
}
} yield ()
}

View File

@ -78,7 +78,14 @@ object AttachmentPageCount {
s"No attachmentmeta record exists for ${ra.id.id}. Creating new." s"No attachmentmeta record exists for ${ra.id.id}. Creating new."
) *> ctx.store.transact( ) *> ctx.store.transact(
RAttachmentMeta.insert( RAttachmentMeta.insert(
RAttachmentMeta(ra.id, None, Nil, MetaProposalList.empty, md.pageCount.some) RAttachmentMeta(
ra.id,
None,
Nil,
MetaProposalList.empty,
md.pageCount.some,
None
)
) )
) )
else 0.pure[F] else 0.pure[F]

View File

@ -108,7 +108,18 @@ object ConvertPdf {
ctx.logger.info(s"Conversion to pdf+txt successful. Saving file.") *> ctx.logger.info(s"Conversion to pdf+txt successful. Saving file.") *>
storePDF(ctx, cfg, ra, pdf) storePDF(ctx, cfg, ra, pdf)
.flatMap(r => .flatMap(r =>
txt.map(t => (r, item.changeMeta(ra.id, _.setContentIfEmpty(t.some)).some)) txt.map(t =>
(
r,
item
.changeMeta(
ra.id,
ctx.args.meta.language,
_.setContentIfEmpty(t.some)
)
.some
)
)
) )
case ConversionResult.UnsupportedFormat(mt) => case ConversionResult.UnsupportedFormat(mt) =>

View File

@ -107,6 +107,8 @@ object CreateItem {
Vector.empty, Vector.empty,
fm.map(a => a.id -> a.fileId).toMap, fm.map(a => a.id -> a.fileId).toMap,
MetaProposalList.empty, MetaProposalList.empty,
Nil,
MetaProposalList.empty,
Nil Nil
) )
} }
@ -166,6 +168,8 @@ object CreateItem {
Vector.empty, Vector.empty,
origMap, origMap,
MetaProposalList.empty, MetaProposalList.empty,
Nil,
MetaProposalList.empty,
Nil Nil
) )
) )

View File

@ -42,7 +42,7 @@ object ExtractArchive {
archive: Option[RAttachmentArchive] archive: Option[RAttachmentArchive]
): Task[F, ProcessItemArgs, (Option[RAttachmentArchive], ItemData)] = ): Task[F, ProcessItemArgs, (Option[RAttachmentArchive], ItemData)] =
singlePass(item, archive).flatMap { t => singlePass(item, archive).flatMap { t =>
if (t._1 == None) Task.pure(t) if (t._1.isEmpty) Task.pure(t)
else multiPass(t._2, t._1) else multiPass(t._2, t._1)
} }

View File

@ -17,24 +17,92 @@ import docspell.store.records._
* by looking up values from NER in the users address book. * by looking up values from NER in the users address book.
*/ */
object FindProposal { object FindProposal {
type Args = ProcessItemArgs
def apply[F[_]: Sync]( def apply[F[_]: Sync](
cfg: Config.Processing cfg: Config.TextAnalysis
)(data: ItemData): Task[F, ProcessItemArgs, ItemData] = )(data: ItemData): Task[F, Args, ItemData] =
Task { ctx => Task { ctx =>
val rmas = data.metas.map(rm => rm.copy(nerlabels = removeDuplicates(rm.nerlabels))) val rmas = data.metas.map(rm => rm.copy(nerlabels = removeDuplicates(rm.nerlabels)))
for {
ctx.logger.info("Starting find-proposal") *> _ <- ctx.logger.info("Starting find-proposal")
rmas rmv <- rmas
.traverse(rm => .traverse(rm =>
processAttachment(cfg, rm, data.findDates(rm), ctx) processAttachment(cfg, rm, data.findDates(rm), ctx)
.map(ml => rm.copy(proposals = ml)) .map(ml => rm.copy(proposals = ml))
) )
.map(rmv => data.copy(metas = rmv)) clp <- lookupClassifierProposals(ctx, data.classifyProposals)
} yield data.copy(metas = rmv, classifyProposals = clp)
}
def lookupClassifierProposals[F[_]: Sync](
ctx: Context[F, Args],
mpList: MetaProposalList
): F[MetaProposalList] = {
val coll = ctx.args.meta.collective
def lookup(mp: MetaProposal): F[Option[IdRef]] =
mp.proposalType match {
case MetaProposalType.CorrOrg =>
ctx.store
.transact(
ROrganization
.findLike(coll, mp.values.head.ref.name.toLowerCase)
.map(_.headOption)
)
.flatTap(oref =>
ctx.logger.debug(s"Found classifier organization for $mp: $oref")
)
case MetaProposalType.CorrPerson =>
ctx.store
.transact(
RPerson
.findLike(coll, mp.values.head.ref.name.toLowerCase, false)
.map(_.headOption)
)
.flatTap(oref =>
ctx.logger.debug(s"Found classifier corr-person for $mp: $oref")
)
case MetaProposalType.ConcPerson =>
ctx.store
.transact(
RPerson
.findLike(coll, mp.values.head.ref.name.toLowerCase, true)
.map(_.headOption)
)
.flatTap(oref =>
ctx.logger.debug(s"Found classifier conc-person for $mp: $oref")
)
case MetaProposalType.ConcEquip =>
ctx.store
.transact(
REquipment
.findLike(coll, mp.values.head.ref.name.toLowerCase)
.map(_.headOption)
)
.flatTap(oref =>
ctx.logger.debug(s"Found classifier conc-equip for $mp: $oref")
)
case MetaProposalType.DocDate =>
(None: Option[IdRef]).pure[F]
case MetaProposalType.DueDate =>
(None: Option[IdRef]).pure[F]
}
def updateRef(mp: MetaProposal)(idRef: Option[IdRef]): Option[MetaProposal] =
idRef // this proposal contains a single value only, since coming from classifier
.map(ref => mp.copy(values = mp.values.map(_.copy(ref = ref))))
ctx.logger.debug(s"Looking up classifier results: ${mpList.proposals}") *>
mpList.proposals
.traverse(mp => lookup(mp).map(updateRef(mp)))
.map(_.flatten)
.map(MetaProposalList.apply)
} }
def processAttachment[F[_]: Sync]( def processAttachment[F[_]: Sync](
cfg: Config.Processing, cfg: Config.TextAnalysis,
rm: RAttachmentMeta, rm: RAttachmentMeta,
rd: Vector[NerDateLabel], rd: Vector[NerDateLabel],
ctx: Context[F, ProcessItemArgs] ctx: Context[F, ProcessItemArgs]
@ -46,11 +114,11 @@ object FindProposal {
} }
def makeDateProposal[F[_]: Sync]( def makeDateProposal[F[_]: Sync](
cfg: Config.Processing, cfg: Config.TextAnalysis,
dates: Vector[NerDateLabel] dates: Vector[NerDateLabel]
): F[MetaProposalList] = ): F[MetaProposalList] =
Timestamp.current[F].map { now => Timestamp.current[F].map { now =>
val maxFuture = now.plus(Duration.years(cfg.maxDueDateYears.toLong)) val maxFuture = now.plus(Duration.years(cfg.nlp.maxDueDateYears.toLong))
val latestFirst = dates val latestFirst = dates
.filter(_.date.isBefore(maxFuture.toUtcDate)) .filter(_.date.isBefore(maxFuture.toUtcDate))
.sortWith((l1, l2) => l1.date.isAfter(l2.date)) .sortWith((l1, l2) => l1.date.isAfter(l2.date))

View File

@ -15,6 +15,9 @@ import docspell.store.records.{RAttachment, RAttachmentMeta, RItem}
* containng the source or origin file * containng the source or origin file
* @param givenMeta meta data to this item that was not "guessed" * @param givenMeta meta data to this item that was not "guessed"
* from an attachment but given and thus is always correct * from an attachment but given and thus is always correct
* @param classifyProposals these are proposals that were obtained by
* a trained classifier. There are no ner-tags, it will only provide a
* single label
*/ */
case class ItemData( case class ItemData(
item: RItem, item: RItem,
@ -23,7 +26,11 @@ case class ItemData(
dateLabels: Vector[AttachmentDates], dateLabels: Vector[AttachmentDates],
originFile: Map[Ident, Ident], // maps RAttachment.id -> FileMeta.id originFile: Map[Ident, Ident], // maps RAttachment.id -> FileMeta.id
givenMeta: MetaProposalList, // given meta data not associated to a specific attachment givenMeta: MetaProposalList, // given meta data not associated to a specific attachment
tags: List[String] // a list of tags (names or ids) attached to the item if they exist // a list of tags (names or ids) attached to the item if they exist
tags: List[String],
// proposals obtained from the classifier
classifyProposals: MetaProposalList,
classifyTags: List[String]
) { ) {
def findMeta(attachId: Ident): Option[RAttachmentMeta] = def findMeta(attachId: Ident): Option[RAttachmentMeta] =
@ -32,8 +39,12 @@ case class ItemData(
def findDates(rm: RAttachmentMeta): Vector[NerDateLabel] = def findDates(rm: RAttachmentMeta): Vector[NerDateLabel] =
dateLabels.find(m => m.rm.id == rm.id).map(_.dates).getOrElse(Vector.empty) dateLabels.find(m => m.rm.id == rm.id).map(_.dates).getOrElse(Vector.empty)
def mapMeta(attachId: Ident, f: RAttachmentMeta => RAttachmentMeta): ItemData = { def mapMeta(
val item = changeMeta(attachId, f) attachId: Ident,
lang: Language,
f: RAttachmentMeta => RAttachmentMeta
): ItemData = {
val item = changeMeta(attachId, lang, f)
val next = metas.map(a => if (a.id == attachId) item else a) val next = metas.map(a => if (a.id == attachId) item else a)
copy(metas = next) copy(metas = next)
} }
@ -43,13 +54,14 @@ case class ItemData(
def changeMeta( def changeMeta(
attachId: Ident, attachId: Ident,
lang: Language,
f: RAttachmentMeta => RAttachmentMeta f: RAttachmentMeta => RAttachmentMeta
): RAttachmentMeta = ): RAttachmentMeta =
f(findOrCreate(attachId)) f(findOrCreate(attachId, lang))
def findOrCreate(attachId: Ident): RAttachmentMeta = def findOrCreate(attachId: Ident, lang: Language): RAttachmentMeta =
metas.find(_.id == attachId).getOrElse { metas.find(_.id == attachId).getOrElse {
RAttachmentMeta.empty(attachId) RAttachmentMeta.empty(attachId, lang)
} }
} }

View File

@ -24,6 +24,7 @@ object LinkProposal {
.flatten(data.metas.map(_.proposals)) .flatten(data.metas.map(_.proposals))
.filter(_.proposalType != MetaProposalType.DocDate) .filter(_.proposalType != MetaProposalType.DocDate)
.sortByWeights .sortByWeights
.fillEmptyFrom(data.classifyProposals)
ctx.logger.info(s"Starting linking proposals") *> ctx.logger.info(s"Starting linking proposals") *>
MetaProposalType.all MetaProposalType.all

View File

@ -41,7 +41,7 @@ object ProcessItem {
regexNer: RegexNerFile[F] regexNer: RegexNerFile[F]
)(item: ItemData): Task[F, ProcessItemArgs, ItemData] = )(item: ItemData): Task[F, ProcessItemArgs, ItemData] =
TextAnalysis[F](cfg.textAnalysis, analyser, regexNer)(item) TextAnalysis[F](cfg.textAnalysis, analyser, regexNer)(item)
.flatMap(FindProposal[F](cfg.processing)) .flatMap(FindProposal[F](cfg.textAnalysis))
.flatMap(EvalProposals[F]) .flatMap(EvalProposals[F])
.flatMap(SaveProposals[F]) .flatMap(SaveProposals[F])

View File

@ -65,6 +65,8 @@ object ReProcessItem {
Vector.empty, Vector.empty,
asrcMap.view.mapValues(_.fileId).toMap, asrcMap.view.mapValues(_.fileId).toMap,
MetaProposalList.empty, MetaProposalList.empty,
Nil,
MetaProposalList.empty,
Nil Nil
)).getOrElseF( )).getOrElseF(
Sync[F].raiseError(new Exception(s"Item not found: ${ctx.args.itemId.id}")) Sync[F].raiseError(new Exception(s"Item not found: ${ctx.args.itemId.id}"))

View File

@ -4,21 +4,51 @@ import cats.effect.Sync
import cats.implicits._ import cats.implicits._
import docspell.common._ import docspell.common._
import docspell.joex.scheduler.Task import docspell.joex.scheduler.{Context, Task}
import docspell.store.AddResult
import docspell.store.records._ import docspell.store.records._
/** Saves the proposals in the database /** Saves the proposals in the database
*/ */
object SaveProposals { object SaveProposals {
type Args = ProcessItemArgs
def apply[F[_]: Sync](data: ItemData): Task[F, ProcessItemArgs, ItemData] = def apply[F[_]: Sync](data: ItemData): Task[F, Args, ItemData] =
Task { ctx => Task { ctx =>
ctx.logger.info("Storing proposals") *> for {
data.metas _ <- ctx.logger.info("Storing proposals")
_ <- data.metas
.traverse(rm => .traverse(rm =>
ctx.logger.debug(s"Storing attachment proposals: ${rm.proposals}") *> ctx.logger.debug(
ctx.store.transact(RAttachmentMeta.updateProposals(rm.id, rm.proposals)) s"Storing attachment proposals: ${rm.proposals}"
) *> ctx.store.transact(RAttachmentMeta.updateProposals(rm.id, rm.proposals))
) )
.map(_ => data) _ <-
if (data.classifyProposals.isEmpty && data.classifyTags.isEmpty) 0.pure[F]
else saveItemProposal(ctx, data)
} yield data
}
def saveItemProposal[F[_]: Sync](ctx: Context[F, Args], data: ItemData): F[Unit] = {
def upsert(v: RItemProposal): F[Int] =
ctx.store.add(RItemProposal.insert(v), RItemProposal.exists(v.itemId)).flatMap {
case AddResult.Success => 1.pure[F]
case AddResult.EntityExists(_) =>
ctx.store.transact(RItemProposal.update(v))
case AddResult.Failure(ex) =>
ctx.logger.warn(s"Could not store item proposals: ${ex.getMessage}") *> 0
.pure[F]
}
for {
_ <- ctx.logger.debug(s"Storing classifier proposals: ${data.classifyProposals}")
tags <- ctx.store.transact(
RTag.findAllByNameOrId(data.classifyTags, ctx.args.meta.collective)
)
tagRefs = tags.map(t => IdRef(t.tagId, t.name))
now <- Timestamp.current[F]
value = RItemProposal(data.item.id, data.classifyProposals, tagRefs.toList, now)
_ <- upsert(value)
} yield ()
} }
} }

View File

@ -45,7 +45,8 @@ object SetGivenData {
Task { ctx => Task { ctx =>
val itemId = data.item.id val itemId = data.item.id
val collective = ctx.args.meta.collective val collective = ctx.args.meta.collective
val tags = (ctx.args.meta.tags.getOrElse(Nil) ++ data.tags).distinct val tags =
(ctx.args.meta.tags.getOrElse(Nil) ++ data.tags ++ data.classifyTags).distinct
for { for {
_ <- ctx.logger.info(s"Set tags from given data: ${tags}") _ <- ctx.logger.info(s"Set tags from given data: ${tags}")
e <- ops.linkTags(itemId, tags, collective).attempt e <- ops.linkTags(itemId, tags, collective).attempt

View File

@ -1,24 +1,20 @@
package docspell.joex.process package docspell.joex.process
import cats.data.OptionT import cats.Traverse
import cats.effect._ import cats.effect._
import cats.implicits._ import cats.implicits._
import docspell.analysis.TextAnalyser import docspell.analysis.classifier.TextClassifier
import docspell.analysis.nlp.ClassifierModel import docspell.analysis.{NlpSettings, TextAnalyser}
import docspell.analysis.nlp.StanfordNerSettings import docspell.common.MetaProposal.Candidate
import docspell.analysis.nlp.TextClassifier
import docspell.common._ import docspell.common._
import docspell.joex.Config import docspell.joex.Config
import docspell.joex.analysis.RegexNerFile import docspell.joex.analysis.RegexNerFile
import docspell.joex.learn.LearnClassifierTask import docspell.joex.learn.{ClassifierName, Classify, LearnClassifierTask}
import docspell.joex.process.ItemData.AttachmentDates import docspell.joex.process.ItemData.AttachmentDates
import docspell.joex.scheduler.Context import docspell.joex.scheduler.Context
import docspell.joex.scheduler.Task import docspell.joex.scheduler.Task
import docspell.store.records.RAttachmentMeta import docspell.store.records.{RAttachmentMeta, RClassifierSetting}
import docspell.store.records.RClassifierSetting
import bitpeace.RangeDef
object TextAnalysis { object TextAnalysis {
type Args = ProcessItemArgs type Args = ProcessItemArgs
@ -41,13 +37,27 @@ object TextAnalysis {
_ <- t.traverse(m => _ <- t.traverse(m =>
ctx.store.transact(RAttachmentMeta.updateLabels(m._1.id, m._1.nerlabels)) ctx.store.transact(RAttachmentMeta.updateLabels(m._1.id, m._1.nerlabels))
) )
v = t.toVector
autoTagEnabled <- getActiveAutoTag(ctx, cfg)
tag <-
if (autoTagEnabled) predictTags(ctx, cfg, item.metas, analyser.classifier)
else List.empty[String].pure[F]
classProposals <-
if (cfg.classification.enabled)
predictItemEntities(ctx, cfg, item.metas, analyser.classifier)
else MetaProposalList.empty.pure[F]
e <- s e <- s
_ <- ctx.logger.info(s"Text-Analysis finished in ${e.formatExact}") _ <- ctx.logger.info(s"Text-Analysis finished in ${e.formatExact}")
v = t.toVector
tag <- predictTag(ctx, cfg, item.metas, analyser.classifier(ctx.blocker)).value
} yield item } yield item
.copy(metas = v.map(_._1), dateLabels = v.map(_._2)) .copy(
.appendTags(tag.toSeq) metas = v.map(_._1),
dateLabels = v.map(_._2),
classifyProposals = classProposals,
classifyTags = tag
)
} }
def annotateAttachment[F[_]: Sync]( def annotateAttachment[F[_]: Sync](
@ -55,7 +65,7 @@ object TextAnalysis {
analyser: TextAnalyser[F], analyser: TextAnalyser[F],
nerFile: RegexNerFile[F] nerFile: RegexNerFile[F]
)(rm: RAttachmentMeta): F[(RAttachmentMeta, AttachmentDates)] = { )(rm: RAttachmentMeta): F[(RAttachmentMeta, AttachmentDates)] = {
val settings = StanfordNerSettings(ctx.args.meta.language, false, None) val settings = NlpSettings(ctx.args.meta.language, false, None)
for { for {
customNer <- nerFile.makeFile(ctx.args.meta.collective) customNer <- nerFile.makeFile(ctx.args.meta.collective)
sett = settings.copy(regexNer = customNer) sett = settings.copy(regexNer = customNer)
@ -68,44 +78,84 @@ object TextAnalysis {
} yield (rm.copy(nerlabels = labels.all.toList), AttachmentDates(rm, labels.dates)) } yield (rm.copy(nerlabels = labels.all.toList), AttachmentDates(rm, labels.dates))
} }
def predictTag[F[_]: Sync: ContextShift]( def predictTags[F[_]: Sync: ContextShift](
ctx: Context[F, Args], ctx: Context[F, Args],
cfg: Config.TextAnalysis, cfg: Config.TextAnalysis,
metas: Vector[RAttachmentMeta], metas: Vector[RAttachmentMeta],
classifier: TextClassifier[F] classifier: TextClassifier[F]
): OptionT[F, String] = ): F[List[String]] = {
for { val text = metas.flatMap(_.content).mkString(LearnClassifierTask.pageSep)
model <- findActiveModel(ctx, cfg) val classifyWith: ClassifierName => F[Option[String]] =
_ <- OptionT.liftF(ctx.logger.info(s"Guessing tag …")) makeClassify(ctx, cfg, classifier)(text)
text = metas.flatMap(_.content).mkString(LearnClassifierTask.pageSep)
modelData =
ctx.store.bitpeace
.get(model.id)
.unNoneTerminate
.through(ctx.store.bitpeace.fetchData2(RangeDef.all))
cls <- OptionT(File.withTempDir(cfg.workingDir, "classify").use { dir =>
val modelFile = dir.resolve("model.ser.gz")
modelData
.through(fs2.io.file.writeAll(modelFile, ctx.blocker))
.compile
.drain
.flatMap(_ => classifier.classify(ctx.logger, ClassifierModel(modelFile), text))
}).filter(_ != LearnClassifierTask.noClass)
_ <- OptionT.liftF(ctx.logger.debug(s"Guessed tag: ${cls}"))
} yield cls
private def findActiveModel[F[_]: Sync]( for {
names <- ctx.store.transact(
ClassifierName.findTagClassifiers(ctx.args.meta.collective)
)
_ <- ctx.logger.debug(s"Guessing tags for ${names.size} categories")
tags <- names.traverse(classifyWith)
} yield tags.flatten
}
def predictItemEntities[F[_]: Sync: ContextShift](
ctx: Context[F, Args], ctx: Context[F, Args],
cfg: Config.TextAnalysis cfg: Config.TextAnalysis,
): OptionT[F, Ident] = metas: Vector[RAttachmentMeta],
(if (cfg.classification.enabled) classifier: TextClassifier[F]
OptionT(ctx.store.transact(RClassifierSetting.findById(ctx.args.meta.collective))) ): F[MetaProposalList] = {
.filter(_.enabled) val text = metas.flatMap(_.content).mkString(LearnClassifierTask.pageSep)
.mapFilter(_.fileId)
else def classifyWith(
OptionT.none[F, Ident]).orElse( cname: ClassifierName,
OptionT.liftF(ctx.logger.info("Classification is disabled.")) *> OptionT mtype: MetaProposalType
.none[F, Ident] ): F[Option[MetaProposal]] =
for {
label <- makeClassify(ctx, cfg, classifier)(text).apply(cname)
} yield label.map(str =>
MetaProposal(mtype, Candidate(IdRef(Ident.unsafe(""), str), Set.empty))
) )
Traverse[List]
.sequence(
List(
classifyWith(ClassifierName.correspondentOrg, MetaProposalType.CorrOrg),
classifyWith(ClassifierName.correspondentPerson, MetaProposalType.CorrPerson),
classifyWith(ClassifierName.concernedPerson, MetaProposalType.ConcPerson),
classifyWith(ClassifierName.concernedEquip, MetaProposalType.ConcEquip)
)
)
.map(_.flatten)
.map(MetaProposalList.apply)
}
private def makeClassify[F[_]: Sync: ContextShift](
ctx: Context[F, Args],
cfg: Config.TextAnalysis,
classifier: TextClassifier[F]
)(text: String): ClassifierName => F[Option[String]] =
Classify[F](
ctx.blocker,
ctx.logger,
cfg.workingDir,
ctx.store,
classifier,
ctx.args.meta.collective,
text
)
private def getActiveAutoTag[F[_]: Sync](
ctx: Context[F, Args],
cfg: Config.TextAnalysis
): F[Boolean] =
if (cfg.classification.enabled)
ctx.store
.transact(RClassifierSetting.findById(ctx.args.meta.collective))
.map(_.exists(_.autoTagEnabled))
.flatTap(enabled =>
if (enabled) ().pure[F]
else ctx.logger.info("Classification is disabled. Check config or settings.")
)
else
ctx.logger.info("Classification is disabled.") *> false.pure[F]
} }

View File

@ -46,10 +46,14 @@ object TextExtraction {
) )
_ <- fts.indexData(ctx.logger, (idxItem +: txt.map(_.td)).toSeq: _*) _ <- fts.indexData(ctx.logger, (idxItem +: txt.map(_.td)).toSeq: _*)
dur <- start dur <- start
_ <- ctx.logger.info(s"Text extraction finished in ${dur.formatExact}") extractedTags = txt.flatMap(_.tags).distinct.toList
_ <- ctx.logger.info(s"Text extraction finished in ${dur.formatExact}.")
_ <-
if (extractedTags.isEmpty) ().pure[F]
else ctx.logger.debug(s"Found tags in file: $extractedTags")
} yield item } yield item
.copy(metas = txt.map(_.am)) .copy(metas = txt.map(_.am))
.appendTags(txt.flatMap(_.tags).distinct.toList) .appendTags(extractedTags)
} }
// -- helpers // -- helpers
@ -78,7 +82,7 @@ object TextExtraction {
pair._2 pair._2
) )
val rm = item.findOrCreate(ra.id) val rm = item.findOrCreate(ra.id, lang)
rm.content match { rm.content match {
case Some(_) => case Some(_) =>
ctx.logger.info("TextExtraction skipped, since text is already available.") *> ctx.logger.info("TextExtraction skipped, since text is already available.") *>
@ -102,6 +106,7 @@ object TextExtraction {
res <- extractTextFallback(ctx, cfg, ra, lang)(fids) res <- extractTextFallback(ctx, cfg, ra, lang)(fids)
meta = item.changeMeta( meta = item.changeMeta(
ra.id, ra.id,
lang,
rm => rm =>
rm.setContentIfEmpty( rm.setContentIfEmpty(
res.map(_.appendPdfMetaToText.text.trim).filter(_.nonEmpty) res.map(_.appendPdfMetaToText.text.trim).filter(_.nonEmpty)

View File

@ -9,7 +9,7 @@ servers:
description: Current host description: Current host
paths: paths:
/api/info: /api/info/version:
get: get:
tags: [ Api Info ] tags: [ Api Info ]
summary: Get basic information about this software. summary: Get basic information about this software.

View File

@ -4850,14 +4850,11 @@ components:
description: | description: |
Settings for learning a document classifier. Settings for learning a document classifier.
required: required:
- enabled
- schedule - schedule
- itemCount - itemCount
- categoryList
- listType
properties: properties:
enabled:
type: boolean
category:
type: string
itemCount: itemCount:
type: integer type: integer
format: int32 format: int32
@ -4867,6 +4864,16 @@ components:
schedule: schedule:
type: string type: string
format: calevent format: calevent
categoryList:
type: array
items:
type: string
listType:
type: string
format: listtype
enum:
- blacklist
- whitelist
SourceList: SourceList:
description: | description: |

View File

@ -6,7 +6,7 @@ import cats.implicits._
import docspell.backend.BackendApp import docspell.backend.BackendApp
import docspell.backend.auth.AuthToken import docspell.backend.auth.AuthToken
import docspell.backend.ops.OCollective import docspell.backend.ops.OCollective
import docspell.common.MakePreviewArgs import docspell.common.{ListType, MakePreviewArgs}
import docspell.restapi.model._ import docspell.restapi.model._
import docspell.restserver.conv.Conversions import docspell.restserver.conv.Conversions
import docspell.restserver.http4s._ import docspell.restserver.http4s._
@ -44,10 +44,10 @@ object CollectiveRoutes {
settings.integrationEnabled, settings.integrationEnabled,
Some( Some(
OCollective.Classifier( OCollective.Classifier(
settings.classifier.enabled,
settings.classifier.schedule, settings.classifier.schedule,
settings.classifier.itemCount, settings.classifier.itemCount,
settings.classifier.category settings.classifier.categoryList,
settings.classifier.listType
) )
) )
) )
@ -65,12 +65,12 @@ object CollectiveRoutes {
c.language, c.language,
c.integrationEnabled, c.integrationEnabled,
ClassifierSetting( ClassifierSetting(
c.classifier.map(_.enabled).getOrElse(false),
c.classifier.flatMap(_.category),
c.classifier.map(_.itemCount).getOrElse(0), c.classifier.map(_.itemCount).getOrElse(0),
c.classifier c.classifier
.map(_.schedule) .map(_.schedule)
.getOrElse(CalEvent.unsafe("*-1/3-01 01:00:00")) .getOrElse(CalEvent.unsafe("*-1/3-01 01:00:00")),
c.classifier.map(_.categories).getOrElse(Nil),
c.classifier.map(_.listType).getOrElse(ListType.whitelist)
) )
) )
) )

View File

@ -0,0 +1,35 @@
ALTER TABLE "attachmentmeta"
ADD COLUMN "language" varchar(254);
update "attachmentmeta"
set "language" = 'deu'
where "attachid" in (
select "m"."attachid"
from "attachmentmeta" m
inner join "attachment" a on "a"."attachid" = "m"."attachid"
inner join "item" i on "a"."itemid" = "i"."itemid"
inner join "collective" c on "c"."cid" = "i"."cid"
where "c"."doclang" = 'deu'
);
update "attachmentmeta"
set "language" = 'eng'
where "attachid" in (
select "m"."attachid"
from "attachmentmeta" m
inner join "attachment" a on "a"."attachid" = "m"."attachid"
inner join "item" i on "a"."itemid" = "i"."itemid"
inner join "collective" c on "c"."cid" = "i"."cid"
where "c"."doclang" = 'eng'
);
update "attachmentmeta"
set "language" = 'fra'
where "attachid" in (
select "m"."attachid"
from "attachmentmeta" m
inner join "attachment" a on "a"."attachid" = "m"."attachid"
inner join "item" i on "a"."itemid" = "i"."itemid"
inner join "collective" c on "c"."cid" = "i"."cid"
where "c"."doclang" = 'fra'
);

View File

@ -0,0 +1,44 @@
CREATE TABLE "classifier_model"(
"id" varchar(254) not null primary key,
"cid" varchar(254) not null,
"name" varchar(254) not null,
"file_id" varchar(254) not null,
"created" timestamp not null,
foreign key ("cid") references "collective"("cid"),
foreign key ("file_id") references "filemeta"("id"),
unique ("cid", "name")
);
insert into "classifier_model"
select random_uuid() as "id", "cid", concat('tagcategory-', "category") as "name", "file_id", "created"
from "classifier_setting"
where "file_id" is not null;
alter table "classifier_setting"
add column "categories" text;
alter table "classifier_setting"
add column "category_list_type" varchar(254);
update "classifier_setting"
set "category_list_type" = 'whitelist';
update "classifier_setting"
set "categories" = concat('["', "category", '"]')
where category is not null;
update "classifier_setting"
set "categories" = '[]'
where category is null;
alter table "classifier_setting"
drop column "category";
alter table "classifier_setting"
drop column "file_id";
ALTER TABLE "classifier_setting"
ALTER COLUMN "categories" SET NOT NULL;
ALTER TABLE "classifier_setting"
ALTER COLUMN "category_list_type" SET NOT NULL;

View File

@ -0,0 +1,7 @@
CREATE TABLE "item_proposal" (
"itemid" varchar(254) not null primary key,
"classifier_proposals" text not null,
"classifier_tags" text not null,
"created" timestamp not null,
foreign key ("itemid") references "item"("itemid")
);

View File

@ -0,0 +1,14 @@
ALTER TABLE `attachmentmeta`
ADD COLUMN (`language` varchar(254));
update `attachmentmeta` `m`
inner join (
select `m`.`attachid`, `c`.`doclang`
from `attachmentmeta` m
inner join `attachment` a on `a`.`attachid` = `m`.`attachid`
inner join `item` i on `a`.`itemid` = `i`.`itemid`
inner join `collective` c on `c`.`cid` = `i`.`cid`
) as `c`
set `m`.`language` = `c`.`doclang`
where `m`.`attachid` = `c`.`attachid` and `m`.`language` is null;

View File

@ -0,0 +1,48 @@
CREATE TABLE `classifier_model`(
`id` varchar(254) not null primary key,
`cid` varchar(254) not null,
`name` varchar(254) not null,
`file_id` varchar(254) not null,
`created` timestamp not null,
foreign key (`cid`) references `collective`(`cid`),
foreign key (`file_id`) references `filemeta`(`id`),
unique (`cid`, `name`)
);
insert into `classifier_model`
select md5(rand()) as id, `cid`,concat('tagcategory-', `category`) as `name`, `file_id`, `created`
from `classifier_setting`
where `file_id` is not null;
alter table `classifier_setting`
add column (`categories` mediumtext);
alter table `classifier_setting`
add column (`category_list_type` varchar(254));
update `classifier_setting`
set `category_list_type` = 'whitelist';
update `classifier_setting`
set `categories` = concat('["', `category`, '"]')
where category is not null;
update `classifier_setting`
set `categories` = '[]'
where category is null;
alter table `classifier_setting`
drop column `category`;
-- mariadb requires to drop constraint manually when dropping a column
alter table `classifier_setting`
drop constraint `classifier_setting_ibfk_2`;
alter table `classifier_setting`
drop column `file_id`;
ALTER TABLE `classifier_setting`
MODIFY `categories` mediumtext NOT NULL;
ALTER TABLE `classifier_setting`
MODIFY `category_list_type` varchar(254) NOT NULL;

View File

@ -0,0 +1,7 @@
CREATE TABLE `item_proposal` (
`itemid` varchar(254) not null primary key,
`classifier_proposals` mediumtext not null,
`classifier_tags` mediumtext not null,
`created` timestamp not null,
foreign key (`itemid`) references `item`(`itemid`)
);

View File

@ -0,0 +1,15 @@
ALTER TABLE "attachmentmeta"
ADD COLUMN "language" varchar(254);
with
"attachlang" as (
select "m"."attachid", "m"."language", "c"."doclang"
from "attachmentmeta" m
inner join "attachment" a on "a"."attachid" = "m"."attachid"
inner join "item" i on "a"."itemid" = "i"."itemid"
inner join "collective" c on "c"."cid" = "i"."cid"
)
update "attachmentmeta" as "m"
set "language" = "c"."doclang"
from "attachlang" c
where "m"."attachid" = "c"."attachid" and "m"."language" is null;

View File

@ -0,0 +1,44 @@
CREATE TABLE "classifier_model"(
"id" varchar(254) not null primary key,
"cid" varchar(254) not null,
"name" varchar(254) not null,
"file_id" varchar(254) not null,
"created" timestamp not null,
foreign key ("cid") references "collective"("cid"),
foreign key ("file_id") references "filemeta"("id"),
unique ("cid", "name")
);
insert into "classifier_model"
select md5(random()::text) as id, "cid",'tagcategory-' || "category" as "name", "file_id", "created"
from "classifier_setting"
where "file_id" is not null;
alter table "classifier_setting"
add column "categories" text;
alter table "classifier_setting"
add column "category_list_type" varchar(254);
update "classifier_setting"
set "category_list_type" = 'whitelist';
update "classifier_setting"
set "categories" = concat('["', "category", '"]')
where category is not null;
update "classifier_setting"
set "categories" = '[]'
where category is null;
alter table "classifier_setting"
drop column "category";
alter table "classifier_setting"
drop column "file_id";
ALTER TABLE "classifier_setting"
ALTER COLUMN "categories" SET NOT NULL;
ALTER TABLE "classifier_setting"
ALTER COLUMN "category_list_type" SET NOT NULL;

View File

@ -0,0 +1,7 @@
CREATE TABLE "item_proposal" (
"itemid" varchar(254) not null primary key,
"classifier_proposals" text not null,
"classifier_tags" text not null,
"created" timestamp not null,
foreign key ("itemid") references "item"("itemid")
);

View File

@ -86,6 +86,9 @@ trait DoobieMeta extends EmilDoobieMeta {
implicit val metaItemProposalList: Meta[MetaProposalList] = implicit val metaItemProposalList: Meta[MetaProposalList] =
jsonMeta[MetaProposalList] jsonMeta[MetaProposalList]
implicit val metaIdRef: Meta[List[IdRef]] =
jsonMeta[List[IdRef]]
implicit val metaLanguage: Meta[Language] = implicit val metaLanguage: Meta[Language] =
Meta[String].imap(Language.unsafe)(_.iso3) Meta[String].imap(Language.unsafe)(_.iso3)
@ -97,6 +100,9 @@ trait DoobieMeta extends EmilDoobieMeta {
implicit val metaCustomFieldType: Meta[CustomFieldType] = implicit val metaCustomFieldType: Meta[CustomFieldType] =
Meta[String].timap(CustomFieldType.unsafe)(_.name) Meta[String].timap(CustomFieldType.unsafe)(_.name)
implicit val metaListType: Meta[ListType] =
Meta[String].timap(ListType.unsafeFromString)(_.name)
} }
object DoobieMeta extends DoobieMeta { object DoobieMeta extends DoobieMeta {

View File

@ -1,5 +1,7 @@
package docspell.store.qb package docspell.store.qb
import cats.data.NonEmptyList
sealed trait DBFunction {} sealed trait DBFunction {}
object DBFunction { object DBFunction {
@ -31,6 +33,8 @@ object DBFunction {
case class Sum(expr: SelectExpr) extends DBFunction case class Sum(expr: SelectExpr) extends DBFunction
case class Concat(exprs: NonEmptyList[SelectExpr]) extends DBFunction
sealed trait Operator sealed trait Operator
object Operator { object Operator {
case object Plus extends Operator case object Plus extends Operator

View File

@ -98,6 +98,9 @@ trait DSL extends DoobieMeta {
def substring(expr: SelectExpr, start: Int, length: Int): DBFunction = def substring(expr: SelectExpr, start: Int, length: Int): DBFunction =
DBFunction.Substring(expr, start, length) DBFunction.Substring(expr, start, length)
def concat(expr: SelectExpr, exprs: SelectExpr*): DBFunction =
DBFunction.Concat(Nel.of(expr, exprs: _*))
def lit[A](value: A)(implicit P: Put[A]): SelectExpr.SelectLit[A] = def lit[A](value: A)(implicit P: Put[A]): SelectExpr.SelectLit[A] =
SelectExpr.SelectLit(value, None) SelectExpr.SelectLit(value, None)

View File

@ -32,6 +32,10 @@ object DBFunctionBuilder extends CommonBuilder {
case DBFunction.Substring(expr, start, len) => case DBFunction.Substring(expr, start, len) =>
sql"SUBSTRING(" ++ SelectExprBuilder.build(expr) ++ fr" FROM $start FOR $len)" sql"SUBSTRING(" ++ SelectExprBuilder.build(expr) ++ fr" FROM $start FOR $len)"
case DBFunction.Concat(exprs) =>
val inner = exprs.map(SelectExprBuilder.build).toList.reduce(_ ++ comma ++ _)
sql"CONCAT(" ++ inner ++ sql")"
case DBFunction.Calc(op, left, right) => case DBFunction.Calc(op, left, right) =>
SelectExprBuilder.build(left) ++ SelectExprBuilder.build(left) ++
buildOperator(op) ++ buildOperator(op) ++

View File

@ -21,6 +21,7 @@ object QAttachment {
private val item = RItem.as("i") private val item = RItem.as("i")
private val am = RAttachmentMeta.as("am") private val am = RAttachmentMeta.as("am")
private val c = RCollective.as("c") private val c = RCollective.as("c")
private val im = RItemProposal.as("im")
def deletePreview[F[_]: Sync](store: Store[F])(attachId: Ident): F[Int] = { def deletePreview[F[_]: Sync](store: Store[F])(attachId: Ident): F[Int] = {
val findPreview = val findPreview =
@ -118,17 +119,27 @@ object QAttachment {
} yield ns.sum } yield ns.sum
def getMetaProposals(itemId: Ident, coll: Ident): ConnectionIO[MetaProposalList] = { def getMetaProposals(itemId: Ident, coll: Ident): ConnectionIO[MetaProposalList] = {
val q = Select( val qa = Select(
am.proposals.s, select(am.proposals),
from(am) from(am)
.innerJoin(a, a.id === am.id) .innerJoin(a, a.id === am.id)
.innerJoin(item, a.itemId === item.id), .innerJoin(item, a.itemId === item.id),
a.itemId === itemId && item.cid === coll a.itemId === itemId && item.cid === coll
).build ).build
val qi = Select(
select(im.classifyProposals),
from(im)
.innerJoin(item, item.id === im.itemId),
item.cid === coll && im.itemId === itemId
).build
for { for {
ml <- q.query[MetaProposalList].to[Vector] mla <- qa.query[MetaProposalList].to[Vector]
} yield MetaProposalList.flatten(ml) mli <- qi.query[MetaProposalList].to[Vector]
} yield MetaProposalList
.flatten(mla)
.insertSecond(MetaProposalList.flatten(mli))
} }
def getAttachmentMeta( def getAttachmentMeta(
@ -160,7 +171,15 @@ object QAttachment {
chunkSize: Int chunkSize: Int
): Stream[ConnectionIO, ContentAndName] = ): Stream[ConnectionIO, ContentAndName] =
Select( Select(
select(a.id, a.itemId, item.cid, item.folder, c.language, a.name, am.content), select(
a.id.s,
a.itemId.s,
item.cid.s,
item.folder.s,
coalesce(am.language.s, c.language.s).s,
a.name.s,
am.content.s
),
from(a) from(a)
.innerJoin(am, am.id === a.id) .innerJoin(am, am.id === a.id)
.innerJoin(item, item.id === a.itemId) .innerJoin(item, item.id === a.itemId)

View File

@ -1,10 +1,8 @@
package docspell.store.queries package docspell.store.queries
import cats.data.OptionT
import fs2.Stream import fs2.Stream
import docspell.common.ContactKind import docspell.common._
import docspell.common.{Direction, Ident}
import docspell.store.qb.DSL._ import docspell.store.qb.DSL._
import docspell.store.qb._ import docspell.store.qb._
import docspell.store.records._ import docspell.store.records._
@ -17,6 +15,7 @@ object QCollective {
private val t = RTag.as("t") private val t = RTag.as("t")
private val ro = ROrganization.as("o") private val ro = ROrganization.as("o")
private val rp = RPerson.as("p") private val rp = RPerson.as("p")
private val re = REquipment.as("e")
private val rc = RContact.as("c") private val rc = RContact.as("c")
private val i = RItem.as("i") private val i = RItem.as("i")
@ -25,13 +24,37 @@ object QCollective {
val empty = Names(Vector.empty, Vector.empty, Vector.empty) val empty = Names(Vector.empty, Vector.empty, Vector.empty)
} }
def allNames(collective: Ident): ConnectionIO[Names] = def allNames(collective: Ident, maxEntries: Int): ConnectionIO[Names] = {
(for { val created = Column[Timestamp]("created", TableDef(""))
orgs <- OptionT.liftF(ROrganization.findAllRef(collective, None, _.name)) union(
pers <- OptionT.liftF(RPerson.findAllRef(collective, None, _.name)) Select(
equp <- OptionT.liftF(REquipment.findAll(collective, None, _.name)) select(ro.name.s, lit(1).as("kind"), ro.created.as(created)),
} yield Names(orgs.map(_.name), pers.map(_.name), equp.map(_.name))) from(ro),
.getOrElse(Names.empty) ro.cid === collective
),
Select(
select(rp.name.s, lit(2).as("kind"), rp.created.as(created)),
from(rp),
rp.cid === collective
),
Select(
select(re.name.s, lit(3).as("kind"), re.created.as(created)),
from(re),
re.cid === collective
)
).orderBy(created.desc)
.limit(Batch.limit(maxEntries))
.build
.query[(String, Int)]
.streamWithChunkSize(maxEntries)
.fold(Names.empty) { case (names, (name, kind)) =>
if (kind == 1) names.copy(org = names.org :+ name)
else if (kind == 2) names.copy(pers = names.pers :+ name)
else names.copy(equip = names.equip :+ name)
}
.compile
.lastOrError
}
case class InsightData( case class InsightData(
incoming: Int, incoming: Int,

View File

@ -441,8 +441,9 @@ object QItem {
tn <- store.transact(RTagItem.deleteItemTags(itemId)) tn <- store.transact(RTagItem.deleteItemTags(itemId))
mn <- store.transact(RSentMail.deleteByItem(itemId)) mn <- store.transact(RSentMail.deleteByItem(itemId))
cf <- store.transact(RCustomFieldValue.deleteByItem(itemId)) cf <- store.transact(RCustomFieldValue.deleteByItem(itemId))
im <- store.transact(RItemProposal.deleteByItem(itemId))
n <- store.transact(RItem.deleteByIdAndCollective(itemId, collective)) n <- store.transact(RItem.deleteByIdAndCollective(itemId, collective))
} yield tn + rn + n + mn + cf } yield tn + rn + n + mn + cf + im
private def findByFileIdsQuery( private def findByFileIdsQuery(
fileMetaIds: Nel[Ident], fileMetaIds: Nel[Ident],
@ -543,11 +544,13 @@ object QItem {
def findAllNewesFirst( def findAllNewesFirst(
collective: Ident, collective: Ident,
chunkSize: Int chunkSize: Int,
limit: Batch
): Stream[ConnectionIO, Ident] = { ): Stream[ConnectionIO, Ident] = {
val i = RItem.as("i") val i = RItem.as("i")
Select(i.id.s, from(i), i.cid === collective && i.state === ItemState.confirmed) Select(i.id.s, from(i), i.cid === collective && i.state === ItemState.confirmed)
.orderBy(i.created.desc) .orderBy(i.created.desc)
.limit(limit)
.build .build
.query[Ident] .query[Ident]
.streamWithChunkSize(chunkSize) .streamWithChunkSize(chunkSize)
@ -557,6 +560,7 @@ object QItem {
collective: Ident, collective: Ident,
itemId: Ident, itemId: Ident,
tagCategory: String, tagCategory: String,
maxLen: Int,
pageSep: String pageSep: String
): ConnectionIO[TextAndTag] = { ): ConnectionIO[TextAndTag] = {
val tags = TableDef("tags").as("tt") val tags = TableDef("tags").as("tt")
@ -564,7 +568,7 @@ object QItem {
val tagsTid = Column[Ident]("tid", tags) val tagsTid = Column[Ident]("tid", tags)
val tagsName = Column[String]("tname", tags) val tagsName = Column[String]("tname", tags)
val q = readTextAndTag(collective, itemId, pageSep) {
withCte( withCte(
tags -> Select( tags -> Select(
select(ti.itemId.as(tagsItem), tag.tid.as(tagsTid), tag.name.as(tagsName)), select(ti.itemId.as(tagsItem), tag.tid.as(tagsTid), tag.name.as(tagsName)),
@ -574,25 +578,98 @@ object QItem {
) )
)( )(
Select( Select(
select(m.content, tagsTid, tagsName), select(substring(m.content.s, 0, maxLen).s, tagsTid.s, tagsName.s),
from(i) from(i)
.innerJoin(a, a.itemId === i.id) .innerJoin(a, a.itemId === i.id)
.innerJoin(m, a.id === m.id) .innerJoin(m, a.id === m.id)
.leftJoin(tags, tagsItem === i.id), .leftJoin(tags, tagsItem === i.id),
i.id === itemId && i.cid === collective && m.content.isNotNull && m.content <> "" i.id === itemId && i.cid === collective && m.content.isNotNull && m.content <> ""
) )
).build )
}
}
def resolveTextAndCorrOrg(
collective: Ident,
itemId: Ident,
maxLen: Int,
pageSep: String
): ConnectionIO[TextAndTag] =
readTextAndTag(collective, itemId, pageSep) {
Select(
select(substring(m.content.s, 0, maxLen).s, org.oid.s, org.name.s),
from(i)
.innerJoin(a, a.itemId === i.id)
.innerJoin(m, m.id === a.id)
.leftJoin(org, org.oid === i.corrOrg),
i.id === itemId && m.content.isNotNull && m.content <> ""
)
}
def resolveTextAndCorrPerson(
collective: Ident,
itemId: Ident,
maxLen: Int,
pageSep: String
): ConnectionIO[TextAndTag] =
readTextAndTag(collective, itemId, pageSep) {
Select(
select(substring(m.content.s, 0, maxLen).s, pers0.pid.s, pers0.name.s),
from(i)
.innerJoin(a, a.itemId === i.id)
.innerJoin(m, m.id === a.id)
.leftJoin(pers0, pers0.pid === i.corrPerson),
i.id === itemId && m.content.isNotNull && m.content <> ""
)
}
def resolveTextAndConcPerson(
collective: Ident,
itemId: Ident,
maxLen: Int,
pageSep: String
): ConnectionIO[TextAndTag] =
readTextAndTag(collective, itemId, pageSep) {
Select(
select(substring(m.content.s, 0, maxLen).s, pers0.pid.s, pers0.name.s),
from(i)
.innerJoin(a, a.itemId === i.id)
.innerJoin(m, m.id === a.id)
.leftJoin(pers0, pers0.pid === i.concPerson),
i.id === itemId && m.content.isNotNull && m.content <> ""
)
}
def resolveTextAndConcEquip(
collective: Ident,
itemId: Ident,
maxLen: Int,
pageSep: String
): ConnectionIO[TextAndTag] =
readTextAndTag(collective, itemId, pageSep) {
Select(
select(substring(m.content.s, 0, maxLen).s, equip.eid.s, equip.name.s),
from(i)
.innerJoin(a, a.itemId === i.id)
.innerJoin(m, m.id === a.id)
.leftJoin(equip, equip.eid === i.concEquipment),
i.id === itemId && m.content.isNotNull && m.content <> ""
)
}
private def readTextAndTag(collective: Ident, itemId: Ident, pageSep: String)(
q: Select
): ConnectionIO[TextAndTag] =
for { for {
_ <- logger.ftrace[ConnectionIO]( _ <- logger.ftrace[ConnectionIO](
s"query: $q (${itemId.id}, ${collective.id}, ${tagCategory})" s"query: $q (${itemId.id}, ${collective.id})"
) )
texts <- q.query[(String, Option[TextAndTag.TagName])].to[List] texts <- q.build.query[(String, Option[TextAndTag.TagName])].to[List]
_ <- logger.ftrace[ConnectionIO]( _ <- logger.ftrace[ConnectionIO](
s"Got ${texts.size} text and tag entries for item ${itemId.id}" s"Got ${texts.size} text and tag entries for item ${itemId.id}"
) )
tag = texts.headOption.flatMap(_._2) tag = texts.headOption.flatMap(_._2)
txt = texts.map(_._1).mkString(pageSep) txt = texts.map(_._1).mkString(pageSep)
} yield TextAndTag(itemId, txt, tag) } yield TextAndTag(itemId, txt, tag)
}
} }

View File

@ -15,7 +15,8 @@ case class RAttachmentMeta(
content: Option[String], content: Option[String],
nerlabels: List[NerLabel], nerlabels: List[NerLabel],
proposals: MetaProposalList, proposals: MetaProposalList,
pages: Option[Int] pages: Option[Int],
language: Option[Language]
) { ) {
def setContentIfEmpty(txt: Option[String]): RAttachmentMeta = def setContentIfEmpty(txt: Option[String]): RAttachmentMeta =
@ -27,8 +28,8 @@ case class RAttachmentMeta(
} }
object RAttachmentMeta { object RAttachmentMeta {
def empty(attachId: Ident) = def empty(attachId: Ident, lang: Language) =
RAttachmentMeta(attachId, None, Nil, MetaProposalList.empty, None) RAttachmentMeta(attachId, None, Nil, MetaProposalList.empty, None, Some(lang))
final case class Table(alias: Option[String]) extends TableDef { final case class Table(alias: Option[String]) extends TableDef {
val tableName = "attachmentmeta" val tableName = "attachmentmeta"
@ -38,7 +39,16 @@ object RAttachmentMeta {
val nerlabels = Column[List[NerLabel]]("nerlabels", this) val nerlabels = Column[List[NerLabel]]("nerlabels", this)
val proposals = Column[MetaProposalList]("itemproposals", this) val proposals = Column[MetaProposalList]("itemproposals", this)
val pages = Column[Int]("page_count", this) val pages = Column[Int]("page_count", this)
val all = NonEmptyList.of[Column[_]](id, content, nerlabels, proposals, pages) val language = Column[Language]("language", this)
val all =
NonEmptyList.of[Column[_]](
id,
content,
nerlabels,
proposals,
pages,
language
)
} }
val T = Table(None) val T = Table(None)
@ -49,7 +59,7 @@ object RAttachmentMeta {
DML.insert( DML.insert(
T, T,
T.all, T.all,
fr"${v.id},${v.content},${v.nerlabels},${v.proposals},${v.pages}" fr"${v.id},${v.content},${v.nerlabels},${v.proposals},${v.pages},${v.language}"
) )
def exists(attachId: Ident): ConnectionIO[Boolean] = def exists(attachId: Ident): ConnectionIO[Boolean] =
@ -90,13 +100,14 @@ object RAttachmentMeta {
) )
) )
def updateProposals(mid: Ident, plist: MetaProposalList): ConnectionIO[Int] = def updateProposals(
mid: Ident,
plist: MetaProposalList
): ConnectionIO[Int] =
DML.update( DML.update(
T, T,
T.id === mid, T.id === mid,
DML.set( DML.set(T.proposals.setTo(plist))
T.proposals.setTo(plist)
)
) )
def updatePageCount(mid: Ident, pageCount: Option[Int]): ConnectionIO[Int] = def updatePageCount(mid: Ident, pageCount: Option[Int]): ConnectionIO[Int] =

View File

@ -0,0 +1,102 @@
package docspell.store.records
import cats.data.NonEmptyList
import cats.effect._
import cats.implicits._
import docspell.common._
import docspell.store.qb.DSL._
import docspell.store.qb._
import doobie._
import doobie.implicits._
final case class RClassifierModel(
id: Ident,
cid: Ident,
name: String,
fileId: Ident,
created: Timestamp
) {}
object RClassifierModel {
def createNew[F[_]: Sync](
cid: Ident,
name: String,
fileId: Ident
): F[RClassifierModel] =
for {
id <- Ident.randomId[F]
now <- Timestamp.current[F]
} yield RClassifierModel(id, cid, name, fileId, now)
final case class Table(alias: Option[String]) extends TableDef {
val tableName = "classifier_model"
val id = Column[Ident]("id", this)
val cid = Column[Ident]("cid", this)
val name = Column[String]("name", this)
val fileId = Column[Ident]("file_id", this)
val created = Column[Timestamp]("created", this)
val all = NonEmptyList.of[Column[_]](id, cid, name, fileId, created)
}
def as(alias: String): Table =
Table(Some(alias))
val T = Table(None)
def insert(v: RClassifierModel): ConnectionIO[Int] =
DML.insert(
T,
T.all,
fr"${v.id},${v.cid},${v.name},${v.fileId},${v.created}"
)
def updateFile(coll: Ident, name: String, fid: Ident): ConnectionIO[Int] =
for {
now <- Timestamp.current[ConnectionIO]
n <- DML.update(
T,
T.cid === coll && T.name === name,
DML.set(T.fileId.setTo(fid), T.created.setTo(now))
)
k <-
if (n == 0) createNew[ConnectionIO](coll, name, fid).flatMap(insert)
else 0.pure[ConnectionIO]
} yield n + k
def deleteById(id: Ident): ConnectionIO[Int] =
DML.delete(T, T.id === id)
def deleteAll(ids: List[Ident]): ConnectionIO[Int] =
NonEmptyList.fromList(ids) match {
case Some(nel) =>
DML.delete(T, T.id.in(nel))
case None =>
0.pure[ConnectionIO]
}
def findByName(cid: Ident, name: String): ConnectionIO[Option[RClassifierModel]] =
Select(select(T.all), from(T), T.cid === cid && T.name === name).build
.query[RClassifierModel]
.option
def findAllByName(
cid: Ident,
names: NonEmptyList[String]
): ConnectionIO[List[RClassifierModel]] =
Select(select(T.all), from(T), T.cid === cid && T.name.in(names)).build
.query[RClassifierModel]
.to[List]
def findAllByQuery(
cid: Ident,
nameQuery: String
): ConnectionIO[List[RClassifierModel]] =
Select(select(T.all), from(T), T.cid === cid && T.name.like(nameQuery)).build
.query[RClassifierModel]
.to[List]
}

View File

@ -1,6 +1,6 @@
package docspell.store.records package docspell.store.records
import cats.data.NonEmptyList import cats.data.{NonEmptyList, OptionT}
import cats.implicits._ import cats.implicits._
import docspell.common._ import docspell.common._
@ -13,27 +13,38 @@ import doobie.implicits._
case class RClassifierSetting( case class RClassifierSetting(
cid: Ident, cid: Ident,
enabled: Boolean,
schedule: CalEvent, schedule: CalEvent,
category: String,
itemCount: Int, itemCount: Int,
fileId: Option[Ident], created: Timestamp,
created: Timestamp categoryList: List[String],
) {} listType: ListType
) {
def autoTagEnabled: Boolean =
listType match {
case ListType.Blacklist =>
true
case ListType.Whitelist =>
categoryList.nonEmpty
}
}
object RClassifierSetting { object RClassifierSetting {
// the categoryList is stored as a json array
implicit val stringListMeta: Meta[List[String]] =
jsonMeta[List[String]]
final case class Table(alias: Option[String]) extends TableDef { final case class Table(alias: Option[String]) extends TableDef {
val tableName = "classifier_setting" val tableName = "classifier_setting"
val cid = Column[Ident]("cid", this) val cid = Column[Ident]("cid", this)
val enabled = Column[Boolean]("enabled", this)
val schedule = Column[CalEvent]("schedule", this) val schedule = Column[CalEvent]("schedule", this)
val category = Column[String]("category", this)
val itemCount = Column[Int]("item_count", this) val itemCount = Column[Int]("item_count", this)
val fileId = Column[Ident]("file_id", this)
val created = Column[Timestamp]("created", this) val created = Column[Timestamp]("created", this)
val categories = Column[List[String]]("categories", this)
val listType = Column[ListType]("category_list_type", this)
val all = NonEmptyList val all = NonEmptyList
.of[Column[_]](cid, enabled, schedule, category, itemCount, fileId, created) .of[Column[_]](cid, schedule, itemCount, created, categories, listType)
} }
val T = Table(None) val T = Table(None)
@ -44,35 +55,19 @@ object RClassifierSetting {
DML.insert( DML.insert(
T, T,
T.all, T.all,
fr"${v.cid},${v.enabled},${v.schedule},${v.category},${v.itemCount},${v.fileId},${v.created}" fr"${v.cid},${v.schedule},${v.itemCount},${v.created},${v.categoryList},${v.listType}"
) )
def updateAll(v: RClassifierSetting): ConnectionIO[Int] = def update(v: RClassifierSetting): ConnectionIO[Int] =
DML.update(
T,
T.cid === v.cid,
DML.set(
T.enabled.setTo(v.enabled),
T.schedule.setTo(v.schedule),
T.category.setTo(v.category),
T.itemCount.setTo(v.itemCount),
T.fileId.setTo(v.fileId)
)
)
def updateFile(coll: Ident, fid: Ident): ConnectionIO[Int] =
DML.update(T, T.cid === coll, DML.set(T.fileId.setTo(fid)))
def updateSettings(v: RClassifierSetting): ConnectionIO[Int] =
for { for {
n1 <- DML.update( n1 <- DML.update(
T, T,
T.cid === v.cid, T.cid === v.cid,
DML.set( DML.set(
T.enabled.setTo(v.enabled),
T.schedule.setTo(v.schedule), T.schedule.setTo(v.schedule),
T.itemCount.setTo(v.itemCount), T.itemCount.setTo(v.itemCount),
T.category.setTo(v.category) T.categories.setTo(v.categoryList),
T.listType.setTo(v.listType)
) )
) )
n2 <- if (n1 <= 0) insert(v) else 0.pure[ConnectionIO] n2 <- if (n1 <= 0) insert(v) else 0.pure[ConnectionIO]
@ -86,27 +81,62 @@ object RClassifierSetting {
def delete(coll: Ident): ConnectionIO[Int] = def delete(coll: Ident): ConnectionIO[Int] =
DML.delete(T, T.cid === coll) DML.delete(T, T.cid === coll)
/** Finds tag categories that exist and match the classifier setting.
* If the setting contains a black list, they are removed from the
* existing categories. If it is a whitelist, the intersection is
* returned.
*/
def getActiveCategories(coll: Ident): ConnectionIO[List[String]] =
(for {
sett <- OptionT(findById(coll))
cats <- OptionT.liftF(RTag.listCategories(coll))
res = sett.listType match {
case ListType.Blacklist =>
cats.diff(sett.categoryList)
case ListType.Whitelist =>
sett.categoryList.intersect(cats)
}
} yield res).getOrElse(Nil)
/** Checks the json array of tag categories and removes those that are not present anymore. */
def fixCategoryList(coll: Ident): ConnectionIO[Int] =
(for {
sett <- OptionT(findById(coll))
cats <- OptionT.liftF(RTag.listCategories(coll))
fixed = sett.categoryList.intersect(cats)
n <- OptionT.liftF(
if (fixed == sett.categoryList) 0.pure[ConnectionIO]
else DML.update(T, T.cid === coll, DML.set(T.categories.setTo(fixed)))
)
} yield n).getOrElse(0)
case class Classifier( case class Classifier(
enabled: Boolean,
schedule: CalEvent, schedule: CalEvent,
itemCount: Int, itemCount: Int,
category: Option[String] categories: List[String],
listType: ListType
) { ) {
def enabled: Boolean =
listType match {
case ListType.Blacklist =>
true
case ListType.Whitelist =>
categories.nonEmpty
}
def toRecord(coll: Ident, created: Timestamp): RClassifierSetting = def toRecord(coll: Ident, created: Timestamp): RClassifierSetting =
RClassifierSetting( RClassifierSetting(
coll, coll,
enabled,
schedule, schedule,
category.getOrElse(""),
itemCount, itemCount,
None, created,
created categories,
listType
) )
} }
object Classifier { object Classifier {
def fromRecord(r: RClassifierSetting): Classifier = def fromRecord(r: RClassifierSetting): Classifier =
Classifier(r.enabled, r.schedule, r.itemCount, r.category.some) Classifier(r.schedule, r.itemCount, r.categoryList, r.listType)
} }
} }

View File

@ -1,6 +1,6 @@
package docspell.store.records package docspell.store.records
import cats.data.NonEmptyList import cats.data.{NonEmptyList, OptionT}
import fs2.Stream import fs2.Stream
import docspell.common._ import docspell.common._
@ -73,13 +73,24 @@ object RCollective {
.map(now => settings.classifier.map(_.toRecord(cid, now))) .map(now => settings.classifier.map(_.toRecord(cid, now)))
n2 <- cls match { n2 <- cls match {
case Some(cr) => case Some(cr) =>
RClassifierSetting.updateSettings(cr) RClassifierSetting.update(cr)
case None => case None =>
RClassifierSetting.delete(cid) RClassifierSetting.delete(cid)
} }
} yield n1 + n2 } yield n1 + n2
def getSettings(coll: Ident): ConnectionIO[Option[Settings]] = { // this hides categories that have been deleted in the meantime
// they are finally removed from the json array once the learn classifier task is run
def getSettings(coll: Ident): ConnectionIO[Option[Settings]] =
(for {
sett <- OptionT(getRawSettings(coll))
prev <- OptionT.fromOption[ConnectionIO](sett.classifier)
cats <- OptionT.liftF(RTag.listCategories(coll))
next = prev.copy(categories = prev.categories.intersect(cats))
} yield sett.copy(classifier = Some(next))).value
private def getRawSettings(coll: Ident): ConnectionIO[Option[Settings]] = {
import RClassifierSetting.stringListMeta
val c = RCollective.as("c") val c = RCollective.as("c")
val cs = RClassifierSetting.as("cs") val cs = RClassifierSetting.as("cs")
@ -87,10 +98,10 @@ object RCollective {
select( select(
c.language.s, c.language.s,
c.integration.s, c.integration.s,
cs.enabled.s,
cs.schedule.s, cs.schedule.s,
cs.itemCount.s, cs.itemCount.s,
cs.category.s cs.categories.s,
cs.listType.s
), ),
from(c).leftJoin(cs, cs.cid === c.id), from(c).leftJoin(cs, cs.cid === c.id),
c.id === coll c.id === coll

View File

@ -0,0 +1,60 @@
package docspell.store.records
import cats.data.NonEmptyList
import docspell.common._
import docspell.store.qb.DSL._
import docspell.store.qb._
import doobie._
import doobie.implicits._
case class RItemProposal(
itemId: Ident,
classifyProposals: MetaProposalList,
classifyTags: List[IdRef],
created: Timestamp
)
object RItemProposal {
final case class Table(alias: Option[String]) extends TableDef {
val tableName = "item_proposal"
val itemId = Column[Ident]("itemid", this)
val classifyProposals = Column[MetaProposalList]("classifier_proposals", this)
val classifyTags = Column[List[IdRef]]("classifier_tags", this)
val created = Column[Timestamp]("created", this)
val all = NonEmptyList.of[Column[_]](itemId, classifyProposals, classifyTags, created)
}
val T = Table(None)
def as(alias: String): Table =
Table(Some(alias))
def insert(v: RItemProposal): ConnectionIO[Int] =
DML.insert(
T,
T.all,
fr"${v.itemId},${v.classifyProposals},${v.classifyTags},${v.created}"
)
def update(v: RItemProposal): ConnectionIO[Int] =
DML.update(
T,
T.itemId === v.itemId,
DML.set(
T.classifyProposals.setTo(v.classifyProposals),
T.classifyTags.setTo(v.classifyTags)
)
)
def deleteByItem(itemId: Ident): ConnectionIO[Int] =
DML.delete(T, T.itemId === itemId)
def exists(itemId: Ident): ConnectionIO[Boolean] =
Select(select(countAll), from(T), T.itemId === itemId).build
.query[Int]
.unique
.map(_ > 0)
}

View File

@ -148,6 +148,13 @@ object RTag {
).orderBy(T.name.asc).build.query[RTag].to[List] ).orderBy(T.name.asc).build.query[RTag].to[List]
} }
def listCategories(coll: Ident): ConnectionIO[List[String]] =
Select(
T.category.s,
from(T),
T.cid === coll && T.category.isNotNull
).distinct.build.query[String].to[List]
def delete(tagId: Ident, coll: Ident): ConnectionIO[Int] = def delete(tagId: Ident, coll: Ident): ConnectionIO[Int] =
DML.delete(T, T.tid === tagId && T.cid === coll) DML.delete(T, T.tid === tagId && T.cid === coll)
} }

View File

@ -11,35 +11,38 @@ import Api
import Api.Model.ClassifierSetting exposing (ClassifierSetting) import Api.Model.ClassifierSetting exposing (ClassifierSetting)
import Api.Model.TagList exposing (TagList) import Api.Model.TagList exposing (TagList)
import Comp.CalEventInput import Comp.CalEventInput
import Comp.Dropdown
import Comp.FixedDropdown import Comp.FixedDropdown
import Comp.IntField import Comp.IntField
import Data.CalEvent exposing (CalEvent) import Data.CalEvent exposing (CalEvent)
import Data.Flags exposing (Flags) import Data.Flags exposing (Flags)
import Data.ListType exposing (ListType)
import Data.UiSettings exposing (UiSettings)
import Data.Validated exposing (Validated(..)) import Data.Validated exposing (Validated(..))
import Html exposing (..) import Html exposing (..)
import Html.Attributes exposing (..) import Html.Attributes exposing (..)
import Html.Events exposing (onCheck)
import Http import Http
import Markdown
import Util.Tag import Util.Tag
type alias Model = type alias Model =
{ enabled : Bool { scheduleModel : Comp.CalEventInput.Model
, categoryModel : Comp.FixedDropdown.Model String
, category : Maybe String
, scheduleModel : Comp.CalEventInput.Model
, schedule : Validated CalEvent , schedule : Validated CalEvent
, itemCountModel : Comp.IntField.Model , itemCountModel : Comp.IntField.Model
, itemCount : Maybe Int , itemCount : Maybe Int
, categoryListModel : Comp.Dropdown.Model String
, categoryListType : ListType
, categoryListTypeModel : Comp.FixedDropdown.Model ListType
} }
type Msg type Msg
= GetTagsResp (Result Http.Error TagList) = ScheduleMsg Comp.CalEventInput.Msg
| ScheduleMsg Comp.CalEventInput.Msg
| ToggleEnabled
| CategoryMsg (Comp.FixedDropdown.Msg String)
| ItemCountMsg Comp.IntField.Msg | ItemCountMsg Comp.IntField.Msg
| GetTagsResp (Result Http.Error TagList)
| CategoryListMsg (Comp.Dropdown.Msg String)
| CategoryListTypeMsg (Comp.FixedDropdown.Msg ListType)
init : Flags -> ClassifierSetting -> ( Model, Cmd Msg ) init : Flags -> ClassifierSetting -> ( Model, Cmd Msg )
@ -52,13 +55,36 @@ init flags sett =
( cem, cec ) = ( cem, cec ) =
Comp.CalEventInput.init flags newSchedule Comp.CalEventInput.init flags newSchedule
in in
( { enabled = sett.enabled ( { scheduleModel = cem
, categoryModel = Comp.FixedDropdown.initString []
, category = sett.category
, scheduleModel = cem
, schedule = Data.Validated.Unknown newSchedule , schedule = Data.Validated.Unknown newSchedule
, itemCountModel = Comp.IntField.init (Just 0) Nothing True "Item Count" , itemCountModel = Comp.IntField.init (Just 0) Nothing True "Item Count"
, itemCount = Just sett.itemCount , itemCount = Just sett.itemCount
, categoryListModel =
let
mkOption s =
{ value = s, text = s, additional = "" }
minit =
Comp.Dropdown.makeModel
{ multiple = True
, searchable = \n -> n > 0
, makeOption = mkOption
, labelColor = \_ -> \_ -> "grey "
, placeholder = "Choose categories "
}
lm =
Comp.Dropdown.SetSelection sett.categoryList
( m_, _ ) =
Comp.Dropdown.update lm minit
in
m_
, categoryListType =
Data.ListType.fromString sett.listType
|> Maybe.withDefault Data.ListType.Whitelist
, categoryListTypeModel =
Comp.FixedDropdown.initMap Data.ListType.label Data.ListType.all
} }
, Cmd.batch , Cmd.batch
[ Api.getTags flags "" GetTagsResp [ Api.getTags flags "" GetTagsResp
@ -71,11 +97,11 @@ getSettings : Model -> Validated ClassifierSetting
getSettings model = getSettings model =
Data.Validated.map Data.Validated.map
(\sch -> (\sch ->
{ enabled = model.enabled { schedule =
, category = model.category
, schedule =
Data.CalEvent.makeEvent sch Data.CalEvent.makeEvent sch
, itemCount = Maybe.withDefault 0 model.itemCount , itemCount = Maybe.withDefault 0 model.itemCount
, listType = Data.ListType.toString model.categoryListType
, categoryList = Comp.Dropdown.getSelected model.categoryListModel
} }
) )
model.schedule model.schedule
@ -89,18 +115,11 @@ update flags msg model =
categories = categories =
Util.Tag.getCategories tl.items Util.Tag.getCategories tl.items
|> List.sort |> List.sort
in
( { model
| categoryModel = Comp.FixedDropdown.initString categories
, category =
if model.category == Nothing then
List.head categories
else lm =
model.category Comp.Dropdown.SetOptions categories
} in
, Cmd.none update flags (CategoryListMsg lm) model
)
GetTagsResp (Err _) -> GetTagsResp (Err _) ->
( model, Cmd.none ) ( model, Cmd.none )
@ -121,28 +140,6 @@ update flags msg model =
, Cmd.map ScheduleMsg cc , Cmd.map ScheduleMsg cc
) )
ToggleEnabled ->
( { model | enabled = not model.enabled }
, Cmd.none
)
CategoryMsg lmsg ->
let
( mm, ma ) =
Comp.FixedDropdown.update lmsg model.categoryModel
in
( { model
| categoryModel = mm
, category =
if ma == Nothing then
model.category
else
ma
}
, Cmd.none
)
ItemCountMsg lmsg -> ItemCountMsg lmsg ->
let let
( im, iv ) = ( im, iv ) =
@ -155,39 +152,68 @@ update flags msg model =
, Cmd.none , Cmd.none
) )
CategoryListMsg lm ->
let
( m_, cmd_ ) =
Comp.Dropdown.update lm model.categoryListModel
in
( { model | categoryListModel = m_ }
, Cmd.map CategoryListMsg cmd_
)
view : Model -> Html Msg CategoryListTypeMsg lm ->
view model = let
( m_, sel ) =
Comp.FixedDropdown.update lm model.categoryListTypeModel
newListType =
Maybe.withDefault model.categoryListType sel
in
( { model
| categoryListTypeModel = m_
, categoryListType = newListType
}
, Cmd.none
)
view : UiSettings -> Model -> Html Msg
view settings model =
let
catListTypeItem =
Comp.FixedDropdown.Item
model.categoryListType
(Data.ListType.label model.categoryListType)
in
div [] div []
[ div [ Markdown.toHtml [ class "ui basic segment" ]
[ class "field" """
]
[ div [ class "ui checkbox" ] Auto-tagging works by learning from existing documents. The more
[ input documents you have correctly tagged, the better. Learning is done
[ type_ "checkbox" periodically based on a schedule. You can specify tag-groups that
, onCheck (\_ -> ToggleEnabled) should either be used (whitelist) or not used (blacklist) for
, checked model.enabled learning.
]
[] Use an empty whitelist to disable auto tagging.
, label [] [ text "Enable classification" ]
, span [ class "small-info" ] """
[ text "Disable document classification if not needed." , div [ class "field" ]
] [ label [] [ text "Is the following a blacklist or whitelist?" ]
] , Html.map CategoryListTypeMsg
] (Comp.FixedDropdown.view (Just catListTypeItem) model.categoryListTypeModel)
, div [ class "ui basic segment" ]
[ text "Document classification tries to predict a tag for new incoming documents. This "
, text "works by learning from existing documents in order to find common patterns within "
, text "the text. The more documents you have correctly tagged, the better. Learning is done "
, text "periodically based on a schedule and you need to specify a tag-group that should "
, text "be used for learning."
] ]
, div [ class "field" ] , div [ class "field" ]
[ label [] [ text "Category" ] [ label []
, Html.map CategoryMsg [ case model.categoryListType of
(Comp.FixedDropdown.viewString model.category Data.ListType.Whitelist ->
model.categoryModel text "Include tag categories for learning"
)
Data.ListType.Blacklist ->
text "Exclude tag categories from learning"
]
, Html.map CategoryListMsg
(Comp.Dropdown.view settings model.categoryListModel)
] ]
, Html.map ItemCountMsg , Html.map ItemCountMsg
(Comp.IntField.viewWithInfo (Comp.IntField.viewWithInfo

View File

@ -280,7 +280,7 @@ view flags settings model =
, ( "invisible hidden", not flags.config.showClassificationSettings ) , ( "invisible hidden", not flags.config.showClassificationSettings )
] ]
] ]
[ text "Document Classifier" [ text "Auto-Tagging"
] ]
, div , div
[ classList [ classList
@ -289,13 +289,10 @@ view flags settings model =
] ]
] ]
[ Html.map ClassifierSettingMsg [ Html.map ClassifierSettingMsg
(Comp.ClassifierSettingsForm.view model.classifierModel) (Comp.ClassifierSettingsForm.view settings model.classifierModel)
, div [ class "ui vertical segment" ] , div [ class "ui vertical segment" ]
[ button [ button
[ classList [ class "ui small secondary basic button"
[ ( "ui small secondary basic button", True )
, ( "disabled", not model.classifierModel.enabled )
]
, title "Starts a task to train a classifier" , title "Starts a task to train a classifier"
, onClick StartClassifierTask , onClick StartClassifierTask
] ]

View File

@ -958,7 +958,6 @@ renderSuggestions model mkName idnames tagger =
] ]
, div [ class "menu" ] <| , div [ class "menu" ] <|
(idnames (idnames
|> List.take 5
|> List.map (\p -> a [ class "item", href "#", onClick (tagger p) ] [ text (mkName p) ]) |> List.map (\p -> a [ class "item", href "#", onClick (tagger p) ] [ text (mkName p) ])
) )
] ]
@ -969,7 +968,7 @@ renderOrgSuggestions : Model -> Html Msg
renderOrgSuggestions model = renderOrgSuggestions model =
renderSuggestions model renderSuggestions model
.name .name
(List.take 5 model.itemProposals.corrOrg) (List.take 6 model.itemProposals.corrOrg)
SetCorrOrgSuggestion SetCorrOrgSuggestion
@ -977,7 +976,7 @@ renderCorrPersonSuggestions : Model -> Html Msg
renderCorrPersonSuggestions model = renderCorrPersonSuggestions model =
renderSuggestions model renderSuggestions model
.name .name
(List.take 5 model.itemProposals.corrPerson) (List.take 6 model.itemProposals.corrPerson)
SetCorrPersonSuggestion SetCorrPersonSuggestion
@ -985,7 +984,7 @@ renderConcPersonSuggestions : Model -> Html Msg
renderConcPersonSuggestions model = renderConcPersonSuggestions model =
renderSuggestions model renderSuggestions model
.name .name
(List.take 5 model.itemProposals.concPerson) (List.take 6 model.itemProposals.concPerson)
SetConcPersonSuggestion SetConcPersonSuggestion
@ -993,7 +992,7 @@ renderConcEquipSuggestions : Model -> Html Msg
renderConcEquipSuggestions model = renderConcEquipSuggestions model =
renderSuggestions model renderSuggestions model
.name .name
(List.take 5 model.itemProposals.concEquipment) (List.take 6 model.itemProposals.concEquipment)
SetConcEquipSuggestion SetConcEquipSuggestion
@ -1001,7 +1000,7 @@ renderItemDateSuggestions : Model -> Html Msg
renderItemDateSuggestions model = renderItemDateSuggestions model =
renderSuggestions model renderSuggestions model
Util.Time.formatDate Util.Time.formatDate
(List.take 5 model.itemProposals.itemDate) (List.take 6 model.itemProposals.itemDate)
SetItemDateSuggestion SetItemDateSuggestion
@ -1009,7 +1008,7 @@ renderDueDateSuggestions : Model -> Html Msg
renderDueDateSuggestions model = renderDueDateSuggestions model =
renderSuggestions model renderSuggestions model
Util.Time.formatDate Util.Time.formatDate
(List.take 5 model.itemProposals.dueDate) (List.take 6 model.itemProposals.dueDate)
SetDueDateSuggestion SetDueDateSuggestion

View File

@ -11,6 +11,17 @@ type Language
= German = German
| English | English
| French | French
| Italian
| Spanish
| Portuguese
| Czech
| Danish
| Finnish
| Norwegian
| Swedish
| Russian
| Romanian
| Dutch
fromString : String -> Maybe Language fromString : String -> Maybe Language
@ -24,6 +35,39 @@ fromString str =
else if str == "fra" || str == "fr" || str == "french" then else if str == "fra" || str == "fr" || str == "french" then
Just French Just French
else if str == "ita" || str == "it" || str == "italian" then
Just Italian
else if str == "spa" || str == "es" || str == "spanish" then
Just Spanish
else if str == "por" || str == "pt" || str == "portuguese" then
Just Portuguese
else if str == "ces" || str == "cs" || str == "czech" then
Just Czech
else if str == "dan" || str == "da" || str == "danish" then
Just Danish
else if str == "nld" || str == "nd" || str == "dutch" then
Just Dutch
else if str == "fin" || str == "fi" || str == "finnish" then
Just Finnish
else if str == "nor" || str == "no" || str == "norwegian" then
Just Norwegian
else if str == "swe" || str == "sv" || str == "swedish" then
Just Swedish
else if str == "rus" || str == "ru" || str == "russian" then
Just Russian
else if str == "ron" || str == "ro" || str == "romanian" then
Just Romanian
else else
Nothing Nothing
@ -40,6 +84,39 @@ toIso3 lang =
French -> French ->
"fra" "fra"
Italian ->
"ita"
Spanish ->
"spa"
Portuguese ->
"por"
Czech ->
"ces"
Danish ->
"dan"
Finnish ->
"fin"
Norwegian ->
"nor"
Swedish ->
"swe"
Russian ->
"rus"
Romanian ->
"ron"
Dutch ->
"nld"
toName : Language -> String toName : Language -> String
toName lang = toName lang =
@ -53,7 +130,54 @@ toName lang =
French -> French ->
"French" "French"
Italian ->
"Italian"
Spanish ->
"Spanish"
Portuguese ->
"Portuguese"
Czech ->
"Czech"
Danish ->
"Danish"
Finnish ->
"Finnish"
Norwegian ->
"Norwegian"
Swedish ->
"Swedish"
Russian ->
"Russian"
Romanian ->
"Romanian"
Dutch ->
"Dutch"
all : List Language all : List Language
all = all =
[ German, English, French ] [ German
, English
, French
, Italian
, Spanish
, Portuguese
, Czech
, Dutch
, Danish
, Finnish
, Norwegian
, Swedish
, Russian
, Romanian
]

View File

@ -0,0 +1,50 @@
module Data.ListType exposing
( ListType(..)
, all
, fromString
, label
, toString
)
type ListType
= Blacklist
| Whitelist
all : List ListType
all =
[ Blacklist, Whitelist ]
toString : ListType -> String
toString lt =
case lt of
Blacklist ->
"blacklist"
Whitelist ->
"whitelist"
label : ListType -> String
label lt =
case lt of
Blacklist ->
"Blacklist"
Whitelist ->
"Whitelist"
fromString : String -> Maybe ListType
fromString str =
case String.toLower str of
"blacklist" ->
Just Blacklist
"whitelist" ->
Just Whitelist
_ ->
Nothing

View File

@ -98,10 +98,14 @@ let
}; };
text-analysis = { text-analysis = {
max-length = 10000; max-length = 10000;
nlp = {
mode = "full";
clear-interval = "15 minutes";
regex-ner = { regex-ner = {
enabled = true; max-entries = 1000;
file-cache-time = "1 minute"; file-cache-time = "1 minute";
}; };
};
classification = { classification = {
enabled = true; enabled = true;
item-count = 0; item-count = 0;
@ -118,7 +122,6 @@ let
]; ];
}; };
working-dir = "/tmp/docspell-analysis"; working-dir = "/tmp/docspell-analysis";
clear-stanford-nlp-interval = "15 minutes";
}; };
processing = { processing = {
max-due-date-years = 10; max-due-date-years = 10;
@ -772,9 +775,50 @@ in {
files. files.
''; '';
}; };
clear-stanford-nlp-interval = mkOption {
nlp = mkOption {
type = types.submodule({
options = {
mode = mkOption {
type = types.str; type = types.str;
default = defaults.text-analysis.clear-stanford-nlp-interval; default = defaults.text-analysis.nlp.mode;
description = ''
The mode for configuring NLP models:
1. full builds the complete pipeline
2. basic - builds only the ner annotator
3. regexonly - matches each entry in your address book via regexps
4. disabled - doesn't use any stanford-nlp feature
The full and basic variants rely on pre-build language models
that are available for only 3 lanugages at the moment: German,
English and French.
Memory usage varies greatly among the languages. German has
quite large models, that require about 1G heap. So joex should
run with -Xmx1400M at least when using mode=full.
The basic variant does a quite good job for German and
English. It might be worse for French, always depending on the
type of text that is analysed. Joex should run with about 600M
heap, here again lanugage German uses the most.
The regexonly variant doesn't depend on a language. It roughly
works by converting all entries in your addressbook into
regexps and matches each one against the text. This can get
memory intensive, too, when the addressbook grows large. This
is included in the full and basic by default, but can be used
independently by setting mode=regexner.
When mode=disabled, then the whole nlp pipeline is disabled,
and you won't get any suggestions. Only what the classifier
returns (if enabled).
'';
};
clear-interval = mkOption {
type = types.str;
default = defaults.text-analysis.nlp.clear-interval;
description = '' description = ''
Idle time after which the NLP caches are cleared to free Idle time after which the NLP caches are cleared to free
memory. If <= 0 clearing the cache is disabled. memory. If <= 0 clearing the cache is disabled.
@ -785,19 +829,22 @@ in {
type = types.submodule({ type = types.submodule({
options = { options = {
enabled = mkOption { enabled = mkOption {
type = types.bool; type = types.int;
default = defaults.text-analysis.regex-ner.enabled; default = defaults.text-analysis.regex-ner.max-entries;
description = '' description = ''
Whether to enable custom NER annotation. This uses the address Whether to enable custom NER annotation. This uses the
book of a collective as input for NER tagging (to automatically address book of a collective as input for NER tagging (to
find correspondent and concerned entities). If the address book automatically find correspondent and concerned entities). If
is large, this can be quite memory intensive and also makes text the address book is large, this can be quite memory
analysis slower. But it greatly improves accuracy. If this is intensive and also makes text analysis much slower. But it
false, NER tagging uses only statistical models (that also work improves accuracy and can be used independent of the
quite well). lanugage. If this is set to 0, it is effectively disabled
and NER tagging uses only statistical models (that also work
quite well, but are restricted to the languages mentioned
above).
This setting might be moved to the collective settings in the Note, this is only relevant if nlp-config.mode is not
future. "disabled".
''; '';
}; };
file-cache-time = mkOption { file-cache-time = mkOption {
@ -811,9 +858,14 @@ in {
}; };
}; };
}); });
default = defaults.text-analysis.regex-ner; default = defaults.text-analysis.nlp.regex-ner;
description = ""; description = "";
}; };
};
});
default = defaults.text-analysis.nlp;
description = "Configure NLP";
};
classification = mkOption { classification = mkOption {
type = types.submodule({ type = types.submodule({

View File

@ -20,6 +20,9 @@ The configuration of both components uses separate namespaces. The
configuration for the REST server is below `docspell.server`, while configuration for the REST server is below `docspell.server`, while
the one for joex is below `docspell.joex`. the one for joex is below `docspell.joex`.
You can therefore use two separate config files or one single file
containing both namespaces.
## JDBC ## JDBC
This configures the connection to the database. This has to be This configures the connection to the database. This has to be
@ -281,6 +284,70 @@ just some minutes, the web application obtains new ones
periodically. So a short time is recommended. periodically. So a short time is recommended.
## File Processing
Files are being processed by the joex component. So all the respective
configuration is in this config only.
File processing involves several stages, detailed information can be
found [here](@/docs/joex/file-processing.md#text-analysis) and in the
corresponding sections in [joex default config](#joex).
Configuration allows to define the external tools and set some
limitations to control memory usage. The sections are:
- `docspell.joex.extraction`
- `docspell.joex.text-analysis`
- `docspell.joex.convert`
Options to external commands can use variables that are replaced by
values at runtime. Variables are enclosed in double braces `{{…}}`.
Please see the default configuration for what variables exist per
command.
### Classification
In `text-analysis.classification` you can define how many documents at
most should be used for learning. The default settings should work
well for most cases. However, it always depends on the amount of data
and the machine that runs joex. For example, by default the documents
to learn from are limited to 600 (`classification.item-count`) and
every text is cut after 8000 characters (`text-analysis.max-length`).
This is fine if *most* of your documents are small and only a few are
near 8000 characters). But if *all* your documents are very large, you
probably need to either assign more heap memory or go down with the
limits.
Classification can be disabled, too, for when it's not needed.
### NLP
This setting defines which NLP mode to use. It defaults to `full`,
which requires more memory for certain languages (with the advantage
of better results). Other values are `basic`, `regexonly` and
`disabled`. The modes `full` and `basic` use pre-defined lanugage
models for procesing documents of languaes German, English and French.
These require some amount of memory (see below).
The mode `basic` is like the "light" variant to `full`. It doesn't use
all NLP features, which makes memory consumption much lower, but comes
with the compromise of less accurate results.
The mode `regexonly` doesn't use pre-defined lanuage models, even if
available. It checks your address book against a document to find
metadata. That means, it is language independent. Also, when using
`full` or `basic` with lanugages where no pre-defined models exist, it
will degrade to `regexonly` for these.
The mode `disabled` skips NLP processing completely. This has least
impact in memory consumption, obviously, but then only the classifier
is used to find metadata.
You might want to try different modes and see what combination suits
best your usage pattern and machine running joex. If a powerful
machine is used, simply leave the defaults. When running on an older
raspberry pi, for example, you might need to adjust things.
# File Format # File Format
The format of the configuration files can be The format of the configuration files can be

Some files were not shown because too many files have changed in this diff Show More