mirror of
https://github.com/TheAnachronism/docspell.git
synced 2025-06-06 15:15:58 +00:00
Merge pull request #581 from eikek/text-analysis-improvements
Text analysis improvements
This commit is contained in:
commit
df5f9e8c51
@ -24,4 +24,4 @@ before_script:
|
||||
- export TZ=Europe/Berlin
|
||||
|
||||
script:
|
||||
- sbt ++$TRAVIS_SCALA_VERSION ";project root ;scalafmtCheckAll ;make ;test"
|
||||
- sbt -J-XX:+UseG1GC ++$TRAVIS_SCALA_VERSION ";project root ;scalafmtCheckAll ;make ;test"
|
||||
|
@ -17,6 +17,9 @@ If you don't like to sign up to github/matrix or like to reach me
|
||||
personally, you can make a mail to `info [at] docspell.org` or on
|
||||
matrix, via `@eikek:matrix.org`.
|
||||
|
||||
If you find a feature request already filed, you can vote on it. I
|
||||
tend to prefer most voted requests to those without much attention.
|
||||
|
||||
|
||||
## Documentation
|
||||
|
||||
|
31
README.md
31
README.md
@ -9,25 +9,28 @@
|
||||
# Docspell
|
||||
|
||||
Docspell is a personal document organizer. You'll need a scanner to
|
||||
convert your papers into files. Docspell can then assist in
|
||||
organizing the resulting mess :wink:.
|
||||
convert your papers into files. Docspell can then assist in organizing
|
||||
the resulting mess :wink:. It is targeted for home use, i.e. families
|
||||
and households and also for (smaller) groups/companies.
|
||||
|
||||
You can associate tags, set correspondends, what a document is
|
||||
concerned with, a name, a date and much more. If your documents are
|
||||
associated with such meta data, you should be able to quickly find
|
||||
them later using the search feature. But adding this manually to each
|
||||
document is a tedious task. Docspell can help you by suggesting
|
||||
correspondents, guessing tags or finding dates using machine learning
|
||||
techniques. This makes adding metadata to your documents a lot easier.
|
||||
You can associate tags, set correspondends and lots of other
|
||||
predefined and custom metadata. If your documents are associated with
|
||||
such meta data, you can quickly find them later using the search
|
||||
feature. But adding this manually is a tedious task. Docspell can help
|
||||
by suggesting correspondents, guessing tags or finding dates using
|
||||
machine learning. It can learn metadata from existing documents and
|
||||
find things using NLP. This makes adding metadata to your documents a
|
||||
lot easier. For machine learning, it relies on the free (GPL)
|
||||
[Stanford Core NLP library](https://github.com/stanfordnlp/CoreNLP).
|
||||
|
||||
Docspell also runs OCR (if needed) on your documents, can provide
|
||||
fulltext search and has great e-mail integration. Everything is
|
||||
accessible via a REST/HTTP api. A mobile friendly SPA web application
|
||||
is provided as the user interface and an [Android
|
||||
app](https://github.com/docspell/android-client) for conveniently
|
||||
uploading files from your phone/tablet. The [feature
|
||||
overview](https://docspell.org/#feature-selection) has a more complete
|
||||
list.
|
||||
is the default user interface. An [Android
|
||||
app](https://github.com/docspell/android-client) exists for
|
||||
conveniently uploading files from your phone/tablet. The [feature
|
||||
overview](https://docspell.org/#feature-selection) lists some more
|
||||
points.
|
||||
|
||||
|
||||
## Impressions
|
||||
|
@ -131,7 +131,8 @@ val openapiScalaSettings = Seq(
|
||||
case "ident" =>
|
||||
field => field.copy(typeDef = TypeDef("Ident", Imports("docspell.common.Ident")))
|
||||
case "accountid" =>
|
||||
field => field.copy(typeDef = TypeDef("AccountId", Imports("docspell.common.AccountId")))
|
||||
field =>
|
||||
field.copy(typeDef = TypeDef("AccountId", Imports("docspell.common.AccountId")))
|
||||
case "collectivestate" =>
|
||||
field =>
|
||||
field.copy(typeDef =
|
||||
@ -190,6 +191,9 @@ val openapiScalaSettings = Seq(
|
||||
field.copy(typeDef =
|
||||
TypeDef("CustomFieldType", Imports("docspell.common.CustomFieldType"))
|
||||
)
|
||||
case "listtype" =>
|
||||
field =>
|
||||
field.copy(typeDef = TypeDef("ListType", Imports("docspell.common.ListType")))
|
||||
}))
|
||||
)
|
||||
|
||||
|
@ -15,6 +15,17 @@ RUN apk add --no-cache openjdk11-jre \
|
||||
tesseract-ocr \
|
||||
tesseract-ocr-data-deu \
|
||||
tesseract-ocr-data-fra \
|
||||
tesseract-ocr-data-ita \
|
||||
tesseract-ocr-data-spa \
|
||||
tesseract-ocr-data-por \
|
||||
tesseract-ocr-data-ces \
|
||||
tesseract-ocr-data-nld \
|
||||
tesseract-ocr-data-dan \
|
||||
tesseract-ocr-data-fin \
|
||||
tesseract-ocr-data-nor \
|
||||
tesseract-ocr-data-swe \
|
||||
tesseract-ocr-data-rus \
|
||||
tesseract-ocr-data-ron \
|
||||
unpaper \
|
||||
wkhtmltopdf \
|
||||
libreoffice \
|
||||
|
@ -0,0 +1,7 @@
|
||||
package docspell.analysis
|
||||
|
||||
import java.nio.file.Path
|
||||
|
||||
import docspell.common._
|
||||
|
||||
case class NlpSettings(lang: Language, highRecall: Boolean, regexNer: Option[Path])
|
@ -1,29 +1,30 @@
|
||||
package docspell.analysis
|
||||
|
||||
import cats.Applicative
|
||||
import cats.effect._
|
||||
import cats.implicits._
|
||||
|
||||
import docspell.analysis.classifier.{StanfordTextClassifier, TextClassifier}
|
||||
import docspell.analysis.contact.Contact
|
||||
import docspell.analysis.date.DateFind
|
||||
import docspell.analysis.nlp.PipelineCache
|
||||
import docspell.analysis.nlp.StanfordNerClassifier
|
||||
import docspell.analysis.nlp.StanfordNerSettings
|
||||
import docspell.analysis.nlp.StanfordTextClassifier
|
||||
import docspell.analysis.nlp.TextClassifier
|
||||
import docspell.analysis.nlp._
|
||||
import docspell.common._
|
||||
|
||||
import org.log4s.getLogger
|
||||
|
||||
trait TextAnalyser[F[_]] {
|
||||
|
||||
def annotate(
|
||||
logger: Logger[F],
|
||||
settings: StanfordNerSettings,
|
||||
settings: NlpSettings,
|
||||
cacheKey: Ident,
|
||||
text: String
|
||||
): F[TextAnalyser.Result]
|
||||
|
||||
def classifier(blocker: Blocker)(implicit CS: ContextShift[F]): TextClassifier[F]
|
||||
def classifier: TextClassifier[F]
|
||||
}
|
||||
object TextAnalyser {
|
||||
private[this] val logger = getLogger
|
||||
|
||||
case class Result(labels: Vector[NerLabel], dates: Vector[NerDateLabel]) {
|
||||
|
||||
@ -31,31 +32,30 @@ object TextAnalyser {
|
||||
labels ++ dates.map(dl => dl.label.copy(label = dl.date.toString))
|
||||
}
|
||||
|
||||
def create[F[_]: Concurrent: Timer](
|
||||
cfg: TextAnalysisConfig
|
||||
def create[F[_]: Concurrent: Timer: ContextShift](
|
||||
cfg: TextAnalysisConfig,
|
||||
blocker: Blocker
|
||||
): Resource[F, TextAnalyser[F]] =
|
||||
Resource
|
||||
.liftF(PipelineCache[F](cfg.clearStanfordPipelineInterval))
|
||||
.map(cache =>
|
||||
.liftF(Nlp(cfg.nlpConfig))
|
||||
.map(stanfordNer =>
|
||||
new TextAnalyser[F] {
|
||||
def annotate(
|
||||
logger: Logger[F],
|
||||
settings: StanfordNerSettings,
|
||||
settings: NlpSettings,
|
||||
cacheKey: Ident,
|
||||
text: String
|
||||
): F[TextAnalyser.Result] =
|
||||
for {
|
||||
input <- textLimit(logger, text)
|
||||
tags0 <- stanfordNer(cacheKey, settings, input)
|
||||
tags0 <- stanfordNer(Nlp.Input(cacheKey, settings, logger, input))
|
||||
tags1 <- contactNer(input)
|
||||
dates <- dateNer(settings.lang, input)
|
||||
list = tags0 ++ tags1
|
||||
spans = NerLabelSpan.build(list)
|
||||
} yield Result(spans ++ list, dates)
|
||||
|
||||
def classifier(blocker: Blocker)(implicit
|
||||
CS: ContextShift[F]
|
||||
): TextClassifier[F] =
|
||||
def classifier: TextClassifier[F] =
|
||||
new StanfordTextClassifier[F](cfg.classifier, blocker)
|
||||
|
||||
private def textLimit(logger: Logger[F], text: String): F[String] =
|
||||
@ -66,10 +66,6 @@ object TextAnalyser {
|
||||
s" Analysing only first ${cfg.maxLength} characters."
|
||||
) *> text.take(cfg.maxLength).pure[F]
|
||||
|
||||
private def stanfordNer(key: Ident, settings: StanfordNerSettings, text: String)
|
||||
: F[Vector[NerLabel]] =
|
||||
StanfordNerClassifier.nerAnnotate[F](key.id, cache)(settings, text)
|
||||
|
||||
private def contactNer(text: String): F[Vector[NerLabel]] =
|
||||
Sync[F].delay {
|
||||
Contact.annotate(text)
|
||||
@ -82,4 +78,36 @@ object TextAnalyser {
|
||||
}
|
||||
)
|
||||
|
||||
/** Provides the nlp pipeline based on the configuration. */
|
||||
private object Nlp {
|
||||
def apply[F[_]: Concurrent: Timer: BracketThrow](
|
||||
cfg: TextAnalysisConfig.NlpConfig
|
||||
): F[Input[F] => F[Vector[NerLabel]]] =
|
||||
cfg.mode match {
|
||||
case NlpMode.Disabled =>
|
||||
Logger.log4s(logger).info("NLP is disabled as defined in config.") *>
|
||||
Applicative[F].pure(_ => Vector.empty[NerLabel].pure[F])
|
||||
case _ =>
|
||||
PipelineCache(cfg.clearInterval)(
|
||||
Annotator[F](cfg.mode),
|
||||
Annotator.clearCaches[F]
|
||||
)
|
||||
.map(annotate[F])
|
||||
}
|
||||
|
||||
final case class Input[F[_]](
|
||||
key: Ident,
|
||||
settings: NlpSettings,
|
||||
logger: Logger[F],
|
||||
text: String
|
||||
)
|
||||
|
||||
def annotate[F[_]: BracketThrow](
|
||||
cache: PipelineCache[F]
|
||||
)(input: Input[F]): F[Vector[NerLabel]] =
|
||||
cache
|
||||
.obtain(input.key.id, input.settings)
|
||||
.use(ann => ann.nerAnnotate(input.logger)(input.text))
|
||||
|
||||
}
|
||||
}
|
||||
|
@ -1,10 +1,16 @@
|
||||
package docspell.analysis
|
||||
|
||||
import docspell.analysis.nlp.TextClassifierConfig
|
||||
import docspell.analysis.TextAnalysisConfig.NlpConfig
|
||||
import docspell.analysis.classifier.TextClassifierConfig
|
||||
import docspell.common._
|
||||
|
||||
case class TextAnalysisConfig(
|
||||
maxLength: Int,
|
||||
clearStanfordPipelineInterval: Duration,
|
||||
nlpConfig: NlpConfig,
|
||||
classifier: TextClassifierConfig
|
||||
)
|
||||
|
||||
object TextAnalysisConfig {
|
||||
|
||||
case class NlpConfig(clearInterval: Duration, mode: NlpMode)
|
||||
}
|
||||
|
@ -1,4 +1,4 @@
|
||||
package docspell.analysis.nlp
|
||||
package docspell.analysis.classifier
|
||||
|
||||
import java.nio.file.Path
|
||||
|
@ -1,4 +1,4 @@
|
||||
package docspell.analysis.nlp
|
||||
package docspell.analysis.classifier
|
||||
|
||||
import java.nio.file.Path
|
||||
|
||||
@ -7,8 +7,11 @@ import cats.effect.concurrent.Ref
|
||||
import cats.implicits._
|
||||
import fs2.Stream
|
||||
|
||||
import docspell.analysis.nlp.TextClassifier._
|
||||
import docspell.analysis.classifier
|
||||
import docspell.analysis.classifier.TextClassifier._
|
||||
import docspell.analysis.nlp.Properties
|
||||
import docspell.common._
|
||||
import docspell.common.syntax.FileSyntax._
|
||||
|
||||
import edu.stanford.nlp.classify.ColumnDataClassifier
|
||||
|
||||
@ -26,7 +29,7 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
|
||||
.use { dir =>
|
||||
for {
|
||||
rawData <- writeDataFile(blocker, dir, data)
|
||||
_ <- logger.info(s"Learning from ${rawData.count} items.")
|
||||
_ <- logger.debug(s"Learning from ${rawData.count} items.")
|
||||
trainData <- splitData(logger, rawData)
|
||||
scores <- cfg.classifierConfigs.traverse(m => train(logger, trainData, m))
|
||||
sorted = scores.sortBy(-_.score)
|
||||
@ -43,7 +46,7 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
|
||||
case Some(text) =>
|
||||
Sync[F].delay {
|
||||
val cls = ColumnDataClassifier.getClassifier(
|
||||
model.model.normalize().toAbsolutePath().toString()
|
||||
model.model.normalize().toAbsolutePath.toString
|
||||
)
|
||||
val cat = cls.classOf(cls.makeDatumFromLine("\t\t" + normalisedText(text)))
|
||||
Option(cat)
|
||||
@ -65,7 +68,7 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
|
||||
val cdc = new ColumnDataClassifier(Properties.fromMap(amendProps(in, props)))
|
||||
cdc.trainClassifier(in.train.toString())
|
||||
val score = cdc.testClassifier(in.test.toString())
|
||||
TrainResult(score.first(), ClassifierModel(in.modelFile))
|
||||
TrainResult(score.first(), classifier.ClassifierModel(in.modelFile))
|
||||
}
|
||||
_ <- logger.debug(s"Trained with result $res")
|
||||
} yield res
|
||||
@ -136,9 +139,9 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
|
||||
props: Map[String, String]
|
||||
): Map[String, String] =
|
||||
prepend("2.", props) ++ Map(
|
||||
"trainFile" -> trainData.train.normalize().toAbsolutePath().toString(),
|
||||
"testFile" -> trainData.test.normalize().toAbsolutePath().toString(),
|
||||
"serializeTo" -> trainData.modelFile.normalize().toAbsolutePath().toString()
|
||||
"trainFile" -> trainData.train.absolutePathAsString,
|
||||
"testFile" -> trainData.test.absolutePathAsString,
|
||||
"serializeTo" -> trainData.modelFile.absolutePathAsString
|
||||
).toList
|
||||
|
||||
case class RawData(count: Long, file: Path)
|
@ -1,9 +1,9 @@
|
||||
package docspell.analysis.nlp
|
||||
package docspell.analysis.classifier
|
||||
|
||||
import cats.data.Kleisli
|
||||
import fs2.Stream
|
||||
|
||||
import docspell.analysis.nlp.TextClassifier.Data
|
||||
import docspell.analysis.classifier.TextClassifier.Data
|
||||
import docspell.common._
|
||||
|
||||
trait TextClassifier[F[_]] {
|
@ -1,4 +1,4 @@
|
||||
package docspell.analysis.nlp
|
||||
package docspell.analysis.classifier
|
||||
|
||||
import java.nio.file.Path
|
||||
|
@ -41,23 +41,41 @@ object DateFind {
|
||||
}
|
||||
|
||||
object SimpleDate {
|
||||
val p0 = (readYear >> readMonth >> readDay).map { case ((y, m), d) =>
|
||||
List(SimpleDate(y, m, d))
|
||||
def pattern0(lang: Language) = (readYear >> readMonth(lang) >> readDay).map {
|
||||
case ((y, m), d) =>
|
||||
List(SimpleDate(y, m, d))
|
||||
}
|
||||
val p1 = (readDay >> readMonth >> readYear).map { case ((d, m), y) =>
|
||||
List(SimpleDate(y, m, d))
|
||||
def pattern1(lang: Language) = (readDay >> readMonth(lang) >> readYear).map {
|
||||
case ((d, m), y) =>
|
||||
List(SimpleDate(y, m, d))
|
||||
}
|
||||
val p2 = (readMonth >> readDay >> readYear).map { case ((m, d), y) =>
|
||||
List(SimpleDate(y, m, d))
|
||||
def pattern2(lang: Language) = (readMonth(lang) >> readDay >> readYear).map {
|
||||
case ((m, d), y) =>
|
||||
List(SimpleDate(y, m, d))
|
||||
}
|
||||
|
||||
// ymd ✔, ydm, dmy ✔, dym, myd, mdy ✔
|
||||
def fromParts(parts: List[Word], lang: Language): List[SimpleDate] = {
|
||||
val ymd = pattern0(lang)
|
||||
val dmy = pattern1(lang)
|
||||
val mdy = pattern2(lang)
|
||||
// most is from wikipedia…
|
||||
val p = lang match {
|
||||
case Language.English =>
|
||||
p2.alt(p1).map(t => t._1 ++ t._2).or(p2).or(p0).or(p1)
|
||||
case Language.German => p1.or(p0).or(p2)
|
||||
case Language.French => p1.or(p0).or(p2)
|
||||
mdy.alt(dmy).map(t => t._1 ++ t._2).or(mdy).or(ymd).or(dmy)
|
||||
case Language.German => dmy.or(ymd).or(mdy)
|
||||
case Language.French => dmy.or(ymd).or(mdy)
|
||||
case Language.Italian => dmy.or(ymd).or(mdy)
|
||||
case Language.Spanish => dmy.or(ymd).or(mdy)
|
||||
case Language.Czech => dmy.or(ymd).or(mdy)
|
||||
case Language.Danish => dmy.or(ymd).or(mdy)
|
||||
case Language.Finnish => dmy.or(ymd).or(mdy)
|
||||
case Language.Norwegian => dmy.or(ymd).or(mdy)
|
||||
case Language.Portuguese => dmy.or(ymd).or(mdy)
|
||||
case Language.Romanian => dmy.or(ymd).or(mdy)
|
||||
case Language.Russian => dmy.or(ymd).or(mdy)
|
||||
case Language.Swedish => ymd.or(dmy).or(mdy)
|
||||
case Language.Dutch => dmy.or(ymd).or(mdy)
|
||||
}
|
||||
p.read(parts) match {
|
||||
case Result.Success(sds, _) =>
|
||||
@ -76,9 +94,11 @@ object DateFind {
|
||||
}
|
||||
)
|
||||
|
||||
def readMonth: Reader[Int] =
|
||||
def readMonth(lang: Language): Reader[Int] =
|
||||
Reader.readFirst(w =>
|
||||
Some(months.indexWhere(_.contains(w.value))).filter(_ >= 0).map(_ + 1)
|
||||
Some(MonthName.getAll(lang).indexWhere(_.contains(w.value)))
|
||||
.filter(_ >= 0)
|
||||
.map(_ + 1)
|
||||
)
|
||||
|
||||
def readDay: Reader[Int] =
|
||||
@ -150,20 +170,5 @@ object DateFind {
|
||||
Failure
|
||||
}
|
||||
}
|
||||
|
||||
private val months = List(
|
||||
List("jan", "january", "januar", "01"),
|
||||
List("feb", "february", "februar", "02"),
|
||||
List("mar", "march", "märz", "marz", "03"),
|
||||
List("apr", "april", "04"),
|
||||
List("may", "mai", "05"),
|
||||
List("jun", "june", "juni", "06"),
|
||||
List("jul", "july", "juli", "07"),
|
||||
List("aug", "august", "08"),
|
||||
List("sep", "september", "09"),
|
||||
List("oct", "october", "oktober", "10"),
|
||||
List("nov", "november", "11"),
|
||||
List("dec", "december", "dezember", "12")
|
||||
)
|
||||
}
|
||||
}
|
||||
|
@ -0,0 +1,270 @@
|
||||
package docspell.analysis.date
|
||||
|
||||
import docspell.common.Language
|
||||
|
||||
object MonthName {
|
||||
|
||||
def getAll(lang: Language): List[List[String]] =
|
||||
merge(numbers, forLang(lang))
|
||||
|
||||
private def merge(n0: List[List[String]], ns: List[List[String]]*): List[List[String]] =
|
||||
ns.foldLeft(n0) { (res, el) =>
|
||||
res.zip(el).map({ case (a, b) => a ++ b })
|
||||
}
|
||||
|
||||
private def forLang(lang: Language): List[List[String]] =
|
||||
lang match {
|
||||
case Language.English =>
|
||||
english
|
||||
case Language.German =>
|
||||
german
|
||||
case Language.French =>
|
||||
french
|
||||
case Language.Italian =>
|
||||
italian
|
||||
case Language.Spanish =>
|
||||
spanish
|
||||
case Language.Swedish =>
|
||||
swedish
|
||||
case Language.Norwegian =>
|
||||
norwegian
|
||||
case Language.Dutch =>
|
||||
dutch
|
||||
case Language.Czech =>
|
||||
czech
|
||||
case Language.Danish =>
|
||||
danish
|
||||
case Language.Portuguese =>
|
||||
portuguese
|
||||
case Language.Romanian =>
|
||||
romanian
|
||||
case Language.Finnish =>
|
||||
finnish
|
||||
case Language.Russian =>
|
||||
russian
|
||||
}
|
||||
|
||||
private val numbers = List(
|
||||
List("01"),
|
||||
List("02"),
|
||||
List("03"),
|
||||
List("04"),
|
||||
List("05"),
|
||||
List("06"),
|
||||
List("07"),
|
||||
List("08"),
|
||||
List("09"),
|
||||
List("10"),
|
||||
List("11"),
|
||||
List("12")
|
||||
)
|
||||
|
||||
private val english = List(
|
||||
List("jan", "january"),
|
||||
List("feb", "february"),
|
||||
List("mar", "march"),
|
||||
List("apr", "april"),
|
||||
List("may"),
|
||||
List("jun", "june"),
|
||||
List("jul", "july"),
|
||||
List("aug", "august"),
|
||||
List("sept", "september"),
|
||||
List("oct", "october"),
|
||||
List("nov", "november"),
|
||||
List("dec", "december")
|
||||
)
|
||||
|
||||
private val german = List(
|
||||
List("jan", "januar"),
|
||||
List("feb", "februar"),
|
||||
List("märz"),
|
||||
List("apr", "april"),
|
||||
List("mai"),
|
||||
List("juni"),
|
||||
List("juli"),
|
||||
List("aug", "august"),
|
||||
List("sept", "september"),
|
||||
List("okt", "oktober"),
|
||||
List("nov", "november"),
|
||||
List("dez", "dezember")
|
||||
)
|
||||
|
||||
private val french = List(
|
||||
List("janv", "janvier"),
|
||||
List("févr", "fevr", "février", "fevrier"),
|
||||
List("mars"),
|
||||
List("avril"),
|
||||
List("mai"),
|
||||
List("juin"),
|
||||
List("juil", "juillet"),
|
||||
List("aout", "août"),
|
||||
List("sept", "septembre"),
|
||||
List("oct", "octobre"),
|
||||
List("nov", "novembre"),
|
||||
List("dec", "déc", "décembre", "decembre")
|
||||
)
|
||||
|
||||
private val italian = List(
|
||||
List("genn", "gennaio"),
|
||||
List("febbr", "febbraio"),
|
||||
List("mar", "marzo"),
|
||||
List("apr", "aprile"),
|
||||
List("magg", "maggio"),
|
||||
List("giugno"),
|
||||
List("luglio"),
|
||||
List("ag", "agosto"),
|
||||
List("sett", "settembre"),
|
||||
List("ott", "ottobre"),
|
||||
List("nov", "novembre"),
|
||||
List("dic", "dicembre")
|
||||
)
|
||||
|
||||
private val spanish = List(
|
||||
List("ene", "enero"),
|
||||
List("feb", "febrero"),
|
||||
List("mar", "marzo"),
|
||||
List("abr", "abril"),
|
||||
List("may", "mayo"),
|
||||
List("jun"),
|
||||
List("jul"),
|
||||
List("ago", "agosto"),
|
||||
List("sep", "septiembre"),
|
||||
List("oct", "octubre"),
|
||||
List("nov", "noviembre"),
|
||||
List("dic", "diciembre")
|
||||
)
|
||||
|
||||
private val swedish = List(
|
||||
List("jan", "januari"),
|
||||
List("febr", "februari"),
|
||||
List("mars"),
|
||||
List("april"),
|
||||
List("maj"),
|
||||
List("juni"),
|
||||
List("juli"),
|
||||
List("aug", "augusti"),
|
||||
List("sept", "september"),
|
||||
List("okt", "oktober"),
|
||||
List("nov", "november"),
|
||||
List("dec", "december")
|
||||
)
|
||||
private val norwegian = List(
|
||||
List("jan", "januar"),
|
||||
List("febr", "februar"),
|
||||
List("mars"),
|
||||
List("april"),
|
||||
List("mai"),
|
||||
List("juni"),
|
||||
List("juli"),
|
||||
List("aug", "august"),
|
||||
List("sept", "september"),
|
||||
List("okt", "oktober"),
|
||||
List("nov", "november"),
|
||||
List("des", "desember")
|
||||
)
|
||||
|
||||
private val czech = List(
|
||||
List("led", "leden"),
|
||||
List("un", "ún", "únor", "unor"),
|
||||
List("brez", "březen", "brezen"),
|
||||
List("dub", "duben"),
|
||||
List("kvet", "květen"),
|
||||
List("cerv", "červen"),
|
||||
List("cerven", "červenec"),
|
||||
List("srp", "srpen"),
|
||||
List("zari", "září"),
|
||||
List("ríj", "rij", "říjen"),
|
||||
List("list", "listopad"),
|
||||
List("pros", "prosinec")
|
||||
)
|
||||
|
||||
private val romanian = List(
|
||||
List("ian", "ianuarie"),
|
||||
List("feb", "februarie"),
|
||||
List("mar", "martie"),
|
||||
List("apr", "aprilie"),
|
||||
List("mai"),
|
||||
List("iunie"),
|
||||
List("iulie"),
|
||||
List("aug", "august"),
|
||||
List("sept", "septembrie"),
|
||||
List("oct", "octombrie"),
|
||||
List("noem", "nov", "noiembrie"),
|
||||
List("dec", "decembrie")
|
||||
)
|
||||
|
||||
private val danish = List(
|
||||
List("jan", "januar"),
|
||||
List("febr", "februar"),
|
||||
List("marts"),
|
||||
List("april"),
|
||||
List("maj"),
|
||||
List("juni"),
|
||||
List("juli"),
|
||||
List("aug", "august"),
|
||||
List("sept", "september"),
|
||||
List("okt", "oktober"),
|
||||
List("nov", "november"),
|
||||
List("dec", "december")
|
||||
)
|
||||
|
||||
private val portuguese = List(
|
||||
List("jan", "janeiro"),
|
||||
List("fev", "fevereiro"),
|
||||
List("março", "marco"),
|
||||
List("abril"),
|
||||
List("maio"),
|
||||
List("junho"),
|
||||
List("julho"),
|
||||
List("agosto"),
|
||||
List("set", "setembro"),
|
||||
List("out", "outubro"),
|
||||
List("nov", "novembro"),
|
||||
List("dez", "dezembro")
|
||||
)
|
||||
|
||||
private val finnish = List(
|
||||
List("tammikuu"),
|
||||
List("helmikuu"),
|
||||
List("maaliskuu"),
|
||||
List("huhtikuu"),
|
||||
List("toukokuu"),
|
||||
List("kesäkuu"),
|
||||
List("heinäkuu"),
|
||||
List("elokuu"),
|
||||
List("syyskuu"),
|
||||
List("lokakuu"),
|
||||
List("marraskuu"),
|
||||
List("joulukuu")
|
||||
)
|
||||
|
||||
private val russian = List(
|
||||
List("январь"),
|
||||
List("февраль"),
|
||||
List("март"),
|
||||
List("апрель"),
|
||||
List("май"),
|
||||
List("июнь"),
|
||||
List("июль"),
|
||||
List("август"),
|
||||
List("сентябрь"),
|
||||
List("октябрь"),
|
||||
List("ноябрь"),
|
||||
List("декабрь")
|
||||
)
|
||||
|
||||
private val dutch = List(
|
||||
List("jan", "januari"),
|
||||
List("feb", "februari"),
|
||||
List("maart"),
|
||||
List("apr", "april"),
|
||||
List("mei"),
|
||||
List("juni"),
|
||||
List("juli"),
|
||||
List("aug", "augustus"),
|
||||
List("sept", "september"),
|
||||
List("okt", "oct", "oktober"),
|
||||
List("nov", "november"),
|
||||
List("dec", "december")
|
||||
)
|
||||
}
|
@ -0,0 +1,98 @@
|
||||
package docspell.analysis.nlp
|
||||
|
||||
import cats.effect.Sync
|
||||
import cats.implicits._
|
||||
import cats.{Applicative, FlatMap}
|
||||
|
||||
import docspell.analysis.NlpSettings
|
||||
import docspell.common._
|
||||
|
||||
import edu.stanford.nlp.pipeline.StanfordCoreNLP
|
||||
|
||||
/** Analyses a text to mark certain parts with a `NerLabel`. */
|
||||
trait Annotator[F[_]] { self =>
|
||||
def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]]
|
||||
|
||||
def ++(next: Annotator[F])(implicit F: FlatMap[F]): Annotator[F] =
|
||||
new Annotator[F] {
|
||||
def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
|
||||
for {
|
||||
n0 <- self.nerAnnotate(logger)(text)
|
||||
n1 <- next.nerAnnotate(logger)(text)
|
||||
} yield (n0 ++ n1).distinct
|
||||
}
|
||||
}
|
||||
|
||||
object Annotator {
|
||||
|
||||
/** Creates an annotator according to the given `mode` and `settings`.
|
||||
*
|
||||
* There are the following ways:
|
||||
*
|
||||
* - disabled: it returns a no-op annotator that always gives an empty list
|
||||
* - full: the complete stanford pipeline is used
|
||||
* - basic: only the ner classifier is used
|
||||
*
|
||||
* Additionally, if there is a regexNer-file specified, the regexner annotator is
|
||||
* also run. In case the full pipeline is used, this is already included.
|
||||
*/
|
||||
def apply[F[_]: Sync](mode: NlpMode)(settings: NlpSettings): Annotator[F] =
|
||||
mode match {
|
||||
case NlpMode.Disabled =>
|
||||
Annotator.none[F]
|
||||
case NlpMode.Full =>
|
||||
StanfordNerSettings.fromNlpSettings(settings) match {
|
||||
case Some(ss) =>
|
||||
Annotator.pipeline(StanfordNerAnnotator.makePipeline(ss))
|
||||
case None =>
|
||||
Annotator.none[F]
|
||||
}
|
||||
case NlpMode.Basic =>
|
||||
StanfordNerSettings.fromNlpSettings(settings) match {
|
||||
case Some(StanfordNerSettings.Full(lang, _, Some(file))) =>
|
||||
Annotator.basic(BasicCRFAnnotator.Cache.getAnnotator(lang)) ++
|
||||
Annotator.pipeline(StanfordNerAnnotator.regexNerPipeline(file))
|
||||
case Some(StanfordNerSettings.Full(lang, _, None)) =>
|
||||
Annotator.basic(BasicCRFAnnotator.Cache.getAnnotator(lang))
|
||||
case Some(StanfordNerSettings.RegexOnly(file)) =>
|
||||
Annotator.pipeline(StanfordNerAnnotator.regexNerPipeline(file))
|
||||
case None =>
|
||||
Annotator.none[F]
|
||||
}
|
||||
case NlpMode.RegexOnly =>
|
||||
settings.regexNer match {
|
||||
case Some(file) =>
|
||||
Annotator.pipeline(StanfordNerAnnotator.regexNerPipeline(file))
|
||||
case None =>
|
||||
Annotator.none[F]
|
||||
}
|
||||
}
|
||||
|
||||
def none[F[_]: Applicative]: Annotator[F] =
|
||||
new Annotator[F] {
|
||||
def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
|
||||
logger.debug("Running empty annotator. NLP not supported.") *>
|
||||
Vector.empty[NerLabel].pure[F]
|
||||
}
|
||||
|
||||
def basic[F[_]: Sync](ann: BasicCRFAnnotator.Annotator): Annotator[F] =
|
||||
new Annotator[F] {
|
||||
def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
|
||||
Sync[F].delay(
|
||||
BasicCRFAnnotator.nerAnnotate(ann)(text)
|
||||
)
|
||||
}
|
||||
|
||||
def pipeline[F[_]: Sync](cp: StanfordCoreNLP): Annotator[F] =
|
||||
new Annotator[F] {
|
||||
def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
|
||||
Sync[F].delay(StanfordNerAnnotator.nerAnnotate(cp, text))
|
||||
|
||||
}
|
||||
|
||||
def clearCaches[F[_]: Sync]: F[Unit] =
|
||||
Sync[F].delay {
|
||||
StanfordCoreNLP.clearAnnotatorPool()
|
||||
BasicCRFAnnotator.Cache.clearCache()
|
||||
}
|
||||
}
|
@ -0,0 +1,94 @@
|
||||
package docspell.analysis.nlp
|
||||
|
||||
import java.net.URL
|
||||
import java.util.concurrent.atomic.AtomicReference
|
||||
import java.util.zip.GZIPInputStream
|
||||
|
||||
import scala.jdk.CollectionConverters._
|
||||
import scala.util.Using
|
||||
|
||||
import docspell.common.Language.NLPLanguage
|
||||
import docspell.common._
|
||||
|
||||
import edu.stanford.nlp.ie.AbstractSequenceClassifier
|
||||
import edu.stanford.nlp.ie.crf.CRFClassifier
|
||||
import edu.stanford.nlp.ling.{CoreAnnotations, CoreLabel}
|
||||
import org.log4s.getLogger
|
||||
|
||||
/** This is only using the CRFClassifier without building an analysis
|
||||
* pipeline. The ner-classifier cannot use results from POS-tagging
|
||||
* etc. and is therefore not as good as the [[StanfordNerAnnotator]].
|
||||
* But it uses less memory, while still being not bad.
|
||||
*/
|
||||
object BasicCRFAnnotator {
|
||||
private[this] val logger = getLogger
|
||||
|
||||
// assert correct resource names
|
||||
List(Language.French, Language.German, Language.English).foreach(classifierResource)
|
||||
|
||||
type Annotator = AbstractSequenceClassifier[CoreLabel]
|
||||
|
||||
def nerAnnotate(nerClassifier: Annotator)(text: String): Vector[NerLabel] =
|
||||
nerClassifier
|
||||
.classify(text)
|
||||
.asScala
|
||||
.flatMap(a => a.asScala)
|
||||
.collect(Function.unlift { label =>
|
||||
val tag = label.get(classOf[CoreAnnotations.AnswerAnnotation])
|
||||
NerTag
|
||||
.fromString(Option(tag).getOrElse(""))
|
||||
.toOption
|
||||
.map(t => NerLabel(label.word(), t, label.beginPosition(), label.endPosition()))
|
||||
})
|
||||
.toVector
|
||||
|
||||
def makeAnnotator(lang: NLPLanguage): Annotator = {
|
||||
logger.info(s"Creating ${lang.name} Stanford NLP NER-only classifier...")
|
||||
val ner = classifierResource(lang)
|
||||
Using(new GZIPInputStream(ner.openStream())) { in =>
|
||||
CRFClassifier.getClassifier(in).asInstanceOf[Annotator]
|
||||
}.fold(throw _, identity)
|
||||
}
|
||||
|
||||
private def classifierResource(lang: NLPLanguage): URL = {
|
||||
def check(name: String): URL =
|
||||
Option(getClass.getResource(name)) match {
|
||||
case None =>
|
||||
sys.error(s"NER model resource '$name' not found for language ${lang.name}")
|
||||
case Some(url) => url
|
||||
}
|
||||
|
||||
check(lang match {
|
||||
case Language.French =>
|
||||
"/edu/stanford/nlp/models/ner/french-wikiner-4class.crf.ser.gz"
|
||||
case Language.German =>
|
||||
"/edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz"
|
||||
case Language.English =>
|
||||
"/edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz"
|
||||
})
|
||||
}
|
||||
|
||||
final class Cache {
|
||||
private[this] lazy val germanNerClassifier = makeAnnotator(Language.German)
|
||||
private[this] lazy val englishNerClassifier = makeAnnotator(Language.English)
|
||||
private[this] lazy val frenchNerClassifier = makeAnnotator(Language.French)
|
||||
|
||||
def forLang(language: NLPLanguage): Annotator =
|
||||
language match {
|
||||
case Language.French => frenchNerClassifier
|
||||
case Language.German => germanNerClassifier
|
||||
case Language.English => englishNerClassifier
|
||||
}
|
||||
}
|
||||
|
||||
object Cache {
|
||||
|
||||
private[this] val cacheRef = new AtomicReference[Cache](new Cache)
|
||||
|
||||
def getAnnotator(language: NLPLanguage): Annotator =
|
||||
cacheRef.get().forLang(language)
|
||||
|
||||
def clearCache(): Unit =
|
||||
cacheRef.set(new Cache)
|
||||
}
|
||||
}
|
@ -7,9 +7,9 @@ import cats.effect._
|
||||
import cats.effect.concurrent.Ref
|
||||
import cats.implicits._
|
||||
|
||||
import docspell.analysis.NlpSettings
|
||||
import docspell.common._
|
||||
|
||||
import edu.stanford.nlp.pipeline.StanfordCoreNLP
|
||||
import org.log4s.getLogger
|
||||
|
||||
/** Creating the StanfordCoreNLP pipeline is quite expensive as it
|
||||
@ -21,46 +21,45 @@ import org.log4s.getLogger
|
||||
*/
|
||||
trait PipelineCache[F[_]] {
|
||||
|
||||
def obtain(key: String, settings: StanfordNerSettings): Resource[F, StanfordCoreNLP]
|
||||
def obtain(key: String, settings: NlpSettings): Resource[F, Annotator[F]]
|
||||
|
||||
}
|
||||
|
||||
object PipelineCache {
|
||||
private[this] val logger = getLogger
|
||||
|
||||
def none[F[_]: Applicative]: PipelineCache[F] =
|
||||
new PipelineCache[F] {
|
||||
def obtain(
|
||||
ignored: String,
|
||||
settings: StanfordNerSettings
|
||||
): Resource[F, StanfordCoreNLP] =
|
||||
Resource.liftF(makeClassifier(settings).pure[F])
|
||||
}
|
||||
|
||||
def apply[F[_]: Concurrent: Timer](clearInterval: Duration): F[PipelineCache[F]] =
|
||||
def apply[F[_]: Concurrent: Timer](clearInterval: Duration)(
|
||||
creator: NlpSettings => Annotator[F],
|
||||
release: F[Unit]
|
||||
): F[PipelineCache[F]] =
|
||||
for {
|
||||
data <- Ref.of(Map.empty[String, Entry])
|
||||
cacheClear <- CacheClearing.create(data, clearInterval)
|
||||
} yield new Impl[F](data, cacheClear)
|
||||
data <- Ref.of(Map.empty[String, Entry[Annotator[F]]])
|
||||
cacheClear <- CacheClearing.create(data, clearInterval, release)
|
||||
_ <- Logger.log4s(logger).info("Creating nlp pipeline cache")
|
||||
} yield new Impl[F](data, creator, cacheClear)
|
||||
|
||||
final private class Impl[F[_]: Sync](
|
||||
data: Ref[F, Map[String, Entry]],
|
||||
data: Ref[F, Map[String, Entry[Annotator[F]]]],
|
||||
creator: NlpSettings => Annotator[F],
|
||||
cacheClear: CacheClearing[F]
|
||||
) extends PipelineCache[F] {
|
||||
|
||||
def obtain(key: String, settings: StanfordNerSettings): Resource[F, StanfordCoreNLP] =
|
||||
def obtain(key: String, settings: NlpSettings): Resource[F, Annotator[F]] =
|
||||
for {
|
||||
_ <- cacheClear.withCache
|
||||
id <- Resource.liftF(makeSettingsId(settings))
|
||||
nlp <- Resource.liftF(data.modify(cache => getOrCreate(key, id, cache, settings)))
|
||||
_ <- cacheClear.withCache
|
||||
id <- Resource.liftF(makeSettingsId(settings))
|
||||
nlp <- Resource.liftF(
|
||||
data.modify(cache => getOrCreate(key, id, cache, settings, creator))
|
||||
)
|
||||
} yield nlp
|
||||
|
||||
private def getOrCreate(
|
||||
key: String,
|
||||
id: String,
|
||||
cache: Map[String, Entry],
|
||||
settings: StanfordNerSettings
|
||||
): (Map[String, Entry], StanfordCoreNLP) =
|
||||
cache: Map[String, Entry[Annotator[F]]],
|
||||
settings: NlpSettings,
|
||||
creator: NlpSettings => Annotator[F]
|
||||
): (Map[String, Entry[Annotator[F]]], Annotator[F]) =
|
||||
cache.get(key) match {
|
||||
case Some(entry) =>
|
||||
if (entry.id == id) (cache, entry.value)
|
||||
@ -68,18 +67,18 @@ object PipelineCache {
|
||||
logger.info(
|
||||
s"StanfordNLP settings changed for key $key. Creating new classifier"
|
||||
)
|
||||
val nlp = makeClassifier(settings)
|
||||
val nlp = creator(settings)
|
||||
val e = Entry(id, nlp)
|
||||
(cache.updated(key, e), nlp)
|
||||
}
|
||||
|
||||
case None =>
|
||||
val nlp = makeClassifier(settings)
|
||||
val nlp = creator(settings)
|
||||
val e = Entry(id, nlp)
|
||||
(cache.updated(key, e), nlp)
|
||||
}
|
||||
|
||||
private def makeSettingsId(settings: StanfordNerSettings): F[String] = {
|
||||
private def makeSettingsId(settings: NlpSettings): F[String] = {
|
||||
val base = settings.copy(regexNer = None).toString
|
||||
val size: F[Long] =
|
||||
settings.regexNer match {
|
||||
@ -104,9 +103,10 @@ object PipelineCache {
|
||||
Resource.pure[F, Unit](())
|
||||
}
|
||||
|
||||
def create[F[_]: Concurrent: Timer](
|
||||
data: Ref[F, Map[String, Entry]],
|
||||
interval: Duration
|
||||
def create[F[_]: Concurrent: Timer, A](
|
||||
data: Ref[F, Map[String, Entry[A]]],
|
||||
interval: Duration,
|
||||
release: F[Unit]
|
||||
): F[CacheClearing[F]] =
|
||||
for {
|
||||
counter <- Ref.of(0L)
|
||||
@ -121,16 +121,23 @@ object PipelineCache {
|
||||
log
|
||||
.info(s"Clearing StanfordNLP cache after $interval idle time")
|
||||
.map(_ =>
|
||||
new CacheClearingImpl[F](data, counter, cleaning, interval.toScala)
|
||||
new CacheClearingImpl[F, A](
|
||||
data,
|
||||
counter,
|
||||
cleaning,
|
||||
interval.toScala,
|
||||
release
|
||||
)
|
||||
)
|
||||
} yield result
|
||||
}
|
||||
|
||||
final private class CacheClearingImpl[F[_]](
|
||||
data: Ref[F, Map[String, Entry]],
|
||||
final private class CacheClearingImpl[F[_], A](
|
||||
data: Ref[F, Map[String, Entry[A]]],
|
||||
counter: Ref[F, Long],
|
||||
cleaningFiber: Ref[F, Option[Fiber[F, Unit]]],
|
||||
clearInterval: FiniteDuration
|
||||
clearInterval: FiniteDuration,
|
||||
release: F[Unit]
|
||||
)(implicit T: Timer[F], F: Concurrent[F])
|
||||
extends CacheClearing[F] {
|
||||
private[this] val log = Logger.log4s[F](logger)
|
||||
@ -158,17 +165,10 @@ object PipelineCache {
|
||||
|
||||
def clearAll: F[Unit] =
|
||||
log.info("Clearing stanford nlp cache now!") *>
|
||||
data.set(Map.empty) *> Sync[F].delay {
|
||||
// turns out that everything is cached in a static map
|
||||
StanfordCoreNLP.clearAnnotatorPool()
|
||||
data.set(Map.empty) *> release *> Sync[F].delay {
|
||||
System.gc();
|
||||
}
|
||||
}
|
||||
|
||||
private def makeClassifier(settings: StanfordNerSettings): StanfordCoreNLP = {
|
||||
logger.info(s"Creating ${settings.lang.name} Stanford NLP NER classifier...")
|
||||
new StanfordCoreNLP(Properties.forSettings(settings))
|
||||
}
|
||||
|
||||
private case class Entry(id: String, value: StanfordCoreNLP)
|
||||
private case class Entry[A](id: String, value: A)
|
||||
}
|
||||
|
@ -1,9 +1,11 @@
|
||||
package docspell.analysis.nlp
|
||||
|
||||
import java.nio.file.Path
|
||||
import java.util.{Properties => JProps}
|
||||
|
||||
import docspell.analysis.nlp.Properties.Implicits._
|
||||
import docspell.common._
|
||||
import docspell.common.syntax.FileSyntax._
|
||||
|
||||
object Properties {
|
||||
|
||||
@ -17,18 +19,21 @@ object Properties {
|
||||
p
|
||||
}
|
||||
|
||||
def forSettings(settings: StanfordNerSettings): JProps = {
|
||||
val regexNerFile = settings.regexNer
|
||||
.map(p => p.normalize().toAbsolutePath().toString())
|
||||
settings.lang match {
|
||||
case Language.German =>
|
||||
Properties.nerGerman(regexNerFile, settings.highRecall)
|
||||
case Language.English =>
|
||||
Properties.nerEnglish(regexNerFile)
|
||||
case Language.French =>
|
||||
Properties.nerFrench(regexNerFile, settings.highRecall)
|
||||
def forSettings(settings: StanfordNerSettings): JProps =
|
||||
settings match {
|
||||
case StanfordNerSettings.Full(lang, highRecall, regexNer) =>
|
||||
val regexNerFile = regexNer.map(p => p.absolutePathAsString)
|
||||
lang match {
|
||||
case Language.German =>
|
||||
Properties.nerGerman(regexNerFile, highRecall)
|
||||
case Language.English =>
|
||||
Properties.nerEnglish(regexNerFile)
|
||||
case Language.French =>
|
||||
Properties.nerFrench(regexNerFile, highRecall)
|
||||
}
|
||||
case StanfordNerSettings.RegexOnly(path) =>
|
||||
Properties.regexNerOnly(path)
|
||||
}
|
||||
}
|
||||
|
||||
def nerGerman(regexNerMappingFile: Option[String], highRecall: Boolean): JProps =
|
||||
Properties(
|
||||
@ -76,6 +81,11 @@ object Properties {
|
||||
"ner.model" -> "edu/stanford/nlp/models/ner/french-wikiner-4class.crf.ser.gz,edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz"
|
||||
).withRegexNer(regexNerMappingFile).withHighRecall(highRecall)
|
||||
|
||||
def regexNerOnly(regexNerMappingFile: Path): JProps =
|
||||
Properties(
|
||||
"annotators" -> "tokenize,ssplit"
|
||||
).withRegexNer(Some(regexNerMappingFile.absolutePathAsString))
|
||||
|
||||
object Implicits {
|
||||
implicit final class JPropsOps(val p: JProps) extends AnyVal {
|
||||
|
||||
|
@ -0,0 +1,52 @@
|
||||
package docspell.analysis.nlp
|
||||
|
||||
import java.nio.file.Path
|
||||
|
||||
import scala.jdk.CollectionConverters._
|
||||
|
||||
import cats.effect._
|
||||
|
||||
import docspell.common._
|
||||
|
||||
import edu.stanford.nlp.pipeline.{CoreDocument, StanfordCoreNLP}
|
||||
import org.log4s.getLogger
|
||||
|
||||
object StanfordNerAnnotator {
|
||||
private[this] val logger = getLogger
|
||||
|
||||
/** Runs named entity recognition on the given `text`.
|
||||
*
|
||||
* This uses the classifier pipeline from stanford-nlp, see
|
||||
* https://nlp.stanford.edu/software/CRF-NER.html. Creating these
|
||||
* classifiers is quite expensive, it involves loading large model
|
||||
* files. The classifiers are thread-safe and so they are cached.
|
||||
* The `cacheKey` defines the "slot" where classifiers are stored
|
||||
* and retrieved. If for a given `cacheKey` the `settings` change,
|
||||
* a new classifier must be created. It will then replace the
|
||||
* previous one.
|
||||
*/
|
||||
def nerAnnotate(nerClassifier: StanfordCoreNLP, text: String): Vector[NerLabel] = {
|
||||
val doc = new CoreDocument(text)
|
||||
nerClassifier.annotate(doc)
|
||||
doc.tokens().asScala.collect(Function.unlift(LabelConverter.toNerLabel)).toVector
|
||||
}
|
||||
|
||||
def makePipeline(settings: StanfordNerSettings): StanfordCoreNLP =
|
||||
settings match {
|
||||
case s: StanfordNerSettings.Full =>
|
||||
logger.info(s"Creating ${s.lang.name} Stanford NLP NER classifier...")
|
||||
new StanfordCoreNLP(Properties.forSettings(settings))
|
||||
case StanfordNerSettings.RegexOnly(path) =>
|
||||
logger.info(s"Creating regexNer-only Stanford NLP NER classifier...")
|
||||
regexNerPipeline(path)
|
||||
}
|
||||
|
||||
def regexNerPipeline(regexNerFile: Path): StanfordCoreNLP =
|
||||
new StanfordCoreNLP(Properties.regexNerOnly(regexNerFile))
|
||||
|
||||
def clearPipelineCaches[F[_]: Sync]: F[Unit] =
|
||||
Sync[F].delay {
|
||||
// turns out that everything is cached in a static map
|
||||
StanfordCoreNLP.clearAnnotatorPool()
|
||||
}
|
||||
}
|
@ -1,39 +0,0 @@
|
||||
package docspell.analysis.nlp
|
||||
|
||||
import scala.jdk.CollectionConverters._
|
||||
|
||||
import cats.Applicative
|
||||
import cats.effect._
|
||||
|
||||
import docspell.common._
|
||||
|
||||
import edu.stanford.nlp.pipeline.{CoreDocument, StanfordCoreNLP}
|
||||
|
||||
object StanfordNerClassifier {
|
||||
|
||||
/** Runs named entity recognition on the given `text`.
|
||||
*
|
||||
* This uses the classifier pipeline from stanford-nlp, see
|
||||
* https://nlp.stanford.edu/software/CRF-NER.html. Creating these
|
||||
* classifiers is quite expensive, it involves loading large model
|
||||
* files. The classifiers are thread-safe and so they are cached.
|
||||
* The `cacheKey` defines the "slot" where classifiers are stored
|
||||
* and retrieved. If for a given `cacheKey` the `settings` change,
|
||||
* a new classifier must be created. It will then replace the
|
||||
* previous one.
|
||||
*/
|
||||
def nerAnnotate[F[_]: BracketThrow](
|
||||
cacheKey: String,
|
||||
cache: PipelineCache[F]
|
||||
)(settings: StanfordNerSettings, text: String): F[Vector[NerLabel]] =
|
||||
cache
|
||||
.obtain(cacheKey, settings)
|
||||
.use(crf => Applicative[F].pure(runClassifier(crf, text)))
|
||||
|
||||
def runClassifier(nerClassifier: StanfordCoreNLP, text: String): Vector[NerLabel] = {
|
||||
val doc = new CoreDocument(text)
|
||||
nerClassifier.annotate(doc)
|
||||
doc.tokens().asScala.collect(Function.unlift(LabelConverter.toNerLabel)).toVector
|
||||
}
|
||||
|
||||
}
|
@ -2,25 +2,41 @@ package docspell.analysis.nlp
|
||||
|
||||
import java.nio.file.Path
|
||||
|
||||
import docspell.common._
|
||||
import docspell.analysis.NlpSettings
|
||||
import docspell.common.Language.NLPLanguage
|
||||
|
||||
/** Settings for configuring the stanford NER pipeline.
|
||||
*
|
||||
* The language is mandatory, only the provided ones are supported.
|
||||
* The `highRecall` only applies for non-English languages. For
|
||||
* non-English languages the english classifier is run as second
|
||||
* classifier and if `highRecall` is true, then it will be used to
|
||||
* tag untagged tokens. This may lead to a lot of false positives,
|
||||
* but since English is omnipresent in other languages, too it
|
||||
* depends on the use case for whether this is useful or not.
|
||||
*
|
||||
* The `regexNer` allows to specify a text file as described here:
|
||||
* https://nlp.stanford.edu/software/regexner.html. This will be used
|
||||
* as a last step to tag untagged tokens using the provided list of
|
||||
* regexps.
|
||||
*/
|
||||
case class StanfordNerSettings(
|
||||
lang: Language,
|
||||
highRecall: Boolean,
|
||||
regexNer: Option[Path]
|
||||
)
|
||||
sealed trait StanfordNerSettings
|
||||
|
||||
object StanfordNerSettings {
|
||||
|
||||
/** Settings for configuring the stanford NER pipeline.
|
||||
*
|
||||
* The language is mandatory, only the provided ones are supported.
|
||||
* The `highRecall` only applies for non-English languages. For
|
||||
* non-English languages the english classifier is run as second
|
||||
* classifier and if `highRecall` is true, then it will be used to
|
||||
* tag untagged tokens. This may lead to a lot of false positives,
|
||||
* but since English is omnipresent in other languages, too it
|
||||
* depends on the use case for whether this is useful or not.
|
||||
*
|
||||
* The `regexNer` allows to specify a text file as described here:
|
||||
* https://nlp.stanford.edu/software/regexner.html. This will be used
|
||||
* as a last step to tag untagged tokens using the provided list of
|
||||
* regexps.
|
||||
*/
|
||||
case class Full(
|
||||
lang: NLPLanguage,
|
||||
highRecall: Boolean,
|
||||
regexNer: Option[Path]
|
||||
) extends StanfordNerSettings
|
||||
|
||||
/** Not all languages are supported with predefined statistical models. This allows to provide regexps only.
|
||||
*/
|
||||
case class RegexOnly(regexNerFile: Path) extends StanfordNerSettings
|
||||
|
||||
def fromNlpSettings(ns: NlpSettings): Option[StanfordNerSettings] =
|
||||
NLPLanguage.all
|
||||
.find(nl => nl == ns.lang)
|
||||
.map(nl => Full(nl, ns.highRecall, ns.regexNer))
|
||||
.orElse(ns.regexNer.map(nrf => RegexOnly(nrf)))
|
||||
}
|
||||
|
12
modules/analysis/src/test/scala/docspell/analysis/Env.scala
Normal file
12
modules/analysis/src/test/scala/docspell/analysis/Env.scala
Normal file
@ -0,0 +1,12 @@
|
||||
package docspell.analysis
|
||||
|
||||
object Env {
|
||||
|
||||
def isCI = bool("CI")
|
||||
|
||||
def bool(key: String): Boolean =
|
||||
string(key).contains("true")
|
||||
|
||||
def string(key: String): Option[String] =
|
||||
Option(System.getenv(key)).filter(_.nonEmpty)
|
||||
}
|
@ -1,4 +1,4 @@
|
||||
package docspell.analysis.nlp
|
||||
package docspell.analysis.classifier
|
||||
|
||||
import minitest._
|
||||
import cats.effect._
|
@ -1,19 +1,22 @@
|
||||
package docspell.analysis.nlp
|
||||
|
||||
import docspell.analysis.Env
|
||||
import docspell.common.Language.NLPLanguage
|
||||
import minitest.SimpleTestSuite
|
||||
import docspell.files.TestFiles
|
||||
import docspell.common._
|
||||
import edu.stanford.nlp.pipeline.StanfordCoreNLP
|
||||
|
||||
object TextAnalyserSuite extends SimpleTestSuite {
|
||||
lazy val germanClassifier =
|
||||
new StanfordCoreNLP(Properties.nerGerman(None, false))
|
||||
lazy val englishClassifier =
|
||||
new StanfordCoreNLP(Properties.nerEnglish(None))
|
||||
object BaseCRFAnnotatorSuite extends SimpleTestSuite {
|
||||
|
||||
def annotate(language: NLPLanguage): String => Vector[NerLabel] =
|
||||
BasicCRFAnnotator.nerAnnotate(BasicCRFAnnotator.Cache.getAnnotator(language))
|
||||
|
||||
test("find english ner labels") {
|
||||
val labels =
|
||||
StanfordNerClassifier.runClassifier(englishClassifier, TestFiles.letterENText)
|
||||
if (Env.isCI) {
|
||||
ignore("Test ignored on travis.")
|
||||
}
|
||||
|
||||
val labels = annotate(Language.English)(TestFiles.letterENText)
|
||||
val expect = Vector(
|
||||
NerLabel("Derek", NerTag.Person, 0, 5),
|
||||
NerLabel("Jeter", NerTag.Person, 6, 11),
|
||||
@ -45,11 +48,15 @@ object TextAnalyserSuite extends SimpleTestSuite {
|
||||
NerLabel("Jeter", NerTag.Person, 1123, 1128)
|
||||
)
|
||||
assertEquals(labels, expect)
|
||||
BasicCRFAnnotator.Cache.clearCache()
|
||||
}
|
||||
|
||||
test("find german ner labels") {
|
||||
val labels =
|
||||
StanfordNerClassifier.runClassifier(germanClassifier, TestFiles.letterDEText)
|
||||
if (Env.isCI) {
|
||||
ignore("Test ignored on travis.")
|
||||
}
|
||||
|
||||
val labels = annotate(Language.German)(TestFiles.letterDEText)
|
||||
val expect = Vector(
|
||||
NerLabel("Max", NerTag.Person, 0, 3),
|
||||
NerLabel("Mustermann", NerTag.Person, 4, 14),
|
||||
@ -65,5 +72,6 @@ object TextAnalyserSuite extends SimpleTestSuite {
|
||||
NerLabel("Mustermann", NerTag.Person, 509, 519)
|
||||
)
|
||||
assertEquals(labels, expect)
|
||||
BasicCRFAnnotator.Cache.clearCache()
|
||||
}
|
||||
}
|
@ -0,0 +1,120 @@
|
||||
package docspell.analysis.nlp
|
||||
|
||||
import java.nio.file.Paths
|
||||
|
||||
import cats.effect.IO
|
||||
import docspell.analysis.Env
|
||||
import minitest.SimpleTestSuite
|
||||
import docspell.files.TestFiles
|
||||
import docspell.common._
|
||||
import docspell.common.syntax.FileSyntax._
|
||||
import edu.stanford.nlp.pipeline.StanfordCoreNLP
|
||||
|
||||
object StanfordNerAnnotatorSuite extends SimpleTestSuite {
|
||||
lazy val germanClassifier =
|
||||
new StanfordCoreNLP(Properties.nerGerman(None, false))
|
||||
lazy val englishClassifier =
|
||||
new StanfordCoreNLP(Properties.nerEnglish(None))
|
||||
|
||||
test("find english ner labels") {
|
||||
if (Env.isCI) {
|
||||
ignore("Test ignored on travis.")
|
||||
}
|
||||
|
||||
val labels =
|
||||
StanfordNerAnnotator.nerAnnotate(englishClassifier, TestFiles.letterENText)
|
||||
val expect = Vector(
|
||||
NerLabel("Derek", NerTag.Person, 0, 5),
|
||||
NerLabel("Jeter", NerTag.Person, 6, 11),
|
||||
NerLabel("Elm", NerTag.Misc, 17, 20),
|
||||
NerLabel("Ave.", NerTag.Misc, 21, 25),
|
||||
NerLabel("Treesville", NerTag.Misc, 27, 37),
|
||||
NerLabel("Derek", NerTag.Person, 68, 73),
|
||||
NerLabel("Jeter", NerTag.Person, 74, 79),
|
||||
NerLabel("Elm", NerTag.Misc, 85, 88),
|
||||
NerLabel("Ave.", NerTag.Misc, 89, 93),
|
||||
NerLabel("Treesville", NerTag.Person, 95, 105),
|
||||
NerLabel("Leaf", NerTag.Organization, 144, 148),
|
||||
NerLabel("Chief", NerTag.Organization, 150, 155),
|
||||
NerLabel("of", NerTag.Organization, 156, 158),
|
||||
NerLabel("Syrup", NerTag.Organization, 159, 164),
|
||||
NerLabel("Production", NerTag.Organization, 165, 175),
|
||||
NerLabel("Old", NerTag.Organization, 176, 179),
|
||||
NerLabel("Sticky", NerTag.Organization, 180, 186),
|
||||
NerLabel("Pancake", NerTag.Organization, 187, 194),
|
||||
NerLabel("Company", NerTag.Organization, 195, 202),
|
||||
NerLabel("Maple", NerTag.Organization, 207, 212),
|
||||
NerLabel("Lane", NerTag.Organization, 213, 217),
|
||||
NerLabel("Forest", NerTag.Organization, 219, 225),
|
||||
NerLabel("Hemptown", NerTag.Location, 239, 247),
|
||||
NerLabel("Leaf", NerTag.Person, 276, 280),
|
||||
NerLabel("Little", NerTag.Misc, 347, 353),
|
||||
NerLabel("League", NerTag.Misc, 354, 360),
|
||||
NerLabel("Derek", NerTag.Person, 1117, 1122),
|
||||
NerLabel("Jeter", NerTag.Person, 1123, 1128)
|
||||
)
|
||||
assertEquals(labels, expect)
|
||||
StanfordCoreNLP.clearAnnotatorPool()
|
||||
}
|
||||
|
||||
test("find german ner labels") {
|
||||
if (Env.isCI) {
|
||||
ignore("Test ignored on travis.")
|
||||
}
|
||||
|
||||
val labels =
|
||||
StanfordNerAnnotator.nerAnnotate(germanClassifier, TestFiles.letterDEText)
|
||||
val expect = Vector(
|
||||
NerLabel("Max", NerTag.Person, 0, 3),
|
||||
NerLabel("Mustermann", NerTag.Person, 4, 14),
|
||||
NerLabel("Lilienweg", NerTag.Person, 16, 25),
|
||||
NerLabel("Max", NerTag.Person, 77, 80),
|
||||
NerLabel("Mustermann", NerTag.Person, 81, 91),
|
||||
NerLabel("Lilienweg", NerTag.Location, 93, 102),
|
||||
NerLabel("EasyCare", NerTag.Organization, 124, 132),
|
||||
NerLabel("AG", NerTag.Organization, 133, 135),
|
||||
NerLabel("Ackerweg", NerTag.Location, 158, 166),
|
||||
NerLabel("Nebendorf", NerTag.Location, 184, 193),
|
||||
NerLabel("Max", NerTag.Person, 505, 508),
|
||||
NerLabel("Mustermann", NerTag.Person, 509, 519)
|
||||
)
|
||||
assertEquals(labels, expect)
|
||||
StanfordCoreNLP.clearAnnotatorPool()
|
||||
}
|
||||
|
||||
test("regexner-only annotator") {
|
||||
if (Env.isCI) {
|
||||
ignore("Test ignored on travis.")
|
||||
}
|
||||
|
||||
val regexNerContent =
|
||||
s"""(?i)volantino ag${"\t"}ORGANIZATION${"\t"}LOCATION,PERSON,MISC${"\t"}3
|
||||
|(?i)volantino${"\t"}ORGANIZATION${"\t"}LOCATION,PERSON,MISC${"\t"}3
|
||||
|(?i)ag${"\t"}ORGANIZATION${"\t"}LOCATION,PERSON,MISC${"\t"}3
|
||||
|(?i)andrea rossi${"\t"}PERSON${"\t"}LOCATION,MISC${"\t"}2
|
||||
|(?i)andrea${"\t"}PERSON${"\t"}LOCATION,MISC${"\t"}2
|
||||
|(?i)rossi${"\t"}PERSON${"\t"}LOCATION,MISC${"\t"}2
|
||||
|""".stripMargin
|
||||
|
||||
File
|
||||
.withTempDir[IO](Paths.get("target"), "test-regex-ner")
|
||||
.use { dir =>
|
||||
for {
|
||||
out <- File.writeString[IO](dir / "regex.txt", regexNerContent)
|
||||
ann = StanfordNerAnnotator.makePipeline(StanfordNerSettings.RegexOnly(out))
|
||||
labels = StanfordNerAnnotator.nerAnnotate(ann, "Hello Andrea Rossi, can you.")
|
||||
_ <- IO(
|
||||
assertEquals(
|
||||
labels,
|
||||
Vector(
|
||||
NerLabel("Andrea", NerTag.Person, 6, 12),
|
||||
NerLabel("Rossi", NerTag.Person, 13, 18)
|
||||
)
|
||||
)
|
||||
)
|
||||
} yield ()
|
||||
}
|
||||
.unsafeRunSync()
|
||||
StanfordCoreNLP.clearAnnotatorPool()
|
||||
}
|
||||
}
|
@ -591,7 +591,7 @@ object OItem {
|
||||
for {
|
||||
itemIds <- store.transact(RItem.filterItems(items, collective))
|
||||
results <- itemIds.traverse(item => deleteItem(item, collective))
|
||||
n = results.fold(0)(_ + _)
|
||||
n = results.sum
|
||||
} yield n
|
||||
|
||||
def getProposals(item: Ident, collective: Ident): F[MetaProposalList] =
|
||||
|
@ -1,5 +1,7 @@
|
||||
package docspell.common
|
||||
|
||||
import cats.data.NonEmptyList
|
||||
|
||||
import io.circe.{Decoder, Encoder}
|
||||
|
||||
sealed trait Language { self: Product =>
|
||||
@ -11,28 +13,107 @@ sealed trait Language { self: Product =>
|
||||
|
||||
def iso3: String
|
||||
|
||||
val allowsNLP: Boolean = false
|
||||
|
||||
private[common] def allNames =
|
||||
Set(name, iso3, iso2)
|
||||
}
|
||||
|
||||
object Language {
|
||||
sealed trait NLPLanguage extends Language with Product {
|
||||
override val allowsNLP = true
|
||||
}
|
||||
object NLPLanguage {
|
||||
val all: NonEmptyList[NLPLanguage] = NonEmptyList.of(German, English, French)
|
||||
}
|
||||
|
||||
case object German extends Language {
|
||||
case object German extends NLPLanguage {
|
||||
val iso2 = "de"
|
||||
val iso3 = "deu"
|
||||
}
|
||||
|
||||
case object English extends Language {
|
||||
case object English extends NLPLanguage {
|
||||
val iso2 = "en"
|
||||
val iso3 = "eng"
|
||||
}
|
||||
|
||||
case object French extends Language {
|
||||
case object French extends NLPLanguage {
|
||||
val iso2 = "fr"
|
||||
val iso3 = "fra"
|
||||
}
|
||||
|
||||
val all: List[Language] = List(German, English, French)
|
||||
case object Italian extends Language {
|
||||
val iso2 = "it"
|
||||
val iso3 = "ita"
|
||||
}
|
||||
|
||||
case object Spanish extends Language {
|
||||
val iso2 = "es"
|
||||
val iso3 = "spa"
|
||||
}
|
||||
|
||||
case object Portuguese extends Language {
|
||||
val iso2 = "pt"
|
||||
val iso3 = "por"
|
||||
}
|
||||
|
||||
case object Czech extends Language {
|
||||
val iso2 = "cs"
|
||||
val iso3 = "ces"
|
||||
}
|
||||
|
||||
case object Danish extends Language {
|
||||
val iso2 = "da"
|
||||
val iso3 = "dan"
|
||||
}
|
||||
|
||||
case object Finnish extends Language {
|
||||
val iso2 = "fi"
|
||||
val iso3 = "fin"
|
||||
}
|
||||
|
||||
case object Norwegian extends Language {
|
||||
val iso2 = "no"
|
||||
val iso3 = "nor"
|
||||
}
|
||||
|
||||
case object Swedish extends Language {
|
||||
val iso2 = "sv"
|
||||
val iso3 = "swe"
|
||||
}
|
||||
|
||||
case object Russian extends Language {
|
||||
val iso2 = "ru"
|
||||
val iso3 = "rus"
|
||||
}
|
||||
|
||||
case object Romanian extends Language {
|
||||
val iso2 = "ro"
|
||||
val iso3 = "ron"
|
||||
}
|
||||
|
||||
case object Dutch extends Language {
|
||||
val iso2 = "nl"
|
||||
val iso3 = "nld"
|
||||
}
|
||||
|
||||
val all: List[Language] =
|
||||
List(
|
||||
German,
|
||||
English,
|
||||
French,
|
||||
Italian,
|
||||
Spanish,
|
||||
Dutch,
|
||||
Portuguese,
|
||||
Czech,
|
||||
Danish,
|
||||
Finnish,
|
||||
Norwegian,
|
||||
Swedish,
|
||||
Russian,
|
||||
Romanian
|
||||
)
|
||||
|
||||
def fromString(str: String): Either[String, Language] = {
|
||||
val lang = str.toLowerCase
|
||||
|
33
modules/common/src/main/scala/docspell/common/ListType.scala
Normal file
33
modules/common/src/main/scala/docspell/common/ListType.scala
Normal file
@ -0,0 +1,33 @@
|
||||
package docspell.common
|
||||
|
||||
import cats.data.NonEmptyList
|
||||
|
||||
import io.circe.{Decoder, Encoder}
|
||||
|
||||
sealed trait ListType { self: Product =>
|
||||
def name: String =
|
||||
productPrefix.toLowerCase
|
||||
}
|
||||
|
||||
object ListType {
|
||||
|
||||
case object Whitelist extends ListType
|
||||
val whitelist: ListType = Whitelist
|
||||
|
||||
case object Blacklist extends ListType
|
||||
val blacklist: ListType = Blacklist
|
||||
|
||||
val all: NonEmptyList[ListType] = NonEmptyList.of(Whitelist, Blacklist)
|
||||
|
||||
def fromString(name: String): Either[String, ListType] =
|
||||
all.find(_.name.equalsIgnoreCase(name)).toRight(s"Unknown list type: $name")
|
||||
|
||||
def unsafeFromString(name: String): ListType =
|
||||
fromString(name).fold(sys.error, identity)
|
||||
|
||||
implicit val jsonEncoder: Encoder[ListType] =
|
||||
Encoder.encodeString.contramap(_.name)
|
||||
|
||||
implicit val jsonDecoder: Decoder[ListType] =
|
||||
Decoder.decodeString.emap(fromString)
|
||||
}
|
@ -87,7 +87,7 @@ object MetaProposal {
|
||||
}
|
||||
}
|
||||
|
||||
/** Merges candidates with same `IdRef' values and concatenates their
|
||||
/** Merges candidates with same `IdRef` values and concatenates their
|
||||
* respective labels. The candidate order is preserved.
|
||||
*/
|
||||
def flatten(s: NonEmptyList[Candidate]): NonEmptyList[Candidate] = {
|
||||
|
@ -45,6 +45,19 @@ case class MetaProposalList private (proposals: List[MetaProposal]) {
|
||||
|
||||
def sortByWeights: MetaProposalList =
|
||||
change(_.sortByWeight)
|
||||
|
||||
def insertSecond(ml: MetaProposalList): MetaProposalList =
|
||||
MetaProposalList.flatten0(
|
||||
Seq(this, ml),
|
||||
(map, next) =>
|
||||
map.get(next.proposalType) match {
|
||||
case Some(MetaProposal(mt, values)) =>
|
||||
val cand = NonEmptyList(values.head, next.values.toList ++ values.tail)
|
||||
map.updated(next.proposalType, MetaProposal(mt, MetaProposal.flatten(cand)))
|
||||
case None =>
|
||||
map.updated(next.proposalType, next)
|
||||
}
|
||||
)
|
||||
}
|
||||
|
||||
object MetaProposalList {
|
||||
@ -74,20 +87,25 @@ object MetaProposalList {
|
||||
* is preserved and candidates of proposals are appended as given
|
||||
* by the order of the given `seq'.
|
||||
*/
|
||||
def flatten(ml: Seq[MetaProposalList]): MetaProposalList = {
|
||||
val init: Map[MetaProposalType, MetaProposal] = Map.empty
|
||||
|
||||
def updateMap(
|
||||
map: Map[MetaProposalType, MetaProposal],
|
||||
mp: MetaProposal
|
||||
): Map[MetaProposalType, MetaProposal] =
|
||||
map.get(mp.proposalType) match {
|
||||
case Some(mp0) => map.updated(mp.proposalType, mp0.addIdRef(mp.values.toList))
|
||||
case None => map.updated(mp.proposalType, mp)
|
||||
}
|
||||
|
||||
val merged = ml.foldLeft(init)((map, el) => el.proposals.foldLeft(map)(updateMap))
|
||||
def flatten(ml: Seq[MetaProposalList]): MetaProposalList =
|
||||
flatten0(
|
||||
ml,
|
||||
(map, mp) =>
|
||||
map.get(mp.proposalType) match {
|
||||
case Some(mp0) => map.updated(mp.proposalType, mp0.addIdRef(mp.values.toList))
|
||||
case None => map.updated(mp.proposalType, mp)
|
||||
}
|
||||
)
|
||||
|
||||
private def flatten0(
|
||||
ml: Seq[MetaProposalList],
|
||||
merge: (
|
||||
Map[MetaProposalType, MetaProposal],
|
||||
MetaProposal
|
||||
) => Map[MetaProposalType, MetaProposal]
|
||||
): MetaProposalList = {
|
||||
val init = Map.empty[MetaProposalType, MetaProposal]
|
||||
val merged = ml.foldLeft(init)((map, el) => el.proposals.foldLeft(map)(merge))
|
||||
fromMap(merged)
|
||||
}
|
||||
|
||||
|
25
modules/common/src/main/scala/docspell/common/NlpMode.scala
Normal file
25
modules/common/src/main/scala/docspell/common/NlpMode.scala
Normal file
@ -0,0 +1,25 @@
|
||||
package docspell.common
|
||||
|
||||
sealed trait NlpMode { self: Product =>
|
||||
|
||||
def name: String =
|
||||
self.productPrefix
|
||||
}
|
||||
object NlpMode {
|
||||
case object Full extends NlpMode
|
||||
case object Basic extends NlpMode
|
||||
case object RegexOnly extends NlpMode
|
||||
case object Disabled extends NlpMode
|
||||
|
||||
def fromString(name: String): Either[String, NlpMode] =
|
||||
name.toLowerCase match {
|
||||
case "full" => Right(Full)
|
||||
case "basic" => Right(Basic)
|
||||
case "regexonly" => Right(RegexOnly)
|
||||
case "disabled" => Right(Disabled)
|
||||
case _ => Left(s"Unknown nlp-mode: $name")
|
||||
}
|
||||
|
||||
def unsafeFromString(name: String): NlpMode =
|
||||
fromString(name).fold(sys.error, identity)
|
||||
}
|
@ -44,6 +44,9 @@ object Implicits {
|
||||
implicit val priorityReader: ConfigReader[Priority] =
|
||||
ConfigReader[String].emap(reason(Priority.fromString))
|
||||
|
||||
implicit val nlpModeReader: ConfigReader[NlpMode] =
|
||||
ConfigReader[String].emap(reason(NlpMode.fromString))
|
||||
|
||||
def reason[A: ClassTag](
|
||||
f: String => Either[String, A]
|
||||
): String => Either[FailureReason, A] =
|
||||
|
@ -0,0 +1,20 @@
|
||||
package docspell.common.syntax
|
||||
|
||||
import java.nio.file.Path
|
||||
|
||||
trait FileSyntax {
|
||||
|
||||
implicit final class PathOps(p: Path) {
|
||||
|
||||
def absolutePath: Path =
|
||||
p.normalize().toAbsolutePath
|
||||
|
||||
def absolutePathAsString: String =
|
||||
absolutePath.toString
|
||||
|
||||
def /(next: String): Path =
|
||||
p.resolve(next)
|
||||
}
|
||||
}
|
||||
|
||||
object FileSyntax extends FileSyntax
|
@ -2,6 +2,11 @@ package docspell.common
|
||||
|
||||
package object syntax {
|
||||
|
||||
object all extends EitherSyntax with StreamSyntax with StringSyntax with LoggerSyntax
|
||||
object all
|
||||
extends EitherSyntax
|
||||
with StreamSyntax
|
||||
with StringSyntax
|
||||
with LoggerSyntax
|
||||
with FileSyntax
|
||||
|
||||
}
|
||||
|
@ -68,4 +68,35 @@ object MetaProposalListTest extends SimpleTestSuite {
|
||||
assertEquals(candidates.head, cand1)
|
||||
assertEquals(candidates.tail.head, cand2)
|
||||
}
|
||||
|
||||
test("insert second") {
|
||||
val cand1 = Candidate(IdRef(Ident.unsafe("123"), "name"), Set.empty)
|
||||
val cand2 = Candidate(IdRef(Ident.unsafe("456"), "name"), Set.empty)
|
||||
val cand3 = Candidate(IdRef(Ident.unsafe("789"), "name"), Set.empty)
|
||||
val cand4 = Candidate(IdRef(Ident.unsafe("abc"), "name"), Set.empty)
|
||||
val cand5 = Candidate(IdRef(Ident.unsafe("def"), "name"), Set.empty)
|
||||
|
||||
val mpl1 = MetaProposalList
|
||||
.of(
|
||||
MetaProposal(MetaProposalType.CorrOrg, NonEmptyList.of(cand1, cand2)),
|
||||
MetaProposal(MetaProposalType.ConcPerson, NonEmptyList.of(cand3))
|
||||
)
|
||||
|
||||
val mpl2 = MetaProposalList
|
||||
.of(
|
||||
MetaProposal(MetaProposalType.CorrOrg, NonEmptyList.of(cand4)),
|
||||
MetaProposal(MetaProposalType.ConcPerson, NonEmptyList.of(cand5))
|
||||
)
|
||||
|
||||
val result = mpl1.insertSecond(mpl2)
|
||||
assertEquals(
|
||||
result,
|
||||
MetaProposalList(
|
||||
List(
|
||||
MetaProposal(MetaProposalType.CorrOrg, NonEmptyList.of(cand1, cand4, cand2)),
|
||||
MetaProposal(MetaProposalType.ConcPerson, NonEmptyList.of(cand3, cand5))
|
||||
)
|
||||
)
|
||||
)
|
||||
}
|
||||
}
|
||||
|
13
modules/files/src/test/resources/examples/letter-ita.txt
Normal file
13
modules/files/src/test/resources/examples/letter-ita.txt
Normal file
@ -0,0 +1,13 @@
|
||||
Pontremoli, 9 aprile 2013
|
||||
|
||||
Spettabile Villa Albicocca
|
||||
Via Francigena, 9
|
||||
55100 Pontetetto (LU)
|
||||
|
||||
Oggetto: Prenotazione
|
||||
|
||||
Gentile Direttore,
|
||||
|
||||
Vorrei prenotare una camera matrimoniale …….
|
||||
|
||||
In attesa di una Sua pronta risposta, La saluto cordialmente
|
@ -1,5 +1,8 @@
|
||||
package docspell.ftsclient
|
||||
|
||||
import cats.Functor
|
||||
import cats.implicits._
|
||||
|
||||
import docspell.common._
|
||||
|
||||
final case class FtsMigration[F[_]](
|
||||
@ -7,7 +10,13 @@ final case class FtsMigration[F[_]](
|
||||
engine: Ident,
|
||||
description: String,
|
||||
task: F[FtsMigration.Result]
|
||||
)
|
||||
) {
|
||||
|
||||
def changeResult(f: FtsMigration.Result => FtsMigration.Result)(implicit
|
||||
F: Functor[F]
|
||||
): FtsMigration[F] =
|
||||
copy(task = task.map(f))
|
||||
}
|
||||
|
||||
object FtsMigration {
|
||||
|
||||
|
@ -21,22 +21,19 @@ object Field {
|
||||
val discriminator = Field("discriminator")
|
||||
val attachmentName = Field("attachmentName")
|
||||
val content = Field("content")
|
||||
val content_de = Field("content_de")
|
||||
val content_en = Field("content_en")
|
||||
val content_fr = Field("content_fr")
|
||||
val content_de = contentField(Language.German)
|
||||
val content_en = contentField(Language.English)
|
||||
val content_fr = contentField(Language.French)
|
||||
val itemName = Field("itemName")
|
||||
val itemNotes = Field("itemNotes")
|
||||
val folderId = Field("folder")
|
||||
|
||||
val contentLangFields = Language.all
|
||||
.map(contentField)
|
||||
|
||||
def contentField(lang: Language): Field =
|
||||
lang match {
|
||||
case Language.German =>
|
||||
Field.content_de
|
||||
case Language.English =>
|
||||
Field.content_en
|
||||
case Language.French =>
|
||||
Field.content_fr
|
||||
}
|
||||
if (lang == Language.Czech) Field(s"content_cz")
|
||||
else Field(s"content_${lang.iso2}")
|
||||
|
||||
implicit val jsonEncoder: Encoder[Field] =
|
||||
Encoder.encodeString.contramap(_.name)
|
||||
|
@ -37,13 +37,10 @@ object SolrQuery {
|
||||
cfg,
|
||||
List(
|
||||
Field.content,
|
||||
Field.content_de,
|
||||
Field.content_en,
|
||||
Field.content_fr,
|
||||
Field.itemName,
|
||||
Field.itemNotes,
|
||||
Field.attachmentName
|
||||
),
|
||||
) ++ Field.contentLangFields,
|
||||
List(
|
||||
Field.id,
|
||||
Field.itemId,
|
||||
|
@ -56,21 +56,51 @@ object SolrSetup {
|
||||
5,
|
||||
solrEngine,
|
||||
"Add content_fr field",
|
||||
addContentFrField.map(_ => FtsMigration.Result.workDone)
|
||||
addContentField(Language.French).map(_ => FtsMigration.Result.workDone)
|
||||
),
|
||||
FtsMigration[F](
|
||||
6,
|
||||
solrEngine,
|
||||
"Index all from database",
|
||||
FtsMigration.Result.indexAll.pure[F]
|
||||
),
|
||||
FtsMigration[F](
|
||||
7,
|
||||
solrEngine,
|
||||
"Add content_it field",
|
||||
addContentField(Language.Italian).map(_ => FtsMigration.Result.reIndexAll)
|
||||
),
|
||||
FtsMigration[F](
|
||||
8,
|
||||
solrEngine,
|
||||
"Add content_es field",
|
||||
addContentField(Language.Spanish).map(_ => FtsMigration.Result.reIndexAll)
|
||||
),
|
||||
FtsMigration[F](
|
||||
9,
|
||||
solrEngine,
|
||||
"Add more content fields",
|
||||
addMoreContentFields.map(_ => FtsMigration.Result.reIndexAll)
|
||||
)
|
||||
)
|
||||
|
||||
def addFolderField: F[Unit] =
|
||||
addStringField(Field.folderId)
|
||||
|
||||
def addContentFrField: F[Unit] =
|
||||
addTextField(Some(Language.French))(Field.content_fr)
|
||||
def addMoreContentFields: F[Unit] = {
|
||||
val remain = List[Language](
|
||||
Language.Norwegian,
|
||||
Language.Romanian,
|
||||
Language.Swedish,
|
||||
Language.Finnish,
|
||||
Language.Danish,
|
||||
Language.Czech,
|
||||
Language.Dutch,
|
||||
Language.Portuguese,
|
||||
Language.Russian
|
||||
)
|
||||
remain.traverse(addContentField).map(_ => ())
|
||||
}
|
||||
|
||||
def setupCoreSchema: F[Unit] = {
|
||||
val cmds0 =
|
||||
@ -90,13 +120,15 @@ object SolrSetup {
|
||||
)
|
||||
.traverse(addTextField(None))
|
||||
|
||||
val cntLang = Language.all.traverse {
|
||||
val cntLang = List(Language.German, Language.English, Language.French).traverse {
|
||||
case l @ Language.German =>
|
||||
addTextField(l.some)(Field.content_de)
|
||||
case l @ Language.English =>
|
||||
addTextField(l.some)(Field.content_en)
|
||||
case l @ Language.French =>
|
||||
addTextField(l.some)(Field.content_fr)
|
||||
case _ =>
|
||||
().pure[F]
|
||||
}
|
||||
|
||||
cmds0 *> cmds1 *> cntLang *> ().pure[F]
|
||||
@ -111,20 +143,17 @@ object SolrSetup {
|
||||
run(DeleteField.command(DeleteField(field))).attempt *>
|
||||
run(AddField.command(AddField.string(field)))
|
||||
|
||||
private def addContentField(lang: Language): F[Unit] =
|
||||
addTextField(Some(lang))(Field.contentField(lang))
|
||||
|
||||
private def addTextField(lang: Option[Language])(field: Field): F[Unit] =
|
||||
lang match {
|
||||
case None =>
|
||||
run(DeleteField.command(DeleteField(field))).attempt *>
|
||||
run(AddField.command(AddField.text(field)))
|
||||
case Some(Language.German) =>
|
||||
run(AddField.command(AddField.textGeneral(field)))
|
||||
case Some(lang) =>
|
||||
run(DeleteField.command(DeleteField(field))).attempt *>
|
||||
run(AddField.command(AddField.textDE(field)))
|
||||
case Some(Language.English) =>
|
||||
run(DeleteField.command(DeleteField(field))).attempt *>
|
||||
run(AddField.command(AddField.textEN(field)))
|
||||
case Some(Language.French) =>
|
||||
run(DeleteField.command(DeleteField(field))).attempt *>
|
||||
run(AddField.command(AddField.textFR(field)))
|
||||
run(AddField.command(AddField.textLang(field, lang)))
|
||||
}
|
||||
}
|
||||
}
|
||||
@ -150,17 +179,12 @@ object SolrSetup {
|
||||
def string(field: Field): AddField =
|
||||
AddField(field, "string", true, true, false)
|
||||
|
||||
def text(field: Field): AddField =
|
||||
def textGeneral(field: Field): AddField =
|
||||
AddField(field, "text_general", true, true, false)
|
||||
|
||||
def textDE(field: Field): AddField =
|
||||
AddField(field, "text_de", true, true, false)
|
||||
|
||||
def textEN(field: Field): AddField =
|
||||
AddField(field, "text_en", true, true, false)
|
||||
|
||||
def textFR(field: Field): AddField =
|
||||
AddField(field, "text_fr", true, true, false)
|
||||
def textLang(field: Field, lang: Language): AddField =
|
||||
if (lang == Language.Czech) AddField(field, s"text_cz", true, true, false)
|
||||
else AddField(field, s"text_${lang.iso2}", true, true, false)
|
||||
}
|
||||
|
||||
case class DeleteField(name: Field)
|
||||
|
@ -269,62 +269,101 @@ docspell.joex {
|
||||
# All text to analyse must fit into RAM. A large document may take
|
||||
# too much heap. Also, most important information is at the
|
||||
# beginning of a document, so in most cases the first two pages
|
||||
# should suffice. Default is 10000, which are about 2-3 pages
|
||||
# (just a rough guess, of course).
|
||||
max-length = 10000
|
||||
# should suffice. Default is 8000, which are about 2-3 pages (just
|
||||
# a rough guess, of course).
|
||||
max-length = 8000
|
||||
|
||||
# A working directory for the analyser to store temporary/working
|
||||
# files.
|
||||
working-dir = ${java.io.tmpdir}"/docspell-analysis"
|
||||
|
||||
# The StanfordCoreNLP library caches language models which
|
||||
# requires quite some amount of memory. Setting this interval to a
|
||||
# positive duration, the cache is cleared after this amount of
|
||||
# idle time. Set it to 0 to disable it if you have enough memory,
|
||||
# processing will be faster.
|
||||
clear-stanford-nlp-interval = "15 minutes"
|
||||
|
||||
regex-ner {
|
||||
# Whether to enable custom NER annotation. This uses the address
|
||||
# book of a collective as input for NER tagging (to automatically
|
||||
# find correspondent and concerned entities). If the address book
|
||||
# is large, this can be quite memory intensive and also makes text
|
||||
# analysis slower. But it greatly improves accuracy. If this is
|
||||
# false, NER tagging uses only statistical models (that also work
|
||||
# quite well).
|
||||
nlp {
|
||||
# The mode for configuring NLP models:
|
||||
#
|
||||
# This setting might be moved to the collective settings in the
|
||||
# future.
|
||||
enabled = true
|
||||
# 1. full – builds the complete pipeline
|
||||
# 2. basic - builds only the ner annotator
|
||||
# 3. regexonly - matches each entry in your address book via regexps
|
||||
# 4. disabled - doesn't use any stanford-nlp feature
|
||||
#
|
||||
# The full and basic variants rely on pre-build language models
|
||||
# that are available for only a few languages. Memory usage
|
||||
# varies among the languages. So joex should run with -Xmx1400M
|
||||
# at least when using mode=full.
|
||||
#
|
||||
# The basic variant does a quite good job for German and
|
||||
# English. It might be worse for French, always depending on the
|
||||
# type of text that is analysed. Joex should run with about 500M
|
||||
# heap, here again lanugage German uses the most.
|
||||
#
|
||||
# The regexonly variant doesn't depend on a language. It roughly
|
||||
# works by converting all entries in your addressbook into
|
||||
# regexps and matches each one against the text. This can get
|
||||
# memory intensive, too, when the addressbook grows large. This
|
||||
# is included in the full and basic by default, but can be used
|
||||
# independently by setting mode=regexner.
|
||||
#
|
||||
# When mode=disabled, then the whole nlp pipeline is disabled,
|
||||
# and you won't get any suggestions. Only what the classifier
|
||||
# returns (if enabled).
|
||||
mode = full
|
||||
|
||||
# The NER annotation uses a file of patterns that is derived from
|
||||
# a collective's address book. This is is the time how long this
|
||||
# file will be kept until a check for a state change is done.
|
||||
file-cache-time = "1 minute"
|
||||
# The StanfordCoreNLP library caches language models which
|
||||
# requires quite some amount of memory. Setting this interval to a
|
||||
# positive duration, the cache is cleared after this amount of
|
||||
# idle time. Set it to 0 to disable it if you have enough memory,
|
||||
# processing will be faster.
|
||||
#
|
||||
# This has only any effect, if mode != disabled.
|
||||
clear-interval = "15 minutes"
|
||||
|
||||
# Restricts proposals for due dates. Only dates earlier than this
|
||||
# number of years in the future are considered.
|
||||
max-due-date-years = 10
|
||||
|
||||
regex-ner {
|
||||
# Whether to enable custom NER annotation. This uses the
|
||||
# address book of a collective as input for NER tagging (to
|
||||
# automatically find correspondent and concerned entities). If
|
||||
# the address book is large, this can be quite memory
|
||||
# intensive and also makes text analysis much slower. But it
|
||||
# improves accuracy and can be used independent of the
|
||||
# lanugage. If this is set to 0, it is effectively disabled
|
||||
# and NER tagging uses only statistical models (that also work
|
||||
# quite well, but are restricted to the languages mentioned
|
||||
# above).
|
||||
#
|
||||
# Note, this is only relevant if nlp-config.mode is not
|
||||
# "disabled".
|
||||
max-entries = 1000
|
||||
|
||||
# The NER annotation uses a file of patterns that is derived
|
||||
# from a collective's address book. This is is the time how
|
||||
# long this data will be kept until a check for a state change
|
||||
# is done.
|
||||
file-cache-time = "1 minute"
|
||||
}
|
||||
}
|
||||
|
||||
# Settings for doing document classification.
|
||||
#
|
||||
# This works by learning from existing documents. A collective can
|
||||
# specify a tag category and the system will try to predict a tag
|
||||
# from this category for new incoming documents.
|
||||
#
|
||||
# This requires a satstical model that is computed from all
|
||||
# existing documents. This process is run periodically as
|
||||
# configured by the collective. It may require a lot of memory,
|
||||
# depending on the amount of data.
|
||||
# This works by learning from existing documents. This requires a
|
||||
# satstical model that is computed from all existing documents.
|
||||
# This process is run periodically as configured by the
|
||||
# collective. It may require more memory, depending on the amount
|
||||
# of data.
|
||||
#
|
||||
# It utilises this NLP library: https://nlp.stanford.edu/.
|
||||
classification {
|
||||
# Whether to enable classification globally. Each collective can
|
||||
# decide to disable it. If it is disabled here, no collective
|
||||
# can use classification.
|
||||
# enable/disable auto-tagging. The classifier is also used for
|
||||
# finding correspondents and concerned entities, if enabled
|
||||
# here.
|
||||
enabled = true
|
||||
|
||||
# If concerned with memory consumption, this restricts the
|
||||
# number of items to consider. More are better for training. A
|
||||
# negative value or zero means no train on all items.
|
||||
item-count = 0
|
||||
# negative value or zero means to train on all items.
|
||||
item-count = 600
|
||||
|
||||
# These settings are used to configure the classifier. If
|
||||
# multiple are given, they are all tried and the "best" is
|
||||
@ -477,13 +516,6 @@ docspell.joex {
|
||||
}
|
||||
}
|
||||
|
||||
# General config for processing documents
|
||||
processing {
|
||||
# Restricts proposals for due dates. Only dates earlier than this
|
||||
# number of years in the future are considered.
|
||||
max-due-date-years = 10
|
||||
}
|
||||
|
||||
# The same section is also present in the rest-server config. It is
|
||||
# used when submitting files into the job queue for processing.
|
||||
#
|
||||
|
@ -5,7 +5,7 @@ import java.nio.file.Path
|
||||
import cats.data.NonEmptyList
|
||||
|
||||
import docspell.analysis.TextAnalysisConfig
|
||||
import docspell.analysis.nlp.TextClassifierConfig
|
||||
import docspell.analysis.classifier.TextClassifierConfig
|
||||
import docspell.backend.Config.Files
|
||||
import docspell.common._
|
||||
import docspell.convert.ConvertConfig
|
||||
@ -31,8 +31,7 @@ case class Config(
|
||||
sendMail: MailSendConfig,
|
||||
files: Files,
|
||||
mailDebug: Boolean,
|
||||
fullTextSearch: Config.FullTextSearch,
|
||||
processing: Config.Processing
|
||||
fullTextSearch: Config.FullTextSearch
|
||||
)
|
||||
|
||||
object Config {
|
||||
@ -55,20 +54,17 @@ object Config {
|
||||
final case class Migration(indexAllChunk: Int)
|
||||
}
|
||||
|
||||
case class Processing(maxDueDateYears: Int)
|
||||
|
||||
case class TextAnalysis(
|
||||
maxLength: Int,
|
||||
workingDir: Path,
|
||||
clearStanfordNlpInterval: Duration,
|
||||
regexNer: RegexNer,
|
||||
nlp: NlpConfig,
|
||||
classification: Classification
|
||||
) {
|
||||
|
||||
def textAnalysisConfig: TextAnalysisConfig =
|
||||
TextAnalysisConfig(
|
||||
maxLength,
|
||||
clearStanfordNlpInterval,
|
||||
TextAnalysisConfig.NlpConfig(nlp.clearInterval, nlp.mode),
|
||||
TextClassifierConfig(
|
||||
workingDir,
|
||||
NonEmptyList
|
||||
@ -78,14 +74,30 @@ object Config {
|
||||
)
|
||||
|
||||
def regexNerFileConfig: RegexNerFile.Config =
|
||||
RegexNerFile.Config(regexNer.enabled, workingDir, regexNer.fileCacheTime)
|
||||
RegexNerFile.Config(
|
||||
nlp.regexNer.maxEntries,
|
||||
workingDir,
|
||||
nlp.regexNer.fileCacheTime
|
||||
)
|
||||
}
|
||||
|
||||
case class RegexNer(enabled: Boolean, fileCacheTime: Duration)
|
||||
case class NlpConfig(
|
||||
mode: NlpMode,
|
||||
clearInterval: Duration,
|
||||
maxDueDateYears: Int,
|
||||
regexNer: RegexNer
|
||||
)
|
||||
|
||||
case class RegexNer(maxEntries: Int, fileCacheTime: Duration)
|
||||
|
||||
case class Classification(
|
||||
enabled: Boolean,
|
||||
itemCount: Int,
|
||||
classifiers: List[Map[String, String]]
|
||||
)
|
||||
) {
|
||||
|
||||
def itemCountOrWhenLower(other: Int): Int =
|
||||
if (itemCount <= 0 || (itemCount > other && other > 0)) other
|
||||
else itemCount
|
||||
}
|
||||
}
|
||||
|
@ -97,7 +97,7 @@ object JoexAppImpl {
|
||||
upload <- OUpload(store, queue, cfg.files, joex)
|
||||
fts <- createFtsClient(cfg)(httpClient)
|
||||
itemOps <- OItem(store, fts, queue, joex)
|
||||
analyser <- TextAnalyser.create[F](cfg.textAnalysis.textAnalysisConfig)
|
||||
analyser <- TextAnalyser.create[F](cfg.textAnalysis.textAnalysisConfig, blocker)
|
||||
regexNer <- RegexNerFile(cfg.textAnalysis.regexNerFileConfig, blocker, store)
|
||||
javaEmil =
|
||||
JavaMailEmil(blocker, Settings.defaultSettings.copy(debug = cfg.mailDebug))
|
||||
@ -169,7 +169,7 @@ object JoexAppImpl {
|
||||
.withTask(
|
||||
JobTask.json(
|
||||
LearnClassifierArgs.taskName,
|
||||
LearnClassifierTask[F](cfg.textAnalysis, blocker, analyser),
|
||||
LearnClassifierTask[F](cfg.textAnalysis, analyser),
|
||||
LearnClassifierTask.onCancel[F]
|
||||
)
|
||||
)
|
||||
|
@ -29,7 +29,7 @@ trait RegexNerFile[F[_]] {
|
||||
object RegexNerFile {
|
||||
private[this] val logger = getLogger
|
||||
|
||||
case class Config(enabled: Boolean, directory: Path, minTime: Duration)
|
||||
case class Config(maxEntries: Int, directory: Path, minTime: Duration)
|
||||
|
||||
def apply[F[_]: Concurrent: ContextShift](
|
||||
cfg: Config,
|
||||
@ -49,7 +49,7 @@ object RegexNerFile {
|
||||
) extends RegexNerFile[F] {
|
||||
|
||||
def makeFile(collective: Ident): F[Option[Path]] =
|
||||
if (cfg.enabled) doMakeFile(collective)
|
||||
if (cfg.maxEntries > 0) doMakeFile(collective)
|
||||
else (None: Option[Path]).pure[F]
|
||||
|
||||
def doMakeFile(collective: Ident): F[Option[Path]] =
|
||||
@ -127,7 +127,7 @@ object RegexNerFile {
|
||||
|
||||
for {
|
||||
_ <- logger.finfo(s"Generating custom NER file for collective '${collective.id}'")
|
||||
names <- store.transact(QCollective.allNames(collective))
|
||||
names <- store.transact(QCollective.allNames(collective, cfg.maxEntries))
|
||||
nerFile = NerFile(collective, lastUpdate, now)
|
||||
_ <- update(nerFile, NerFile.mkNerConfig(names))
|
||||
} yield nerFile
|
||||
|
@ -14,16 +14,26 @@ object FtsWork {
|
||||
def apply[F[_]](f: FtsContext[F] => F[Unit]): FtsWork[F] =
|
||||
Kleisli(f)
|
||||
|
||||
def allInitializeTasks[F[_]: Monad]: FtsWork[F] =
|
||||
FtsWork[F](_ => ().pure[F]).tap[FtsContext[F]].flatMap { ctx =>
|
||||
NonEmptyList.fromList(ctx.fts.initialize.map(fm => from[F](fm.task))) match {
|
||||
/** Runs all migration tasks unconditionally and inserts all data as last step. */
|
||||
def reInitializeTasks[F[_]: Monad]: FtsWork[F] =
|
||||
FtsWork { ctx =>
|
||||
val migrations =
|
||||
ctx.fts.initialize.map(fm => fm.changeResult(_ => FtsMigration.Result.workDone))
|
||||
|
||||
NonEmptyList.fromList(migrations) match {
|
||||
case Some(nel) =>
|
||||
nel.reduce(semigroup[F])
|
||||
nel
|
||||
.map(fm => from[F](fm.task))
|
||||
.append(insertAll[F](None))
|
||||
.reduce(semigroup[F])
|
||||
.run(ctx)
|
||||
case None =>
|
||||
FtsWork[F](_ => ().pure[F])
|
||||
().pure[F]
|
||||
}
|
||||
}
|
||||
|
||||
/**
|
||||
*/
|
||||
def from[F[_]: FlatMap: Applicative](t: F[FtsMigration.Result]): FtsWork[F] =
|
||||
Kleisli.liftF(t).flatMap(transformResult[F])
|
||||
|
||||
|
@ -11,6 +11,11 @@ import docspell.joex.Config
|
||||
import docspell.store.records.RFtsMigration
|
||||
import docspell.store.{AddResult, Store}
|
||||
|
||||
/** Migrating the index from the previous version to this version.
|
||||
*
|
||||
* The sql database stores the outcome of a migration task. If this
|
||||
* task has already been applied, it is skipped.
|
||||
*/
|
||||
case class Migration[F[_]](
|
||||
version: Int,
|
||||
engine: Ident,
|
||||
|
@ -46,6 +46,6 @@ object ReIndexTask {
|
||||
FtsWork.log[F](_.info("Clearing data failed. Continue re-indexing."))
|
||||
) ++
|
||||
FtsWork.log[F](_.info("Running index initialize")) ++
|
||||
FtsWork.allInitializeTasks[F]
|
||||
FtsWork.reInitializeTasks[F]
|
||||
})
|
||||
}
|
||||
|
@ -4,6 +4,9 @@ import cats.data.Kleisli
|
||||
|
||||
package object fts {
|
||||
|
||||
/** Some work that must be done to advance the schema of the fulltext
|
||||
* index.
|
||||
*/
|
||||
type FtsWork[F[_]] = Kleisli[F, FtsContext[F], Unit]
|
||||
|
||||
}
|
||||
|
@ -0,0 +1,66 @@
|
||||
package docspell.joex.learn
|
||||
|
||||
import cats.data.NonEmptyList
|
||||
import cats.implicits._
|
||||
|
||||
import docspell.common.Ident
|
||||
import docspell.store.records.{RClassifierModel, RClassifierSetting}
|
||||
|
||||
import doobie._
|
||||
|
||||
final class ClassifierName(val name: String) extends AnyVal
|
||||
|
||||
object ClassifierName {
|
||||
def apply(name: String): ClassifierName =
|
||||
new ClassifierName(name)
|
||||
|
||||
private val categoryPrefix = "tagcategory-"
|
||||
|
||||
def tagCategory(cat: String): ClassifierName =
|
||||
apply(s"${categoryPrefix}${cat}")
|
||||
|
||||
val concernedPerson: ClassifierName =
|
||||
apply("concernedperson")
|
||||
|
||||
val concernedEquip: ClassifierName =
|
||||
apply("concernedequip")
|
||||
|
||||
val correspondentOrg: ClassifierName =
|
||||
apply("correspondentorg")
|
||||
|
||||
val correspondentPerson: ClassifierName =
|
||||
apply("correspondentperson")
|
||||
|
||||
def findTagClassifiers[F[_]](coll: Ident): ConnectionIO[List[ClassifierName]] =
|
||||
for {
|
||||
categories <- RClassifierSetting.getActiveCategories(coll)
|
||||
} yield categories.map(tagCategory)
|
||||
|
||||
def findTagModels[F[_]](coll: Ident): ConnectionIO[List[RClassifierModel]] =
|
||||
for {
|
||||
categories <- RClassifierSetting.getActiveCategories(coll)
|
||||
models <- NonEmptyList.fromList(categories) match {
|
||||
case Some(nel) =>
|
||||
RClassifierModel.findAllByName(coll, nel.map(tagCategory).map(_.name))
|
||||
case None =>
|
||||
List.empty[RClassifierModel].pure[ConnectionIO]
|
||||
}
|
||||
} yield models
|
||||
|
||||
def findOrphanTagModels[F[_]](coll: Ident): ConnectionIO[List[RClassifierModel]] =
|
||||
for {
|
||||
cats <- RClassifierSetting.getActiveCategories(coll)
|
||||
allModels = RClassifierModel.findAllByQuery(coll, s"${categoryPrefix}%")
|
||||
result <- NonEmptyList.fromList(cats) match {
|
||||
case Some(nel) =>
|
||||
allModels.flatMap(all =>
|
||||
RClassifierModel
|
||||
.findAllByName(coll, nel.map(tagCategory).map(_.name))
|
||||
.map(active => all.diff(active))
|
||||
)
|
||||
case None =>
|
||||
allModels
|
||||
}
|
||||
} yield result
|
||||
|
||||
}
|
@ -0,0 +1,48 @@
|
||||
package docspell.joex.learn
|
||||
|
||||
import java.nio.file.Path
|
||||
|
||||
import cats.data.OptionT
|
||||
import cats.effect._
|
||||
import cats.implicits._
|
||||
|
||||
import docspell.analysis.classifier.{ClassifierModel, TextClassifier}
|
||||
import docspell.common._
|
||||
import docspell.store.Store
|
||||
import docspell.store.records.RClassifierModel
|
||||
|
||||
import bitpeace.RangeDef
|
||||
|
||||
object Classify {
|
||||
|
||||
def apply[F[_]: Sync: ContextShift](
|
||||
blocker: Blocker,
|
||||
logger: Logger[F],
|
||||
workingDir: Path,
|
||||
store: Store[F],
|
||||
classifier: TextClassifier[F],
|
||||
coll: Ident,
|
||||
text: String
|
||||
)(cname: ClassifierName): F[Option[String]] =
|
||||
(for {
|
||||
_ <- OptionT.liftF(logger.info(s"Guessing label for ${cname.name} …"))
|
||||
model <- OptionT(store.transact(RClassifierModel.findByName(coll, cname.name)))
|
||||
.flatTapNone(logger.debug("No classifier model found."))
|
||||
modelData =
|
||||
store.bitpeace
|
||||
.get(model.fileId.id)
|
||||
.unNoneTerminate
|
||||
.through(store.bitpeace.fetchData2(RangeDef.all))
|
||||
cls <- OptionT(File.withTempDir(workingDir, "classify").use { dir =>
|
||||
val modelFile = dir.resolve("model.ser.gz")
|
||||
modelData
|
||||
.through(fs2.io.file.writeAll(modelFile, blocker))
|
||||
.compile
|
||||
.drain
|
||||
.flatMap(_ => classifier.classify(logger, ClassifierModel(modelFile), text))
|
||||
}).filter(_ != LearnClassifierTask.noClass)
|
||||
.flatTapNone(logger.debug("Guessed: <none>"))
|
||||
_ <- OptionT.liftF(logger.debug(s"Guessed: ${cls}"))
|
||||
} yield cls).value
|
||||
|
||||
}
|
@ -1,26 +1,19 @@
|
||||
package docspell.joex.learn
|
||||
|
||||
import cats.data.Kleisli
|
||||
import cats.data.OptionT
|
||||
import cats.effect._
|
||||
import cats.implicits._
|
||||
import fs2.{Pipe, Stream}
|
||||
|
||||
import docspell.analysis.TextAnalyser
|
||||
import docspell.analysis.nlp.ClassifierModel
|
||||
import docspell.analysis.nlp.TextClassifier.Data
|
||||
import docspell.backend.ops.OCollective
|
||||
import docspell.common._
|
||||
import docspell.joex.Config
|
||||
import docspell.joex.scheduler._
|
||||
import docspell.store.queries.QItem
|
||||
import docspell.store.records.RClassifierSetting
|
||||
|
||||
import bitpeace.MimetypeHint
|
||||
import docspell.store.records.{RClassifierModel, RClassifierSetting}
|
||||
|
||||
object LearnClassifierTask {
|
||||
val noClass = "__NONE__"
|
||||
val pageSep = " --n-- "
|
||||
val noClass = "__NONE__"
|
||||
|
||||
type Args = LearnClassifierArgs
|
||||
|
||||
@ -29,83 +22,86 @@ object LearnClassifierTask {
|
||||
|
||||
def apply[F[_]: Sync: ContextShift](
|
||||
cfg: Config.TextAnalysis,
|
||||
blocker: Blocker,
|
||||
analyser: TextAnalyser[F]
|
||||
): Task[F, Args, Unit] =
|
||||
learnTags(cfg, analyser)
|
||||
.flatMap(_ => learnItemEntities(cfg, analyser))
|
||||
.flatMap(_ => Task(_ => Sync[F].delay(System.gc())))
|
||||
|
||||
private def learnItemEntities[F[_]: Sync: ContextShift](
|
||||
cfg: Config.TextAnalysis,
|
||||
analyser: TextAnalyser[F]
|
||||
): Task[F, Args, Unit] =
|
||||
Task { ctx =>
|
||||
(for {
|
||||
sett <- findActiveSettings[F](ctx, cfg)
|
||||
data = selectItems(
|
||||
ctx,
|
||||
math.min(cfg.classification.itemCount, sett.itemCount).toLong,
|
||||
sett.category.getOrElse("")
|
||||
)
|
||||
_ <- OptionT.liftF(
|
||||
analyser
|
||||
.classifier(blocker)
|
||||
.trainClassifier[Unit](ctx.logger, data)(Kleisli(handleModel(ctx, blocker)))
|
||||
)
|
||||
} yield ())
|
||||
.getOrElseF(logInactiveWarning(ctx.logger))
|
||||
if (cfg.classification.enabled)
|
||||
LearnItemEntities
|
||||
.learnAll(
|
||||
analyser,
|
||||
ctx.args.collective,
|
||||
cfg.classification.itemCount,
|
||||
cfg.maxLength
|
||||
)
|
||||
.run(ctx)
|
||||
else ().pure[F]
|
||||
}
|
||||
|
||||
private def handleModel[F[_]: Sync: ContextShift](
|
||||
ctx: Context[F, Args],
|
||||
blocker: Blocker
|
||||
)(trainedModel: ClassifierModel): F[Unit] =
|
||||
private def learnTags[F[_]: Sync: ContextShift](
|
||||
cfg: Config.TextAnalysis,
|
||||
analyser: TextAnalyser[F]
|
||||
): Task[F, Args, Unit] =
|
||||
Task { ctx =>
|
||||
val learnTags =
|
||||
for {
|
||||
sett <- findActiveSettings[F](ctx, cfg)
|
||||
maxItems = cfg.classification.itemCountOrWhenLower(sett.itemCount)
|
||||
_ <- OptionT.liftF(
|
||||
LearnTags
|
||||
.learnAllTagCategories(analyser)(
|
||||
ctx.args.collective,
|
||||
maxItems,
|
||||
cfg.maxLength
|
||||
)
|
||||
.run(ctx)
|
||||
)
|
||||
} yield ()
|
||||
// learn classifier models from active tag categories
|
||||
learnTags.getOrElseF(logInactiveWarning(ctx.logger)) *>
|
||||
// delete classifier model files for categories that have been removed
|
||||
clearObsoleteTagModels(ctx) *>
|
||||
// when tags are deleted, categories may get removed. fix the json array
|
||||
ctx.store
|
||||
.transact(RClassifierSetting.fixCategoryList(ctx.args.collective))
|
||||
.map(_ => ())
|
||||
}
|
||||
|
||||
private def clearObsoleteTagModels[F[_]: Sync](ctx: Context[F, Args]): F[Unit] =
|
||||
for {
|
||||
oldFile <- ctx.store.transact(
|
||||
RClassifierSetting.findById(ctx.args.collective).map(_.flatMap(_.fileId))
|
||||
list <- ctx.store.transact(
|
||||
ClassifierName.findOrphanTagModels(ctx.args.collective)
|
||||
)
|
||||
_ <- ctx.logger.info("Storing new trained model")
|
||||
fileData = fs2.io.file.readAll(trainedModel.model, blocker, 4096)
|
||||
newFile <-
|
||||
ctx.store.bitpeace.saveNew(fileData, 4096, MimetypeHint.none).compile.lastOrError
|
||||
_ <- ctx.store.transact(
|
||||
RClassifierSetting.updateFile(ctx.args.collective, Ident.unsafe(newFile.id))
|
||||
_ <- ctx.logger.info(
|
||||
s"Found ${list.size} obsolete model files that are deleted now."
|
||||
)
|
||||
_ <- ctx.logger.debug(s"New model stored at file ${newFile.id}")
|
||||
_ <- oldFile match {
|
||||
case Some(fid) =>
|
||||
ctx.logger.debug(s"Deleting old model file ${fid.id}") *>
|
||||
ctx.store.bitpeace.delete(fid.id).compile.drain
|
||||
case None => ().pure[F]
|
||||
}
|
||||
n <- ctx.store.transact(RClassifierModel.deleteAll(list.map(_.id)))
|
||||
_ <- list
|
||||
.map(_.fileId.id)
|
||||
.traverse(id => ctx.store.bitpeace.delete(id).compile.drain)
|
||||
_ <- ctx.logger.debug(s"Deleted $n model files.")
|
||||
} yield ()
|
||||
|
||||
private def selectItems[F[_]](
|
||||
ctx: Context[F, Args],
|
||||
max: Long,
|
||||
category: String
|
||||
): Stream[F, Data] = {
|
||||
val connStream =
|
||||
for {
|
||||
item <- QItem.findAllNewesFirst(ctx.args.collective, 10).through(restrictTo(max))
|
||||
tt <- Stream.eval(
|
||||
QItem.resolveTextAndTag(ctx.args.collective, item, category, pageSep)
|
||||
)
|
||||
} yield Data(tt.tag.map(_.name).getOrElse(noClass), item.id, tt.text.trim)
|
||||
ctx.store.transact(connStream.filter(_.text.nonEmpty))
|
||||
}
|
||||
|
||||
private def restrictTo[F[_], A](max: Long): Pipe[F, A, A] =
|
||||
if (max <= 0) identity
|
||||
else _.take(max)
|
||||
|
||||
private def findActiveSettings[F[_]: Sync](
|
||||
ctx: Context[F, Args],
|
||||
cfg: Config.TextAnalysis
|
||||
): OptionT[F, OCollective.Classifier] =
|
||||
if (cfg.classification.enabled)
|
||||
OptionT(ctx.store.transact(RClassifierSetting.findById(ctx.args.collective)))
|
||||
.filter(_.enabled)
|
||||
.filter(_.category.nonEmpty)
|
||||
.filter(_.autoTagEnabled)
|
||||
.map(OCollective.Classifier.fromRecord)
|
||||
else
|
||||
OptionT.none
|
||||
|
||||
private def logInactiveWarning[F[_]: Sync](logger: Logger[F]): F[Unit] =
|
||||
logger.warn(
|
||||
"Classification is disabled. Check joex config and the collective settings."
|
||||
"Auto-tagging is disabled. Check joex config and the collective settings."
|
||||
)
|
||||
}
|
||||
|
@ -0,0 +1,79 @@
|
||||
package docspell.joex.learn
|
||||
|
||||
import cats.data.Kleisli
|
||||
import cats.effect._
|
||||
import cats.implicits._
|
||||
import fs2.Stream
|
||||
|
||||
import docspell.analysis.TextAnalyser
|
||||
import docspell.analysis.classifier.TextClassifier.Data
|
||||
import docspell.common._
|
||||
import docspell.joex.scheduler._
|
||||
|
||||
object LearnItemEntities {
|
||||
def learnAll[F[_]: Sync: ContextShift, A](
|
||||
analyser: TextAnalyser[F],
|
||||
collective: Ident,
|
||||
maxItems: Int,
|
||||
maxTextLen: Int
|
||||
): Task[F, A, Unit] =
|
||||
learnCorrOrg(analyser, collective, maxItems, maxTextLen)
|
||||
.flatMap(_ => learnCorrPerson[F, A](analyser, collective, maxItems, maxTextLen))
|
||||
.flatMap(_ => learnConcPerson(analyser, collective, maxItems, maxTextLen))
|
||||
.flatMap(_ => learnConcEquip(analyser, collective, maxItems, maxTextLen))
|
||||
|
||||
def learnCorrOrg[F[_]: Sync: ContextShift, A](
|
||||
analyser: TextAnalyser[F],
|
||||
collective: Ident,
|
||||
maxItems: Int,
|
||||
maxTextLen: Int
|
||||
): Task[F, A, Unit] =
|
||||
learn(analyser, collective)(
|
||||
ClassifierName.correspondentOrg,
|
||||
ctx => SelectItems.forCorrOrg(ctx.store, collective, maxItems, maxTextLen)
|
||||
)
|
||||
|
||||
def learnCorrPerson[F[_]: Sync: ContextShift, A](
|
||||
analyser: TextAnalyser[F],
|
||||
collective: Ident,
|
||||
maxItems: Int,
|
||||
maxTextLen: Int
|
||||
): Task[F, A, Unit] =
|
||||
learn(analyser, collective)(
|
||||
ClassifierName.correspondentPerson,
|
||||
ctx => SelectItems.forCorrPerson(ctx.store, collective, maxItems, maxTextLen)
|
||||
)
|
||||
|
||||
def learnConcPerson[F[_]: Sync: ContextShift, A](
|
||||
analyser: TextAnalyser[F],
|
||||
collective: Ident,
|
||||
maxItems: Int,
|
||||
maxTextLen: Int
|
||||
): Task[F, A, Unit] =
|
||||
learn(analyser, collective)(
|
||||
ClassifierName.concernedPerson,
|
||||
ctx => SelectItems.forConcPerson(ctx.store, collective, maxItems, maxTextLen)
|
||||
)
|
||||
|
||||
def learnConcEquip[F[_]: Sync: ContextShift, A](
|
||||
analyser: TextAnalyser[F],
|
||||
collective: Ident,
|
||||
maxItems: Int,
|
||||
maxTextLen: Int
|
||||
): Task[F, A, Unit] =
|
||||
learn(analyser, collective)(
|
||||
ClassifierName.concernedEquip,
|
||||
ctx => SelectItems.forConcEquip(ctx.store, collective, maxItems, maxTextLen)
|
||||
)
|
||||
|
||||
private def learn[F[_]: Sync: ContextShift, A](
|
||||
analyser: TextAnalyser[F],
|
||||
collective: Ident
|
||||
)(cname: ClassifierName, data: Context[F, _] => Stream[F, Data]): Task[F, A, Unit] =
|
||||
Task { ctx =>
|
||||
ctx.logger.info(s"Learn classifier ${cname.name}") *>
|
||||
analyser.classifier.trainClassifier(ctx.logger, data(ctx))(
|
||||
Kleisli(StoreClassifierModel.handleModel(ctx, collective, cname))
|
||||
)
|
||||
}
|
||||
}
|
@ -0,0 +1,48 @@
|
||||
package docspell.joex.learn
|
||||
|
||||
import cats.data.Kleisli
|
||||
import cats.effect._
|
||||
import cats.implicits._
|
||||
|
||||
import docspell.analysis.TextAnalyser
|
||||
import docspell.common._
|
||||
import docspell.joex.scheduler._
|
||||
import docspell.store.records.RClassifierSetting
|
||||
|
||||
object LearnTags {
|
||||
|
||||
def learnTagCategory[F[_]: Sync: ContextShift, A](
|
||||
analyser: TextAnalyser[F],
|
||||
collective: Ident,
|
||||
maxItems: Int,
|
||||
maxTextLen: Int
|
||||
)(
|
||||
category: String
|
||||
): Task[F, A, Unit] =
|
||||
Task { ctx =>
|
||||
val data = SelectItems.forCategory(ctx, collective)(maxItems, category, maxTextLen)
|
||||
ctx.logger.info(s"Learn classifier for tag category: $category") *>
|
||||
analyser.classifier.trainClassifier(ctx.logger, data)(
|
||||
Kleisli(
|
||||
StoreClassifierModel.handleModel(
|
||||
ctx,
|
||||
collective,
|
||||
ClassifierName.tagCategory(category)
|
||||
)
|
||||
)
|
||||
)
|
||||
}
|
||||
|
||||
def learnAllTagCategories[F[_]: Sync: ContextShift, A](analyser: TextAnalyser[F])(
|
||||
collective: Ident,
|
||||
maxItems: Int,
|
||||
maxTextLen: Int
|
||||
): Task[F, A, Unit] =
|
||||
Task { ctx =>
|
||||
for {
|
||||
cats <- ctx.store.transact(RClassifierSetting.getActiveCategories(collective))
|
||||
task = learnTagCategory[F, A](analyser, collective, maxItems, maxTextLen) _
|
||||
_ <- cats.map(task).traverse(_.run(ctx))
|
||||
} yield ()
|
||||
}
|
||||
}
|
@ -0,0 +1,109 @@
|
||||
package docspell.joex.learn
|
||||
|
||||
import fs2.{Pipe, Stream}
|
||||
|
||||
import docspell.analysis.classifier.TextClassifier.Data
|
||||
import docspell.common._
|
||||
import docspell.joex.scheduler.Context
|
||||
import docspell.store.Store
|
||||
import docspell.store.qb.Batch
|
||||
import docspell.store.queries.{QItem, TextAndTag}
|
||||
|
||||
import doobie._
|
||||
|
||||
object SelectItems {
|
||||
val pageSep = LearnClassifierTask.pageSep
|
||||
val noClass = LearnClassifierTask.noClass
|
||||
|
||||
def forCategory[F[_]](ctx: Context[F, _], collective: Ident)(
|
||||
maxItems: Int,
|
||||
category: String,
|
||||
maxTextLen: Int
|
||||
): Stream[F, Data] =
|
||||
forCategory(ctx.store, collective, maxItems, category, maxTextLen)
|
||||
|
||||
def forCategory[F[_]](
|
||||
store: Store[F],
|
||||
collective: Ident,
|
||||
maxItems: Int,
|
||||
category: String,
|
||||
maxTextLen: Int
|
||||
): Stream[F, Data] = {
|
||||
val connStream =
|
||||
allItems(collective, maxItems)
|
||||
.evalMap(item =>
|
||||
QItem.resolveTextAndTag(collective, item, category, maxTextLen, pageSep)
|
||||
)
|
||||
.through(mkData)
|
||||
store.transact(connStream)
|
||||
}
|
||||
|
||||
def forCorrOrg[F[_]](
|
||||
store: Store[F],
|
||||
collective: Ident,
|
||||
maxItems: Int,
|
||||
maxTextLen: Int
|
||||
): Stream[F, Data] = {
|
||||
val connStream =
|
||||
allItems(collective, maxItems)
|
||||
.evalMap(item =>
|
||||
QItem.resolveTextAndCorrOrg(collective, item, maxTextLen, pageSep)
|
||||
)
|
||||
.through(mkData)
|
||||
store.transact(connStream)
|
||||
}
|
||||
|
||||
def forCorrPerson[F[_]](
|
||||
store: Store[F],
|
||||
collective: Ident,
|
||||
maxItems: Int,
|
||||
maxTextLen: Int
|
||||
): Stream[F, Data] = {
|
||||
val connStream =
|
||||
allItems(collective, maxItems)
|
||||
.evalMap(item =>
|
||||
QItem.resolveTextAndCorrPerson(collective, item, maxTextLen, pageSep)
|
||||
)
|
||||
.through(mkData)
|
||||
store.transact(connStream)
|
||||
}
|
||||
|
||||
def forConcPerson[F[_]](
|
||||
store: Store[F],
|
||||
collective: Ident,
|
||||
maxItems: Int,
|
||||
maxTextLen: Int
|
||||
): Stream[F, Data] = {
|
||||
val connStream =
|
||||
allItems(collective, maxItems)
|
||||
.evalMap(item =>
|
||||
QItem.resolveTextAndConcPerson(collective, item, maxTextLen, pageSep)
|
||||
)
|
||||
.through(mkData)
|
||||
store.transact(connStream)
|
||||
}
|
||||
|
||||
def forConcEquip[F[_]](
|
||||
store: Store[F],
|
||||
collective: Ident,
|
||||
maxItems: Int,
|
||||
maxTextLen: Int
|
||||
): Stream[F, Data] = {
|
||||
val connStream =
|
||||
allItems(collective, maxItems)
|
||||
.evalMap(item =>
|
||||
QItem.resolveTextAndConcEquip(collective, item, maxTextLen, pageSep)
|
||||
)
|
||||
.through(mkData)
|
||||
store.transact(connStream)
|
||||
}
|
||||
|
||||
private def allItems(collective: Ident, max: Int): Stream[ConnectionIO, Ident] = {
|
||||
val limit = if (max <= 0) Batch.all else Batch.limit(max)
|
||||
QItem.findAllNewesFirst(collective, 10, limit)
|
||||
}
|
||||
|
||||
private def mkData[F[_]]: Pipe[F, TextAndTag, Data] =
|
||||
_.map(tt => Data(tt.tag.map(_.name).getOrElse(noClass), tt.itemId.id, tt.text.trim))
|
||||
.filter(_.text.nonEmpty)
|
||||
}
|
@ -0,0 +1,53 @@
|
||||
package docspell.joex.learn
|
||||
|
||||
import cats.effect._
|
||||
import cats.implicits._
|
||||
|
||||
import docspell.analysis.classifier.ClassifierModel
|
||||
import docspell.common._
|
||||
import docspell.joex.scheduler._
|
||||
import docspell.store.Store
|
||||
import docspell.store.records.RClassifierModel
|
||||
|
||||
import bitpeace.MimetypeHint
|
||||
|
||||
object StoreClassifierModel {
|
||||
|
||||
def handleModel[F[_]: Sync: ContextShift](
|
||||
ctx: Context[F, _],
|
||||
collective: Ident,
|
||||
modelName: ClassifierName
|
||||
)(
|
||||
trainedModel: ClassifierModel
|
||||
): F[Unit] =
|
||||
handleModel(ctx.store, ctx.blocker, ctx.logger)(collective, modelName, trainedModel)
|
||||
|
||||
def handleModel[F[_]: Sync: ContextShift](
|
||||
store: Store[F],
|
||||
blocker: Blocker,
|
||||
logger: Logger[F]
|
||||
)(
|
||||
collective: Ident,
|
||||
modelName: ClassifierName,
|
||||
trainedModel: ClassifierModel
|
||||
): F[Unit] =
|
||||
for {
|
||||
oldFile <- store.transact(
|
||||
RClassifierModel.findByName(collective, modelName.name).map(_.map(_.fileId))
|
||||
)
|
||||
_ <- logger.debug(s"Storing new trained model for: ${modelName.name}")
|
||||
fileData = fs2.io.file.readAll(trainedModel.model, blocker, 4096)
|
||||
newFile <-
|
||||
store.bitpeace.saveNew(fileData, 4096, MimetypeHint.none).compile.lastOrError
|
||||
_ <- store.transact(
|
||||
RClassifierModel.updateFile(collective, modelName.name, Ident.unsafe(newFile.id))
|
||||
)
|
||||
_ <- logger.debug(s"New model stored at file ${newFile.id}")
|
||||
_ <- oldFile match {
|
||||
case Some(fid) =>
|
||||
logger.debug(s"Deleting old model file ${fid.id}") *>
|
||||
store.bitpeace.delete(fid.id).compile.drain
|
||||
case None => ().pure[F]
|
||||
}
|
||||
} yield ()
|
||||
}
|
@ -78,7 +78,14 @@ object AttachmentPageCount {
|
||||
s"No attachmentmeta record exists for ${ra.id.id}. Creating new."
|
||||
) *> ctx.store.transact(
|
||||
RAttachmentMeta.insert(
|
||||
RAttachmentMeta(ra.id, None, Nil, MetaProposalList.empty, md.pageCount.some)
|
||||
RAttachmentMeta(
|
||||
ra.id,
|
||||
None,
|
||||
Nil,
|
||||
MetaProposalList.empty,
|
||||
md.pageCount.some,
|
||||
None
|
||||
)
|
||||
)
|
||||
)
|
||||
else 0.pure[F]
|
||||
|
@ -108,7 +108,18 @@ object ConvertPdf {
|
||||
ctx.logger.info(s"Conversion to pdf+txt successful. Saving file.") *>
|
||||
storePDF(ctx, cfg, ra, pdf)
|
||||
.flatMap(r =>
|
||||
txt.map(t => (r, item.changeMeta(ra.id, _.setContentIfEmpty(t.some)).some))
|
||||
txt.map(t =>
|
||||
(
|
||||
r,
|
||||
item
|
||||
.changeMeta(
|
||||
ra.id,
|
||||
ctx.args.meta.language,
|
||||
_.setContentIfEmpty(t.some)
|
||||
)
|
||||
.some
|
||||
)
|
||||
)
|
||||
)
|
||||
|
||||
case ConversionResult.UnsupportedFormat(mt) =>
|
||||
|
@ -107,6 +107,8 @@ object CreateItem {
|
||||
Vector.empty,
|
||||
fm.map(a => a.id -> a.fileId).toMap,
|
||||
MetaProposalList.empty,
|
||||
Nil,
|
||||
MetaProposalList.empty,
|
||||
Nil
|
||||
)
|
||||
}
|
||||
@ -166,6 +168,8 @@ object CreateItem {
|
||||
Vector.empty,
|
||||
origMap,
|
||||
MetaProposalList.empty,
|
||||
Nil,
|
||||
MetaProposalList.empty,
|
||||
Nil
|
||||
)
|
||||
)
|
||||
|
@ -42,7 +42,7 @@ object ExtractArchive {
|
||||
archive: Option[RAttachmentArchive]
|
||||
): Task[F, ProcessItemArgs, (Option[RAttachmentArchive], ItemData)] =
|
||||
singlePass(item, archive).flatMap { t =>
|
||||
if (t._1 == None) Task.pure(t)
|
||||
if (t._1.isEmpty) Task.pure(t)
|
||||
else multiPass(t._2, t._1)
|
||||
}
|
||||
|
||||
|
@ -17,24 +17,92 @@ import docspell.store.records._
|
||||
* by looking up values from NER in the users address book.
|
||||
*/
|
||||
object FindProposal {
|
||||
type Args = ProcessItemArgs
|
||||
|
||||
def apply[F[_]: Sync](
|
||||
cfg: Config.Processing
|
||||
)(data: ItemData): Task[F, ProcessItemArgs, ItemData] =
|
||||
cfg: Config.TextAnalysis
|
||||
)(data: ItemData): Task[F, Args, ItemData] =
|
||||
Task { ctx =>
|
||||
val rmas = data.metas.map(rm => rm.copy(nerlabels = removeDuplicates(rm.nerlabels)))
|
||||
|
||||
ctx.logger.info("Starting find-proposal") *>
|
||||
rmas
|
||||
for {
|
||||
_ <- ctx.logger.info("Starting find-proposal")
|
||||
rmv <- rmas
|
||||
.traverse(rm =>
|
||||
processAttachment(cfg, rm, data.findDates(rm), ctx)
|
||||
.map(ml => rm.copy(proposals = ml))
|
||||
)
|
||||
.map(rmv => data.copy(metas = rmv))
|
||||
clp <- lookupClassifierProposals(ctx, data.classifyProposals)
|
||||
} yield data.copy(metas = rmv, classifyProposals = clp)
|
||||
}
|
||||
|
||||
def lookupClassifierProposals[F[_]: Sync](
|
||||
ctx: Context[F, Args],
|
||||
mpList: MetaProposalList
|
||||
): F[MetaProposalList] = {
|
||||
val coll = ctx.args.meta.collective
|
||||
|
||||
def lookup(mp: MetaProposal): F[Option[IdRef]] =
|
||||
mp.proposalType match {
|
||||
case MetaProposalType.CorrOrg =>
|
||||
ctx.store
|
||||
.transact(
|
||||
ROrganization
|
||||
.findLike(coll, mp.values.head.ref.name.toLowerCase)
|
||||
.map(_.headOption)
|
||||
)
|
||||
.flatTap(oref =>
|
||||
ctx.logger.debug(s"Found classifier organization for $mp: $oref")
|
||||
)
|
||||
case MetaProposalType.CorrPerson =>
|
||||
ctx.store
|
||||
.transact(
|
||||
RPerson
|
||||
.findLike(coll, mp.values.head.ref.name.toLowerCase, false)
|
||||
.map(_.headOption)
|
||||
)
|
||||
.flatTap(oref =>
|
||||
ctx.logger.debug(s"Found classifier corr-person for $mp: $oref")
|
||||
)
|
||||
case MetaProposalType.ConcPerson =>
|
||||
ctx.store
|
||||
.transact(
|
||||
RPerson
|
||||
.findLike(coll, mp.values.head.ref.name.toLowerCase, true)
|
||||
.map(_.headOption)
|
||||
)
|
||||
.flatTap(oref =>
|
||||
ctx.logger.debug(s"Found classifier conc-person for $mp: $oref")
|
||||
)
|
||||
case MetaProposalType.ConcEquip =>
|
||||
ctx.store
|
||||
.transact(
|
||||
REquipment
|
||||
.findLike(coll, mp.values.head.ref.name.toLowerCase)
|
||||
.map(_.headOption)
|
||||
)
|
||||
.flatTap(oref =>
|
||||
ctx.logger.debug(s"Found classifier conc-equip for $mp: $oref")
|
||||
)
|
||||
case MetaProposalType.DocDate =>
|
||||
(None: Option[IdRef]).pure[F]
|
||||
|
||||
case MetaProposalType.DueDate =>
|
||||
(None: Option[IdRef]).pure[F]
|
||||
}
|
||||
|
||||
def updateRef(mp: MetaProposal)(idRef: Option[IdRef]): Option[MetaProposal] =
|
||||
idRef // this proposal contains a single value only, since coming from classifier
|
||||
.map(ref => mp.copy(values = mp.values.map(_.copy(ref = ref))))
|
||||
|
||||
ctx.logger.debug(s"Looking up classifier results: ${mpList.proposals}") *>
|
||||
mpList.proposals
|
||||
.traverse(mp => lookup(mp).map(updateRef(mp)))
|
||||
.map(_.flatten)
|
||||
.map(MetaProposalList.apply)
|
||||
}
|
||||
|
||||
def processAttachment[F[_]: Sync](
|
||||
cfg: Config.Processing,
|
||||
cfg: Config.TextAnalysis,
|
||||
rm: RAttachmentMeta,
|
||||
rd: Vector[NerDateLabel],
|
||||
ctx: Context[F, ProcessItemArgs]
|
||||
@ -46,11 +114,11 @@ object FindProposal {
|
||||
}
|
||||
|
||||
def makeDateProposal[F[_]: Sync](
|
||||
cfg: Config.Processing,
|
||||
cfg: Config.TextAnalysis,
|
||||
dates: Vector[NerDateLabel]
|
||||
): F[MetaProposalList] =
|
||||
Timestamp.current[F].map { now =>
|
||||
val maxFuture = now.plus(Duration.years(cfg.maxDueDateYears.toLong))
|
||||
val maxFuture = now.plus(Duration.years(cfg.nlp.maxDueDateYears.toLong))
|
||||
val latestFirst = dates
|
||||
.filter(_.date.isBefore(maxFuture.toUtcDate))
|
||||
.sortWith((l1, l2) => l1.date.isAfter(l2.date))
|
||||
|
@ -15,6 +15,9 @@ import docspell.store.records.{RAttachment, RAttachmentMeta, RItem}
|
||||
* containng the source or origin file
|
||||
* @param givenMeta meta data to this item that was not "guessed"
|
||||
* from an attachment but given and thus is always correct
|
||||
* @param classifyProposals these are proposals that were obtained by
|
||||
* a trained classifier. There are no ner-tags, it will only provide a
|
||||
* single label
|
||||
*/
|
||||
case class ItemData(
|
||||
item: RItem,
|
||||
@ -23,7 +26,11 @@ case class ItemData(
|
||||
dateLabels: Vector[AttachmentDates],
|
||||
originFile: Map[Ident, Ident], // maps RAttachment.id -> FileMeta.id
|
||||
givenMeta: MetaProposalList, // given meta data not associated to a specific attachment
|
||||
tags: List[String] // a list of tags (names or ids) attached to the item if they exist
|
||||
// a list of tags (names or ids) attached to the item if they exist
|
||||
tags: List[String],
|
||||
// proposals obtained from the classifier
|
||||
classifyProposals: MetaProposalList,
|
||||
classifyTags: List[String]
|
||||
) {
|
||||
|
||||
def findMeta(attachId: Ident): Option[RAttachmentMeta] =
|
||||
@ -32,8 +39,12 @@ case class ItemData(
|
||||
def findDates(rm: RAttachmentMeta): Vector[NerDateLabel] =
|
||||
dateLabels.find(m => m.rm.id == rm.id).map(_.dates).getOrElse(Vector.empty)
|
||||
|
||||
def mapMeta(attachId: Ident, f: RAttachmentMeta => RAttachmentMeta): ItemData = {
|
||||
val item = changeMeta(attachId, f)
|
||||
def mapMeta(
|
||||
attachId: Ident,
|
||||
lang: Language,
|
||||
f: RAttachmentMeta => RAttachmentMeta
|
||||
): ItemData = {
|
||||
val item = changeMeta(attachId, lang, f)
|
||||
val next = metas.map(a => if (a.id == attachId) item else a)
|
||||
copy(metas = next)
|
||||
}
|
||||
@ -43,13 +54,14 @@ case class ItemData(
|
||||
|
||||
def changeMeta(
|
||||
attachId: Ident,
|
||||
lang: Language,
|
||||
f: RAttachmentMeta => RAttachmentMeta
|
||||
): RAttachmentMeta =
|
||||
f(findOrCreate(attachId))
|
||||
f(findOrCreate(attachId, lang))
|
||||
|
||||
def findOrCreate(attachId: Ident): RAttachmentMeta =
|
||||
def findOrCreate(attachId: Ident, lang: Language): RAttachmentMeta =
|
||||
metas.find(_.id == attachId).getOrElse {
|
||||
RAttachmentMeta.empty(attachId)
|
||||
RAttachmentMeta.empty(attachId, lang)
|
||||
}
|
||||
|
||||
}
|
||||
|
@ -24,6 +24,7 @@ object LinkProposal {
|
||||
.flatten(data.metas.map(_.proposals))
|
||||
.filter(_.proposalType != MetaProposalType.DocDate)
|
||||
.sortByWeights
|
||||
.fillEmptyFrom(data.classifyProposals)
|
||||
|
||||
ctx.logger.info(s"Starting linking proposals") *>
|
||||
MetaProposalType.all
|
||||
|
@ -41,7 +41,7 @@ object ProcessItem {
|
||||
regexNer: RegexNerFile[F]
|
||||
)(item: ItemData): Task[F, ProcessItemArgs, ItemData] =
|
||||
TextAnalysis[F](cfg.textAnalysis, analyser, regexNer)(item)
|
||||
.flatMap(FindProposal[F](cfg.processing))
|
||||
.flatMap(FindProposal[F](cfg.textAnalysis))
|
||||
.flatMap(EvalProposals[F])
|
||||
.flatMap(SaveProposals[F])
|
||||
|
||||
|
@ -65,6 +65,8 @@ object ReProcessItem {
|
||||
Vector.empty,
|
||||
asrcMap.view.mapValues(_.fileId).toMap,
|
||||
MetaProposalList.empty,
|
||||
Nil,
|
||||
MetaProposalList.empty,
|
||||
Nil
|
||||
)).getOrElseF(
|
||||
Sync[F].raiseError(new Exception(s"Item not found: ${ctx.args.itemId.id}"))
|
||||
|
@ -4,21 +4,51 @@ import cats.effect.Sync
|
||||
import cats.implicits._
|
||||
|
||||
import docspell.common._
|
||||
import docspell.joex.scheduler.Task
|
||||
import docspell.joex.scheduler.{Context, Task}
|
||||
import docspell.store.AddResult
|
||||
import docspell.store.records._
|
||||
|
||||
/** Saves the proposals in the database
|
||||
*/
|
||||
object SaveProposals {
|
||||
type Args = ProcessItemArgs
|
||||
|
||||
def apply[F[_]: Sync](data: ItemData): Task[F, ProcessItemArgs, ItemData] =
|
||||
def apply[F[_]: Sync](data: ItemData): Task[F, Args, ItemData] =
|
||||
Task { ctx =>
|
||||
ctx.logger.info("Storing proposals") *>
|
||||
data.metas
|
||||
for {
|
||||
_ <- ctx.logger.info("Storing proposals")
|
||||
_ <- data.metas
|
||||
.traverse(rm =>
|
||||
ctx.logger.debug(s"Storing attachment proposals: ${rm.proposals}") *>
|
||||
ctx.store.transact(RAttachmentMeta.updateProposals(rm.id, rm.proposals))
|
||||
ctx.logger.debug(
|
||||
s"Storing attachment proposals: ${rm.proposals}"
|
||||
) *> ctx.store.transact(RAttachmentMeta.updateProposals(rm.id, rm.proposals))
|
||||
)
|
||||
.map(_ => data)
|
||||
_ <-
|
||||
if (data.classifyProposals.isEmpty && data.classifyTags.isEmpty) 0.pure[F]
|
||||
else saveItemProposal(ctx, data)
|
||||
} yield data
|
||||
}
|
||||
|
||||
def saveItemProposal[F[_]: Sync](ctx: Context[F, Args], data: ItemData): F[Unit] = {
|
||||
def upsert(v: RItemProposal): F[Int] =
|
||||
ctx.store.add(RItemProposal.insert(v), RItemProposal.exists(v.itemId)).flatMap {
|
||||
case AddResult.Success => 1.pure[F]
|
||||
case AddResult.EntityExists(_) =>
|
||||
ctx.store.transact(RItemProposal.update(v))
|
||||
case AddResult.Failure(ex) =>
|
||||
ctx.logger.warn(s"Could not store item proposals: ${ex.getMessage}") *> 0
|
||||
.pure[F]
|
||||
}
|
||||
|
||||
for {
|
||||
_ <- ctx.logger.debug(s"Storing classifier proposals: ${data.classifyProposals}")
|
||||
tags <- ctx.store.transact(
|
||||
RTag.findAllByNameOrId(data.classifyTags, ctx.args.meta.collective)
|
||||
)
|
||||
tagRefs = tags.map(t => IdRef(t.tagId, t.name))
|
||||
now <- Timestamp.current[F]
|
||||
value = RItemProposal(data.item.id, data.classifyProposals, tagRefs.toList, now)
|
||||
_ <- upsert(value)
|
||||
} yield ()
|
||||
}
|
||||
}
|
||||
|
@ -45,7 +45,8 @@ object SetGivenData {
|
||||
Task { ctx =>
|
||||
val itemId = data.item.id
|
||||
val collective = ctx.args.meta.collective
|
||||
val tags = (ctx.args.meta.tags.getOrElse(Nil) ++ data.tags).distinct
|
||||
val tags =
|
||||
(ctx.args.meta.tags.getOrElse(Nil) ++ data.tags ++ data.classifyTags).distinct
|
||||
for {
|
||||
_ <- ctx.logger.info(s"Set tags from given data: ${tags}")
|
||||
e <- ops.linkTags(itemId, tags, collective).attempt
|
||||
|
@ -1,24 +1,20 @@
|
||||
package docspell.joex.process
|
||||
|
||||
import cats.data.OptionT
|
||||
import cats.Traverse
|
||||
import cats.effect._
|
||||
import cats.implicits._
|
||||
|
||||
import docspell.analysis.TextAnalyser
|
||||
import docspell.analysis.nlp.ClassifierModel
|
||||
import docspell.analysis.nlp.StanfordNerSettings
|
||||
import docspell.analysis.nlp.TextClassifier
|
||||
import docspell.analysis.classifier.TextClassifier
|
||||
import docspell.analysis.{NlpSettings, TextAnalyser}
|
||||
import docspell.common.MetaProposal.Candidate
|
||||
import docspell.common._
|
||||
import docspell.joex.Config
|
||||
import docspell.joex.analysis.RegexNerFile
|
||||
import docspell.joex.learn.LearnClassifierTask
|
||||
import docspell.joex.learn.{ClassifierName, Classify, LearnClassifierTask}
|
||||
import docspell.joex.process.ItemData.AttachmentDates
|
||||
import docspell.joex.scheduler.Context
|
||||
import docspell.joex.scheduler.Task
|
||||
import docspell.store.records.RAttachmentMeta
|
||||
import docspell.store.records.RClassifierSetting
|
||||
|
||||
import bitpeace.RangeDef
|
||||
import docspell.store.records.{RAttachmentMeta, RClassifierSetting}
|
||||
|
||||
object TextAnalysis {
|
||||
type Args = ProcessItemArgs
|
||||
@ -41,13 +37,27 @@ object TextAnalysis {
|
||||
_ <- t.traverse(m =>
|
||||
ctx.store.transact(RAttachmentMeta.updateLabels(m._1.id, m._1.nerlabels))
|
||||
)
|
||||
|
||||
v = t.toVector
|
||||
autoTagEnabled <- getActiveAutoTag(ctx, cfg)
|
||||
tag <-
|
||||
if (autoTagEnabled) predictTags(ctx, cfg, item.metas, analyser.classifier)
|
||||
else List.empty[String].pure[F]
|
||||
|
||||
classProposals <-
|
||||
if (cfg.classification.enabled)
|
||||
predictItemEntities(ctx, cfg, item.metas, analyser.classifier)
|
||||
else MetaProposalList.empty.pure[F]
|
||||
|
||||
e <- s
|
||||
_ <- ctx.logger.info(s"Text-Analysis finished in ${e.formatExact}")
|
||||
v = t.toVector
|
||||
tag <- predictTag(ctx, cfg, item.metas, analyser.classifier(ctx.blocker)).value
|
||||
} yield item
|
||||
.copy(metas = v.map(_._1), dateLabels = v.map(_._2))
|
||||
.appendTags(tag.toSeq)
|
||||
.copy(
|
||||
metas = v.map(_._1),
|
||||
dateLabels = v.map(_._2),
|
||||
classifyProposals = classProposals,
|
||||
classifyTags = tag
|
||||
)
|
||||
}
|
||||
|
||||
def annotateAttachment[F[_]: Sync](
|
||||
@ -55,7 +65,7 @@ object TextAnalysis {
|
||||
analyser: TextAnalyser[F],
|
||||
nerFile: RegexNerFile[F]
|
||||
)(rm: RAttachmentMeta): F[(RAttachmentMeta, AttachmentDates)] = {
|
||||
val settings = StanfordNerSettings(ctx.args.meta.language, false, None)
|
||||
val settings = NlpSettings(ctx.args.meta.language, false, None)
|
||||
for {
|
||||
customNer <- nerFile.makeFile(ctx.args.meta.collective)
|
||||
sett = settings.copy(regexNer = customNer)
|
||||
@ -68,44 +78,84 @@ object TextAnalysis {
|
||||
} yield (rm.copy(nerlabels = labels.all.toList), AttachmentDates(rm, labels.dates))
|
||||
}
|
||||
|
||||
def predictTag[F[_]: Sync: ContextShift](
|
||||
def predictTags[F[_]: Sync: ContextShift](
|
||||
ctx: Context[F, Args],
|
||||
cfg: Config.TextAnalysis,
|
||||
metas: Vector[RAttachmentMeta],
|
||||
classifier: TextClassifier[F]
|
||||
): OptionT[F, String] =
|
||||
for {
|
||||
model <- findActiveModel(ctx, cfg)
|
||||
_ <- OptionT.liftF(ctx.logger.info(s"Guessing tag …"))
|
||||
text = metas.flatMap(_.content).mkString(LearnClassifierTask.pageSep)
|
||||
modelData =
|
||||
ctx.store.bitpeace
|
||||
.get(model.id)
|
||||
.unNoneTerminate
|
||||
.through(ctx.store.bitpeace.fetchData2(RangeDef.all))
|
||||
cls <- OptionT(File.withTempDir(cfg.workingDir, "classify").use { dir =>
|
||||
val modelFile = dir.resolve("model.ser.gz")
|
||||
modelData
|
||||
.through(fs2.io.file.writeAll(modelFile, ctx.blocker))
|
||||
.compile
|
||||
.drain
|
||||
.flatMap(_ => classifier.classify(ctx.logger, ClassifierModel(modelFile), text))
|
||||
}).filter(_ != LearnClassifierTask.noClass)
|
||||
_ <- OptionT.liftF(ctx.logger.debug(s"Guessed tag: ${cls}"))
|
||||
} yield cls
|
||||
): F[List[String]] = {
|
||||
val text = metas.flatMap(_.content).mkString(LearnClassifierTask.pageSep)
|
||||
val classifyWith: ClassifierName => F[Option[String]] =
|
||||
makeClassify(ctx, cfg, classifier)(text)
|
||||
|
||||
private def findActiveModel[F[_]: Sync](
|
||||
for {
|
||||
names <- ctx.store.transact(
|
||||
ClassifierName.findTagClassifiers(ctx.args.meta.collective)
|
||||
)
|
||||
_ <- ctx.logger.debug(s"Guessing tags for ${names.size} categories")
|
||||
tags <- names.traverse(classifyWith)
|
||||
} yield tags.flatten
|
||||
}
|
||||
|
||||
def predictItemEntities[F[_]: Sync: ContextShift](
|
||||
ctx: Context[F, Args],
|
||||
cfg: Config.TextAnalysis
|
||||
): OptionT[F, Ident] =
|
||||
(if (cfg.classification.enabled)
|
||||
OptionT(ctx.store.transact(RClassifierSetting.findById(ctx.args.meta.collective)))
|
||||
.filter(_.enabled)
|
||||
.mapFilter(_.fileId)
|
||||
else
|
||||
OptionT.none[F, Ident]).orElse(
|
||||
OptionT.liftF(ctx.logger.info("Classification is disabled.")) *> OptionT
|
||||
.none[F, Ident]
|
||||
cfg: Config.TextAnalysis,
|
||||
metas: Vector[RAttachmentMeta],
|
||||
classifier: TextClassifier[F]
|
||||
): F[MetaProposalList] = {
|
||||
val text = metas.flatMap(_.content).mkString(LearnClassifierTask.pageSep)
|
||||
|
||||
def classifyWith(
|
||||
cname: ClassifierName,
|
||||
mtype: MetaProposalType
|
||||
): F[Option[MetaProposal]] =
|
||||
for {
|
||||
label <- makeClassify(ctx, cfg, classifier)(text).apply(cname)
|
||||
} yield label.map(str =>
|
||||
MetaProposal(mtype, Candidate(IdRef(Ident.unsafe(""), str), Set.empty))
|
||||
)
|
||||
|
||||
Traverse[List]
|
||||
.sequence(
|
||||
List(
|
||||
classifyWith(ClassifierName.correspondentOrg, MetaProposalType.CorrOrg),
|
||||
classifyWith(ClassifierName.correspondentPerson, MetaProposalType.CorrPerson),
|
||||
classifyWith(ClassifierName.concernedPerson, MetaProposalType.ConcPerson),
|
||||
classifyWith(ClassifierName.concernedEquip, MetaProposalType.ConcEquip)
|
||||
)
|
||||
)
|
||||
.map(_.flatten)
|
||||
.map(MetaProposalList.apply)
|
||||
}
|
||||
|
||||
private def makeClassify[F[_]: Sync: ContextShift](
|
||||
ctx: Context[F, Args],
|
||||
cfg: Config.TextAnalysis,
|
||||
classifier: TextClassifier[F]
|
||||
)(text: String): ClassifierName => F[Option[String]] =
|
||||
Classify[F](
|
||||
ctx.blocker,
|
||||
ctx.logger,
|
||||
cfg.workingDir,
|
||||
ctx.store,
|
||||
classifier,
|
||||
ctx.args.meta.collective,
|
||||
text
|
||||
)
|
||||
|
||||
private def getActiveAutoTag[F[_]: Sync](
|
||||
ctx: Context[F, Args],
|
||||
cfg: Config.TextAnalysis
|
||||
): F[Boolean] =
|
||||
if (cfg.classification.enabled)
|
||||
ctx.store
|
||||
.transact(RClassifierSetting.findById(ctx.args.meta.collective))
|
||||
.map(_.exists(_.autoTagEnabled))
|
||||
.flatTap(enabled =>
|
||||
if (enabled) ().pure[F]
|
||||
else ctx.logger.info("Classification is disabled. Check config or settings.")
|
||||
)
|
||||
else
|
||||
ctx.logger.info("Classification is disabled.") *> false.pure[F]
|
||||
|
||||
}
|
||||
|
@ -46,10 +46,14 @@ object TextExtraction {
|
||||
)
|
||||
_ <- fts.indexData(ctx.logger, (idxItem +: txt.map(_.td)).toSeq: _*)
|
||||
dur <- start
|
||||
_ <- ctx.logger.info(s"Text extraction finished in ${dur.formatExact}")
|
||||
extractedTags = txt.flatMap(_.tags).distinct.toList
|
||||
_ <- ctx.logger.info(s"Text extraction finished in ${dur.formatExact}.")
|
||||
_ <-
|
||||
if (extractedTags.isEmpty) ().pure[F]
|
||||
else ctx.logger.debug(s"Found tags in file: $extractedTags")
|
||||
} yield item
|
||||
.copy(metas = txt.map(_.am))
|
||||
.appendTags(txt.flatMap(_.tags).distinct.toList)
|
||||
.appendTags(extractedTags)
|
||||
}
|
||||
|
||||
// -- helpers
|
||||
@ -78,7 +82,7 @@ object TextExtraction {
|
||||
pair._2
|
||||
)
|
||||
|
||||
val rm = item.findOrCreate(ra.id)
|
||||
val rm = item.findOrCreate(ra.id, lang)
|
||||
rm.content match {
|
||||
case Some(_) =>
|
||||
ctx.logger.info("TextExtraction skipped, since text is already available.") *>
|
||||
@ -102,6 +106,7 @@ object TextExtraction {
|
||||
res <- extractTextFallback(ctx, cfg, ra, lang)(fids)
|
||||
meta = item.changeMeta(
|
||||
ra.id,
|
||||
lang,
|
||||
rm =>
|
||||
rm.setContentIfEmpty(
|
||||
res.map(_.appendPdfMetaToText.text.trim).filter(_.nonEmpty)
|
||||
|
@ -9,7 +9,7 @@ servers:
|
||||
description: Current host
|
||||
|
||||
paths:
|
||||
/api/info:
|
||||
/api/info/version:
|
||||
get:
|
||||
tags: [ Api Info ]
|
||||
summary: Get basic information about this software.
|
||||
|
@ -4850,14 +4850,11 @@ components:
|
||||
description: |
|
||||
Settings for learning a document classifier.
|
||||
required:
|
||||
- enabled
|
||||
- schedule
|
||||
- itemCount
|
||||
- categoryList
|
||||
- listType
|
||||
properties:
|
||||
enabled:
|
||||
type: boolean
|
||||
category:
|
||||
type: string
|
||||
itemCount:
|
||||
type: integer
|
||||
format: int32
|
||||
@ -4867,6 +4864,16 @@ components:
|
||||
schedule:
|
||||
type: string
|
||||
format: calevent
|
||||
categoryList:
|
||||
type: array
|
||||
items:
|
||||
type: string
|
||||
listType:
|
||||
type: string
|
||||
format: listtype
|
||||
enum:
|
||||
- blacklist
|
||||
- whitelist
|
||||
|
||||
SourceList:
|
||||
description: |
|
||||
|
@ -6,7 +6,7 @@ import cats.implicits._
|
||||
import docspell.backend.BackendApp
|
||||
import docspell.backend.auth.AuthToken
|
||||
import docspell.backend.ops.OCollective
|
||||
import docspell.common.MakePreviewArgs
|
||||
import docspell.common.{ListType, MakePreviewArgs}
|
||||
import docspell.restapi.model._
|
||||
import docspell.restserver.conv.Conversions
|
||||
import docspell.restserver.http4s._
|
||||
@ -44,10 +44,10 @@ object CollectiveRoutes {
|
||||
settings.integrationEnabled,
|
||||
Some(
|
||||
OCollective.Classifier(
|
||||
settings.classifier.enabled,
|
||||
settings.classifier.schedule,
|
||||
settings.classifier.itemCount,
|
||||
settings.classifier.category
|
||||
settings.classifier.categoryList,
|
||||
settings.classifier.listType
|
||||
)
|
||||
)
|
||||
)
|
||||
@ -65,12 +65,12 @@ object CollectiveRoutes {
|
||||
c.language,
|
||||
c.integrationEnabled,
|
||||
ClassifierSetting(
|
||||
c.classifier.map(_.enabled).getOrElse(false),
|
||||
c.classifier.flatMap(_.category),
|
||||
c.classifier.map(_.itemCount).getOrElse(0),
|
||||
c.classifier
|
||||
.map(_.schedule)
|
||||
.getOrElse(CalEvent.unsafe("*-1/3-01 01:00:00"))
|
||||
.getOrElse(CalEvent.unsafe("*-1/3-01 01:00:00")),
|
||||
c.classifier.map(_.categories).getOrElse(Nil),
|
||||
c.classifier.map(_.listType).getOrElse(ListType.whitelist)
|
||||
)
|
||||
)
|
||||
)
|
||||
|
@ -0,0 +1,35 @@
|
||||
ALTER TABLE "attachmentmeta"
|
||||
ADD COLUMN "language" varchar(254);
|
||||
|
||||
update "attachmentmeta"
|
||||
set "language" = 'deu'
|
||||
where "attachid" in (
|
||||
select "m"."attachid"
|
||||
from "attachmentmeta" m
|
||||
inner join "attachment" a on "a"."attachid" = "m"."attachid"
|
||||
inner join "item" i on "a"."itemid" = "i"."itemid"
|
||||
inner join "collective" c on "c"."cid" = "i"."cid"
|
||||
where "c"."doclang" = 'deu'
|
||||
);
|
||||
|
||||
update "attachmentmeta"
|
||||
set "language" = 'eng'
|
||||
where "attachid" in (
|
||||
select "m"."attachid"
|
||||
from "attachmentmeta" m
|
||||
inner join "attachment" a on "a"."attachid" = "m"."attachid"
|
||||
inner join "item" i on "a"."itemid" = "i"."itemid"
|
||||
inner join "collective" c on "c"."cid" = "i"."cid"
|
||||
where "c"."doclang" = 'eng'
|
||||
);
|
||||
|
||||
update "attachmentmeta"
|
||||
set "language" = 'fra'
|
||||
where "attachid" in (
|
||||
select "m"."attachid"
|
||||
from "attachmentmeta" m
|
||||
inner join "attachment" a on "a"."attachid" = "m"."attachid"
|
||||
inner join "item" i on "a"."itemid" = "i"."itemid"
|
||||
inner join "collective" c on "c"."cid" = "i"."cid"
|
||||
where "c"."doclang" = 'fra'
|
||||
);
|
@ -0,0 +1,44 @@
|
||||
CREATE TABLE "classifier_model"(
|
||||
"id" varchar(254) not null primary key,
|
||||
"cid" varchar(254) not null,
|
||||
"name" varchar(254) not null,
|
||||
"file_id" varchar(254) not null,
|
||||
"created" timestamp not null,
|
||||
foreign key ("cid") references "collective"("cid"),
|
||||
foreign key ("file_id") references "filemeta"("id"),
|
||||
unique ("cid", "name")
|
||||
);
|
||||
|
||||
insert into "classifier_model"
|
||||
select random_uuid() as "id", "cid", concat('tagcategory-', "category") as "name", "file_id", "created"
|
||||
from "classifier_setting"
|
||||
where "file_id" is not null;
|
||||
|
||||
alter table "classifier_setting"
|
||||
add column "categories" text;
|
||||
|
||||
alter table "classifier_setting"
|
||||
add column "category_list_type" varchar(254);
|
||||
|
||||
update "classifier_setting"
|
||||
set "category_list_type" = 'whitelist';
|
||||
|
||||
update "classifier_setting"
|
||||
set "categories" = concat('["', "category", '"]')
|
||||
where category is not null;
|
||||
|
||||
update "classifier_setting"
|
||||
set "categories" = '[]'
|
||||
where category is null;
|
||||
|
||||
alter table "classifier_setting"
|
||||
drop column "category";
|
||||
|
||||
alter table "classifier_setting"
|
||||
drop column "file_id";
|
||||
|
||||
ALTER TABLE "classifier_setting"
|
||||
ALTER COLUMN "categories" SET NOT NULL;
|
||||
|
||||
ALTER TABLE "classifier_setting"
|
||||
ALTER COLUMN "category_list_type" SET NOT NULL;
|
@ -0,0 +1,7 @@
|
||||
CREATE TABLE "item_proposal" (
|
||||
"itemid" varchar(254) not null primary key,
|
||||
"classifier_proposals" text not null,
|
||||
"classifier_tags" text not null,
|
||||
"created" timestamp not null,
|
||||
foreign key ("itemid") references "item"("itemid")
|
||||
);
|
@ -0,0 +1,14 @@
|
||||
ALTER TABLE `attachmentmeta`
|
||||
ADD COLUMN (`language` varchar(254));
|
||||
|
||||
update `attachmentmeta` `m`
|
||||
inner join (
|
||||
select `m`.`attachid`, `c`.`doclang`
|
||||
from `attachmentmeta` m
|
||||
inner join `attachment` a on `a`.`attachid` = `m`.`attachid`
|
||||
inner join `item` i on `a`.`itemid` = `i`.`itemid`
|
||||
inner join `collective` c on `c`.`cid` = `i`.`cid`
|
||||
) as `c`
|
||||
set `m`.`language` = `c`.`doclang`
|
||||
where `m`.`attachid` = `c`.`attachid` and `m`.`language` is null;
|
||||
|
@ -0,0 +1,48 @@
|
||||
CREATE TABLE `classifier_model`(
|
||||
`id` varchar(254) not null primary key,
|
||||
`cid` varchar(254) not null,
|
||||
`name` varchar(254) not null,
|
||||
`file_id` varchar(254) not null,
|
||||
`created` timestamp not null,
|
||||
foreign key (`cid`) references `collective`(`cid`),
|
||||
foreign key (`file_id`) references `filemeta`(`id`),
|
||||
unique (`cid`, `name`)
|
||||
);
|
||||
|
||||
insert into `classifier_model`
|
||||
select md5(rand()) as id, `cid`,concat('tagcategory-', `category`) as `name`, `file_id`, `created`
|
||||
from `classifier_setting`
|
||||
where `file_id` is not null;
|
||||
|
||||
alter table `classifier_setting`
|
||||
add column (`categories` mediumtext);
|
||||
|
||||
alter table `classifier_setting`
|
||||
add column (`category_list_type` varchar(254));
|
||||
|
||||
update `classifier_setting`
|
||||
set `category_list_type` = 'whitelist';
|
||||
|
||||
update `classifier_setting`
|
||||
set `categories` = concat('["', `category`, '"]')
|
||||
where category is not null;
|
||||
|
||||
update `classifier_setting`
|
||||
set `categories` = '[]'
|
||||
where category is null;
|
||||
|
||||
alter table `classifier_setting`
|
||||
drop column `category`;
|
||||
|
||||
-- mariadb requires to drop constraint manually when dropping a column
|
||||
alter table `classifier_setting`
|
||||
drop constraint `classifier_setting_ibfk_2`;
|
||||
|
||||
alter table `classifier_setting`
|
||||
drop column `file_id`;
|
||||
|
||||
ALTER TABLE `classifier_setting`
|
||||
MODIFY `categories` mediumtext NOT NULL;
|
||||
|
||||
ALTER TABLE `classifier_setting`
|
||||
MODIFY `category_list_type` varchar(254) NOT NULL;
|
@ -0,0 +1,7 @@
|
||||
CREATE TABLE `item_proposal` (
|
||||
`itemid` varchar(254) not null primary key,
|
||||
`classifier_proposals` mediumtext not null,
|
||||
`classifier_tags` mediumtext not null,
|
||||
`created` timestamp not null,
|
||||
foreign key (`itemid`) references `item`(`itemid`)
|
||||
);
|
@ -0,0 +1,15 @@
|
||||
ALTER TABLE "attachmentmeta"
|
||||
ADD COLUMN "language" varchar(254);
|
||||
|
||||
with
|
||||
"attachlang" as (
|
||||
select "m"."attachid", "m"."language", "c"."doclang"
|
||||
from "attachmentmeta" m
|
||||
inner join "attachment" a on "a"."attachid" = "m"."attachid"
|
||||
inner join "item" i on "a"."itemid" = "i"."itemid"
|
||||
inner join "collective" c on "c"."cid" = "i"."cid"
|
||||
)
|
||||
update "attachmentmeta" as "m"
|
||||
set "language" = "c"."doclang"
|
||||
from "attachlang" c
|
||||
where "m"."attachid" = "c"."attachid" and "m"."language" is null;
|
@ -0,0 +1,44 @@
|
||||
CREATE TABLE "classifier_model"(
|
||||
"id" varchar(254) not null primary key,
|
||||
"cid" varchar(254) not null,
|
||||
"name" varchar(254) not null,
|
||||
"file_id" varchar(254) not null,
|
||||
"created" timestamp not null,
|
||||
foreign key ("cid") references "collective"("cid"),
|
||||
foreign key ("file_id") references "filemeta"("id"),
|
||||
unique ("cid", "name")
|
||||
);
|
||||
|
||||
insert into "classifier_model"
|
||||
select md5(random()::text) as id, "cid",'tagcategory-' || "category" as "name", "file_id", "created"
|
||||
from "classifier_setting"
|
||||
where "file_id" is not null;
|
||||
|
||||
alter table "classifier_setting"
|
||||
add column "categories" text;
|
||||
|
||||
alter table "classifier_setting"
|
||||
add column "category_list_type" varchar(254);
|
||||
|
||||
update "classifier_setting"
|
||||
set "category_list_type" = 'whitelist';
|
||||
|
||||
update "classifier_setting"
|
||||
set "categories" = concat('["', "category", '"]')
|
||||
where category is not null;
|
||||
|
||||
update "classifier_setting"
|
||||
set "categories" = '[]'
|
||||
where category is null;
|
||||
|
||||
alter table "classifier_setting"
|
||||
drop column "category";
|
||||
|
||||
alter table "classifier_setting"
|
||||
drop column "file_id";
|
||||
|
||||
ALTER TABLE "classifier_setting"
|
||||
ALTER COLUMN "categories" SET NOT NULL;
|
||||
|
||||
ALTER TABLE "classifier_setting"
|
||||
ALTER COLUMN "category_list_type" SET NOT NULL;
|
@ -0,0 +1,7 @@
|
||||
CREATE TABLE "item_proposal" (
|
||||
"itemid" varchar(254) not null primary key,
|
||||
"classifier_proposals" text not null,
|
||||
"classifier_tags" text not null,
|
||||
"created" timestamp not null,
|
||||
foreign key ("itemid") references "item"("itemid")
|
||||
);
|
@ -86,6 +86,9 @@ trait DoobieMeta extends EmilDoobieMeta {
|
||||
implicit val metaItemProposalList: Meta[MetaProposalList] =
|
||||
jsonMeta[MetaProposalList]
|
||||
|
||||
implicit val metaIdRef: Meta[List[IdRef]] =
|
||||
jsonMeta[List[IdRef]]
|
||||
|
||||
implicit val metaLanguage: Meta[Language] =
|
||||
Meta[String].imap(Language.unsafe)(_.iso3)
|
||||
|
||||
@ -97,6 +100,9 @@ trait DoobieMeta extends EmilDoobieMeta {
|
||||
|
||||
implicit val metaCustomFieldType: Meta[CustomFieldType] =
|
||||
Meta[String].timap(CustomFieldType.unsafe)(_.name)
|
||||
|
||||
implicit val metaListType: Meta[ListType] =
|
||||
Meta[String].timap(ListType.unsafeFromString)(_.name)
|
||||
}
|
||||
|
||||
object DoobieMeta extends DoobieMeta {
|
||||
|
@ -1,5 +1,7 @@
|
||||
package docspell.store.qb
|
||||
|
||||
import cats.data.NonEmptyList
|
||||
|
||||
sealed trait DBFunction {}
|
||||
|
||||
object DBFunction {
|
||||
@ -31,6 +33,8 @@ object DBFunction {
|
||||
|
||||
case class Sum(expr: SelectExpr) extends DBFunction
|
||||
|
||||
case class Concat(exprs: NonEmptyList[SelectExpr]) extends DBFunction
|
||||
|
||||
sealed trait Operator
|
||||
object Operator {
|
||||
case object Plus extends Operator
|
||||
|
@ -98,6 +98,9 @@ trait DSL extends DoobieMeta {
|
||||
def substring(expr: SelectExpr, start: Int, length: Int): DBFunction =
|
||||
DBFunction.Substring(expr, start, length)
|
||||
|
||||
def concat(expr: SelectExpr, exprs: SelectExpr*): DBFunction =
|
||||
DBFunction.Concat(Nel.of(expr, exprs: _*))
|
||||
|
||||
def lit[A](value: A)(implicit P: Put[A]): SelectExpr.SelectLit[A] =
|
||||
SelectExpr.SelectLit(value, None)
|
||||
|
||||
|
@ -32,6 +32,10 @@ object DBFunctionBuilder extends CommonBuilder {
|
||||
case DBFunction.Substring(expr, start, len) =>
|
||||
sql"SUBSTRING(" ++ SelectExprBuilder.build(expr) ++ fr" FROM $start FOR $len)"
|
||||
|
||||
case DBFunction.Concat(exprs) =>
|
||||
val inner = exprs.map(SelectExprBuilder.build).toList.reduce(_ ++ comma ++ _)
|
||||
sql"CONCAT(" ++ inner ++ sql")"
|
||||
|
||||
case DBFunction.Calc(op, left, right) =>
|
||||
SelectExprBuilder.build(left) ++
|
||||
buildOperator(op) ++
|
||||
|
@ -21,6 +21,7 @@ object QAttachment {
|
||||
private val item = RItem.as("i")
|
||||
private val am = RAttachmentMeta.as("am")
|
||||
private val c = RCollective.as("c")
|
||||
private val im = RItemProposal.as("im")
|
||||
|
||||
def deletePreview[F[_]: Sync](store: Store[F])(attachId: Ident): F[Int] = {
|
||||
val findPreview =
|
||||
@ -118,17 +119,27 @@ object QAttachment {
|
||||
} yield ns.sum
|
||||
|
||||
def getMetaProposals(itemId: Ident, coll: Ident): ConnectionIO[MetaProposalList] = {
|
||||
val q = Select(
|
||||
am.proposals.s,
|
||||
val qa = Select(
|
||||
select(am.proposals),
|
||||
from(am)
|
||||
.innerJoin(a, a.id === am.id)
|
||||
.innerJoin(item, a.itemId === item.id),
|
||||
a.itemId === itemId && item.cid === coll
|
||||
).build
|
||||
|
||||
val qi = Select(
|
||||
select(im.classifyProposals),
|
||||
from(im)
|
||||
.innerJoin(item, item.id === im.itemId),
|
||||
item.cid === coll && im.itemId === itemId
|
||||
).build
|
||||
|
||||
for {
|
||||
ml <- q.query[MetaProposalList].to[Vector]
|
||||
} yield MetaProposalList.flatten(ml)
|
||||
mla <- qa.query[MetaProposalList].to[Vector]
|
||||
mli <- qi.query[MetaProposalList].to[Vector]
|
||||
} yield MetaProposalList
|
||||
.flatten(mla)
|
||||
.insertSecond(MetaProposalList.flatten(mli))
|
||||
}
|
||||
|
||||
def getAttachmentMeta(
|
||||
@ -160,7 +171,15 @@ object QAttachment {
|
||||
chunkSize: Int
|
||||
): Stream[ConnectionIO, ContentAndName] =
|
||||
Select(
|
||||
select(a.id, a.itemId, item.cid, item.folder, c.language, a.name, am.content),
|
||||
select(
|
||||
a.id.s,
|
||||
a.itemId.s,
|
||||
item.cid.s,
|
||||
item.folder.s,
|
||||
coalesce(am.language.s, c.language.s).s,
|
||||
a.name.s,
|
||||
am.content.s
|
||||
),
|
||||
from(a)
|
||||
.innerJoin(am, am.id === a.id)
|
||||
.innerJoin(item, item.id === a.itemId)
|
||||
|
@ -1,10 +1,8 @@
|
||||
package docspell.store.queries
|
||||
|
||||
import cats.data.OptionT
|
||||
import fs2.Stream
|
||||
|
||||
import docspell.common.ContactKind
|
||||
import docspell.common.{Direction, Ident}
|
||||
import docspell.common._
|
||||
import docspell.store.qb.DSL._
|
||||
import docspell.store.qb._
|
||||
import docspell.store.records._
|
||||
@ -17,6 +15,7 @@ object QCollective {
|
||||
private val t = RTag.as("t")
|
||||
private val ro = ROrganization.as("o")
|
||||
private val rp = RPerson.as("p")
|
||||
private val re = REquipment.as("e")
|
||||
private val rc = RContact.as("c")
|
||||
private val i = RItem.as("i")
|
||||
|
||||
@ -25,13 +24,37 @@ object QCollective {
|
||||
val empty = Names(Vector.empty, Vector.empty, Vector.empty)
|
||||
}
|
||||
|
||||
def allNames(collective: Ident): ConnectionIO[Names] =
|
||||
(for {
|
||||
orgs <- OptionT.liftF(ROrganization.findAllRef(collective, None, _.name))
|
||||
pers <- OptionT.liftF(RPerson.findAllRef(collective, None, _.name))
|
||||
equp <- OptionT.liftF(REquipment.findAll(collective, None, _.name))
|
||||
} yield Names(orgs.map(_.name), pers.map(_.name), equp.map(_.name)))
|
||||
.getOrElse(Names.empty)
|
||||
def allNames(collective: Ident, maxEntries: Int): ConnectionIO[Names] = {
|
||||
val created = Column[Timestamp]("created", TableDef(""))
|
||||
union(
|
||||
Select(
|
||||
select(ro.name.s, lit(1).as("kind"), ro.created.as(created)),
|
||||
from(ro),
|
||||
ro.cid === collective
|
||||
),
|
||||
Select(
|
||||
select(rp.name.s, lit(2).as("kind"), rp.created.as(created)),
|
||||
from(rp),
|
||||
rp.cid === collective
|
||||
),
|
||||
Select(
|
||||
select(re.name.s, lit(3).as("kind"), re.created.as(created)),
|
||||
from(re),
|
||||
re.cid === collective
|
||||
)
|
||||
).orderBy(created.desc)
|
||||
.limit(Batch.limit(maxEntries))
|
||||
.build
|
||||
.query[(String, Int)]
|
||||
.streamWithChunkSize(maxEntries)
|
||||
.fold(Names.empty) { case (names, (name, kind)) =>
|
||||
if (kind == 1) names.copy(org = names.org :+ name)
|
||||
else if (kind == 2) names.copy(pers = names.pers :+ name)
|
||||
else names.copy(equip = names.equip :+ name)
|
||||
}
|
||||
.compile
|
||||
.lastOrError
|
||||
}
|
||||
|
||||
case class InsightData(
|
||||
incoming: Int,
|
||||
|
@ -441,8 +441,9 @@ object QItem {
|
||||
tn <- store.transact(RTagItem.deleteItemTags(itemId))
|
||||
mn <- store.transact(RSentMail.deleteByItem(itemId))
|
||||
cf <- store.transact(RCustomFieldValue.deleteByItem(itemId))
|
||||
im <- store.transact(RItemProposal.deleteByItem(itemId))
|
||||
n <- store.transact(RItem.deleteByIdAndCollective(itemId, collective))
|
||||
} yield tn + rn + n + mn + cf
|
||||
} yield tn + rn + n + mn + cf + im
|
||||
|
||||
private def findByFileIdsQuery(
|
||||
fileMetaIds: Nel[Ident],
|
||||
@ -543,11 +544,13 @@ object QItem {
|
||||
|
||||
def findAllNewesFirst(
|
||||
collective: Ident,
|
||||
chunkSize: Int
|
||||
chunkSize: Int,
|
||||
limit: Batch
|
||||
): Stream[ConnectionIO, Ident] = {
|
||||
val i = RItem.as("i")
|
||||
Select(i.id.s, from(i), i.cid === collective && i.state === ItemState.confirmed)
|
||||
.orderBy(i.created.desc)
|
||||
.limit(limit)
|
||||
.build
|
||||
.query[Ident]
|
||||
.streamWithChunkSize(chunkSize)
|
||||
@ -557,6 +560,7 @@ object QItem {
|
||||
collective: Ident,
|
||||
itemId: Ident,
|
||||
tagCategory: String,
|
||||
maxLen: Int,
|
||||
pageSep: String
|
||||
): ConnectionIO[TextAndTag] = {
|
||||
val tags = TableDef("tags").as("tt")
|
||||
@ -564,7 +568,7 @@ object QItem {
|
||||
val tagsTid = Column[Ident]("tid", tags)
|
||||
val tagsName = Column[String]("tname", tags)
|
||||
|
||||
val q =
|
||||
readTextAndTag(collective, itemId, pageSep) {
|
||||
withCte(
|
||||
tags -> Select(
|
||||
select(ti.itemId.as(tagsItem), tag.tid.as(tagsTid), tag.name.as(tagsName)),
|
||||
@ -574,25 +578,98 @@ object QItem {
|
||||
)
|
||||
)(
|
||||
Select(
|
||||
select(m.content, tagsTid, tagsName),
|
||||
select(substring(m.content.s, 0, maxLen).s, tagsTid.s, tagsName.s),
|
||||
from(i)
|
||||
.innerJoin(a, a.itemId === i.id)
|
||||
.innerJoin(m, a.id === m.id)
|
||||
.leftJoin(tags, tagsItem === i.id),
|
||||
i.id === itemId && i.cid === collective && m.content.isNotNull && m.content <> ""
|
||||
)
|
||||
).build
|
||||
)
|
||||
}
|
||||
}
|
||||
|
||||
def resolveTextAndCorrOrg(
|
||||
collective: Ident,
|
||||
itemId: Ident,
|
||||
maxLen: Int,
|
||||
pageSep: String
|
||||
): ConnectionIO[TextAndTag] =
|
||||
readTextAndTag(collective, itemId, pageSep) {
|
||||
Select(
|
||||
select(substring(m.content.s, 0, maxLen).s, org.oid.s, org.name.s),
|
||||
from(i)
|
||||
.innerJoin(a, a.itemId === i.id)
|
||||
.innerJoin(m, m.id === a.id)
|
||||
.leftJoin(org, org.oid === i.corrOrg),
|
||||
i.id === itemId && m.content.isNotNull && m.content <> ""
|
||||
)
|
||||
}
|
||||
|
||||
def resolveTextAndCorrPerson(
|
||||
collective: Ident,
|
||||
itemId: Ident,
|
||||
maxLen: Int,
|
||||
pageSep: String
|
||||
): ConnectionIO[TextAndTag] =
|
||||
readTextAndTag(collective, itemId, pageSep) {
|
||||
Select(
|
||||
select(substring(m.content.s, 0, maxLen).s, pers0.pid.s, pers0.name.s),
|
||||
from(i)
|
||||
.innerJoin(a, a.itemId === i.id)
|
||||
.innerJoin(m, m.id === a.id)
|
||||
.leftJoin(pers0, pers0.pid === i.corrPerson),
|
||||
i.id === itemId && m.content.isNotNull && m.content <> ""
|
||||
)
|
||||
}
|
||||
|
||||
def resolveTextAndConcPerson(
|
||||
collective: Ident,
|
||||
itemId: Ident,
|
||||
maxLen: Int,
|
||||
pageSep: String
|
||||
): ConnectionIO[TextAndTag] =
|
||||
readTextAndTag(collective, itemId, pageSep) {
|
||||
Select(
|
||||
select(substring(m.content.s, 0, maxLen).s, pers0.pid.s, pers0.name.s),
|
||||
from(i)
|
||||
.innerJoin(a, a.itemId === i.id)
|
||||
.innerJoin(m, m.id === a.id)
|
||||
.leftJoin(pers0, pers0.pid === i.concPerson),
|
||||
i.id === itemId && m.content.isNotNull && m.content <> ""
|
||||
)
|
||||
}
|
||||
|
||||
def resolveTextAndConcEquip(
|
||||
collective: Ident,
|
||||
itemId: Ident,
|
||||
maxLen: Int,
|
||||
pageSep: String
|
||||
): ConnectionIO[TextAndTag] =
|
||||
readTextAndTag(collective, itemId, pageSep) {
|
||||
Select(
|
||||
select(substring(m.content.s, 0, maxLen).s, equip.eid.s, equip.name.s),
|
||||
from(i)
|
||||
.innerJoin(a, a.itemId === i.id)
|
||||
.innerJoin(m, m.id === a.id)
|
||||
.leftJoin(equip, equip.eid === i.concEquipment),
|
||||
i.id === itemId && m.content.isNotNull && m.content <> ""
|
||||
)
|
||||
}
|
||||
|
||||
private def readTextAndTag(collective: Ident, itemId: Ident, pageSep: String)(
|
||||
q: Select
|
||||
): ConnectionIO[TextAndTag] =
|
||||
for {
|
||||
_ <- logger.ftrace[ConnectionIO](
|
||||
s"query: $q (${itemId.id}, ${collective.id}, ${tagCategory})"
|
||||
s"query: $q (${itemId.id}, ${collective.id})"
|
||||
)
|
||||
texts <- q.query[(String, Option[TextAndTag.TagName])].to[List]
|
||||
texts <- q.build.query[(String, Option[TextAndTag.TagName])].to[List]
|
||||
_ <- logger.ftrace[ConnectionIO](
|
||||
s"Got ${texts.size} text and tag entries for item ${itemId.id}"
|
||||
)
|
||||
tag = texts.headOption.flatMap(_._2)
|
||||
txt = texts.map(_._1).mkString(pageSep)
|
||||
} yield TextAndTag(itemId, txt, tag)
|
||||
}
|
||||
|
||||
}
|
||||
|
@ -15,7 +15,8 @@ case class RAttachmentMeta(
|
||||
content: Option[String],
|
||||
nerlabels: List[NerLabel],
|
||||
proposals: MetaProposalList,
|
||||
pages: Option[Int]
|
||||
pages: Option[Int],
|
||||
language: Option[Language]
|
||||
) {
|
||||
|
||||
def setContentIfEmpty(txt: Option[String]): RAttachmentMeta =
|
||||
@ -27,8 +28,8 @@ case class RAttachmentMeta(
|
||||
}
|
||||
|
||||
object RAttachmentMeta {
|
||||
def empty(attachId: Ident) =
|
||||
RAttachmentMeta(attachId, None, Nil, MetaProposalList.empty, None)
|
||||
def empty(attachId: Ident, lang: Language) =
|
||||
RAttachmentMeta(attachId, None, Nil, MetaProposalList.empty, None, Some(lang))
|
||||
|
||||
final case class Table(alias: Option[String]) extends TableDef {
|
||||
val tableName = "attachmentmeta"
|
||||
@ -38,7 +39,16 @@ object RAttachmentMeta {
|
||||
val nerlabels = Column[List[NerLabel]]("nerlabels", this)
|
||||
val proposals = Column[MetaProposalList]("itemproposals", this)
|
||||
val pages = Column[Int]("page_count", this)
|
||||
val all = NonEmptyList.of[Column[_]](id, content, nerlabels, proposals, pages)
|
||||
val language = Column[Language]("language", this)
|
||||
val all =
|
||||
NonEmptyList.of[Column[_]](
|
||||
id,
|
||||
content,
|
||||
nerlabels,
|
||||
proposals,
|
||||
pages,
|
||||
language
|
||||
)
|
||||
}
|
||||
|
||||
val T = Table(None)
|
||||
@ -49,7 +59,7 @@ object RAttachmentMeta {
|
||||
DML.insert(
|
||||
T,
|
||||
T.all,
|
||||
fr"${v.id},${v.content},${v.nerlabels},${v.proposals},${v.pages}"
|
||||
fr"${v.id},${v.content},${v.nerlabels},${v.proposals},${v.pages},${v.language}"
|
||||
)
|
||||
|
||||
def exists(attachId: Ident): ConnectionIO[Boolean] =
|
||||
@ -90,13 +100,14 @@ object RAttachmentMeta {
|
||||
)
|
||||
)
|
||||
|
||||
def updateProposals(mid: Ident, plist: MetaProposalList): ConnectionIO[Int] =
|
||||
def updateProposals(
|
||||
mid: Ident,
|
||||
plist: MetaProposalList
|
||||
): ConnectionIO[Int] =
|
||||
DML.update(
|
||||
T,
|
||||
T.id === mid,
|
||||
DML.set(
|
||||
T.proposals.setTo(plist)
|
||||
)
|
||||
DML.set(T.proposals.setTo(plist))
|
||||
)
|
||||
|
||||
def updatePageCount(mid: Ident, pageCount: Option[Int]): ConnectionIO[Int] =
|
||||
|
@ -0,0 +1,102 @@
|
||||
package docspell.store.records
|
||||
|
||||
import cats.data.NonEmptyList
|
||||
import cats.effect._
|
||||
import cats.implicits._
|
||||
|
||||
import docspell.common._
|
||||
import docspell.store.qb.DSL._
|
||||
import docspell.store.qb._
|
||||
|
||||
import doobie._
|
||||
import doobie.implicits._
|
||||
|
||||
final case class RClassifierModel(
|
||||
id: Ident,
|
||||
cid: Ident,
|
||||
name: String,
|
||||
fileId: Ident,
|
||||
created: Timestamp
|
||||
) {}
|
||||
|
||||
object RClassifierModel {
|
||||
|
||||
def createNew[F[_]: Sync](
|
||||
cid: Ident,
|
||||
name: String,
|
||||
fileId: Ident
|
||||
): F[RClassifierModel] =
|
||||
for {
|
||||
id <- Ident.randomId[F]
|
||||
now <- Timestamp.current[F]
|
||||
} yield RClassifierModel(id, cid, name, fileId, now)
|
||||
|
||||
final case class Table(alias: Option[String]) extends TableDef {
|
||||
val tableName = "classifier_model"
|
||||
|
||||
val id = Column[Ident]("id", this)
|
||||
val cid = Column[Ident]("cid", this)
|
||||
val name = Column[String]("name", this)
|
||||
val fileId = Column[Ident]("file_id", this)
|
||||
val created = Column[Timestamp]("created", this)
|
||||
|
||||
val all = NonEmptyList.of[Column[_]](id, cid, name, fileId, created)
|
||||
}
|
||||
|
||||
def as(alias: String): Table =
|
||||
Table(Some(alias))
|
||||
|
||||
val T = Table(None)
|
||||
|
||||
def insert(v: RClassifierModel): ConnectionIO[Int] =
|
||||
DML.insert(
|
||||
T,
|
||||
T.all,
|
||||
fr"${v.id},${v.cid},${v.name},${v.fileId},${v.created}"
|
||||
)
|
||||
|
||||
def updateFile(coll: Ident, name: String, fid: Ident): ConnectionIO[Int] =
|
||||
for {
|
||||
now <- Timestamp.current[ConnectionIO]
|
||||
n <- DML.update(
|
||||
T,
|
||||
T.cid === coll && T.name === name,
|
||||
DML.set(T.fileId.setTo(fid), T.created.setTo(now))
|
||||
)
|
||||
k <-
|
||||
if (n == 0) createNew[ConnectionIO](coll, name, fid).flatMap(insert)
|
||||
else 0.pure[ConnectionIO]
|
||||
} yield n + k
|
||||
|
||||
def deleteById(id: Ident): ConnectionIO[Int] =
|
||||
DML.delete(T, T.id === id)
|
||||
|
||||
def deleteAll(ids: List[Ident]): ConnectionIO[Int] =
|
||||
NonEmptyList.fromList(ids) match {
|
||||
case Some(nel) =>
|
||||
DML.delete(T, T.id.in(nel))
|
||||
case None =>
|
||||
0.pure[ConnectionIO]
|
||||
}
|
||||
|
||||
def findByName(cid: Ident, name: String): ConnectionIO[Option[RClassifierModel]] =
|
||||
Select(select(T.all), from(T), T.cid === cid && T.name === name).build
|
||||
.query[RClassifierModel]
|
||||
.option
|
||||
|
||||
def findAllByName(
|
||||
cid: Ident,
|
||||
names: NonEmptyList[String]
|
||||
): ConnectionIO[List[RClassifierModel]] =
|
||||
Select(select(T.all), from(T), T.cid === cid && T.name.in(names)).build
|
||||
.query[RClassifierModel]
|
||||
.to[List]
|
||||
|
||||
def findAllByQuery(
|
||||
cid: Ident,
|
||||
nameQuery: String
|
||||
): ConnectionIO[List[RClassifierModel]] =
|
||||
Select(select(T.all), from(T), T.cid === cid && T.name.like(nameQuery)).build
|
||||
.query[RClassifierModel]
|
||||
.to[List]
|
||||
}
|
@ -1,6 +1,6 @@
|
||||
package docspell.store.records
|
||||
|
||||
import cats.data.NonEmptyList
|
||||
import cats.data.{NonEmptyList, OptionT}
|
||||
import cats.implicits._
|
||||
|
||||
import docspell.common._
|
||||
@ -13,27 +13,38 @@ import doobie.implicits._
|
||||
|
||||
case class RClassifierSetting(
|
||||
cid: Ident,
|
||||
enabled: Boolean,
|
||||
schedule: CalEvent,
|
||||
category: String,
|
||||
itemCount: Int,
|
||||
fileId: Option[Ident],
|
||||
created: Timestamp
|
||||
) {}
|
||||
created: Timestamp,
|
||||
categoryList: List[String],
|
||||
listType: ListType
|
||||
) {
|
||||
|
||||
def autoTagEnabled: Boolean =
|
||||
listType match {
|
||||
case ListType.Blacklist =>
|
||||
true
|
||||
case ListType.Whitelist =>
|
||||
categoryList.nonEmpty
|
||||
}
|
||||
}
|
||||
|
||||
object RClassifierSetting {
|
||||
// the categoryList is stored as a json array
|
||||
implicit val stringListMeta: Meta[List[String]] =
|
||||
jsonMeta[List[String]]
|
||||
|
||||
final case class Table(alias: Option[String]) extends TableDef {
|
||||
val tableName = "classifier_setting"
|
||||
|
||||
val cid = Column[Ident]("cid", this)
|
||||
val enabled = Column[Boolean]("enabled", this)
|
||||
val schedule = Column[CalEvent]("schedule", this)
|
||||
val category = Column[String]("category", this)
|
||||
val itemCount = Column[Int]("item_count", this)
|
||||
val fileId = Column[Ident]("file_id", this)
|
||||
val created = Column[Timestamp]("created", this)
|
||||
val cid = Column[Ident]("cid", this)
|
||||
val schedule = Column[CalEvent]("schedule", this)
|
||||
val itemCount = Column[Int]("item_count", this)
|
||||
val created = Column[Timestamp]("created", this)
|
||||
val categories = Column[List[String]]("categories", this)
|
||||
val listType = Column[ListType]("category_list_type", this)
|
||||
val all = NonEmptyList
|
||||
.of[Column[_]](cid, enabled, schedule, category, itemCount, fileId, created)
|
||||
.of[Column[_]](cid, schedule, itemCount, created, categories, listType)
|
||||
}
|
||||
|
||||
val T = Table(None)
|
||||
@ -44,35 +55,19 @@ object RClassifierSetting {
|
||||
DML.insert(
|
||||
T,
|
||||
T.all,
|
||||
fr"${v.cid},${v.enabled},${v.schedule},${v.category},${v.itemCount},${v.fileId},${v.created}"
|
||||
fr"${v.cid},${v.schedule},${v.itemCount},${v.created},${v.categoryList},${v.listType}"
|
||||
)
|
||||
|
||||
def updateAll(v: RClassifierSetting): ConnectionIO[Int] =
|
||||
DML.update(
|
||||
T,
|
||||
T.cid === v.cid,
|
||||
DML.set(
|
||||
T.enabled.setTo(v.enabled),
|
||||
T.schedule.setTo(v.schedule),
|
||||
T.category.setTo(v.category),
|
||||
T.itemCount.setTo(v.itemCount),
|
||||
T.fileId.setTo(v.fileId)
|
||||
)
|
||||
)
|
||||
|
||||
def updateFile(coll: Ident, fid: Ident): ConnectionIO[Int] =
|
||||
DML.update(T, T.cid === coll, DML.set(T.fileId.setTo(fid)))
|
||||
|
||||
def updateSettings(v: RClassifierSetting): ConnectionIO[Int] =
|
||||
def update(v: RClassifierSetting): ConnectionIO[Int] =
|
||||
for {
|
||||
n1 <- DML.update(
|
||||
T,
|
||||
T.cid === v.cid,
|
||||
DML.set(
|
||||
T.enabled.setTo(v.enabled),
|
||||
T.schedule.setTo(v.schedule),
|
||||
T.itemCount.setTo(v.itemCount),
|
||||
T.category.setTo(v.category)
|
||||
T.categories.setTo(v.categoryList),
|
||||
T.listType.setTo(v.listType)
|
||||
)
|
||||
)
|
||||
n2 <- if (n1 <= 0) insert(v) else 0.pure[ConnectionIO]
|
||||
@ -86,27 +81,62 @@ object RClassifierSetting {
|
||||
def delete(coll: Ident): ConnectionIO[Int] =
|
||||
DML.delete(T, T.cid === coll)
|
||||
|
||||
/** Finds tag categories that exist and match the classifier setting.
|
||||
* If the setting contains a black list, they are removed from the
|
||||
* existing categories. If it is a whitelist, the intersection is
|
||||
* returned.
|
||||
*/
|
||||
def getActiveCategories(coll: Ident): ConnectionIO[List[String]] =
|
||||
(for {
|
||||
sett <- OptionT(findById(coll))
|
||||
cats <- OptionT.liftF(RTag.listCategories(coll))
|
||||
res = sett.listType match {
|
||||
case ListType.Blacklist =>
|
||||
cats.diff(sett.categoryList)
|
||||
case ListType.Whitelist =>
|
||||
sett.categoryList.intersect(cats)
|
||||
}
|
||||
} yield res).getOrElse(Nil)
|
||||
|
||||
/** Checks the json array of tag categories and removes those that are not present anymore. */
|
||||
def fixCategoryList(coll: Ident): ConnectionIO[Int] =
|
||||
(for {
|
||||
sett <- OptionT(findById(coll))
|
||||
cats <- OptionT.liftF(RTag.listCategories(coll))
|
||||
fixed = sett.categoryList.intersect(cats)
|
||||
n <- OptionT.liftF(
|
||||
if (fixed == sett.categoryList) 0.pure[ConnectionIO]
|
||||
else DML.update(T, T.cid === coll, DML.set(T.categories.setTo(fixed)))
|
||||
)
|
||||
} yield n).getOrElse(0)
|
||||
|
||||
case class Classifier(
|
||||
enabled: Boolean,
|
||||
schedule: CalEvent,
|
||||
itemCount: Int,
|
||||
category: Option[String]
|
||||
categories: List[String],
|
||||
listType: ListType
|
||||
) {
|
||||
def enabled: Boolean =
|
||||
listType match {
|
||||
case ListType.Blacklist =>
|
||||
true
|
||||
case ListType.Whitelist =>
|
||||
categories.nonEmpty
|
||||
}
|
||||
|
||||
def toRecord(coll: Ident, created: Timestamp): RClassifierSetting =
|
||||
RClassifierSetting(
|
||||
coll,
|
||||
enabled,
|
||||
schedule,
|
||||
category.getOrElse(""),
|
||||
itemCount,
|
||||
None,
|
||||
created
|
||||
created,
|
||||
categories,
|
||||
listType
|
||||
)
|
||||
}
|
||||
object Classifier {
|
||||
def fromRecord(r: RClassifierSetting): Classifier =
|
||||
Classifier(r.enabled, r.schedule, r.itemCount, r.category.some)
|
||||
Classifier(r.schedule, r.itemCount, r.categoryList, r.listType)
|
||||
}
|
||||
|
||||
}
|
||||
|
@ -1,6 +1,6 @@
|
||||
package docspell.store.records
|
||||
|
||||
import cats.data.NonEmptyList
|
||||
import cats.data.{NonEmptyList, OptionT}
|
||||
import fs2.Stream
|
||||
|
||||
import docspell.common._
|
||||
@ -73,13 +73,24 @@ object RCollective {
|
||||
.map(now => settings.classifier.map(_.toRecord(cid, now)))
|
||||
n2 <- cls match {
|
||||
case Some(cr) =>
|
||||
RClassifierSetting.updateSettings(cr)
|
||||
RClassifierSetting.update(cr)
|
||||
case None =>
|
||||
RClassifierSetting.delete(cid)
|
||||
}
|
||||
} yield n1 + n2
|
||||
|
||||
def getSettings(coll: Ident): ConnectionIO[Option[Settings]] = {
|
||||
// this hides categories that have been deleted in the meantime
|
||||
// they are finally removed from the json array once the learn classifier task is run
|
||||
def getSettings(coll: Ident): ConnectionIO[Option[Settings]] =
|
||||
(for {
|
||||
sett <- OptionT(getRawSettings(coll))
|
||||
prev <- OptionT.fromOption[ConnectionIO](sett.classifier)
|
||||
cats <- OptionT.liftF(RTag.listCategories(coll))
|
||||
next = prev.copy(categories = prev.categories.intersect(cats))
|
||||
} yield sett.copy(classifier = Some(next))).value
|
||||
|
||||
private def getRawSettings(coll: Ident): ConnectionIO[Option[Settings]] = {
|
||||
import RClassifierSetting.stringListMeta
|
||||
val c = RCollective.as("c")
|
||||
val cs = RClassifierSetting.as("cs")
|
||||
|
||||
@ -87,10 +98,10 @@ object RCollective {
|
||||
select(
|
||||
c.language.s,
|
||||
c.integration.s,
|
||||
cs.enabled.s,
|
||||
cs.schedule.s,
|
||||
cs.itemCount.s,
|
||||
cs.category.s
|
||||
cs.categories.s,
|
||||
cs.listType.s
|
||||
),
|
||||
from(c).leftJoin(cs, cs.cid === c.id),
|
||||
c.id === coll
|
||||
|
@ -0,0 +1,60 @@
|
||||
package docspell.store.records
|
||||
|
||||
import cats.data.NonEmptyList
|
||||
|
||||
import docspell.common._
|
||||
import docspell.store.qb.DSL._
|
||||
import docspell.store.qb._
|
||||
|
||||
import doobie._
|
||||
import doobie.implicits._
|
||||
|
||||
case class RItemProposal(
|
||||
itemId: Ident,
|
||||
classifyProposals: MetaProposalList,
|
||||
classifyTags: List[IdRef],
|
||||
created: Timestamp
|
||||
)
|
||||
|
||||
object RItemProposal {
|
||||
final case class Table(alias: Option[String]) extends TableDef {
|
||||
val tableName = "item_proposal"
|
||||
|
||||
val itemId = Column[Ident]("itemid", this)
|
||||
val classifyProposals = Column[MetaProposalList]("classifier_proposals", this)
|
||||
val classifyTags = Column[List[IdRef]]("classifier_tags", this)
|
||||
val created = Column[Timestamp]("created", this)
|
||||
val all = NonEmptyList.of[Column[_]](itemId, classifyProposals, classifyTags, created)
|
||||
}
|
||||
|
||||
val T = Table(None)
|
||||
def as(alias: String): Table =
|
||||
Table(Some(alias))
|
||||
|
||||
def insert(v: RItemProposal): ConnectionIO[Int] =
|
||||
DML.insert(
|
||||
T,
|
||||
T.all,
|
||||
fr"${v.itemId},${v.classifyProposals},${v.classifyTags},${v.created}"
|
||||
)
|
||||
|
||||
def update(v: RItemProposal): ConnectionIO[Int] =
|
||||
DML.update(
|
||||
T,
|
||||
T.itemId === v.itemId,
|
||||
DML.set(
|
||||
T.classifyProposals.setTo(v.classifyProposals),
|
||||
T.classifyTags.setTo(v.classifyTags)
|
||||
)
|
||||
)
|
||||
|
||||
def deleteByItem(itemId: Ident): ConnectionIO[Int] =
|
||||
DML.delete(T, T.itemId === itemId)
|
||||
|
||||
def exists(itemId: Ident): ConnectionIO[Boolean] =
|
||||
Select(select(countAll), from(T), T.itemId === itemId).build
|
||||
.query[Int]
|
||||
.unique
|
||||
.map(_ > 0)
|
||||
|
||||
}
|
@ -148,6 +148,13 @@ object RTag {
|
||||
).orderBy(T.name.asc).build.query[RTag].to[List]
|
||||
}
|
||||
|
||||
def listCategories(coll: Ident): ConnectionIO[List[String]] =
|
||||
Select(
|
||||
T.category.s,
|
||||
from(T),
|
||||
T.cid === coll && T.category.isNotNull
|
||||
).distinct.build.query[String].to[List]
|
||||
|
||||
def delete(tagId: Ident, coll: Ident): ConnectionIO[Int] =
|
||||
DML.delete(T, T.tid === tagId && T.cid === coll)
|
||||
}
|
||||
|
@ -11,35 +11,38 @@ import Api
|
||||
import Api.Model.ClassifierSetting exposing (ClassifierSetting)
|
||||
import Api.Model.TagList exposing (TagList)
|
||||
import Comp.CalEventInput
|
||||
import Comp.Dropdown
|
||||
import Comp.FixedDropdown
|
||||
import Comp.IntField
|
||||
import Data.CalEvent exposing (CalEvent)
|
||||
import Data.Flags exposing (Flags)
|
||||
import Data.ListType exposing (ListType)
|
||||
import Data.UiSettings exposing (UiSettings)
|
||||
import Data.Validated exposing (Validated(..))
|
||||
import Html exposing (..)
|
||||
import Html.Attributes exposing (..)
|
||||
import Html.Events exposing (onCheck)
|
||||
import Http
|
||||
import Markdown
|
||||
import Util.Tag
|
||||
|
||||
|
||||
type alias Model =
|
||||
{ enabled : Bool
|
||||
, categoryModel : Comp.FixedDropdown.Model String
|
||||
, category : Maybe String
|
||||
, scheduleModel : Comp.CalEventInput.Model
|
||||
{ scheduleModel : Comp.CalEventInput.Model
|
||||
, schedule : Validated CalEvent
|
||||
, itemCountModel : Comp.IntField.Model
|
||||
, itemCount : Maybe Int
|
||||
, categoryListModel : Comp.Dropdown.Model String
|
||||
, categoryListType : ListType
|
||||
, categoryListTypeModel : Comp.FixedDropdown.Model ListType
|
||||
}
|
||||
|
||||
|
||||
type Msg
|
||||
= GetTagsResp (Result Http.Error TagList)
|
||||
| ScheduleMsg Comp.CalEventInput.Msg
|
||||
| ToggleEnabled
|
||||
| CategoryMsg (Comp.FixedDropdown.Msg String)
|
||||
= ScheduleMsg Comp.CalEventInput.Msg
|
||||
| ItemCountMsg Comp.IntField.Msg
|
||||
| GetTagsResp (Result Http.Error TagList)
|
||||
| CategoryListMsg (Comp.Dropdown.Msg String)
|
||||
| CategoryListTypeMsg (Comp.FixedDropdown.Msg ListType)
|
||||
|
||||
|
||||
init : Flags -> ClassifierSetting -> ( Model, Cmd Msg )
|
||||
@ -52,13 +55,36 @@ init flags sett =
|
||||
( cem, cec ) =
|
||||
Comp.CalEventInput.init flags newSchedule
|
||||
in
|
||||
( { enabled = sett.enabled
|
||||
, categoryModel = Comp.FixedDropdown.initString []
|
||||
, category = sett.category
|
||||
, scheduleModel = cem
|
||||
( { scheduleModel = cem
|
||||
, schedule = Data.Validated.Unknown newSchedule
|
||||
, itemCountModel = Comp.IntField.init (Just 0) Nothing True "Item Count"
|
||||
, itemCount = Just sett.itemCount
|
||||
, categoryListModel =
|
||||
let
|
||||
mkOption s =
|
||||
{ value = s, text = s, additional = "" }
|
||||
|
||||
minit =
|
||||
Comp.Dropdown.makeModel
|
||||
{ multiple = True
|
||||
, searchable = \n -> n > 0
|
||||
, makeOption = mkOption
|
||||
, labelColor = \_ -> \_ -> "grey "
|
||||
, placeholder = "Choose categories …"
|
||||
}
|
||||
|
||||
lm =
|
||||
Comp.Dropdown.SetSelection sett.categoryList
|
||||
|
||||
( m_, _ ) =
|
||||
Comp.Dropdown.update lm minit
|
||||
in
|
||||
m_
|
||||
, categoryListType =
|
||||
Data.ListType.fromString sett.listType
|
||||
|> Maybe.withDefault Data.ListType.Whitelist
|
||||
, categoryListTypeModel =
|
||||
Comp.FixedDropdown.initMap Data.ListType.label Data.ListType.all
|
||||
}
|
||||
, Cmd.batch
|
||||
[ Api.getTags flags "" GetTagsResp
|
||||
@ -71,11 +97,11 @@ getSettings : Model -> Validated ClassifierSetting
|
||||
getSettings model =
|
||||
Data.Validated.map
|
||||
(\sch ->
|
||||
{ enabled = model.enabled
|
||||
, category = model.category
|
||||
, schedule =
|
||||
{ schedule =
|
||||
Data.CalEvent.makeEvent sch
|
||||
, itemCount = Maybe.withDefault 0 model.itemCount
|
||||
, listType = Data.ListType.toString model.categoryListType
|
||||
, categoryList = Comp.Dropdown.getSelected model.categoryListModel
|
||||
}
|
||||
)
|
||||
model.schedule
|
||||
@ -89,18 +115,11 @@ update flags msg model =
|
||||
categories =
|
||||
Util.Tag.getCategories tl.items
|
||||
|> List.sort
|
||||
in
|
||||
( { model
|
||||
| categoryModel = Comp.FixedDropdown.initString categories
|
||||
, category =
|
||||
if model.category == Nothing then
|
||||
List.head categories
|
||||
|
||||
else
|
||||
model.category
|
||||
}
|
||||
, Cmd.none
|
||||
)
|
||||
lm =
|
||||
Comp.Dropdown.SetOptions categories
|
||||
in
|
||||
update flags (CategoryListMsg lm) model
|
||||
|
||||
GetTagsResp (Err _) ->
|
||||
( model, Cmd.none )
|
||||
@ -121,28 +140,6 @@ update flags msg model =
|
||||
, Cmd.map ScheduleMsg cc
|
||||
)
|
||||
|
||||
ToggleEnabled ->
|
||||
( { model | enabled = not model.enabled }
|
||||
, Cmd.none
|
||||
)
|
||||
|
||||
CategoryMsg lmsg ->
|
||||
let
|
||||
( mm, ma ) =
|
||||
Comp.FixedDropdown.update lmsg model.categoryModel
|
||||
in
|
||||
( { model
|
||||
| categoryModel = mm
|
||||
, category =
|
||||
if ma == Nothing then
|
||||
model.category
|
||||
|
||||
else
|
||||
ma
|
||||
}
|
||||
, Cmd.none
|
||||
)
|
||||
|
||||
ItemCountMsg lmsg ->
|
||||
let
|
||||
( im, iv ) =
|
||||
@ -155,39 +152,68 @@ update flags msg model =
|
||||
, Cmd.none
|
||||
)
|
||||
|
||||
CategoryListMsg lm ->
|
||||
let
|
||||
( m_, cmd_ ) =
|
||||
Comp.Dropdown.update lm model.categoryListModel
|
||||
in
|
||||
( { model | categoryListModel = m_ }
|
||||
, Cmd.map CategoryListMsg cmd_
|
||||
)
|
||||
|
||||
view : Model -> Html Msg
|
||||
view model =
|
||||
CategoryListTypeMsg lm ->
|
||||
let
|
||||
( m_, sel ) =
|
||||
Comp.FixedDropdown.update lm model.categoryListTypeModel
|
||||
|
||||
newListType =
|
||||
Maybe.withDefault model.categoryListType sel
|
||||
in
|
||||
( { model
|
||||
| categoryListTypeModel = m_
|
||||
, categoryListType = newListType
|
||||
}
|
||||
, Cmd.none
|
||||
)
|
||||
|
||||
|
||||
view : UiSettings -> Model -> Html Msg
|
||||
view settings model =
|
||||
let
|
||||
catListTypeItem =
|
||||
Comp.FixedDropdown.Item
|
||||
model.categoryListType
|
||||
(Data.ListType.label model.categoryListType)
|
||||
in
|
||||
div []
|
||||
[ div
|
||||
[ class "field"
|
||||
]
|
||||
[ div [ class "ui checkbox" ]
|
||||
[ input
|
||||
[ type_ "checkbox"
|
||||
, onCheck (\_ -> ToggleEnabled)
|
||||
, checked model.enabled
|
||||
]
|
||||
[]
|
||||
, label [] [ text "Enable classification" ]
|
||||
, span [ class "small-info" ]
|
||||
[ text "Disable document classification if not needed."
|
||||
]
|
||||
]
|
||||
]
|
||||
, div [ class "ui basic segment" ]
|
||||
[ text "Document classification tries to predict a tag for new incoming documents. This "
|
||||
, text "works by learning from existing documents in order to find common patterns within "
|
||||
, text "the text. The more documents you have correctly tagged, the better. Learning is done "
|
||||
, text "periodically based on a schedule and you need to specify a tag-group that should "
|
||||
, text "be used for learning."
|
||||
[ Markdown.toHtml [ class "ui basic segment" ]
|
||||
"""
|
||||
|
||||
Auto-tagging works by learning from existing documents. The more
|
||||
documents you have correctly tagged, the better. Learning is done
|
||||
periodically based on a schedule. You can specify tag-groups that
|
||||
should either be used (whitelist) or not used (blacklist) for
|
||||
learning.
|
||||
|
||||
Use an empty whitelist to disable auto tagging.
|
||||
|
||||
"""
|
||||
, div [ class "field" ]
|
||||
[ label [] [ text "Is the following a blacklist or whitelist?" ]
|
||||
, Html.map CategoryListTypeMsg
|
||||
(Comp.FixedDropdown.view (Just catListTypeItem) model.categoryListTypeModel)
|
||||
]
|
||||
, div [ class "field" ]
|
||||
[ label [] [ text "Category" ]
|
||||
, Html.map CategoryMsg
|
||||
(Comp.FixedDropdown.viewString model.category
|
||||
model.categoryModel
|
||||
)
|
||||
[ label []
|
||||
[ case model.categoryListType of
|
||||
Data.ListType.Whitelist ->
|
||||
text "Include tag categories for learning"
|
||||
|
||||
Data.ListType.Blacklist ->
|
||||
text "Exclude tag categories from learning"
|
||||
]
|
||||
, Html.map CategoryListMsg
|
||||
(Comp.Dropdown.view settings model.categoryListModel)
|
||||
]
|
||||
, Html.map ItemCountMsg
|
||||
(Comp.IntField.viewWithInfo
|
||||
|
@ -280,7 +280,7 @@ view flags settings model =
|
||||
, ( "invisible hidden", not flags.config.showClassificationSettings )
|
||||
]
|
||||
]
|
||||
[ text "Document Classifier"
|
||||
[ text "Auto-Tagging"
|
||||
]
|
||||
, div
|
||||
[ classList
|
||||
@ -289,13 +289,10 @@ view flags settings model =
|
||||
]
|
||||
]
|
||||
[ Html.map ClassifierSettingMsg
|
||||
(Comp.ClassifierSettingsForm.view model.classifierModel)
|
||||
(Comp.ClassifierSettingsForm.view settings model.classifierModel)
|
||||
, div [ class "ui vertical segment" ]
|
||||
[ button
|
||||
[ classList
|
||||
[ ( "ui small secondary basic button", True )
|
||||
, ( "disabled", not model.classifierModel.enabled )
|
||||
]
|
||||
[ class "ui small secondary basic button"
|
||||
, title "Starts a task to train a classifier"
|
||||
, onClick StartClassifierTask
|
||||
]
|
||||
|
@ -958,7 +958,6 @@ renderSuggestions model mkName idnames tagger =
|
||||
]
|
||||
, div [ class "menu" ] <|
|
||||
(idnames
|
||||
|> List.take 5
|
||||
|> List.map (\p -> a [ class "item", href "#", onClick (tagger p) ] [ text (mkName p) ])
|
||||
)
|
||||
]
|
||||
@ -969,7 +968,7 @@ renderOrgSuggestions : Model -> Html Msg
|
||||
renderOrgSuggestions model =
|
||||
renderSuggestions model
|
||||
.name
|
||||
(List.take 5 model.itemProposals.corrOrg)
|
||||
(List.take 6 model.itemProposals.corrOrg)
|
||||
SetCorrOrgSuggestion
|
||||
|
||||
|
||||
@ -977,7 +976,7 @@ renderCorrPersonSuggestions : Model -> Html Msg
|
||||
renderCorrPersonSuggestions model =
|
||||
renderSuggestions model
|
||||
.name
|
||||
(List.take 5 model.itemProposals.corrPerson)
|
||||
(List.take 6 model.itemProposals.corrPerson)
|
||||
SetCorrPersonSuggestion
|
||||
|
||||
|
||||
@ -985,7 +984,7 @@ renderConcPersonSuggestions : Model -> Html Msg
|
||||
renderConcPersonSuggestions model =
|
||||
renderSuggestions model
|
||||
.name
|
||||
(List.take 5 model.itemProposals.concPerson)
|
||||
(List.take 6 model.itemProposals.concPerson)
|
||||
SetConcPersonSuggestion
|
||||
|
||||
|
||||
@ -993,7 +992,7 @@ renderConcEquipSuggestions : Model -> Html Msg
|
||||
renderConcEquipSuggestions model =
|
||||
renderSuggestions model
|
||||
.name
|
||||
(List.take 5 model.itemProposals.concEquipment)
|
||||
(List.take 6 model.itemProposals.concEquipment)
|
||||
SetConcEquipSuggestion
|
||||
|
||||
|
||||
@ -1001,7 +1000,7 @@ renderItemDateSuggestions : Model -> Html Msg
|
||||
renderItemDateSuggestions model =
|
||||
renderSuggestions model
|
||||
Util.Time.formatDate
|
||||
(List.take 5 model.itemProposals.itemDate)
|
||||
(List.take 6 model.itemProposals.itemDate)
|
||||
SetItemDateSuggestion
|
||||
|
||||
|
||||
@ -1009,7 +1008,7 @@ renderDueDateSuggestions : Model -> Html Msg
|
||||
renderDueDateSuggestions model =
|
||||
renderSuggestions model
|
||||
Util.Time.formatDate
|
||||
(List.take 5 model.itemProposals.dueDate)
|
||||
(List.take 6 model.itemProposals.dueDate)
|
||||
SetDueDateSuggestion
|
||||
|
||||
|
||||
|
@ -11,6 +11,17 @@ type Language
|
||||
= German
|
||||
| English
|
||||
| French
|
||||
| Italian
|
||||
| Spanish
|
||||
| Portuguese
|
||||
| Czech
|
||||
| Danish
|
||||
| Finnish
|
||||
| Norwegian
|
||||
| Swedish
|
||||
| Russian
|
||||
| Romanian
|
||||
| Dutch
|
||||
|
||||
|
||||
fromString : String -> Maybe Language
|
||||
@ -24,6 +35,39 @@ fromString str =
|
||||
else if str == "fra" || str == "fr" || str == "french" then
|
||||
Just French
|
||||
|
||||
else if str == "ita" || str == "it" || str == "italian" then
|
||||
Just Italian
|
||||
|
||||
else if str == "spa" || str == "es" || str == "spanish" then
|
||||
Just Spanish
|
||||
|
||||
else if str == "por" || str == "pt" || str == "portuguese" then
|
||||
Just Portuguese
|
||||
|
||||
else if str == "ces" || str == "cs" || str == "czech" then
|
||||
Just Czech
|
||||
|
||||
else if str == "dan" || str == "da" || str == "danish" then
|
||||
Just Danish
|
||||
|
||||
else if str == "nld" || str == "nd" || str == "dutch" then
|
||||
Just Dutch
|
||||
|
||||
else if str == "fin" || str == "fi" || str == "finnish" then
|
||||
Just Finnish
|
||||
|
||||
else if str == "nor" || str == "no" || str == "norwegian" then
|
||||
Just Norwegian
|
||||
|
||||
else if str == "swe" || str == "sv" || str == "swedish" then
|
||||
Just Swedish
|
||||
|
||||
else if str == "rus" || str == "ru" || str == "russian" then
|
||||
Just Russian
|
||||
|
||||
else if str == "ron" || str == "ro" || str == "romanian" then
|
||||
Just Romanian
|
||||
|
||||
else
|
||||
Nothing
|
||||
|
||||
@ -40,6 +84,39 @@ toIso3 lang =
|
||||
French ->
|
||||
"fra"
|
||||
|
||||
Italian ->
|
||||
"ita"
|
||||
|
||||
Spanish ->
|
||||
"spa"
|
||||
|
||||
Portuguese ->
|
||||
"por"
|
||||
|
||||
Czech ->
|
||||
"ces"
|
||||
|
||||
Danish ->
|
||||
"dan"
|
||||
|
||||
Finnish ->
|
||||
"fin"
|
||||
|
||||
Norwegian ->
|
||||
"nor"
|
||||
|
||||
Swedish ->
|
||||
"swe"
|
||||
|
||||
Russian ->
|
||||
"rus"
|
||||
|
||||
Romanian ->
|
||||
"ron"
|
||||
|
||||
Dutch ->
|
||||
"nld"
|
||||
|
||||
|
||||
toName : Language -> String
|
||||
toName lang =
|
||||
@ -53,7 +130,54 @@ toName lang =
|
||||
French ->
|
||||
"French"
|
||||
|
||||
Italian ->
|
||||
"Italian"
|
||||
|
||||
Spanish ->
|
||||
"Spanish"
|
||||
|
||||
Portuguese ->
|
||||
"Portuguese"
|
||||
|
||||
Czech ->
|
||||
"Czech"
|
||||
|
||||
Danish ->
|
||||
"Danish"
|
||||
|
||||
Finnish ->
|
||||
"Finnish"
|
||||
|
||||
Norwegian ->
|
||||
"Norwegian"
|
||||
|
||||
Swedish ->
|
||||
"Swedish"
|
||||
|
||||
Russian ->
|
||||
"Russian"
|
||||
|
||||
Romanian ->
|
||||
"Romanian"
|
||||
|
||||
Dutch ->
|
||||
"Dutch"
|
||||
|
||||
|
||||
all : List Language
|
||||
all =
|
||||
[ German, English, French ]
|
||||
[ German
|
||||
, English
|
||||
, French
|
||||
, Italian
|
||||
, Spanish
|
||||
, Portuguese
|
||||
, Czech
|
||||
, Dutch
|
||||
, Danish
|
||||
, Finnish
|
||||
, Norwegian
|
||||
, Swedish
|
||||
, Russian
|
||||
, Romanian
|
||||
]
|
||||
|
50
modules/webapp/src/main/elm/Data/ListType.elm
Normal file
50
modules/webapp/src/main/elm/Data/ListType.elm
Normal file
@ -0,0 +1,50 @@
|
||||
module Data.ListType exposing
|
||||
( ListType(..)
|
||||
, all
|
||||
, fromString
|
||||
, label
|
||||
, toString
|
||||
)
|
||||
|
||||
|
||||
type ListType
|
||||
= Blacklist
|
||||
| Whitelist
|
||||
|
||||
|
||||
all : List ListType
|
||||
all =
|
||||
[ Blacklist, Whitelist ]
|
||||
|
||||
|
||||
toString : ListType -> String
|
||||
toString lt =
|
||||
case lt of
|
||||
Blacklist ->
|
||||
"blacklist"
|
||||
|
||||
Whitelist ->
|
||||
"whitelist"
|
||||
|
||||
|
||||
label : ListType -> String
|
||||
label lt =
|
||||
case lt of
|
||||
Blacklist ->
|
||||
"Blacklist"
|
||||
|
||||
Whitelist ->
|
||||
"Whitelist"
|
||||
|
||||
|
||||
fromString : String -> Maybe ListType
|
||||
fromString str =
|
||||
case String.toLower str of
|
||||
"blacklist" ->
|
||||
Just Blacklist
|
||||
|
||||
"whitelist" ->
|
||||
Just Whitelist
|
||||
|
||||
_ ->
|
||||
Nothing
|
@ -98,9 +98,13 @@ let
|
||||
};
|
||||
text-analysis = {
|
||||
max-length = 10000;
|
||||
regex-ner = {
|
||||
enabled = true;
|
||||
file-cache-time = "1 minute";
|
||||
nlp = {
|
||||
mode = "full";
|
||||
clear-interval = "15 minutes";
|
||||
regex-ner = {
|
||||
max-entries = 1000;
|
||||
file-cache-time = "1 minute";
|
||||
};
|
||||
};
|
||||
classification = {
|
||||
enabled = true;
|
||||
@ -118,7 +122,6 @@ let
|
||||
];
|
||||
};
|
||||
working-dir = "/tmp/docspell-analysis";
|
||||
clear-stanford-nlp-interval = "15 minutes";
|
||||
};
|
||||
processing = {
|
||||
max-due-date-years = 10;
|
||||
@ -772,47 +775,96 @@ in {
|
||||
files.
|
||||
'';
|
||||
};
|
||||
clear-stanford-nlp-interval = mkOption {
|
||||
type = types.str;
|
||||
default = defaults.text-analysis.clear-stanford-nlp-interval;
|
||||
description = ''
|
||||
Idle time after which the NLP caches are cleared to free
|
||||
memory. If <= 0 clearing the cache is disabled.
|
||||
'';
|
||||
};
|
||||
|
||||
regex-ner = mkOption {
|
||||
nlp = mkOption {
|
||||
type = types.submodule({
|
||||
options = {
|
||||
enabled = mkOption {
|
||||
type = types.bool;
|
||||
default = defaults.text-analysis.regex-ner.enabled;
|
||||
mode = mkOption {
|
||||
type = types.str;
|
||||
default = defaults.text-analysis.nlp.mode;
|
||||
description = ''
|
||||
Whether to enable custom NER annotation. This uses the address
|
||||
book of a collective as input for NER tagging (to automatically
|
||||
find correspondent and concerned entities). If the address book
|
||||
is large, this can be quite memory intensive and also makes text
|
||||
analysis slower. But it greatly improves accuracy. If this is
|
||||
false, NER tagging uses only statistical models (that also work
|
||||
quite well).
|
||||
The mode for configuring NLP models:
|
||||
|
||||
This setting might be moved to the collective settings in the
|
||||
future.
|
||||
1. full – builds the complete pipeline
|
||||
2. basic - builds only the ner annotator
|
||||
3. regexonly - matches each entry in your address book via regexps
|
||||
4. disabled - doesn't use any stanford-nlp feature
|
||||
|
||||
The full and basic variants rely on pre-build language models
|
||||
that are available for only 3 lanugages at the moment: German,
|
||||
English and French.
|
||||
|
||||
Memory usage varies greatly among the languages. German has
|
||||
quite large models, that require about 1G heap. So joex should
|
||||
run with -Xmx1400M at least when using mode=full.
|
||||
|
||||
The basic variant does a quite good job for German and
|
||||
English. It might be worse for French, always depending on the
|
||||
type of text that is analysed. Joex should run with about 600M
|
||||
heap, here again lanugage German uses the most.
|
||||
|
||||
The regexonly variant doesn't depend on a language. It roughly
|
||||
works by converting all entries in your addressbook into
|
||||
regexps and matches each one against the text. This can get
|
||||
memory intensive, too, when the addressbook grows large. This
|
||||
is included in the full and basic by default, but can be used
|
||||
independently by setting mode=regexner.
|
||||
|
||||
When mode=disabled, then the whole nlp pipeline is disabled,
|
||||
and you won't get any suggestions. Only what the classifier
|
||||
returns (if enabled).
|
||||
'';
|
||||
};
|
||||
file-cache-time = mkOption {
|
||||
|
||||
clear-interval = mkOption {
|
||||
type = types.str;
|
||||
default = defaults.text-analysis.ner-file-cache-time;
|
||||
default = defaults.text-analysis.nlp.clear-interval;
|
||||
description = ''
|
||||
The NER annotation uses a file of patterns that is derived from
|
||||
a collective's address book. This is is the time how long this
|
||||
file will be kept until a check for a state change is done.
|
||||
Idle time after which the NLP caches are cleared to free
|
||||
memory. If <= 0 clearing the cache is disabled.
|
||||
'';
|
||||
};
|
||||
|
||||
regex-ner = mkOption {
|
||||
type = types.submodule({
|
||||
options = {
|
||||
enabled = mkOption {
|
||||
type = types.int;
|
||||
default = defaults.text-analysis.regex-ner.max-entries;
|
||||
description = ''
|
||||
Whether to enable custom NER annotation. This uses the
|
||||
address book of a collective as input for NER tagging (to
|
||||
automatically find correspondent and concerned entities). If
|
||||
the address book is large, this can be quite memory
|
||||
intensive and also makes text analysis much slower. But it
|
||||
improves accuracy and can be used independent of the
|
||||
lanugage. If this is set to 0, it is effectively disabled
|
||||
and NER tagging uses only statistical models (that also work
|
||||
quite well, but are restricted to the languages mentioned
|
||||
above).
|
||||
|
||||
Note, this is only relevant if nlp-config.mode is not
|
||||
"disabled".
|
||||
'';
|
||||
};
|
||||
file-cache-time = mkOption {
|
||||
type = types.str;
|
||||
default = defaults.text-analysis.ner-file-cache-time;
|
||||
description = ''
|
||||
The NER annotation uses a file of patterns that is derived from
|
||||
a collective's address book. This is is the time how long this
|
||||
file will be kept until a check for a state change is done.
|
||||
'';
|
||||
};
|
||||
};
|
||||
});
|
||||
default = defaults.text-analysis.nlp.regex-ner;
|
||||
description = "";
|
||||
};
|
||||
};
|
||||
});
|
||||
default = defaults.text-analysis.regex-ner;
|
||||
description = "";
|
||||
default = defaults.text-analysis.nlp;
|
||||
description = "Configure NLP";
|
||||
};
|
||||
|
||||
classification = mkOption {
|
||||
|
@ -20,6 +20,9 @@ The configuration of both components uses separate namespaces. The
|
||||
configuration for the REST server is below `docspell.server`, while
|
||||
the one for joex is below `docspell.joex`.
|
||||
|
||||
You can therefore use two separate config files or one single file
|
||||
containing both namespaces.
|
||||
|
||||
## JDBC
|
||||
|
||||
This configures the connection to the database. This has to be
|
||||
@ -281,6 +284,70 @@ just some minutes, the web application obtains new ones
|
||||
periodically. So a short time is recommended.
|
||||
|
||||
|
||||
## File Processing
|
||||
|
||||
Files are being processed by the joex component. So all the respective
|
||||
configuration is in this config only.
|
||||
|
||||
File processing involves several stages, detailed information can be
|
||||
found [here](@/docs/joex/file-processing.md#text-analysis) and in the
|
||||
corresponding sections in [joex default config](#joex).
|
||||
|
||||
Configuration allows to define the external tools and set some
|
||||
limitations to control memory usage. The sections are:
|
||||
|
||||
- `docspell.joex.extraction`
|
||||
- `docspell.joex.text-analysis`
|
||||
- `docspell.joex.convert`
|
||||
|
||||
Options to external commands can use variables that are replaced by
|
||||
values at runtime. Variables are enclosed in double braces `{{…}}`.
|
||||
Please see the default configuration for what variables exist per
|
||||
command.
|
||||
|
||||
### Classification
|
||||
|
||||
In `text-analysis.classification` you can define how many documents at
|
||||
most should be used for learning. The default settings should work
|
||||
well for most cases. However, it always depends on the amount of data
|
||||
and the machine that runs joex. For example, by default the documents
|
||||
to learn from are limited to 600 (`classification.item-count`) and
|
||||
every text is cut after 8000 characters (`text-analysis.max-length`).
|
||||
This is fine if *most* of your documents are small and only a few are
|
||||
near 8000 characters). But if *all* your documents are very large, you
|
||||
probably need to either assign more heap memory or go down with the
|
||||
limits.
|
||||
|
||||
Classification can be disabled, too, for when it's not needed.
|
||||
|
||||
### NLP
|
||||
|
||||
This setting defines which NLP mode to use. It defaults to `full`,
|
||||
which requires more memory for certain languages (with the advantage
|
||||
of better results). Other values are `basic`, `regexonly` and
|
||||
`disabled`. The modes `full` and `basic` use pre-defined lanugage
|
||||
models for procesing documents of languaes German, English and French.
|
||||
These require some amount of memory (see below).
|
||||
|
||||
The mode `basic` is like the "light" variant to `full`. It doesn't use
|
||||
all NLP features, which makes memory consumption much lower, but comes
|
||||
with the compromise of less accurate results.
|
||||
|
||||
The mode `regexonly` doesn't use pre-defined lanuage models, even if
|
||||
available. It checks your address book against a document to find
|
||||
metadata. That means, it is language independent. Also, when using
|
||||
`full` or `basic` with lanugages where no pre-defined models exist, it
|
||||
will degrade to `regexonly` for these.
|
||||
|
||||
The mode `disabled` skips NLP processing completely. This has least
|
||||
impact in memory consumption, obviously, but then only the classifier
|
||||
is used to find metadata.
|
||||
|
||||
You might want to try different modes and see what combination suits
|
||||
best your usage pattern and machine running joex. If a powerful
|
||||
machine is used, simply leave the defaults. When running on an older
|
||||
raspberry pi, for example, you might need to adjust things.
|
||||
|
||||
# File Format
|
||||
|
||||
The format of the configuration files can be
|
||||
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
x
Reference in New Issue
Block a user