Merge pull request #581 from eikek/text-analysis-improvements

Text analysis improvements
This commit is contained in:
mergify[bot] 2021-01-21 22:01:50 +00:00 committed by GitHub
commit df5f9e8c51
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
104 changed files with 3385 additions and 714 deletions

View File

@ -24,4 +24,4 @@ before_script:
- export TZ=Europe/Berlin
script:
- sbt ++$TRAVIS_SCALA_VERSION ";project root ;scalafmtCheckAll ;make ;test"
- sbt -J-XX:+UseG1GC ++$TRAVIS_SCALA_VERSION ";project root ;scalafmtCheckAll ;make ;test"

View File

@ -17,6 +17,9 @@ If you don't like to sign up to github/matrix or like to reach me
personally, you can make a mail to `info [at] docspell.org` or on
matrix, via `@eikek:matrix.org`.
If you find a feature request already filed, you can vote on it. I
tend to prefer most voted requests to those without much attention.
## Documentation

View File

@ -9,25 +9,28 @@
# Docspell
Docspell is a personal document organizer. You'll need a scanner to
convert your papers into files. Docspell can then assist in
organizing the resulting mess :wink:.
convert your papers into files. Docspell can then assist in organizing
the resulting mess :wink:. It is targeted for home use, i.e. families
and households and also for (smaller) groups/companies.
You can associate tags, set correspondends, what a document is
concerned with, a name, a date and much more. If your documents are
associated with such meta data, you should be able to quickly find
them later using the search feature. But adding this manually to each
document is a tedious task. Docspell can help you by suggesting
correspondents, guessing tags or finding dates using machine learning
techniques. This makes adding metadata to your documents a lot easier.
You can associate tags, set correspondends and lots of other
predefined and custom metadata. If your documents are associated with
such meta data, you can quickly find them later using the search
feature. But adding this manually is a tedious task. Docspell can help
by suggesting correspondents, guessing tags or finding dates using
machine learning. It can learn metadata from existing documents and
find things using NLP. This makes adding metadata to your documents a
lot easier. For machine learning, it relies on the free (GPL)
[Stanford Core NLP library](https://github.com/stanfordnlp/CoreNLP).
Docspell also runs OCR (if needed) on your documents, can provide
fulltext search and has great e-mail integration. Everything is
accessible via a REST/HTTP api. A mobile friendly SPA web application
is provided as the user interface and an [Android
app](https://github.com/docspell/android-client) for conveniently
uploading files from your phone/tablet. The [feature
overview](https://docspell.org/#feature-selection) has a more complete
list.
is the default user interface. An [Android
app](https://github.com/docspell/android-client) exists for
conveniently uploading files from your phone/tablet. The [feature
overview](https://docspell.org/#feature-selection) lists some more
points.
## Impressions

View File

@ -131,7 +131,8 @@ val openapiScalaSettings = Seq(
case "ident" =>
field => field.copy(typeDef = TypeDef("Ident", Imports("docspell.common.Ident")))
case "accountid" =>
field => field.copy(typeDef = TypeDef("AccountId", Imports("docspell.common.AccountId")))
field =>
field.copy(typeDef = TypeDef("AccountId", Imports("docspell.common.AccountId")))
case "collectivestate" =>
field =>
field.copy(typeDef =
@ -190,6 +191,9 @@ val openapiScalaSettings = Seq(
field.copy(typeDef =
TypeDef("CustomFieldType", Imports("docspell.common.CustomFieldType"))
)
case "listtype" =>
field =>
field.copy(typeDef = TypeDef("ListType", Imports("docspell.common.ListType")))
}))
)

View File

@ -15,6 +15,17 @@ RUN apk add --no-cache openjdk11-jre \
tesseract-ocr \
tesseract-ocr-data-deu \
tesseract-ocr-data-fra \
tesseract-ocr-data-ita \
tesseract-ocr-data-spa \
tesseract-ocr-data-por \
tesseract-ocr-data-ces \
tesseract-ocr-data-nld \
tesseract-ocr-data-dan \
tesseract-ocr-data-fin \
tesseract-ocr-data-nor \
tesseract-ocr-data-swe \
tesseract-ocr-data-rus \
tesseract-ocr-data-ron \
unpaper \
wkhtmltopdf \
libreoffice \

View File

@ -0,0 +1,7 @@
package docspell.analysis
import java.nio.file.Path
import docspell.common._
case class NlpSettings(lang: Language, highRecall: Boolean, regexNer: Option[Path])

View File

@ -1,29 +1,30 @@
package docspell.analysis
import cats.Applicative
import cats.effect._
import cats.implicits._
import docspell.analysis.classifier.{StanfordTextClassifier, TextClassifier}
import docspell.analysis.contact.Contact
import docspell.analysis.date.DateFind
import docspell.analysis.nlp.PipelineCache
import docspell.analysis.nlp.StanfordNerClassifier
import docspell.analysis.nlp.StanfordNerSettings
import docspell.analysis.nlp.StanfordTextClassifier
import docspell.analysis.nlp.TextClassifier
import docspell.analysis.nlp._
import docspell.common._
import org.log4s.getLogger
trait TextAnalyser[F[_]] {
def annotate(
logger: Logger[F],
settings: StanfordNerSettings,
settings: NlpSettings,
cacheKey: Ident,
text: String
): F[TextAnalyser.Result]
def classifier(blocker: Blocker)(implicit CS: ContextShift[F]): TextClassifier[F]
def classifier: TextClassifier[F]
}
object TextAnalyser {
private[this] val logger = getLogger
case class Result(labels: Vector[NerLabel], dates: Vector[NerDateLabel]) {
@ -31,31 +32,30 @@ object TextAnalyser {
labels ++ dates.map(dl => dl.label.copy(label = dl.date.toString))
}
def create[F[_]: Concurrent: Timer](
cfg: TextAnalysisConfig
def create[F[_]: Concurrent: Timer: ContextShift](
cfg: TextAnalysisConfig,
blocker: Blocker
): Resource[F, TextAnalyser[F]] =
Resource
.liftF(PipelineCache[F](cfg.clearStanfordPipelineInterval))
.map(cache =>
.liftF(Nlp(cfg.nlpConfig))
.map(stanfordNer =>
new TextAnalyser[F] {
def annotate(
logger: Logger[F],
settings: StanfordNerSettings,
settings: NlpSettings,
cacheKey: Ident,
text: String
): F[TextAnalyser.Result] =
for {
input <- textLimit(logger, text)
tags0 <- stanfordNer(cacheKey, settings, input)
tags0 <- stanfordNer(Nlp.Input(cacheKey, settings, logger, input))
tags1 <- contactNer(input)
dates <- dateNer(settings.lang, input)
list = tags0 ++ tags1
spans = NerLabelSpan.build(list)
} yield Result(spans ++ list, dates)
def classifier(blocker: Blocker)(implicit
CS: ContextShift[F]
): TextClassifier[F] =
def classifier: TextClassifier[F] =
new StanfordTextClassifier[F](cfg.classifier, blocker)
private def textLimit(logger: Logger[F], text: String): F[String] =
@ -66,10 +66,6 @@ object TextAnalyser {
s" Analysing only first ${cfg.maxLength} characters."
) *> text.take(cfg.maxLength).pure[F]
private def stanfordNer(key: Ident, settings: StanfordNerSettings, text: String)
: F[Vector[NerLabel]] =
StanfordNerClassifier.nerAnnotate[F](key.id, cache)(settings, text)
private def contactNer(text: String): F[Vector[NerLabel]] =
Sync[F].delay {
Contact.annotate(text)
@ -82,4 +78,36 @@ object TextAnalyser {
}
)
/** Provides the nlp pipeline based on the configuration. */
private object Nlp {
def apply[F[_]: Concurrent: Timer: BracketThrow](
cfg: TextAnalysisConfig.NlpConfig
): F[Input[F] => F[Vector[NerLabel]]] =
cfg.mode match {
case NlpMode.Disabled =>
Logger.log4s(logger).info("NLP is disabled as defined in config.") *>
Applicative[F].pure(_ => Vector.empty[NerLabel].pure[F])
case _ =>
PipelineCache(cfg.clearInterval)(
Annotator[F](cfg.mode),
Annotator.clearCaches[F]
)
.map(annotate[F])
}
final case class Input[F[_]](
key: Ident,
settings: NlpSettings,
logger: Logger[F],
text: String
)
def annotate[F[_]: BracketThrow](
cache: PipelineCache[F]
)(input: Input[F]): F[Vector[NerLabel]] =
cache
.obtain(input.key.id, input.settings)
.use(ann => ann.nerAnnotate(input.logger)(input.text))
}
}

View File

@ -1,10 +1,16 @@
package docspell.analysis
import docspell.analysis.nlp.TextClassifierConfig
import docspell.analysis.TextAnalysisConfig.NlpConfig
import docspell.analysis.classifier.TextClassifierConfig
import docspell.common._
case class TextAnalysisConfig(
maxLength: Int,
clearStanfordPipelineInterval: Duration,
nlpConfig: NlpConfig,
classifier: TextClassifierConfig
)
object TextAnalysisConfig {
case class NlpConfig(clearInterval: Duration, mode: NlpMode)
}

View File

@ -1,4 +1,4 @@
package docspell.analysis.nlp
package docspell.analysis.classifier
import java.nio.file.Path

View File

@ -1,4 +1,4 @@
package docspell.analysis.nlp
package docspell.analysis.classifier
import java.nio.file.Path
@ -7,8 +7,11 @@ import cats.effect.concurrent.Ref
import cats.implicits._
import fs2.Stream
import docspell.analysis.nlp.TextClassifier._
import docspell.analysis.classifier
import docspell.analysis.classifier.TextClassifier._
import docspell.analysis.nlp.Properties
import docspell.common._
import docspell.common.syntax.FileSyntax._
import edu.stanford.nlp.classify.ColumnDataClassifier
@ -26,7 +29,7 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
.use { dir =>
for {
rawData <- writeDataFile(blocker, dir, data)
_ <- logger.info(s"Learning from ${rawData.count} items.")
_ <- logger.debug(s"Learning from ${rawData.count} items.")
trainData <- splitData(logger, rawData)
scores <- cfg.classifierConfigs.traverse(m => train(logger, trainData, m))
sorted = scores.sortBy(-_.score)
@ -43,7 +46,7 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
case Some(text) =>
Sync[F].delay {
val cls = ColumnDataClassifier.getClassifier(
model.model.normalize().toAbsolutePath().toString()
model.model.normalize().toAbsolutePath.toString
)
val cat = cls.classOf(cls.makeDatumFromLine("\t\t" + normalisedText(text)))
Option(cat)
@ -65,7 +68,7 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
val cdc = new ColumnDataClassifier(Properties.fromMap(amendProps(in, props)))
cdc.trainClassifier(in.train.toString())
val score = cdc.testClassifier(in.test.toString())
TrainResult(score.first(), ClassifierModel(in.modelFile))
TrainResult(score.first(), classifier.ClassifierModel(in.modelFile))
}
_ <- logger.debug(s"Trained with result $res")
} yield res
@ -136,9 +139,9 @@ final class StanfordTextClassifier[F[_]: Sync: ContextShift](
props: Map[String, String]
): Map[String, String] =
prepend("2.", props) ++ Map(
"trainFile" -> trainData.train.normalize().toAbsolutePath().toString(),
"testFile" -> trainData.test.normalize().toAbsolutePath().toString(),
"serializeTo" -> trainData.modelFile.normalize().toAbsolutePath().toString()
"trainFile" -> trainData.train.absolutePathAsString,
"testFile" -> trainData.test.absolutePathAsString,
"serializeTo" -> trainData.modelFile.absolutePathAsString
).toList
case class RawData(count: Long, file: Path)

View File

@ -1,9 +1,9 @@
package docspell.analysis.nlp
package docspell.analysis.classifier
import cats.data.Kleisli
import fs2.Stream
import docspell.analysis.nlp.TextClassifier.Data
import docspell.analysis.classifier.TextClassifier.Data
import docspell.common._
trait TextClassifier[F[_]] {

View File

@ -1,4 +1,4 @@
package docspell.analysis.nlp
package docspell.analysis.classifier
import java.nio.file.Path

View File

@ -41,23 +41,41 @@ object DateFind {
}
object SimpleDate {
val p0 = (readYear >> readMonth >> readDay).map { case ((y, m), d) =>
List(SimpleDate(y, m, d))
def pattern0(lang: Language) = (readYear >> readMonth(lang) >> readDay).map {
case ((y, m), d) =>
List(SimpleDate(y, m, d))
}
val p1 = (readDay >> readMonth >> readYear).map { case ((d, m), y) =>
List(SimpleDate(y, m, d))
def pattern1(lang: Language) = (readDay >> readMonth(lang) >> readYear).map {
case ((d, m), y) =>
List(SimpleDate(y, m, d))
}
val p2 = (readMonth >> readDay >> readYear).map { case ((m, d), y) =>
List(SimpleDate(y, m, d))
def pattern2(lang: Language) = (readMonth(lang) >> readDay >> readYear).map {
case ((m, d), y) =>
List(SimpleDate(y, m, d))
}
// ymd , ydm, dmy , dym, myd, mdy
def fromParts(parts: List[Word], lang: Language): List[SimpleDate] = {
val ymd = pattern0(lang)
val dmy = pattern1(lang)
val mdy = pattern2(lang)
// most is from wikipedia
val p = lang match {
case Language.English =>
p2.alt(p1).map(t => t._1 ++ t._2).or(p2).or(p0).or(p1)
case Language.German => p1.or(p0).or(p2)
case Language.French => p1.or(p0).or(p2)
mdy.alt(dmy).map(t => t._1 ++ t._2).or(mdy).or(ymd).or(dmy)
case Language.German => dmy.or(ymd).or(mdy)
case Language.French => dmy.or(ymd).or(mdy)
case Language.Italian => dmy.or(ymd).or(mdy)
case Language.Spanish => dmy.or(ymd).or(mdy)
case Language.Czech => dmy.or(ymd).or(mdy)
case Language.Danish => dmy.or(ymd).or(mdy)
case Language.Finnish => dmy.or(ymd).or(mdy)
case Language.Norwegian => dmy.or(ymd).or(mdy)
case Language.Portuguese => dmy.or(ymd).or(mdy)
case Language.Romanian => dmy.or(ymd).or(mdy)
case Language.Russian => dmy.or(ymd).or(mdy)
case Language.Swedish => ymd.or(dmy).or(mdy)
case Language.Dutch => dmy.or(ymd).or(mdy)
}
p.read(parts) match {
case Result.Success(sds, _) =>
@ -76,9 +94,11 @@ object DateFind {
}
)
def readMonth: Reader[Int] =
def readMonth(lang: Language): Reader[Int] =
Reader.readFirst(w =>
Some(months.indexWhere(_.contains(w.value))).filter(_ >= 0).map(_ + 1)
Some(MonthName.getAll(lang).indexWhere(_.contains(w.value)))
.filter(_ >= 0)
.map(_ + 1)
)
def readDay: Reader[Int] =
@ -150,20 +170,5 @@ object DateFind {
Failure
}
}
private val months = List(
List("jan", "january", "januar", "01"),
List("feb", "february", "februar", "02"),
List("mar", "march", "märz", "marz", "03"),
List("apr", "april", "04"),
List("may", "mai", "05"),
List("jun", "june", "juni", "06"),
List("jul", "july", "juli", "07"),
List("aug", "august", "08"),
List("sep", "september", "09"),
List("oct", "october", "oktober", "10"),
List("nov", "november", "11"),
List("dec", "december", "dezember", "12")
)
}
}

View File

@ -0,0 +1,270 @@
package docspell.analysis.date
import docspell.common.Language
object MonthName {
def getAll(lang: Language): List[List[String]] =
merge(numbers, forLang(lang))
private def merge(n0: List[List[String]], ns: List[List[String]]*): List[List[String]] =
ns.foldLeft(n0) { (res, el) =>
res.zip(el).map({ case (a, b) => a ++ b })
}
private def forLang(lang: Language): List[List[String]] =
lang match {
case Language.English =>
english
case Language.German =>
german
case Language.French =>
french
case Language.Italian =>
italian
case Language.Spanish =>
spanish
case Language.Swedish =>
swedish
case Language.Norwegian =>
norwegian
case Language.Dutch =>
dutch
case Language.Czech =>
czech
case Language.Danish =>
danish
case Language.Portuguese =>
portuguese
case Language.Romanian =>
romanian
case Language.Finnish =>
finnish
case Language.Russian =>
russian
}
private val numbers = List(
List("01"),
List("02"),
List("03"),
List("04"),
List("05"),
List("06"),
List("07"),
List("08"),
List("09"),
List("10"),
List("11"),
List("12")
)
private val english = List(
List("jan", "january"),
List("feb", "february"),
List("mar", "march"),
List("apr", "april"),
List("may"),
List("jun", "june"),
List("jul", "july"),
List("aug", "august"),
List("sept", "september"),
List("oct", "october"),
List("nov", "november"),
List("dec", "december")
)
private val german = List(
List("jan", "januar"),
List("feb", "februar"),
List("märz"),
List("apr", "april"),
List("mai"),
List("juni"),
List("juli"),
List("aug", "august"),
List("sept", "september"),
List("okt", "oktober"),
List("nov", "november"),
List("dez", "dezember")
)
private val french = List(
List("janv", "janvier"),
List("févr", "fevr", "février", "fevrier"),
List("mars"),
List("avril"),
List("mai"),
List("juin"),
List("juil", "juillet"),
List("aout", "août"),
List("sept", "septembre"),
List("oct", "octobre"),
List("nov", "novembre"),
List("dec", "déc", "décembre", "decembre")
)
private val italian = List(
List("genn", "gennaio"),
List("febbr", "febbraio"),
List("mar", "marzo"),
List("apr", "aprile"),
List("magg", "maggio"),
List("giugno"),
List("luglio"),
List("ag", "agosto"),
List("sett", "settembre"),
List("ott", "ottobre"),
List("nov", "novembre"),
List("dic", "dicembre")
)
private val spanish = List(
List("ene", "enero"),
List("feb", "febrero"),
List("mar", "marzo"),
List("abr", "abril"),
List("may", "mayo"),
List("jun"),
List("jul"),
List("ago", "agosto"),
List("sep", "septiembre"),
List("oct", "octubre"),
List("nov", "noviembre"),
List("dic", "diciembre")
)
private val swedish = List(
List("jan", "januari"),
List("febr", "februari"),
List("mars"),
List("april"),
List("maj"),
List("juni"),
List("juli"),
List("aug", "augusti"),
List("sept", "september"),
List("okt", "oktober"),
List("nov", "november"),
List("dec", "december")
)
private val norwegian = List(
List("jan", "januar"),
List("febr", "februar"),
List("mars"),
List("april"),
List("mai"),
List("juni"),
List("juli"),
List("aug", "august"),
List("sept", "september"),
List("okt", "oktober"),
List("nov", "november"),
List("des", "desember")
)
private val czech = List(
List("led", "leden"),
List("un", "ún", "únor", "unor"),
List("brez", "březen", "brezen"),
List("dub", "duben"),
List("kvet", "květen"),
List("cerv", "červen"),
List("cerven", "červenec"),
List("srp", "srpen"),
List("zari", "září"),
List("ríj", "rij", "říjen"),
List("list", "listopad"),
List("pros", "prosinec")
)
private val romanian = List(
List("ian", "ianuarie"),
List("feb", "februarie"),
List("mar", "martie"),
List("apr", "aprilie"),
List("mai"),
List("iunie"),
List("iulie"),
List("aug", "august"),
List("sept", "septembrie"),
List("oct", "octombrie"),
List("noem", "nov", "noiembrie"),
List("dec", "decembrie")
)
private val danish = List(
List("jan", "januar"),
List("febr", "februar"),
List("marts"),
List("april"),
List("maj"),
List("juni"),
List("juli"),
List("aug", "august"),
List("sept", "september"),
List("okt", "oktober"),
List("nov", "november"),
List("dec", "december")
)
private val portuguese = List(
List("jan", "janeiro"),
List("fev", "fevereiro"),
List("março", "marco"),
List("abril"),
List("maio"),
List("junho"),
List("julho"),
List("agosto"),
List("set", "setembro"),
List("out", "outubro"),
List("nov", "novembro"),
List("dez", "dezembro")
)
private val finnish = List(
List("tammikuu"),
List("helmikuu"),
List("maaliskuu"),
List("huhtikuu"),
List("toukokuu"),
List("kesäkuu"),
List("heinäkuu"),
List("elokuu"),
List("syyskuu"),
List("lokakuu"),
List("marraskuu"),
List("joulukuu")
)
private val russian = List(
List("январь"),
List("февраль"),
List("март"),
List("апрель"),
List("май"),
List("июнь"),
List("июль"),
List("август"),
List("сентябрь"),
List("октябрь"),
List("ноябрь"),
List("декабрь")
)
private val dutch = List(
List("jan", "januari"),
List("feb", "februari"),
List("maart"),
List("apr", "april"),
List("mei"),
List("juni"),
List("juli"),
List("aug", "augustus"),
List("sept", "september"),
List("okt", "oct", "oktober"),
List("nov", "november"),
List("dec", "december")
)
}

View File

@ -0,0 +1,98 @@
package docspell.analysis.nlp
import cats.effect.Sync
import cats.implicits._
import cats.{Applicative, FlatMap}
import docspell.analysis.NlpSettings
import docspell.common._
import edu.stanford.nlp.pipeline.StanfordCoreNLP
/** Analyses a text to mark certain parts with a `NerLabel`. */
trait Annotator[F[_]] { self =>
def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]]
def ++(next: Annotator[F])(implicit F: FlatMap[F]): Annotator[F] =
new Annotator[F] {
def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
for {
n0 <- self.nerAnnotate(logger)(text)
n1 <- next.nerAnnotate(logger)(text)
} yield (n0 ++ n1).distinct
}
}
object Annotator {
/** Creates an annotator according to the given `mode` and `settings`.
*
* There are the following ways:
*
* - disabled: it returns a no-op annotator that always gives an empty list
* - full: the complete stanford pipeline is used
* - basic: only the ner classifier is used
*
* Additionally, if there is a regexNer-file specified, the regexner annotator is
* also run. In case the full pipeline is used, this is already included.
*/
def apply[F[_]: Sync](mode: NlpMode)(settings: NlpSettings): Annotator[F] =
mode match {
case NlpMode.Disabled =>
Annotator.none[F]
case NlpMode.Full =>
StanfordNerSettings.fromNlpSettings(settings) match {
case Some(ss) =>
Annotator.pipeline(StanfordNerAnnotator.makePipeline(ss))
case None =>
Annotator.none[F]
}
case NlpMode.Basic =>
StanfordNerSettings.fromNlpSettings(settings) match {
case Some(StanfordNerSettings.Full(lang, _, Some(file))) =>
Annotator.basic(BasicCRFAnnotator.Cache.getAnnotator(lang)) ++
Annotator.pipeline(StanfordNerAnnotator.regexNerPipeline(file))
case Some(StanfordNerSettings.Full(lang, _, None)) =>
Annotator.basic(BasicCRFAnnotator.Cache.getAnnotator(lang))
case Some(StanfordNerSettings.RegexOnly(file)) =>
Annotator.pipeline(StanfordNerAnnotator.regexNerPipeline(file))
case None =>
Annotator.none[F]
}
case NlpMode.RegexOnly =>
settings.regexNer match {
case Some(file) =>
Annotator.pipeline(StanfordNerAnnotator.regexNerPipeline(file))
case None =>
Annotator.none[F]
}
}
def none[F[_]: Applicative]: Annotator[F] =
new Annotator[F] {
def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
logger.debug("Running empty annotator. NLP not supported.") *>
Vector.empty[NerLabel].pure[F]
}
def basic[F[_]: Sync](ann: BasicCRFAnnotator.Annotator): Annotator[F] =
new Annotator[F] {
def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
Sync[F].delay(
BasicCRFAnnotator.nerAnnotate(ann)(text)
)
}
def pipeline[F[_]: Sync](cp: StanfordCoreNLP): Annotator[F] =
new Annotator[F] {
def nerAnnotate(logger: Logger[F])(text: String): F[Vector[NerLabel]] =
Sync[F].delay(StanfordNerAnnotator.nerAnnotate(cp, text))
}
def clearCaches[F[_]: Sync]: F[Unit] =
Sync[F].delay {
StanfordCoreNLP.clearAnnotatorPool()
BasicCRFAnnotator.Cache.clearCache()
}
}

View File

@ -0,0 +1,94 @@
package docspell.analysis.nlp
import java.net.URL
import java.util.concurrent.atomic.AtomicReference
import java.util.zip.GZIPInputStream
import scala.jdk.CollectionConverters._
import scala.util.Using
import docspell.common.Language.NLPLanguage
import docspell.common._
import edu.stanford.nlp.ie.AbstractSequenceClassifier
import edu.stanford.nlp.ie.crf.CRFClassifier
import edu.stanford.nlp.ling.{CoreAnnotations, CoreLabel}
import org.log4s.getLogger
/** This is only using the CRFClassifier without building an analysis
* pipeline. The ner-classifier cannot use results from POS-tagging
* etc. and is therefore not as good as the [[StanfordNerAnnotator]].
* But it uses less memory, while still being not bad.
*/
object BasicCRFAnnotator {
private[this] val logger = getLogger
// assert correct resource names
List(Language.French, Language.German, Language.English).foreach(classifierResource)
type Annotator = AbstractSequenceClassifier[CoreLabel]
def nerAnnotate(nerClassifier: Annotator)(text: String): Vector[NerLabel] =
nerClassifier
.classify(text)
.asScala
.flatMap(a => a.asScala)
.collect(Function.unlift { label =>
val tag = label.get(classOf[CoreAnnotations.AnswerAnnotation])
NerTag
.fromString(Option(tag).getOrElse(""))
.toOption
.map(t => NerLabel(label.word(), t, label.beginPosition(), label.endPosition()))
})
.toVector
def makeAnnotator(lang: NLPLanguage): Annotator = {
logger.info(s"Creating ${lang.name} Stanford NLP NER-only classifier...")
val ner = classifierResource(lang)
Using(new GZIPInputStream(ner.openStream())) { in =>
CRFClassifier.getClassifier(in).asInstanceOf[Annotator]
}.fold(throw _, identity)
}
private def classifierResource(lang: NLPLanguage): URL = {
def check(name: String): URL =
Option(getClass.getResource(name)) match {
case None =>
sys.error(s"NER model resource '$name' not found for language ${lang.name}")
case Some(url) => url
}
check(lang match {
case Language.French =>
"/edu/stanford/nlp/models/ner/french-wikiner-4class.crf.ser.gz"
case Language.German =>
"/edu/stanford/nlp/models/ner/german.distsim.crf.ser.gz"
case Language.English =>
"/edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz"
})
}
final class Cache {
private[this] lazy val germanNerClassifier = makeAnnotator(Language.German)
private[this] lazy val englishNerClassifier = makeAnnotator(Language.English)
private[this] lazy val frenchNerClassifier = makeAnnotator(Language.French)
def forLang(language: NLPLanguage): Annotator =
language match {
case Language.French => frenchNerClassifier
case Language.German => germanNerClassifier
case Language.English => englishNerClassifier
}
}
object Cache {
private[this] val cacheRef = new AtomicReference[Cache](new Cache)
def getAnnotator(language: NLPLanguage): Annotator =
cacheRef.get().forLang(language)
def clearCache(): Unit =
cacheRef.set(new Cache)
}
}

View File

@ -7,9 +7,9 @@ import cats.effect._
import cats.effect.concurrent.Ref
import cats.implicits._
import docspell.analysis.NlpSettings
import docspell.common._
import edu.stanford.nlp.pipeline.StanfordCoreNLP
import org.log4s.getLogger
/** Creating the StanfordCoreNLP pipeline is quite expensive as it
@ -21,46 +21,45 @@ import org.log4s.getLogger
*/
trait PipelineCache[F[_]] {
def obtain(key: String, settings: StanfordNerSettings): Resource[F, StanfordCoreNLP]
def obtain(key: String, settings: NlpSettings): Resource[F, Annotator[F]]
}
object PipelineCache {
private[this] val logger = getLogger
def none[F[_]: Applicative]: PipelineCache[F] =
new PipelineCache[F] {
def obtain(
ignored: String,
settings: StanfordNerSettings
): Resource[F, StanfordCoreNLP] =
Resource.liftF(makeClassifier(settings).pure[F])
}
def apply[F[_]: Concurrent: Timer](clearInterval: Duration): F[PipelineCache[F]] =
def apply[F[_]: Concurrent: Timer](clearInterval: Duration)(
creator: NlpSettings => Annotator[F],
release: F[Unit]
): F[PipelineCache[F]] =
for {
data <- Ref.of(Map.empty[String, Entry])
cacheClear <- CacheClearing.create(data, clearInterval)
} yield new Impl[F](data, cacheClear)
data <- Ref.of(Map.empty[String, Entry[Annotator[F]]])
cacheClear <- CacheClearing.create(data, clearInterval, release)
_ <- Logger.log4s(logger).info("Creating nlp pipeline cache")
} yield new Impl[F](data, creator, cacheClear)
final private class Impl[F[_]: Sync](
data: Ref[F, Map[String, Entry]],
data: Ref[F, Map[String, Entry[Annotator[F]]]],
creator: NlpSettings => Annotator[F],
cacheClear: CacheClearing[F]
) extends PipelineCache[F] {
def obtain(key: String, settings: StanfordNerSettings): Resource[F, StanfordCoreNLP] =
def obtain(key: String, settings: NlpSettings): Resource[F, Annotator[F]] =
for {
_ <- cacheClear.withCache
id <- Resource.liftF(makeSettingsId(settings))
nlp <- Resource.liftF(data.modify(cache => getOrCreate(key, id, cache, settings)))
_ <- cacheClear.withCache
id <- Resource.liftF(makeSettingsId(settings))
nlp <- Resource.liftF(
data.modify(cache => getOrCreate(key, id, cache, settings, creator))
)
} yield nlp
private def getOrCreate(
key: String,
id: String,
cache: Map[String, Entry],
settings: StanfordNerSettings
): (Map[String, Entry], StanfordCoreNLP) =
cache: Map[String, Entry[Annotator[F]]],
settings: NlpSettings,
creator: NlpSettings => Annotator[F]
): (Map[String, Entry[Annotator[F]]], Annotator[F]) =
cache.get(key) match {
case Some(entry) =>
if (entry.id == id) (cache, entry.value)
@ -68,18 +67,18 @@ object PipelineCache {
logger.info(
s"StanfordNLP settings changed for key $key. Creating new classifier"
)
val nlp = makeClassifier(settings)
val nlp = creator(settings)
val e = Entry(id, nlp)
(cache.updated(key, e), nlp)
}
case None =>
val nlp = makeClassifier(settings)
val nlp = creator(settings)
val e = Entry(id, nlp)
(cache.updated(key, e), nlp)
}
private def makeSettingsId(settings: StanfordNerSettings): F[String] = {
private def makeSettingsId(settings: NlpSettings): F[String] = {
val base = settings.copy(regexNer = None).toString
val size: F[Long] =
settings.regexNer match {
@ -104,9 +103,10 @@ object PipelineCache {
Resource.pure[F, Unit](())
}
def create[F[_]: Concurrent: Timer](
data: Ref[F, Map[String, Entry]],
interval: Duration
def create[F[_]: Concurrent: Timer, A](
data: Ref[F, Map[String, Entry[A]]],
interval: Duration,
release: F[Unit]
): F[CacheClearing[F]] =
for {
counter <- Ref.of(0L)
@ -121,16 +121,23 @@ object PipelineCache {
log
.info(s"Clearing StanfordNLP cache after $interval idle time")
.map(_ =>
new CacheClearingImpl[F](data, counter, cleaning, interval.toScala)
new CacheClearingImpl[F, A](
data,
counter,
cleaning,
interval.toScala,
release
)
)
} yield result
}
final private class CacheClearingImpl[F[_]](
data: Ref[F, Map[String, Entry]],
final private class CacheClearingImpl[F[_], A](
data: Ref[F, Map[String, Entry[A]]],
counter: Ref[F, Long],
cleaningFiber: Ref[F, Option[Fiber[F, Unit]]],
clearInterval: FiniteDuration
clearInterval: FiniteDuration,
release: F[Unit]
)(implicit T: Timer[F], F: Concurrent[F])
extends CacheClearing[F] {
private[this] val log = Logger.log4s[F](logger)
@ -158,17 +165,10 @@ object PipelineCache {
def clearAll: F[Unit] =
log.info("Clearing stanford nlp cache now!") *>
data.set(Map.empty) *> Sync[F].delay {
// turns out that everything is cached in a static map
StanfordCoreNLP.clearAnnotatorPool()
data.set(Map.empty) *> release *> Sync[F].delay {
System.gc();
}
}
private def makeClassifier(settings: StanfordNerSettings): StanfordCoreNLP = {
logger.info(s"Creating ${settings.lang.name} Stanford NLP NER classifier...")
new StanfordCoreNLP(Properties.forSettings(settings))
}
private case class Entry(id: String, value: StanfordCoreNLP)
private case class Entry[A](id: String, value: A)
}

View File

@ -1,9 +1,11 @@
package docspell.analysis.nlp
import java.nio.file.Path
import java.util.{Properties => JProps}
import docspell.analysis.nlp.Properties.Implicits._
import docspell.common._
import docspell.common.syntax.FileSyntax._
object Properties {
@ -17,18 +19,21 @@ object Properties {
p
}
def forSettings(settings: StanfordNerSettings): JProps = {
val regexNerFile = settings.regexNer
.map(p => p.normalize().toAbsolutePath().toString())
settings.lang match {
case Language.German =>
Properties.nerGerman(regexNerFile, settings.highRecall)
case Language.English =>
Properties.nerEnglish(regexNerFile)
case Language.French =>
Properties.nerFrench(regexNerFile, settings.highRecall)
def forSettings(settings: StanfordNerSettings): JProps =
settings match {
case StanfordNerSettings.Full(lang, highRecall, regexNer) =>
val regexNerFile = regexNer.map(p => p.absolutePathAsString)
lang match {
case Language.German =>
Properties.nerGerman(regexNerFile, highRecall)
case Language.English =>
Properties.nerEnglish(regexNerFile)
case Language.French =>
Properties.nerFrench(regexNerFile, highRecall)
}
case StanfordNerSettings.RegexOnly(path) =>
Properties.regexNerOnly(path)
}
}
def nerGerman(regexNerMappingFile: Option[String], highRecall: Boolean): JProps =
Properties(
@ -76,6 +81,11 @@ object Properties {
"ner.model" -> "edu/stanford/nlp/models/ner/french-wikiner-4class.crf.ser.gz,edu/stanford/nlp/models/ner/english.conll.4class.distsim.crf.ser.gz"
).withRegexNer(regexNerMappingFile).withHighRecall(highRecall)
def regexNerOnly(regexNerMappingFile: Path): JProps =
Properties(
"annotators" -> "tokenize,ssplit"
).withRegexNer(Some(regexNerMappingFile.absolutePathAsString))
object Implicits {
implicit final class JPropsOps(val p: JProps) extends AnyVal {

View File

@ -0,0 +1,52 @@
package docspell.analysis.nlp
import java.nio.file.Path
import scala.jdk.CollectionConverters._
import cats.effect._
import docspell.common._
import edu.stanford.nlp.pipeline.{CoreDocument, StanfordCoreNLP}
import org.log4s.getLogger
object StanfordNerAnnotator {
private[this] val logger = getLogger
/** Runs named entity recognition on the given `text`.
*
* This uses the classifier pipeline from stanford-nlp, see
* https://nlp.stanford.edu/software/CRF-NER.html. Creating these
* classifiers is quite expensive, it involves loading large model
* files. The classifiers are thread-safe and so they are cached.
* The `cacheKey` defines the "slot" where classifiers are stored
* and retrieved. If for a given `cacheKey` the `settings` change,
* a new classifier must be created. It will then replace the
* previous one.
*/
def nerAnnotate(nerClassifier: StanfordCoreNLP, text: String): Vector[NerLabel] = {
val doc = new CoreDocument(text)
nerClassifier.annotate(doc)
doc.tokens().asScala.collect(Function.unlift(LabelConverter.toNerLabel)).toVector
}
def makePipeline(settings: StanfordNerSettings): StanfordCoreNLP =
settings match {
case s: StanfordNerSettings.Full =>
logger.info(s"Creating ${s.lang.name} Stanford NLP NER classifier...")
new StanfordCoreNLP(Properties.forSettings(settings))
case StanfordNerSettings.RegexOnly(path) =>
logger.info(s"Creating regexNer-only Stanford NLP NER classifier...")
regexNerPipeline(path)
}
def regexNerPipeline(regexNerFile: Path): StanfordCoreNLP =
new StanfordCoreNLP(Properties.regexNerOnly(regexNerFile))
def clearPipelineCaches[F[_]: Sync]: F[Unit] =
Sync[F].delay {
// turns out that everything is cached in a static map
StanfordCoreNLP.clearAnnotatorPool()
}
}

View File

@ -1,39 +0,0 @@
package docspell.analysis.nlp
import scala.jdk.CollectionConverters._
import cats.Applicative
import cats.effect._
import docspell.common._
import edu.stanford.nlp.pipeline.{CoreDocument, StanfordCoreNLP}
object StanfordNerClassifier {
/** Runs named entity recognition on the given `text`.
*
* This uses the classifier pipeline from stanford-nlp, see
* https://nlp.stanford.edu/software/CRF-NER.html. Creating these
* classifiers is quite expensive, it involves loading large model
* files. The classifiers are thread-safe and so they are cached.
* The `cacheKey` defines the "slot" where classifiers are stored
* and retrieved. If for a given `cacheKey` the `settings` change,
* a new classifier must be created. It will then replace the
* previous one.
*/
def nerAnnotate[F[_]: BracketThrow](
cacheKey: String,
cache: PipelineCache[F]
)(settings: StanfordNerSettings, text: String): F[Vector[NerLabel]] =
cache
.obtain(cacheKey, settings)
.use(crf => Applicative[F].pure(runClassifier(crf, text)))
def runClassifier(nerClassifier: StanfordCoreNLP, text: String): Vector[NerLabel] = {
val doc = new CoreDocument(text)
nerClassifier.annotate(doc)
doc.tokens().asScala.collect(Function.unlift(LabelConverter.toNerLabel)).toVector
}
}

View File

@ -2,25 +2,41 @@ package docspell.analysis.nlp
import java.nio.file.Path
import docspell.common._
import docspell.analysis.NlpSettings
import docspell.common.Language.NLPLanguage
/** Settings for configuring the stanford NER pipeline.
*
* The language is mandatory, only the provided ones are supported.
* The `highRecall` only applies for non-English languages. For
* non-English languages the english classifier is run as second
* classifier and if `highRecall` is true, then it will be used to
* tag untagged tokens. This may lead to a lot of false positives,
* but since English is omnipresent in other languages, too it
* depends on the use case for whether this is useful or not.
*
* The `regexNer` allows to specify a text file as described here:
* https://nlp.stanford.edu/software/regexner.html. This will be used
* as a last step to tag untagged tokens using the provided list of
* regexps.
*/
case class StanfordNerSettings(
lang: Language,
highRecall: Boolean,
regexNer: Option[Path]
)
sealed trait StanfordNerSettings
object StanfordNerSettings {
/** Settings for configuring the stanford NER pipeline.
*
* The language is mandatory, only the provided ones are supported.
* The `highRecall` only applies for non-English languages. For
* non-English languages the english classifier is run as second
* classifier and if `highRecall` is true, then it will be used to
* tag untagged tokens. This may lead to a lot of false positives,
* but since English is omnipresent in other languages, too it
* depends on the use case for whether this is useful or not.
*
* The `regexNer` allows to specify a text file as described here:
* https://nlp.stanford.edu/software/regexner.html. This will be used
* as a last step to tag untagged tokens using the provided list of
* regexps.
*/
case class Full(
lang: NLPLanguage,
highRecall: Boolean,
regexNer: Option[Path]
) extends StanfordNerSettings
/** Not all languages are supported with predefined statistical models. This allows to provide regexps only.
*/
case class RegexOnly(regexNerFile: Path) extends StanfordNerSettings
def fromNlpSettings(ns: NlpSettings): Option[StanfordNerSettings] =
NLPLanguage.all
.find(nl => nl == ns.lang)
.map(nl => Full(nl, ns.highRecall, ns.regexNer))
.orElse(ns.regexNer.map(nrf => RegexOnly(nrf)))
}

View File

@ -0,0 +1,12 @@
package docspell.analysis
object Env {
def isCI = bool("CI")
def bool(key: String): Boolean =
string(key).contains("true")
def string(key: String): Option[String] =
Option(System.getenv(key)).filter(_.nonEmpty)
}

View File

@ -1,4 +1,4 @@
package docspell.analysis.nlp
package docspell.analysis.classifier
import minitest._
import cats.effect._

View File

@ -1,19 +1,22 @@
package docspell.analysis.nlp
import docspell.analysis.Env
import docspell.common.Language.NLPLanguage
import minitest.SimpleTestSuite
import docspell.files.TestFiles
import docspell.common._
import edu.stanford.nlp.pipeline.StanfordCoreNLP
object TextAnalyserSuite extends SimpleTestSuite {
lazy val germanClassifier =
new StanfordCoreNLP(Properties.nerGerman(None, false))
lazy val englishClassifier =
new StanfordCoreNLP(Properties.nerEnglish(None))
object BaseCRFAnnotatorSuite extends SimpleTestSuite {
def annotate(language: NLPLanguage): String => Vector[NerLabel] =
BasicCRFAnnotator.nerAnnotate(BasicCRFAnnotator.Cache.getAnnotator(language))
test("find english ner labels") {
val labels =
StanfordNerClassifier.runClassifier(englishClassifier, TestFiles.letterENText)
if (Env.isCI) {
ignore("Test ignored on travis.")
}
val labels = annotate(Language.English)(TestFiles.letterENText)
val expect = Vector(
NerLabel("Derek", NerTag.Person, 0, 5),
NerLabel("Jeter", NerTag.Person, 6, 11),
@ -45,11 +48,15 @@ object TextAnalyserSuite extends SimpleTestSuite {
NerLabel("Jeter", NerTag.Person, 1123, 1128)
)
assertEquals(labels, expect)
BasicCRFAnnotator.Cache.clearCache()
}
test("find german ner labels") {
val labels =
StanfordNerClassifier.runClassifier(germanClassifier, TestFiles.letterDEText)
if (Env.isCI) {
ignore("Test ignored on travis.")
}
val labels = annotate(Language.German)(TestFiles.letterDEText)
val expect = Vector(
NerLabel("Max", NerTag.Person, 0, 3),
NerLabel("Mustermann", NerTag.Person, 4, 14),
@ -65,5 +72,6 @@ object TextAnalyserSuite extends SimpleTestSuite {
NerLabel("Mustermann", NerTag.Person, 509, 519)
)
assertEquals(labels, expect)
BasicCRFAnnotator.Cache.clearCache()
}
}

View File

@ -0,0 +1,120 @@
package docspell.analysis.nlp
import java.nio.file.Paths
import cats.effect.IO
import docspell.analysis.Env
import minitest.SimpleTestSuite
import docspell.files.TestFiles
import docspell.common._
import docspell.common.syntax.FileSyntax._
import edu.stanford.nlp.pipeline.StanfordCoreNLP
object StanfordNerAnnotatorSuite extends SimpleTestSuite {
lazy val germanClassifier =
new StanfordCoreNLP(Properties.nerGerman(None, false))
lazy val englishClassifier =
new StanfordCoreNLP(Properties.nerEnglish(None))
test("find english ner labels") {
if (Env.isCI) {
ignore("Test ignored on travis.")
}
val labels =
StanfordNerAnnotator.nerAnnotate(englishClassifier, TestFiles.letterENText)
val expect = Vector(
NerLabel("Derek", NerTag.Person, 0, 5),
NerLabel("Jeter", NerTag.Person, 6, 11),
NerLabel("Elm", NerTag.Misc, 17, 20),
NerLabel("Ave.", NerTag.Misc, 21, 25),
NerLabel("Treesville", NerTag.Misc, 27, 37),
NerLabel("Derek", NerTag.Person, 68, 73),
NerLabel("Jeter", NerTag.Person, 74, 79),
NerLabel("Elm", NerTag.Misc, 85, 88),
NerLabel("Ave.", NerTag.Misc, 89, 93),
NerLabel("Treesville", NerTag.Person, 95, 105),
NerLabel("Leaf", NerTag.Organization, 144, 148),
NerLabel("Chief", NerTag.Organization, 150, 155),
NerLabel("of", NerTag.Organization, 156, 158),
NerLabel("Syrup", NerTag.Organization, 159, 164),
NerLabel("Production", NerTag.Organization, 165, 175),
NerLabel("Old", NerTag.Organization, 176, 179),
NerLabel("Sticky", NerTag.Organization, 180, 186),
NerLabel("Pancake", NerTag.Organization, 187, 194),
NerLabel("Company", NerTag.Organization, 195, 202),
NerLabel("Maple", NerTag.Organization, 207, 212),
NerLabel("Lane", NerTag.Organization, 213, 217),
NerLabel("Forest", NerTag.Organization, 219, 225),
NerLabel("Hemptown", NerTag.Location, 239, 247),
NerLabel("Leaf", NerTag.Person, 276, 280),
NerLabel("Little", NerTag.Misc, 347, 353),
NerLabel("League", NerTag.Misc, 354, 360),
NerLabel("Derek", NerTag.Person, 1117, 1122),
NerLabel("Jeter", NerTag.Person, 1123, 1128)
)
assertEquals(labels, expect)
StanfordCoreNLP.clearAnnotatorPool()
}
test("find german ner labels") {
if (Env.isCI) {
ignore("Test ignored on travis.")
}
val labels =
StanfordNerAnnotator.nerAnnotate(germanClassifier, TestFiles.letterDEText)
val expect = Vector(
NerLabel("Max", NerTag.Person, 0, 3),
NerLabel("Mustermann", NerTag.Person, 4, 14),
NerLabel("Lilienweg", NerTag.Person, 16, 25),
NerLabel("Max", NerTag.Person, 77, 80),
NerLabel("Mustermann", NerTag.Person, 81, 91),
NerLabel("Lilienweg", NerTag.Location, 93, 102),
NerLabel("EasyCare", NerTag.Organization, 124, 132),
NerLabel("AG", NerTag.Organization, 133, 135),
NerLabel("Ackerweg", NerTag.Location, 158, 166),
NerLabel("Nebendorf", NerTag.Location, 184, 193),
NerLabel("Max", NerTag.Person, 505, 508),
NerLabel("Mustermann", NerTag.Person, 509, 519)
)
assertEquals(labels, expect)
StanfordCoreNLP.clearAnnotatorPool()
}
test("regexner-only annotator") {
if (Env.isCI) {
ignore("Test ignored on travis.")
}
val regexNerContent =
s"""(?i)volantino ag${"\t"}ORGANIZATION${"\t"}LOCATION,PERSON,MISC${"\t"}3
|(?i)volantino${"\t"}ORGANIZATION${"\t"}LOCATION,PERSON,MISC${"\t"}3
|(?i)ag${"\t"}ORGANIZATION${"\t"}LOCATION,PERSON,MISC${"\t"}3
|(?i)andrea rossi${"\t"}PERSON${"\t"}LOCATION,MISC${"\t"}2
|(?i)andrea${"\t"}PERSON${"\t"}LOCATION,MISC${"\t"}2
|(?i)rossi${"\t"}PERSON${"\t"}LOCATION,MISC${"\t"}2
|""".stripMargin
File
.withTempDir[IO](Paths.get("target"), "test-regex-ner")
.use { dir =>
for {
out <- File.writeString[IO](dir / "regex.txt", regexNerContent)
ann = StanfordNerAnnotator.makePipeline(StanfordNerSettings.RegexOnly(out))
labels = StanfordNerAnnotator.nerAnnotate(ann, "Hello Andrea Rossi, can you.")
_ <- IO(
assertEquals(
labels,
Vector(
NerLabel("Andrea", NerTag.Person, 6, 12),
NerLabel("Rossi", NerTag.Person, 13, 18)
)
)
)
} yield ()
}
.unsafeRunSync()
StanfordCoreNLP.clearAnnotatorPool()
}
}

View File

@ -591,7 +591,7 @@ object OItem {
for {
itemIds <- store.transact(RItem.filterItems(items, collective))
results <- itemIds.traverse(item => deleteItem(item, collective))
n = results.fold(0)(_ + _)
n = results.sum
} yield n
def getProposals(item: Ident, collective: Ident): F[MetaProposalList] =

View File

@ -1,5 +1,7 @@
package docspell.common
import cats.data.NonEmptyList
import io.circe.{Decoder, Encoder}
sealed trait Language { self: Product =>
@ -11,28 +13,107 @@ sealed trait Language { self: Product =>
def iso3: String
val allowsNLP: Boolean = false
private[common] def allNames =
Set(name, iso3, iso2)
}
object Language {
sealed trait NLPLanguage extends Language with Product {
override val allowsNLP = true
}
object NLPLanguage {
val all: NonEmptyList[NLPLanguage] = NonEmptyList.of(German, English, French)
}
case object German extends Language {
case object German extends NLPLanguage {
val iso2 = "de"
val iso3 = "deu"
}
case object English extends Language {
case object English extends NLPLanguage {
val iso2 = "en"
val iso3 = "eng"
}
case object French extends Language {
case object French extends NLPLanguage {
val iso2 = "fr"
val iso3 = "fra"
}
val all: List[Language] = List(German, English, French)
case object Italian extends Language {
val iso2 = "it"
val iso3 = "ita"
}
case object Spanish extends Language {
val iso2 = "es"
val iso3 = "spa"
}
case object Portuguese extends Language {
val iso2 = "pt"
val iso3 = "por"
}
case object Czech extends Language {
val iso2 = "cs"
val iso3 = "ces"
}
case object Danish extends Language {
val iso2 = "da"
val iso3 = "dan"
}
case object Finnish extends Language {
val iso2 = "fi"
val iso3 = "fin"
}
case object Norwegian extends Language {
val iso2 = "no"
val iso3 = "nor"
}
case object Swedish extends Language {
val iso2 = "sv"
val iso3 = "swe"
}
case object Russian extends Language {
val iso2 = "ru"
val iso3 = "rus"
}
case object Romanian extends Language {
val iso2 = "ro"
val iso3 = "ron"
}
case object Dutch extends Language {
val iso2 = "nl"
val iso3 = "nld"
}
val all: List[Language] =
List(
German,
English,
French,
Italian,
Spanish,
Dutch,
Portuguese,
Czech,
Danish,
Finnish,
Norwegian,
Swedish,
Russian,
Romanian
)
def fromString(str: String): Either[String, Language] = {
val lang = str.toLowerCase

View File

@ -0,0 +1,33 @@
package docspell.common
import cats.data.NonEmptyList
import io.circe.{Decoder, Encoder}
sealed trait ListType { self: Product =>
def name: String =
productPrefix.toLowerCase
}
object ListType {
case object Whitelist extends ListType
val whitelist: ListType = Whitelist
case object Blacklist extends ListType
val blacklist: ListType = Blacklist
val all: NonEmptyList[ListType] = NonEmptyList.of(Whitelist, Blacklist)
def fromString(name: String): Either[String, ListType] =
all.find(_.name.equalsIgnoreCase(name)).toRight(s"Unknown list type: $name")
def unsafeFromString(name: String): ListType =
fromString(name).fold(sys.error, identity)
implicit val jsonEncoder: Encoder[ListType] =
Encoder.encodeString.contramap(_.name)
implicit val jsonDecoder: Decoder[ListType] =
Decoder.decodeString.emap(fromString)
}

View File

@ -87,7 +87,7 @@ object MetaProposal {
}
}
/** Merges candidates with same `IdRef' values and concatenates their
/** Merges candidates with same `IdRef` values and concatenates their
* respective labels. The candidate order is preserved.
*/
def flatten(s: NonEmptyList[Candidate]): NonEmptyList[Candidate] = {

View File

@ -45,6 +45,19 @@ case class MetaProposalList private (proposals: List[MetaProposal]) {
def sortByWeights: MetaProposalList =
change(_.sortByWeight)
def insertSecond(ml: MetaProposalList): MetaProposalList =
MetaProposalList.flatten0(
Seq(this, ml),
(map, next) =>
map.get(next.proposalType) match {
case Some(MetaProposal(mt, values)) =>
val cand = NonEmptyList(values.head, next.values.toList ++ values.tail)
map.updated(next.proposalType, MetaProposal(mt, MetaProposal.flatten(cand)))
case None =>
map.updated(next.proposalType, next)
}
)
}
object MetaProposalList {
@ -74,20 +87,25 @@ object MetaProposalList {
* is preserved and candidates of proposals are appended as given
* by the order of the given `seq'.
*/
def flatten(ml: Seq[MetaProposalList]): MetaProposalList = {
val init: Map[MetaProposalType, MetaProposal] = Map.empty
def updateMap(
map: Map[MetaProposalType, MetaProposal],
mp: MetaProposal
): Map[MetaProposalType, MetaProposal] =
map.get(mp.proposalType) match {
case Some(mp0) => map.updated(mp.proposalType, mp0.addIdRef(mp.values.toList))
case None => map.updated(mp.proposalType, mp)
}
val merged = ml.foldLeft(init)((map, el) => el.proposals.foldLeft(map)(updateMap))
def flatten(ml: Seq[MetaProposalList]): MetaProposalList =
flatten0(
ml,
(map, mp) =>
map.get(mp.proposalType) match {
case Some(mp0) => map.updated(mp.proposalType, mp0.addIdRef(mp.values.toList))
case None => map.updated(mp.proposalType, mp)
}
)
private def flatten0(
ml: Seq[MetaProposalList],
merge: (
Map[MetaProposalType, MetaProposal],
MetaProposal
) => Map[MetaProposalType, MetaProposal]
): MetaProposalList = {
val init = Map.empty[MetaProposalType, MetaProposal]
val merged = ml.foldLeft(init)((map, el) => el.proposals.foldLeft(map)(merge))
fromMap(merged)
}

View File

@ -0,0 +1,25 @@
package docspell.common
sealed trait NlpMode { self: Product =>
def name: String =
self.productPrefix
}
object NlpMode {
case object Full extends NlpMode
case object Basic extends NlpMode
case object RegexOnly extends NlpMode
case object Disabled extends NlpMode
def fromString(name: String): Either[String, NlpMode] =
name.toLowerCase match {
case "full" => Right(Full)
case "basic" => Right(Basic)
case "regexonly" => Right(RegexOnly)
case "disabled" => Right(Disabled)
case _ => Left(s"Unknown nlp-mode: $name")
}
def unsafeFromString(name: String): NlpMode =
fromString(name).fold(sys.error, identity)
}

View File

@ -44,6 +44,9 @@ object Implicits {
implicit val priorityReader: ConfigReader[Priority] =
ConfigReader[String].emap(reason(Priority.fromString))
implicit val nlpModeReader: ConfigReader[NlpMode] =
ConfigReader[String].emap(reason(NlpMode.fromString))
def reason[A: ClassTag](
f: String => Either[String, A]
): String => Either[FailureReason, A] =

View File

@ -0,0 +1,20 @@
package docspell.common.syntax
import java.nio.file.Path
trait FileSyntax {
implicit final class PathOps(p: Path) {
def absolutePath: Path =
p.normalize().toAbsolutePath
def absolutePathAsString: String =
absolutePath.toString
def /(next: String): Path =
p.resolve(next)
}
}
object FileSyntax extends FileSyntax

View File

@ -2,6 +2,11 @@ package docspell.common
package object syntax {
object all extends EitherSyntax with StreamSyntax with StringSyntax with LoggerSyntax
object all
extends EitherSyntax
with StreamSyntax
with StringSyntax
with LoggerSyntax
with FileSyntax
}

View File

@ -68,4 +68,35 @@ object MetaProposalListTest extends SimpleTestSuite {
assertEquals(candidates.head, cand1)
assertEquals(candidates.tail.head, cand2)
}
test("insert second") {
val cand1 = Candidate(IdRef(Ident.unsafe("123"), "name"), Set.empty)
val cand2 = Candidate(IdRef(Ident.unsafe("456"), "name"), Set.empty)
val cand3 = Candidate(IdRef(Ident.unsafe("789"), "name"), Set.empty)
val cand4 = Candidate(IdRef(Ident.unsafe("abc"), "name"), Set.empty)
val cand5 = Candidate(IdRef(Ident.unsafe("def"), "name"), Set.empty)
val mpl1 = MetaProposalList
.of(
MetaProposal(MetaProposalType.CorrOrg, NonEmptyList.of(cand1, cand2)),
MetaProposal(MetaProposalType.ConcPerson, NonEmptyList.of(cand3))
)
val mpl2 = MetaProposalList
.of(
MetaProposal(MetaProposalType.CorrOrg, NonEmptyList.of(cand4)),
MetaProposal(MetaProposalType.ConcPerson, NonEmptyList.of(cand5))
)
val result = mpl1.insertSecond(mpl2)
assertEquals(
result,
MetaProposalList(
List(
MetaProposal(MetaProposalType.CorrOrg, NonEmptyList.of(cand1, cand4, cand2)),
MetaProposal(MetaProposalType.ConcPerson, NonEmptyList.of(cand3, cand5))
)
)
)
}
}

View File

@ -0,0 +1,13 @@
Pontremoli, 9 aprile 2013
Spettabile Villa Albicocca
Via Francigena, 9
55100 Pontetetto (LU)
Oggetto: Prenotazione
Gentile Direttore,
Vorrei prenotare una camera matrimoniale …….
In attesa di una Sua pronta risposta, La saluto cordialmente

View File

@ -1,5 +1,8 @@
package docspell.ftsclient
import cats.Functor
import cats.implicits._
import docspell.common._
final case class FtsMigration[F[_]](
@ -7,7 +10,13 @@ final case class FtsMigration[F[_]](
engine: Ident,
description: String,
task: F[FtsMigration.Result]
)
) {
def changeResult(f: FtsMigration.Result => FtsMigration.Result)(implicit
F: Functor[F]
): FtsMigration[F] =
copy(task = task.map(f))
}
object FtsMigration {

View File

@ -21,22 +21,19 @@ object Field {
val discriminator = Field("discriminator")
val attachmentName = Field("attachmentName")
val content = Field("content")
val content_de = Field("content_de")
val content_en = Field("content_en")
val content_fr = Field("content_fr")
val content_de = contentField(Language.German)
val content_en = contentField(Language.English)
val content_fr = contentField(Language.French)
val itemName = Field("itemName")
val itemNotes = Field("itemNotes")
val folderId = Field("folder")
val contentLangFields = Language.all
.map(contentField)
def contentField(lang: Language): Field =
lang match {
case Language.German =>
Field.content_de
case Language.English =>
Field.content_en
case Language.French =>
Field.content_fr
}
if (lang == Language.Czech) Field(s"content_cz")
else Field(s"content_${lang.iso2}")
implicit val jsonEncoder: Encoder[Field] =
Encoder.encodeString.contramap(_.name)

View File

@ -37,13 +37,10 @@ object SolrQuery {
cfg,
List(
Field.content,
Field.content_de,
Field.content_en,
Field.content_fr,
Field.itemName,
Field.itemNotes,
Field.attachmentName
),
) ++ Field.contentLangFields,
List(
Field.id,
Field.itemId,

View File

@ -56,21 +56,51 @@ object SolrSetup {
5,
solrEngine,
"Add content_fr field",
addContentFrField.map(_ => FtsMigration.Result.workDone)
addContentField(Language.French).map(_ => FtsMigration.Result.workDone)
),
FtsMigration[F](
6,
solrEngine,
"Index all from database",
FtsMigration.Result.indexAll.pure[F]
),
FtsMigration[F](
7,
solrEngine,
"Add content_it field",
addContentField(Language.Italian).map(_ => FtsMigration.Result.reIndexAll)
),
FtsMigration[F](
8,
solrEngine,
"Add content_es field",
addContentField(Language.Spanish).map(_ => FtsMigration.Result.reIndexAll)
),
FtsMigration[F](
9,
solrEngine,
"Add more content fields",
addMoreContentFields.map(_ => FtsMigration.Result.reIndexAll)
)
)
def addFolderField: F[Unit] =
addStringField(Field.folderId)
def addContentFrField: F[Unit] =
addTextField(Some(Language.French))(Field.content_fr)
def addMoreContentFields: F[Unit] = {
val remain = List[Language](
Language.Norwegian,
Language.Romanian,
Language.Swedish,
Language.Finnish,
Language.Danish,
Language.Czech,
Language.Dutch,
Language.Portuguese,
Language.Russian
)
remain.traverse(addContentField).map(_ => ())
}
def setupCoreSchema: F[Unit] = {
val cmds0 =
@ -90,13 +120,15 @@ object SolrSetup {
)
.traverse(addTextField(None))
val cntLang = Language.all.traverse {
val cntLang = List(Language.German, Language.English, Language.French).traverse {
case l @ Language.German =>
addTextField(l.some)(Field.content_de)
case l @ Language.English =>
addTextField(l.some)(Field.content_en)
case l @ Language.French =>
addTextField(l.some)(Field.content_fr)
case _ =>
().pure[F]
}
cmds0 *> cmds1 *> cntLang *> ().pure[F]
@ -111,20 +143,17 @@ object SolrSetup {
run(DeleteField.command(DeleteField(field))).attempt *>
run(AddField.command(AddField.string(field)))
private def addContentField(lang: Language): F[Unit] =
addTextField(Some(lang))(Field.contentField(lang))
private def addTextField(lang: Option[Language])(field: Field): F[Unit] =
lang match {
case None =>
run(DeleteField.command(DeleteField(field))).attempt *>
run(AddField.command(AddField.text(field)))
case Some(Language.German) =>
run(AddField.command(AddField.textGeneral(field)))
case Some(lang) =>
run(DeleteField.command(DeleteField(field))).attempt *>
run(AddField.command(AddField.textDE(field)))
case Some(Language.English) =>
run(DeleteField.command(DeleteField(field))).attempt *>
run(AddField.command(AddField.textEN(field)))
case Some(Language.French) =>
run(DeleteField.command(DeleteField(field))).attempt *>
run(AddField.command(AddField.textFR(field)))
run(AddField.command(AddField.textLang(field, lang)))
}
}
}
@ -150,17 +179,12 @@ object SolrSetup {
def string(field: Field): AddField =
AddField(field, "string", true, true, false)
def text(field: Field): AddField =
def textGeneral(field: Field): AddField =
AddField(field, "text_general", true, true, false)
def textDE(field: Field): AddField =
AddField(field, "text_de", true, true, false)
def textEN(field: Field): AddField =
AddField(field, "text_en", true, true, false)
def textFR(field: Field): AddField =
AddField(field, "text_fr", true, true, false)
def textLang(field: Field, lang: Language): AddField =
if (lang == Language.Czech) AddField(field, s"text_cz", true, true, false)
else AddField(field, s"text_${lang.iso2}", true, true, false)
}
case class DeleteField(name: Field)

View File

@ -269,62 +269,101 @@ docspell.joex {
# All text to analyse must fit into RAM. A large document may take
# too much heap. Also, most important information is at the
# beginning of a document, so in most cases the first two pages
# should suffice. Default is 10000, which are about 2-3 pages
# (just a rough guess, of course).
max-length = 10000
# should suffice. Default is 8000, which are about 2-3 pages (just
# a rough guess, of course).
max-length = 8000
# A working directory for the analyser to store temporary/working
# files.
working-dir = ${java.io.tmpdir}"/docspell-analysis"
# The StanfordCoreNLP library caches language models which
# requires quite some amount of memory. Setting this interval to a
# positive duration, the cache is cleared after this amount of
# idle time. Set it to 0 to disable it if you have enough memory,
# processing will be faster.
clear-stanford-nlp-interval = "15 minutes"
regex-ner {
# Whether to enable custom NER annotation. This uses the address
# book of a collective as input for NER tagging (to automatically
# find correspondent and concerned entities). If the address book
# is large, this can be quite memory intensive and also makes text
# analysis slower. But it greatly improves accuracy. If this is
# false, NER tagging uses only statistical models (that also work
# quite well).
nlp {
# The mode for configuring NLP models:
#
# This setting might be moved to the collective settings in the
# future.
enabled = true
# 1. full builds the complete pipeline
# 2. basic - builds only the ner annotator
# 3. regexonly - matches each entry in your address book via regexps
# 4. disabled - doesn't use any stanford-nlp feature
#
# The full and basic variants rely on pre-build language models
# that are available for only a few languages. Memory usage
# varies among the languages. So joex should run with -Xmx1400M
# at least when using mode=full.
#
# The basic variant does a quite good job for German and
# English. It might be worse for French, always depending on the
# type of text that is analysed. Joex should run with about 500M
# heap, here again lanugage German uses the most.
#
# The regexonly variant doesn't depend on a language. It roughly
# works by converting all entries in your addressbook into
# regexps and matches each one against the text. This can get
# memory intensive, too, when the addressbook grows large. This
# is included in the full and basic by default, but can be used
# independently by setting mode=regexner.
#
# When mode=disabled, then the whole nlp pipeline is disabled,
# and you won't get any suggestions. Only what the classifier
# returns (if enabled).
mode = full
# The NER annotation uses a file of patterns that is derived from
# a collective's address book. This is is the time how long this
# file will be kept until a check for a state change is done.
file-cache-time = "1 minute"
# The StanfordCoreNLP library caches language models which
# requires quite some amount of memory. Setting this interval to a
# positive duration, the cache is cleared after this amount of
# idle time. Set it to 0 to disable it if you have enough memory,
# processing will be faster.
#
# This has only any effect, if mode != disabled.
clear-interval = "15 minutes"
# Restricts proposals for due dates. Only dates earlier than this
# number of years in the future are considered.
max-due-date-years = 10
regex-ner {
# Whether to enable custom NER annotation. This uses the
# address book of a collective as input for NER tagging (to
# automatically find correspondent and concerned entities). If
# the address book is large, this can be quite memory
# intensive and also makes text analysis much slower. But it
# improves accuracy and can be used independent of the
# lanugage. If this is set to 0, it is effectively disabled
# and NER tagging uses only statistical models (that also work
# quite well, but are restricted to the languages mentioned
# above).
#
# Note, this is only relevant if nlp-config.mode is not
# "disabled".
max-entries = 1000
# The NER annotation uses a file of patterns that is derived
# from a collective's address book. This is is the time how
# long this data will be kept until a check for a state change
# is done.
file-cache-time = "1 minute"
}
}
# Settings for doing document classification.
#
# This works by learning from existing documents. A collective can
# specify a tag category and the system will try to predict a tag
# from this category for new incoming documents.
#
# This requires a satstical model that is computed from all
# existing documents. This process is run periodically as
# configured by the collective. It may require a lot of memory,
# depending on the amount of data.
# This works by learning from existing documents. This requires a
# satstical model that is computed from all existing documents.
# This process is run periodically as configured by the
# collective. It may require more memory, depending on the amount
# of data.
#
# It utilises this NLP library: https://nlp.stanford.edu/.
classification {
# Whether to enable classification globally. Each collective can
# decide to disable it. If it is disabled here, no collective
# can use classification.
# enable/disable auto-tagging. The classifier is also used for
# finding correspondents and concerned entities, if enabled
# here.
enabled = true
# If concerned with memory consumption, this restricts the
# number of items to consider. More are better for training. A
# negative value or zero means no train on all items.
item-count = 0
# negative value or zero means to train on all items.
item-count = 600
# These settings are used to configure the classifier. If
# multiple are given, they are all tried and the "best" is
@ -477,13 +516,6 @@ docspell.joex {
}
}
# General config for processing documents
processing {
# Restricts proposals for due dates. Only dates earlier than this
# number of years in the future are considered.
max-due-date-years = 10
}
# The same section is also present in the rest-server config. It is
# used when submitting files into the job queue for processing.
#

View File

@ -5,7 +5,7 @@ import java.nio.file.Path
import cats.data.NonEmptyList
import docspell.analysis.TextAnalysisConfig
import docspell.analysis.nlp.TextClassifierConfig
import docspell.analysis.classifier.TextClassifierConfig
import docspell.backend.Config.Files
import docspell.common._
import docspell.convert.ConvertConfig
@ -31,8 +31,7 @@ case class Config(
sendMail: MailSendConfig,
files: Files,
mailDebug: Boolean,
fullTextSearch: Config.FullTextSearch,
processing: Config.Processing
fullTextSearch: Config.FullTextSearch
)
object Config {
@ -55,20 +54,17 @@ object Config {
final case class Migration(indexAllChunk: Int)
}
case class Processing(maxDueDateYears: Int)
case class TextAnalysis(
maxLength: Int,
workingDir: Path,
clearStanfordNlpInterval: Duration,
regexNer: RegexNer,
nlp: NlpConfig,
classification: Classification
) {
def textAnalysisConfig: TextAnalysisConfig =
TextAnalysisConfig(
maxLength,
clearStanfordNlpInterval,
TextAnalysisConfig.NlpConfig(nlp.clearInterval, nlp.mode),
TextClassifierConfig(
workingDir,
NonEmptyList
@ -78,14 +74,30 @@ object Config {
)
def regexNerFileConfig: RegexNerFile.Config =
RegexNerFile.Config(regexNer.enabled, workingDir, regexNer.fileCacheTime)
RegexNerFile.Config(
nlp.regexNer.maxEntries,
workingDir,
nlp.regexNer.fileCacheTime
)
}
case class RegexNer(enabled: Boolean, fileCacheTime: Duration)
case class NlpConfig(
mode: NlpMode,
clearInterval: Duration,
maxDueDateYears: Int,
regexNer: RegexNer
)
case class RegexNer(maxEntries: Int, fileCacheTime: Duration)
case class Classification(
enabled: Boolean,
itemCount: Int,
classifiers: List[Map[String, String]]
)
) {
def itemCountOrWhenLower(other: Int): Int =
if (itemCount <= 0 || (itemCount > other && other > 0)) other
else itemCount
}
}

View File

@ -97,7 +97,7 @@ object JoexAppImpl {
upload <- OUpload(store, queue, cfg.files, joex)
fts <- createFtsClient(cfg)(httpClient)
itemOps <- OItem(store, fts, queue, joex)
analyser <- TextAnalyser.create[F](cfg.textAnalysis.textAnalysisConfig)
analyser <- TextAnalyser.create[F](cfg.textAnalysis.textAnalysisConfig, blocker)
regexNer <- RegexNerFile(cfg.textAnalysis.regexNerFileConfig, blocker, store)
javaEmil =
JavaMailEmil(blocker, Settings.defaultSettings.copy(debug = cfg.mailDebug))
@ -169,7 +169,7 @@ object JoexAppImpl {
.withTask(
JobTask.json(
LearnClassifierArgs.taskName,
LearnClassifierTask[F](cfg.textAnalysis, blocker, analyser),
LearnClassifierTask[F](cfg.textAnalysis, analyser),
LearnClassifierTask.onCancel[F]
)
)

View File

@ -29,7 +29,7 @@ trait RegexNerFile[F[_]] {
object RegexNerFile {
private[this] val logger = getLogger
case class Config(enabled: Boolean, directory: Path, minTime: Duration)
case class Config(maxEntries: Int, directory: Path, minTime: Duration)
def apply[F[_]: Concurrent: ContextShift](
cfg: Config,
@ -49,7 +49,7 @@ object RegexNerFile {
) extends RegexNerFile[F] {
def makeFile(collective: Ident): F[Option[Path]] =
if (cfg.enabled) doMakeFile(collective)
if (cfg.maxEntries > 0) doMakeFile(collective)
else (None: Option[Path]).pure[F]
def doMakeFile(collective: Ident): F[Option[Path]] =
@ -127,7 +127,7 @@ object RegexNerFile {
for {
_ <- logger.finfo(s"Generating custom NER file for collective '${collective.id}'")
names <- store.transact(QCollective.allNames(collective))
names <- store.transact(QCollective.allNames(collective, cfg.maxEntries))
nerFile = NerFile(collective, lastUpdate, now)
_ <- update(nerFile, NerFile.mkNerConfig(names))
} yield nerFile

View File

@ -14,16 +14,26 @@ object FtsWork {
def apply[F[_]](f: FtsContext[F] => F[Unit]): FtsWork[F] =
Kleisli(f)
def allInitializeTasks[F[_]: Monad]: FtsWork[F] =
FtsWork[F](_ => ().pure[F]).tap[FtsContext[F]].flatMap { ctx =>
NonEmptyList.fromList(ctx.fts.initialize.map(fm => from[F](fm.task))) match {
/** Runs all migration tasks unconditionally and inserts all data as last step. */
def reInitializeTasks[F[_]: Monad]: FtsWork[F] =
FtsWork { ctx =>
val migrations =
ctx.fts.initialize.map(fm => fm.changeResult(_ => FtsMigration.Result.workDone))
NonEmptyList.fromList(migrations) match {
case Some(nel) =>
nel.reduce(semigroup[F])
nel
.map(fm => from[F](fm.task))
.append(insertAll[F](None))
.reduce(semigroup[F])
.run(ctx)
case None =>
FtsWork[F](_ => ().pure[F])
().pure[F]
}
}
/**
*/
def from[F[_]: FlatMap: Applicative](t: F[FtsMigration.Result]): FtsWork[F] =
Kleisli.liftF(t).flatMap(transformResult[F])

View File

@ -11,6 +11,11 @@ import docspell.joex.Config
import docspell.store.records.RFtsMigration
import docspell.store.{AddResult, Store}
/** Migrating the index from the previous version to this version.
*
* The sql database stores the outcome of a migration task. If this
* task has already been applied, it is skipped.
*/
case class Migration[F[_]](
version: Int,
engine: Ident,

View File

@ -46,6 +46,6 @@ object ReIndexTask {
FtsWork.log[F](_.info("Clearing data failed. Continue re-indexing."))
) ++
FtsWork.log[F](_.info("Running index initialize")) ++
FtsWork.allInitializeTasks[F]
FtsWork.reInitializeTasks[F]
})
}

View File

@ -4,6 +4,9 @@ import cats.data.Kleisli
package object fts {
/** Some work that must be done to advance the schema of the fulltext
* index.
*/
type FtsWork[F[_]] = Kleisli[F, FtsContext[F], Unit]
}

View File

@ -0,0 +1,66 @@
package docspell.joex.learn
import cats.data.NonEmptyList
import cats.implicits._
import docspell.common.Ident
import docspell.store.records.{RClassifierModel, RClassifierSetting}
import doobie._
final class ClassifierName(val name: String) extends AnyVal
object ClassifierName {
def apply(name: String): ClassifierName =
new ClassifierName(name)
private val categoryPrefix = "tagcategory-"
def tagCategory(cat: String): ClassifierName =
apply(s"${categoryPrefix}${cat}")
val concernedPerson: ClassifierName =
apply("concernedperson")
val concernedEquip: ClassifierName =
apply("concernedequip")
val correspondentOrg: ClassifierName =
apply("correspondentorg")
val correspondentPerson: ClassifierName =
apply("correspondentperson")
def findTagClassifiers[F[_]](coll: Ident): ConnectionIO[List[ClassifierName]] =
for {
categories <- RClassifierSetting.getActiveCategories(coll)
} yield categories.map(tagCategory)
def findTagModels[F[_]](coll: Ident): ConnectionIO[List[RClassifierModel]] =
for {
categories <- RClassifierSetting.getActiveCategories(coll)
models <- NonEmptyList.fromList(categories) match {
case Some(nel) =>
RClassifierModel.findAllByName(coll, nel.map(tagCategory).map(_.name))
case None =>
List.empty[RClassifierModel].pure[ConnectionIO]
}
} yield models
def findOrphanTagModels[F[_]](coll: Ident): ConnectionIO[List[RClassifierModel]] =
for {
cats <- RClassifierSetting.getActiveCategories(coll)
allModels = RClassifierModel.findAllByQuery(coll, s"${categoryPrefix}%")
result <- NonEmptyList.fromList(cats) match {
case Some(nel) =>
allModels.flatMap(all =>
RClassifierModel
.findAllByName(coll, nel.map(tagCategory).map(_.name))
.map(active => all.diff(active))
)
case None =>
allModels
}
} yield result
}

View File

@ -0,0 +1,48 @@
package docspell.joex.learn
import java.nio.file.Path
import cats.data.OptionT
import cats.effect._
import cats.implicits._
import docspell.analysis.classifier.{ClassifierModel, TextClassifier}
import docspell.common._
import docspell.store.Store
import docspell.store.records.RClassifierModel
import bitpeace.RangeDef
object Classify {
def apply[F[_]: Sync: ContextShift](
blocker: Blocker,
logger: Logger[F],
workingDir: Path,
store: Store[F],
classifier: TextClassifier[F],
coll: Ident,
text: String
)(cname: ClassifierName): F[Option[String]] =
(for {
_ <- OptionT.liftF(logger.info(s"Guessing label for ${cname.name}"))
model <- OptionT(store.transact(RClassifierModel.findByName(coll, cname.name)))
.flatTapNone(logger.debug("No classifier model found."))
modelData =
store.bitpeace
.get(model.fileId.id)
.unNoneTerminate
.through(store.bitpeace.fetchData2(RangeDef.all))
cls <- OptionT(File.withTempDir(workingDir, "classify").use { dir =>
val modelFile = dir.resolve("model.ser.gz")
modelData
.through(fs2.io.file.writeAll(modelFile, blocker))
.compile
.drain
.flatMap(_ => classifier.classify(logger, ClassifierModel(modelFile), text))
}).filter(_ != LearnClassifierTask.noClass)
.flatTapNone(logger.debug("Guessed: <none>"))
_ <- OptionT.liftF(logger.debug(s"Guessed: ${cls}"))
} yield cls).value
}

View File

@ -1,26 +1,19 @@
package docspell.joex.learn
import cats.data.Kleisli
import cats.data.OptionT
import cats.effect._
import cats.implicits._
import fs2.{Pipe, Stream}
import docspell.analysis.TextAnalyser
import docspell.analysis.nlp.ClassifierModel
import docspell.analysis.nlp.TextClassifier.Data
import docspell.backend.ops.OCollective
import docspell.common._
import docspell.joex.Config
import docspell.joex.scheduler._
import docspell.store.queries.QItem
import docspell.store.records.RClassifierSetting
import bitpeace.MimetypeHint
import docspell.store.records.{RClassifierModel, RClassifierSetting}
object LearnClassifierTask {
val noClass = "__NONE__"
val pageSep = " --n-- "
val noClass = "__NONE__"
type Args = LearnClassifierArgs
@ -29,83 +22,86 @@ object LearnClassifierTask {
def apply[F[_]: Sync: ContextShift](
cfg: Config.TextAnalysis,
blocker: Blocker,
analyser: TextAnalyser[F]
): Task[F, Args, Unit] =
learnTags(cfg, analyser)
.flatMap(_ => learnItemEntities(cfg, analyser))
.flatMap(_ => Task(_ => Sync[F].delay(System.gc())))
private def learnItemEntities[F[_]: Sync: ContextShift](
cfg: Config.TextAnalysis,
analyser: TextAnalyser[F]
): Task[F, Args, Unit] =
Task { ctx =>
(for {
sett <- findActiveSettings[F](ctx, cfg)
data = selectItems(
ctx,
math.min(cfg.classification.itemCount, sett.itemCount).toLong,
sett.category.getOrElse("")
)
_ <- OptionT.liftF(
analyser
.classifier(blocker)
.trainClassifier[Unit](ctx.logger, data)(Kleisli(handleModel(ctx, blocker)))
)
} yield ())
.getOrElseF(logInactiveWarning(ctx.logger))
if (cfg.classification.enabled)
LearnItemEntities
.learnAll(
analyser,
ctx.args.collective,
cfg.classification.itemCount,
cfg.maxLength
)
.run(ctx)
else ().pure[F]
}
private def handleModel[F[_]: Sync: ContextShift](
ctx: Context[F, Args],
blocker: Blocker
)(trainedModel: ClassifierModel): F[Unit] =
private def learnTags[F[_]: Sync: ContextShift](
cfg: Config.TextAnalysis,
analyser: TextAnalyser[F]
): Task[F, Args, Unit] =
Task { ctx =>
val learnTags =
for {
sett <- findActiveSettings[F](ctx, cfg)
maxItems = cfg.classification.itemCountOrWhenLower(sett.itemCount)
_ <- OptionT.liftF(
LearnTags
.learnAllTagCategories(analyser)(
ctx.args.collective,
maxItems,
cfg.maxLength
)
.run(ctx)
)
} yield ()
// learn classifier models from active tag categories
learnTags.getOrElseF(logInactiveWarning(ctx.logger)) *>
// delete classifier model files for categories that have been removed
clearObsoleteTagModels(ctx) *>
// when tags are deleted, categories may get removed. fix the json array
ctx.store
.transact(RClassifierSetting.fixCategoryList(ctx.args.collective))
.map(_ => ())
}
private def clearObsoleteTagModels[F[_]: Sync](ctx: Context[F, Args]): F[Unit] =
for {
oldFile <- ctx.store.transact(
RClassifierSetting.findById(ctx.args.collective).map(_.flatMap(_.fileId))
list <- ctx.store.transact(
ClassifierName.findOrphanTagModels(ctx.args.collective)
)
_ <- ctx.logger.info("Storing new trained model")
fileData = fs2.io.file.readAll(trainedModel.model, blocker, 4096)
newFile <-
ctx.store.bitpeace.saveNew(fileData, 4096, MimetypeHint.none).compile.lastOrError
_ <- ctx.store.transact(
RClassifierSetting.updateFile(ctx.args.collective, Ident.unsafe(newFile.id))
_ <- ctx.logger.info(
s"Found ${list.size} obsolete model files that are deleted now."
)
_ <- ctx.logger.debug(s"New model stored at file ${newFile.id}")
_ <- oldFile match {
case Some(fid) =>
ctx.logger.debug(s"Deleting old model file ${fid.id}") *>
ctx.store.bitpeace.delete(fid.id).compile.drain
case None => ().pure[F]
}
n <- ctx.store.transact(RClassifierModel.deleteAll(list.map(_.id)))
_ <- list
.map(_.fileId.id)
.traverse(id => ctx.store.bitpeace.delete(id).compile.drain)
_ <- ctx.logger.debug(s"Deleted $n model files.")
} yield ()
private def selectItems[F[_]](
ctx: Context[F, Args],
max: Long,
category: String
): Stream[F, Data] = {
val connStream =
for {
item <- QItem.findAllNewesFirst(ctx.args.collective, 10).through(restrictTo(max))
tt <- Stream.eval(
QItem.resolveTextAndTag(ctx.args.collective, item, category, pageSep)
)
} yield Data(tt.tag.map(_.name).getOrElse(noClass), item.id, tt.text.trim)
ctx.store.transact(connStream.filter(_.text.nonEmpty))
}
private def restrictTo[F[_], A](max: Long): Pipe[F, A, A] =
if (max <= 0) identity
else _.take(max)
private def findActiveSettings[F[_]: Sync](
ctx: Context[F, Args],
cfg: Config.TextAnalysis
): OptionT[F, OCollective.Classifier] =
if (cfg.classification.enabled)
OptionT(ctx.store.transact(RClassifierSetting.findById(ctx.args.collective)))
.filter(_.enabled)
.filter(_.category.nonEmpty)
.filter(_.autoTagEnabled)
.map(OCollective.Classifier.fromRecord)
else
OptionT.none
private def logInactiveWarning[F[_]: Sync](logger: Logger[F]): F[Unit] =
logger.warn(
"Classification is disabled. Check joex config and the collective settings."
"Auto-tagging is disabled. Check joex config and the collective settings."
)
}

View File

@ -0,0 +1,79 @@
package docspell.joex.learn
import cats.data.Kleisli
import cats.effect._
import cats.implicits._
import fs2.Stream
import docspell.analysis.TextAnalyser
import docspell.analysis.classifier.TextClassifier.Data
import docspell.common._
import docspell.joex.scheduler._
object LearnItemEntities {
def learnAll[F[_]: Sync: ContextShift, A](
analyser: TextAnalyser[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Task[F, A, Unit] =
learnCorrOrg(analyser, collective, maxItems, maxTextLen)
.flatMap(_ => learnCorrPerson[F, A](analyser, collective, maxItems, maxTextLen))
.flatMap(_ => learnConcPerson(analyser, collective, maxItems, maxTextLen))
.flatMap(_ => learnConcEquip(analyser, collective, maxItems, maxTextLen))
def learnCorrOrg[F[_]: Sync: ContextShift, A](
analyser: TextAnalyser[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Task[F, A, Unit] =
learn(analyser, collective)(
ClassifierName.correspondentOrg,
ctx => SelectItems.forCorrOrg(ctx.store, collective, maxItems, maxTextLen)
)
def learnCorrPerson[F[_]: Sync: ContextShift, A](
analyser: TextAnalyser[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Task[F, A, Unit] =
learn(analyser, collective)(
ClassifierName.correspondentPerson,
ctx => SelectItems.forCorrPerson(ctx.store, collective, maxItems, maxTextLen)
)
def learnConcPerson[F[_]: Sync: ContextShift, A](
analyser: TextAnalyser[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Task[F, A, Unit] =
learn(analyser, collective)(
ClassifierName.concernedPerson,
ctx => SelectItems.forConcPerson(ctx.store, collective, maxItems, maxTextLen)
)
def learnConcEquip[F[_]: Sync: ContextShift, A](
analyser: TextAnalyser[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Task[F, A, Unit] =
learn(analyser, collective)(
ClassifierName.concernedEquip,
ctx => SelectItems.forConcEquip(ctx.store, collective, maxItems, maxTextLen)
)
private def learn[F[_]: Sync: ContextShift, A](
analyser: TextAnalyser[F],
collective: Ident
)(cname: ClassifierName, data: Context[F, _] => Stream[F, Data]): Task[F, A, Unit] =
Task { ctx =>
ctx.logger.info(s"Learn classifier ${cname.name}") *>
analyser.classifier.trainClassifier(ctx.logger, data(ctx))(
Kleisli(StoreClassifierModel.handleModel(ctx, collective, cname))
)
}
}

View File

@ -0,0 +1,48 @@
package docspell.joex.learn
import cats.data.Kleisli
import cats.effect._
import cats.implicits._
import docspell.analysis.TextAnalyser
import docspell.common._
import docspell.joex.scheduler._
import docspell.store.records.RClassifierSetting
object LearnTags {
def learnTagCategory[F[_]: Sync: ContextShift, A](
analyser: TextAnalyser[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
)(
category: String
): Task[F, A, Unit] =
Task { ctx =>
val data = SelectItems.forCategory(ctx, collective)(maxItems, category, maxTextLen)
ctx.logger.info(s"Learn classifier for tag category: $category") *>
analyser.classifier.trainClassifier(ctx.logger, data)(
Kleisli(
StoreClassifierModel.handleModel(
ctx,
collective,
ClassifierName.tagCategory(category)
)
)
)
}
def learnAllTagCategories[F[_]: Sync: ContextShift, A](analyser: TextAnalyser[F])(
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Task[F, A, Unit] =
Task { ctx =>
for {
cats <- ctx.store.transact(RClassifierSetting.getActiveCategories(collective))
task = learnTagCategory[F, A](analyser, collective, maxItems, maxTextLen) _
_ <- cats.map(task).traverse(_.run(ctx))
} yield ()
}
}

View File

@ -0,0 +1,109 @@
package docspell.joex.learn
import fs2.{Pipe, Stream}
import docspell.analysis.classifier.TextClassifier.Data
import docspell.common._
import docspell.joex.scheduler.Context
import docspell.store.Store
import docspell.store.qb.Batch
import docspell.store.queries.{QItem, TextAndTag}
import doobie._
object SelectItems {
val pageSep = LearnClassifierTask.pageSep
val noClass = LearnClassifierTask.noClass
def forCategory[F[_]](ctx: Context[F, _], collective: Ident)(
maxItems: Int,
category: String,
maxTextLen: Int
): Stream[F, Data] =
forCategory(ctx.store, collective, maxItems, category, maxTextLen)
def forCategory[F[_]](
store: Store[F],
collective: Ident,
maxItems: Int,
category: String,
maxTextLen: Int
): Stream[F, Data] = {
val connStream =
allItems(collective, maxItems)
.evalMap(item =>
QItem.resolveTextAndTag(collective, item, category, maxTextLen, pageSep)
)
.through(mkData)
store.transact(connStream)
}
def forCorrOrg[F[_]](
store: Store[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Stream[F, Data] = {
val connStream =
allItems(collective, maxItems)
.evalMap(item =>
QItem.resolveTextAndCorrOrg(collective, item, maxTextLen, pageSep)
)
.through(mkData)
store.transact(connStream)
}
def forCorrPerson[F[_]](
store: Store[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Stream[F, Data] = {
val connStream =
allItems(collective, maxItems)
.evalMap(item =>
QItem.resolveTextAndCorrPerson(collective, item, maxTextLen, pageSep)
)
.through(mkData)
store.transact(connStream)
}
def forConcPerson[F[_]](
store: Store[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Stream[F, Data] = {
val connStream =
allItems(collective, maxItems)
.evalMap(item =>
QItem.resolveTextAndConcPerson(collective, item, maxTextLen, pageSep)
)
.through(mkData)
store.transact(connStream)
}
def forConcEquip[F[_]](
store: Store[F],
collective: Ident,
maxItems: Int,
maxTextLen: Int
): Stream[F, Data] = {
val connStream =
allItems(collective, maxItems)
.evalMap(item =>
QItem.resolveTextAndConcEquip(collective, item, maxTextLen, pageSep)
)
.through(mkData)
store.transact(connStream)
}
private def allItems(collective: Ident, max: Int): Stream[ConnectionIO, Ident] = {
val limit = if (max <= 0) Batch.all else Batch.limit(max)
QItem.findAllNewesFirst(collective, 10, limit)
}
private def mkData[F[_]]: Pipe[F, TextAndTag, Data] =
_.map(tt => Data(tt.tag.map(_.name).getOrElse(noClass), tt.itemId.id, tt.text.trim))
.filter(_.text.nonEmpty)
}

View File

@ -0,0 +1,53 @@
package docspell.joex.learn
import cats.effect._
import cats.implicits._
import docspell.analysis.classifier.ClassifierModel
import docspell.common._
import docspell.joex.scheduler._
import docspell.store.Store
import docspell.store.records.RClassifierModel
import bitpeace.MimetypeHint
object StoreClassifierModel {
def handleModel[F[_]: Sync: ContextShift](
ctx: Context[F, _],
collective: Ident,
modelName: ClassifierName
)(
trainedModel: ClassifierModel
): F[Unit] =
handleModel(ctx.store, ctx.blocker, ctx.logger)(collective, modelName, trainedModel)
def handleModel[F[_]: Sync: ContextShift](
store: Store[F],
blocker: Blocker,
logger: Logger[F]
)(
collective: Ident,
modelName: ClassifierName,
trainedModel: ClassifierModel
): F[Unit] =
for {
oldFile <- store.transact(
RClassifierModel.findByName(collective, modelName.name).map(_.map(_.fileId))
)
_ <- logger.debug(s"Storing new trained model for: ${modelName.name}")
fileData = fs2.io.file.readAll(trainedModel.model, blocker, 4096)
newFile <-
store.bitpeace.saveNew(fileData, 4096, MimetypeHint.none).compile.lastOrError
_ <- store.transact(
RClassifierModel.updateFile(collective, modelName.name, Ident.unsafe(newFile.id))
)
_ <- logger.debug(s"New model stored at file ${newFile.id}")
_ <- oldFile match {
case Some(fid) =>
logger.debug(s"Deleting old model file ${fid.id}") *>
store.bitpeace.delete(fid.id).compile.drain
case None => ().pure[F]
}
} yield ()
}

View File

@ -78,7 +78,14 @@ object AttachmentPageCount {
s"No attachmentmeta record exists for ${ra.id.id}. Creating new."
) *> ctx.store.transact(
RAttachmentMeta.insert(
RAttachmentMeta(ra.id, None, Nil, MetaProposalList.empty, md.pageCount.some)
RAttachmentMeta(
ra.id,
None,
Nil,
MetaProposalList.empty,
md.pageCount.some,
None
)
)
)
else 0.pure[F]

View File

@ -108,7 +108,18 @@ object ConvertPdf {
ctx.logger.info(s"Conversion to pdf+txt successful. Saving file.") *>
storePDF(ctx, cfg, ra, pdf)
.flatMap(r =>
txt.map(t => (r, item.changeMeta(ra.id, _.setContentIfEmpty(t.some)).some))
txt.map(t =>
(
r,
item
.changeMeta(
ra.id,
ctx.args.meta.language,
_.setContentIfEmpty(t.some)
)
.some
)
)
)
case ConversionResult.UnsupportedFormat(mt) =>

View File

@ -107,6 +107,8 @@ object CreateItem {
Vector.empty,
fm.map(a => a.id -> a.fileId).toMap,
MetaProposalList.empty,
Nil,
MetaProposalList.empty,
Nil
)
}
@ -166,6 +168,8 @@ object CreateItem {
Vector.empty,
origMap,
MetaProposalList.empty,
Nil,
MetaProposalList.empty,
Nil
)
)

View File

@ -42,7 +42,7 @@ object ExtractArchive {
archive: Option[RAttachmentArchive]
): Task[F, ProcessItemArgs, (Option[RAttachmentArchive], ItemData)] =
singlePass(item, archive).flatMap { t =>
if (t._1 == None) Task.pure(t)
if (t._1.isEmpty) Task.pure(t)
else multiPass(t._2, t._1)
}

View File

@ -17,24 +17,92 @@ import docspell.store.records._
* by looking up values from NER in the users address book.
*/
object FindProposal {
type Args = ProcessItemArgs
def apply[F[_]: Sync](
cfg: Config.Processing
)(data: ItemData): Task[F, ProcessItemArgs, ItemData] =
cfg: Config.TextAnalysis
)(data: ItemData): Task[F, Args, ItemData] =
Task { ctx =>
val rmas = data.metas.map(rm => rm.copy(nerlabels = removeDuplicates(rm.nerlabels)))
ctx.logger.info("Starting find-proposal") *>
rmas
for {
_ <- ctx.logger.info("Starting find-proposal")
rmv <- rmas
.traverse(rm =>
processAttachment(cfg, rm, data.findDates(rm), ctx)
.map(ml => rm.copy(proposals = ml))
)
.map(rmv => data.copy(metas = rmv))
clp <- lookupClassifierProposals(ctx, data.classifyProposals)
} yield data.copy(metas = rmv, classifyProposals = clp)
}
def lookupClassifierProposals[F[_]: Sync](
ctx: Context[F, Args],
mpList: MetaProposalList
): F[MetaProposalList] = {
val coll = ctx.args.meta.collective
def lookup(mp: MetaProposal): F[Option[IdRef]] =
mp.proposalType match {
case MetaProposalType.CorrOrg =>
ctx.store
.transact(
ROrganization
.findLike(coll, mp.values.head.ref.name.toLowerCase)
.map(_.headOption)
)
.flatTap(oref =>
ctx.logger.debug(s"Found classifier organization for $mp: $oref")
)
case MetaProposalType.CorrPerson =>
ctx.store
.transact(
RPerson
.findLike(coll, mp.values.head.ref.name.toLowerCase, false)
.map(_.headOption)
)
.flatTap(oref =>
ctx.logger.debug(s"Found classifier corr-person for $mp: $oref")
)
case MetaProposalType.ConcPerson =>
ctx.store
.transact(
RPerson
.findLike(coll, mp.values.head.ref.name.toLowerCase, true)
.map(_.headOption)
)
.flatTap(oref =>
ctx.logger.debug(s"Found classifier conc-person for $mp: $oref")
)
case MetaProposalType.ConcEquip =>
ctx.store
.transact(
REquipment
.findLike(coll, mp.values.head.ref.name.toLowerCase)
.map(_.headOption)
)
.flatTap(oref =>
ctx.logger.debug(s"Found classifier conc-equip for $mp: $oref")
)
case MetaProposalType.DocDate =>
(None: Option[IdRef]).pure[F]
case MetaProposalType.DueDate =>
(None: Option[IdRef]).pure[F]
}
def updateRef(mp: MetaProposal)(idRef: Option[IdRef]): Option[MetaProposal] =
idRef // this proposal contains a single value only, since coming from classifier
.map(ref => mp.copy(values = mp.values.map(_.copy(ref = ref))))
ctx.logger.debug(s"Looking up classifier results: ${mpList.proposals}") *>
mpList.proposals
.traverse(mp => lookup(mp).map(updateRef(mp)))
.map(_.flatten)
.map(MetaProposalList.apply)
}
def processAttachment[F[_]: Sync](
cfg: Config.Processing,
cfg: Config.TextAnalysis,
rm: RAttachmentMeta,
rd: Vector[NerDateLabel],
ctx: Context[F, ProcessItemArgs]
@ -46,11 +114,11 @@ object FindProposal {
}
def makeDateProposal[F[_]: Sync](
cfg: Config.Processing,
cfg: Config.TextAnalysis,
dates: Vector[NerDateLabel]
): F[MetaProposalList] =
Timestamp.current[F].map { now =>
val maxFuture = now.plus(Duration.years(cfg.maxDueDateYears.toLong))
val maxFuture = now.plus(Duration.years(cfg.nlp.maxDueDateYears.toLong))
val latestFirst = dates
.filter(_.date.isBefore(maxFuture.toUtcDate))
.sortWith((l1, l2) => l1.date.isAfter(l2.date))

View File

@ -15,6 +15,9 @@ import docspell.store.records.{RAttachment, RAttachmentMeta, RItem}
* containng the source or origin file
* @param givenMeta meta data to this item that was not "guessed"
* from an attachment but given and thus is always correct
* @param classifyProposals these are proposals that were obtained by
* a trained classifier. There are no ner-tags, it will only provide a
* single label
*/
case class ItemData(
item: RItem,
@ -23,7 +26,11 @@ case class ItemData(
dateLabels: Vector[AttachmentDates],
originFile: Map[Ident, Ident], // maps RAttachment.id -> FileMeta.id
givenMeta: MetaProposalList, // given meta data not associated to a specific attachment
tags: List[String] // a list of tags (names or ids) attached to the item if they exist
// a list of tags (names or ids) attached to the item if they exist
tags: List[String],
// proposals obtained from the classifier
classifyProposals: MetaProposalList,
classifyTags: List[String]
) {
def findMeta(attachId: Ident): Option[RAttachmentMeta] =
@ -32,8 +39,12 @@ case class ItemData(
def findDates(rm: RAttachmentMeta): Vector[NerDateLabel] =
dateLabels.find(m => m.rm.id == rm.id).map(_.dates).getOrElse(Vector.empty)
def mapMeta(attachId: Ident, f: RAttachmentMeta => RAttachmentMeta): ItemData = {
val item = changeMeta(attachId, f)
def mapMeta(
attachId: Ident,
lang: Language,
f: RAttachmentMeta => RAttachmentMeta
): ItemData = {
val item = changeMeta(attachId, lang, f)
val next = metas.map(a => if (a.id == attachId) item else a)
copy(metas = next)
}
@ -43,13 +54,14 @@ case class ItemData(
def changeMeta(
attachId: Ident,
lang: Language,
f: RAttachmentMeta => RAttachmentMeta
): RAttachmentMeta =
f(findOrCreate(attachId))
f(findOrCreate(attachId, lang))
def findOrCreate(attachId: Ident): RAttachmentMeta =
def findOrCreate(attachId: Ident, lang: Language): RAttachmentMeta =
metas.find(_.id == attachId).getOrElse {
RAttachmentMeta.empty(attachId)
RAttachmentMeta.empty(attachId, lang)
}
}

View File

@ -24,6 +24,7 @@ object LinkProposal {
.flatten(data.metas.map(_.proposals))
.filter(_.proposalType != MetaProposalType.DocDate)
.sortByWeights
.fillEmptyFrom(data.classifyProposals)
ctx.logger.info(s"Starting linking proposals") *>
MetaProposalType.all

View File

@ -41,7 +41,7 @@ object ProcessItem {
regexNer: RegexNerFile[F]
)(item: ItemData): Task[F, ProcessItemArgs, ItemData] =
TextAnalysis[F](cfg.textAnalysis, analyser, regexNer)(item)
.flatMap(FindProposal[F](cfg.processing))
.flatMap(FindProposal[F](cfg.textAnalysis))
.flatMap(EvalProposals[F])
.flatMap(SaveProposals[F])

View File

@ -65,6 +65,8 @@ object ReProcessItem {
Vector.empty,
asrcMap.view.mapValues(_.fileId).toMap,
MetaProposalList.empty,
Nil,
MetaProposalList.empty,
Nil
)).getOrElseF(
Sync[F].raiseError(new Exception(s"Item not found: ${ctx.args.itemId.id}"))

View File

@ -4,21 +4,51 @@ import cats.effect.Sync
import cats.implicits._
import docspell.common._
import docspell.joex.scheduler.Task
import docspell.joex.scheduler.{Context, Task}
import docspell.store.AddResult
import docspell.store.records._
/** Saves the proposals in the database
*/
object SaveProposals {
type Args = ProcessItemArgs
def apply[F[_]: Sync](data: ItemData): Task[F, ProcessItemArgs, ItemData] =
def apply[F[_]: Sync](data: ItemData): Task[F, Args, ItemData] =
Task { ctx =>
ctx.logger.info("Storing proposals") *>
data.metas
for {
_ <- ctx.logger.info("Storing proposals")
_ <- data.metas
.traverse(rm =>
ctx.logger.debug(s"Storing attachment proposals: ${rm.proposals}") *>
ctx.store.transact(RAttachmentMeta.updateProposals(rm.id, rm.proposals))
ctx.logger.debug(
s"Storing attachment proposals: ${rm.proposals}"
) *> ctx.store.transact(RAttachmentMeta.updateProposals(rm.id, rm.proposals))
)
.map(_ => data)
_ <-
if (data.classifyProposals.isEmpty && data.classifyTags.isEmpty) 0.pure[F]
else saveItemProposal(ctx, data)
} yield data
}
def saveItemProposal[F[_]: Sync](ctx: Context[F, Args], data: ItemData): F[Unit] = {
def upsert(v: RItemProposal): F[Int] =
ctx.store.add(RItemProposal.insert(v), RItemProposal.exists(v.itemId)).flatMap {
case AddResult.Success => 1.pure[F]
case AddResult.EntityExists(_) =>
ctx.store.transact(RItemProposal.update(v))
case AddResult.Failure(ex) =>
ctx.logger.warn(s"Could not store item proposals: ${ex.getMessage}") *> 0
.pure[F]
}
for {
_ <- ctx.logger.debug(s"Storing classifier proposals: ${data.classifyProposals}")
tags <- ctx.store.transact(
RTag.findAllByNameOrId(data.classifyTags, ctx.args.meta.collective)
)
tagRefs = tags.map(t => IdRef(t.tagId, t.name))
now <- Timestamp.current[F]
value = RItemProposal(data.item.id, data.classifyProposals, tagRefs.toList, now)
_ <- upsert(value)
} yield ()
}
}

View File

@ -45,7 +45,8 @@ object SetGivenData {
Task { ctx =>
val itemId = data.item.id
val collective = ctx.args.meta.collective
val tags = (ctx.args.meta.tags.getOrElse(Nil) ++ data.tags).distinct
val tags =
(ctx.args.meta.tags.getOrElse(Nil) ++ data.tags ++ data.classifyTags).distinct
for {
_ <- ctx.logger.info(s"Set tags from given data: ${tags}")
e <- ops.linkTags(itemId, tags, collective).attempt

View File

@ -1,24 +1,20 @@
package docspell.joex.process
import cats.data.OptionT
import cats.Traverse
import cats.effect._
import cats.implicits._
import docspell.analysis.TextAnalyser
import docspell.analysis.nlp.ClassifierModel
import docspell.analysis.nlp.StanfordNerSettings
import docspell.analysis.nlp.TextClassifier
import docspell.analysis.classifier.TextClassifier
import docspell.analysis.{NlpSettings, TextAnalyser}
import docspell.common.MetaProposal.Candidate
import docspell.common._
import docspell.joex.Config
import docspell.joex.analysis.RegexNerFile
import docspell.joex.learn.LearnClassifierTask
import docspell.joex.learn.{ClassifierName, Classify, LearnClassifierTask}
import docspell.joex.process.ItemData.AttachmentDates
import docspell.joex.scheduler.Context
import docspell.joex.scheduler.Task
import docspell.store.records.RAttachmentMeta
import docspell.store.records.RClassifierSetting
import bitpeace.RangeDef
import docspell.store.records.{RAttachmentMeta, RClassifierSetting}
object TextAnalysis {
type Args = ProcessItemArgs
@ -41,13 +37,27 @@ object TextAnalysis {
_ <- t.traverse(m =>
ctx.store.transact(RAttachmentMeta.updateLabels(m._1.id, m._1.nerlabels))
)
v = t.toVector
autoTagEnabled <- getActiveAutoTag(ctx, cfg)
tag <-
if (autoTagEnabled) predictTags(ctx, cfg, item.metas, analyser.classifier)
else List.empty[String].pure[F]
classProposals <-
if (cfg.classification.enabled)
predictItemEntities(ctx, cfg, item.metas, analyser.classifier)
else MetaProposalList.empty.pure[F]
e <- s
_ <- ctx.logger.info(s"Text-Analysis finished in ${e.formatExact}")
v = t.toVector
tag <- predictTag(ctx, cfg, item.metas, analyser.classifier(ctx.blocker)).value
} yield item
.copy(metas = v.map(_._1), dateLabels = v.map(_._2))
.appendTags(tag.toSeq)
.copy(
metas = v.map(_._1),
dateLabels = v.map(_._2),
classifyProposals = classProposals,
classifyTags = tag
)
}
def annotateAttachment[F[_]: Sync](
@ -55,7 +65,7 @@ object TextAnalysis {
analyser: TextAnalyser[F],
nerFile: RegexNerFile[F]
)(rm: RAttachmentMeta): F[(RAttachmentMeta, AttachmentDates)] = {
val settings = StanfordNerSettings(ctx.args.meta.language, false, None)
val settings = NlpSettings(ctx.args.meta.language, false, None)
for {
customNer <- nerFile.makeFile(ctx.args.meta.collective)
sett = settings.copy(regexNer = customNer)
@ -68,44 +78,84 @@ object TextAnalysis {
} yield (rm.copy(nerlabels = labels.all.toList), AttachmentDates(rm, labels.dates))
}
def predictTag[F[_]: Sync: ContextShift](
def predictTags[F[_]: Sync: ContextShift](
ctx: Context[F, Args],
cfg: Config.TextAnalysis,
metas: Vector[RAttachmentMeta],
classifier: TextClassifier[F]
): OptionT[F, String] =
for {
model <- findActiveModel(ctx, cfg)
_ <- OptionT.liftF(ctx.logger.info(s"Guessing tag …"))
text = metas.flatMap(_.content).mkString(LearnClassifierTask.pageSep)
modelData =
ctx.store.bitpeace
.get(model.id)
.unNoneTerminate
.through(ctx.store.bitpeace.fetchData2(RangeDef.all))
cls <- OptionT(File.withTempDir(cfg.workingDir, "classify").use { dir =>
val modelFile = dir.resolve("model.ser.gz")
modelData
.through(fs2.io.file.writeAll(modelFile, ctx.blocker))
.compile
.drain
.flatMap(_ => classifier.classify(ctx.logger, ClassifierModel(modelFile), text))
}).filter(_ != LearnClassifierTask.noClass)
_ <- OptionT.liftF(ctx.logger.debug(s"Guessed tag: ${cls}"))
} yield cls
): F[List[String]] = {
val text = metas.flatMap(_.content).mkString(LearnClassifierTask.pageSep)
val classifyWith: ClassifierName => F[Option[String]] =
makeClassify(ctx, cfg, classifier)(text)
private def findActiveModel[F[_]: Sync](
for {
names <- ctx.store.transact(
ClassifierName.findTagClassifiers(ctx.args.meta.collective)
)
_ <- ctx.logger.debug(s"Guessing tags for ${names.size} categories")
tags <- names.traverse(classifyWith)
} yield tags.flatten
}
def predictItemEntities[F[_]: Sync: ContextShift](
ctx: Context[F, Args],
cfg: Config.TextAnalysis
): OptionT[F, Ident] =
(if (cfg.classification.enabled)
OptionT(ctx.store.transact(RClassifierSetting.findById(ctx.args.meta.collective)))
.filter(_.enabled)
.mapFilter(_.fileId)
else
OptionT.none[F, Ident]).orElse(
OptionT.liftF(ctx.logger.info("Classification is disabled.")) *> OptionT
.none[F, Ident]
cfg: Config.TextAnalysis,
metas: Vector[RAttachmentMeta],
classifier: TextClassifier[F]
): F[MetaProposalList] = {
val text = metas.flatMap(_.content).mkString(LearnClassifierTask.pageSep)
def classifyWith(
cname: ClassifierName,
mtype: MetaProposalType
): F[Option[MetaProposal]] =
for {
label <- makeClassify(ctx, cfg, classifier)(text).apply(cname)
} yield label.map(str =>
MetaProposal(mtype, Candidate(IdRef(Ident.unsafe(""), str), Set.empty))
)
Traverse[List]
.sequence(
List(
classifyWith(ClassifierName.correspondentOrg, MetaProposalType.CorrOrg),
classifyWith(ClassifierName.correspondentPerson, MetaProposalType.CorrPerson),
classifyWith(ClassifierName.concernedPerson, MetaProposalType.ConcPerson),
classifyWith(ClassifierName.concernedEquip, MetaProposalType.ConcEquip)
)
)
.map(_.flatten)
.map(MetaProposalList.apply)
}
private def makeClassify[F[_]: Sync: ContextShift](
ctx: Context[F, Args],
cfg: Config.TextAnalysis,
classifier: TextClassifier[F]
)(text: String): ClassifierName => F[Option[String]] =
Classify[F](
ctx.blocker,
ctx.logger,
cfg.workingDir,
ctx.store,
classifier,
ctx.args.meta.collective,
text
)
private def getActiveAutoTag[F[_]: Sync](
ctx: Context[F, Args],
cfg: Config.TextAnalysis
): F[Boolean] =
if (cfg.classification.enabled)
ctx.store
.transact(RClassifierSetting.findById(ctx.args.meta.collective))
.map(_.exists(_.autoTagEnabled))
.flatTap(enabled =>
if (enabled) ().pure[F]
else ctx.logger.info("Classification is disabled. Check config or settings.")
)
else
ctx.logger.info("Classification is disabled.") *> false.pure[F]
}

View File

@ -46,10 +46,14 @@ object TextExtraction {
)
_ <- fts.indexData(ctx.logger, (idxItem +: txt.map(_.td)).toSeq: _*)
dur <- start
_ <- ctx.logger.info(s"Text extraction finished in ${dur.formatExact}")
extractedTags = txt.flatMap(_.tags).distinct.toList
_ <- ctx.logger.info(s"Text extraction finished in ${dur.formatExact}.")
_ <-
if (extractedTags.isEmpty) ().pure[F]
else ctx.logger.debug(s"Found tags in file: $extractedTags")
} yield item
.copy(metas = txt.map(_.am))
.appendTags(txt.flatMap(_.tags).distinct.toList)
.appendTags(extractedTags)
}
// -- helpers
@ -78,7 +82,7 @@ object TextExtraction {
pair._2
)
val rm = item.findOrCreate(ra.id)
val rm = item.findOrCreate(ra.id, lang)
rm.content match {
case Some(_) =>
ctx.logger.info("TextExtraction skipped, since text is already available.") *>
@ -102,6 +106,7 @@ object TextExtraction {
res <- extractTextFallback(ctx, cfg, ra, lang)(fids)
meta = item.changeMeta(
ra.id,
lang,
rm =>
rm.setContentIfEmpty(
res.map(_.appendPdfMetaToText.text.trim).filter(_.nonEmpty)

View File

@ -9,7 +9,7 @@ servers:
description: Current host
paths:
/api/info:
/api/info/version:
get:
tags: [ Api Info ]
summary: Get basic information about this software.

View File

@ -4850,14 +4850,11 @@ components:
description: |
Settings for learning a document classifier.
required:
- enabled
- schedule
- itemCount
- categoryList
- listType
properties:
enabled:
type: boolean
category:
type: string
itemCount:
type: integer
format: int32
@ -4867,6 +4864,16 @@ components:
schedule:
type: string
format: calevent
categoryList:
type: array
items:
type: string
listType:
type: string
format: listtype
enum:
- blacklist
- whitelist
SourceList:
description: |

View File

@ -6,7 +6,7 @@ import cats.implicits._
import docspell.backend.BackendApp
import docspell.backend.auth.AuthToken
import docspell.backend.ops.OCollective
import docspell.common.MakePreviewArgs
import docspell.common.{ListType, MakePreviewArgs}
import docspell.restapi.model._
import docspell.restserver.conv.Conversions
import docspell.restserver.http4s._
@ -44,10 +44,10 @@ object CollectiveRoutes {
settings.integrationEnabled,
Some(
OCollective.Classifier(
settings.classifier.enabled,
settings.classifier.schedule,
settings.classifier.itemCount,
settings.classifier.category
settings.classifier.categoryList,
settings.classifier.listType
)
)
)
@ -65,12 +65,12 @@ object CollectiveRoutes {
c.language,
c.integrationEnabled,
ClassifierSetting(
c.classifier.map(_.enabled).getOrElse(false),
c.classifier.flatMap(_.category),
c.classifier.map(_.itemCount).getOrElse(0),
c.classifier
.map(_.schedule)
.getOrElse(CalEvent.unsafe("*-1/3-01 01:00:00"))
.getOrElse(CalEvent.unsafe("*-1/3-01 01:00:00")),
c.classifier.map(_.categories).getOrElse(Nil),
c.classifier.map(_.listType).getOrElse(ListType.whitelist)
)
)
)

View File

@ -0,0 +1,35 @@
ALTER TABLE "attachmentmeta"
ADD COLUMN "language" varchar(254);
update "attachmentmeta"
set "language" = 'deu'
where "attachid" in (
select "m"."attachid"
from "attachmentmeta" m
inner join "attachment" a on "a"."attachid" = "m"."attachid"
inner join "item" i on "a"."itemid" = "i"."itemid"
inner join "collective" c on "c"."cid" = "i"."cid"
where "c"."doclang" = 'deu'
);
update "attachmentmeta"
set "language" = 'eng'
where "attachid" in (
select "m"."attachid"
from "attachmentmeta" m
inner join "attachment" a on "a"."attachid" = "m"."attachid"
inner join "item" i on "a"."itemid" = "i"."itemid"
inner join "collective" c on "c"."cid" = "i"."cid"
where "c"."doclang" = 'eng'
);
update "attachmentmeta"
set "language" = 'fra'
where "attachid" in (
select "m"."attachid"
from "attachmentmeta" m
inner join "attachment" a on "a"."attachid" = "m"."attachid"
inner join "item" i on "a"."itemid" = "i"."itemid"
inner join "collective" c on "c"."cid" = "i"."cid"
where "c"."doclang" = 'fra'
);

View File

@ -0,0 +1,44 @@
CREATE TABLE "classifier_model"(
"id" varchar(254) not null primary key,
"cid" varchar(254) not null,
"name" varchar(254) not null,
"file_id" varchar(254) not null,
"created" timestamp not null,
foreign key ("cid") references "collective"("cid"),
foreign key ("file_id") references "filemeta"("id"),
unique ("cid", "name")
);
insert into "classifier_model"
select random_uuid() as "id", "cid", concat('tagcategory-', "category") as "name", "file_id", "created"
from "classifier_setting"
where "file_id" is not null;
alter table "classifier_setting"
add column "categories" text;
alter table "classifier_setting"
add column "category_list_type" varchar(254);
update "classifier_setting"
set "category_list_type" = 'whitelist';
update "classifier_setting"
set "categories" = concat('["', "category", '"]')
where category is not null;
update "classifier_setting"
set "categories" = '[]'
where category is null;
alter table "classifier_setting"
drop column "category";
alter table "classifier_setting"
drop column "file_id";
ALTER TABLE "classifier_setting"
ALTER COLUMN "categories" SET NOT NULL;
ALTER TABLE "classifier_setting"
ALTER COLUMN "category_list_type" SET NOT NULL;

View File

@ -0,0 +1,7 @@
CREATE TABLE "item_proposal" (
"itemid" varchar(254) not null primary key,
"classifier_proposals" text not null,
"classifier_tags" text not null,
"created" timestamp not null,
foreign key ("itemid") references "item"("itemid")
);

View File

@ -0,0 +1,14 @@
ALTER TABLE `attachmentmeta`
ADD COLUMN (`language` varchar(254));
update `attachmentmeta` `m`
inner join (
select `m`.`attachid`, `c`.`doclang`
from `attachmentmeta` m
inner join `attachment` a on `a`.`attachid` = `m`.`attachid`
inner join `item` i on `a`.`itemid` = `i`.`itemid`
inner join `collective` c on `c`.`cid` = `i`.`cid`
) as `c`
set `m`.`language` = `c`.`doclang`
where `m`.`attachid` = `c`.`attachid` and `m`.`language` is null;

View File

@ -0,0 +1,48 @@
CREATE TABLE `classifier_model`(
`id` varchar(254) not null primary key,
`cid` varchar(254) not null,
`name` varchar(254) not null,
`file_id` varchar(254) not null,
`created` timestamp not null,
foreign key (`cid`) references `collective`(`cid`),
foreign key (`file_id`) references `filemeta`(`id`),
unique (`cid`, `name`)
);
insert into `classifier_model`
select md5(rand()) as id, `cid`,concat('tagcategory-', `category`) as `name`, `file_id`, `created`
from `classifier_setting`
where `file_id` is not null;
alter table `classifier_setting`
add column (`categories` mediumtext);
alter table `classifier_setting`
add column (`category_list_type` varchar(254));
update `classifier_setting`
set `category_list_type` = 'whitelist';
update `classifier_setting`
set `categories` = concat('["', `category`, '"]')
where category is not null;
update `classifier_setting`
set `categories` = '[]'
where category is null;
alter table `classifier_setting`
drop column `category`;
-- mariadb requires to drop constraint manually when dropping a column
alter table `classifier_setting`
drop constraint `classifier_setting_ibfk_2`;
alter table `classifier_setting`
drop column `file_id`;
ALTER TABLE `classifier_setting`
MODIFY `categories` mediumtext NOT NULL;
ALTER TABLE `classifier_setting`
MODIFY `category_list_type` varchar(254) NOT NULL;

View File

@ -0,0 +1,7 @@
CREATE TABLE `item_proposal` (
`itemid` varchar(254) not null primary key,
`classifier_proposals` mediumtext not null,
`classifier_tags` mediumtext not null,
`created` timestamp not null,
foreign key (`itemid`) references `item`(`itemid`)
);

View File

@ -0,0 +1,15 @@
ALTER TABLE "attachmentmeta"
ADD COLUMN "language" varchar(254);
with
"attachlang" as (
select "m"."attachid", "m"."language", "c"."doclang"
from "attachmentmeta" m
inner join "attachment" a on "a"."attachid" = "m"."attachid"
inner join "item" i on "a"."itemid" = "i"."itemid"
inner join "collective" c on "c"."cid" = "i"."cid"
)
update "attachmentmeta" as "m"
set "language" = "c"."doclang"
from "attachlang" c
where "m"."attachid" = "c"."attachid" and "m"."language" is null;

View File

@ -0,0 +1,44 @@
CREATE TABLE "classifier_model"(
"id" varchar(254) not null primary key,
"cid" varchar(254) not null,
"name" varchar(254) not null,
"file_id" varchar(254) not null,
"created" timestamp not null,
foreign key ("cid") references "collective"("cid"),
foreign key ("file_id") references "filemeta"("id"),
unique ("cid", "name")
);
insert into "classifier_model"
select md5(random()::text) as id, "cid",'tagcategory-' || "category" as "name", "file_id", "created"
from "classifier_setting"
where "file_id" is not null;
alter table "classifier_setting"
add column "categories" text;
alter table "classifier_setting"
add column "category_list_type" varchar(254);
update "classifier_setting"
set "category_list_type" = 'whitelist';
update "classifier_setting"
set "categories" = concat('["', "category", '"]')
where category is not null;
update "classifier_setting"
set "categories" = '[]'
where category is null;
alter table "classifier_setting"
drop column "category";
alter table "classifier_setting"
drop column "file_id";
ALTER TABLE "classifier_setting"
ALTER COLUMN "categories" SET NOT NULL;
ALTER TABLE "classifier_setting"
ALTER COLUMN "category_list_type" SET NOT NULL;

View File

@ -0,0 +1,7 @@
CREATE TABLE "item_proposal" (
"itemid" varchar(254) not null primary key,
"classifier_proposals" text not null,
"classifier_tags" text not null,
"created" timestamp not null,
foreign key ("itemid") references "item"("itemid")
);

View File

@ -86,6 +86,9 @@ trait DoobieMeta extends EmilDoobieMeta {
implicit val metaItemProposalList: Meta[MetaProposalList] =
jsonMeta[MetaProposalList]
implicit val metaIdRef: Meta[List[IdRef]] =
jsonMeta[List[IdRef]]
implicit val metaLanguage: Meta[Language] =
Meta[String].imap(Language.unsafe)(_.iso3)
@ -97,6 +100,9 @@ trait DoobieMeta extends EmilDoobieMeta {
implicit val metaCustomFieldType: Meta[CustomFieldType] =
Meta[String].timap(CustomFieldType.unsafe)(_.name)
implicit val metaListType: Meta[ListType] =
Meta[String].timap(ListType.unsafeFromString)(_.name)
}
object DoobieMeta extends DoobieMeta {

View File

@ -1,5 +1,7 @@
package docspell.store.qb
import cats.data.NonEmptyList
sealed trait DBFunction {}
object DBFunction {
@ -31,6 +33,8 @@ object DBFunction {
case class Sum(expr: SelectExpr) extends DBFunction
case class Concat(exprs: NonEmptyList[SelectExpr]) extends DBFunction
sealed trait Operator
object Operator {
case object Plus extends Operator

View File

@ -98,6 +98,9 @@ trait DSL extends DoobieMeta {
def substring(expr: SelectExpr, start: Int, length: Int): DBFunction =
DBFunction.Substring(expr, start, length)
def concat(expr: SelectExpr, exprs: SelectExpr*): DBFunction =
DBFunction.Concat(Nel.of(expr, exprs: _*))
def lit[A](value: A)(implicit P: Put[A]): SelectExpr.SelectLit[A] =
SelectExpr.SelectLit(value, None)

View File

@ -32,6 +32,10 @@ object DBFunctionBuilder extends CommonBuilder {
case DBFunction.Substring(expr, start, len) =>
sql"SUBSTRING(" ++ SelectExprBuilder.build(expr) ++ fr" FROM $start FOR $len)"
case DBFunction.Concat(exprs) =>
val inner = exprs.map(SelectExprBuilder.build).toList.reduce(_ ++ comma ++ _)
sql"CONCAT(" ++ inner ++ sql")"
case DBFunction.Calc(op, left, right) =>
SelectExprBuilder.build(left) ++
buildOperator(op) ++

View File

@ -21,6 +21,7 @@ object QAttachment {
private val item = RItem.as("i")
private val am = RAttachmentMeta.as("am")
private val c = RCollective.as("c")
private val im = RItemProposal.as("im")
def deletePreview[F[_]: Sync](store: Store[F])(attachId: Ident): F[Int] = {
val findPreview =
@ -118,17 +119,27 @@ object QAttachment {
} yield ns.sum
def getMetaProposals(itemId: Ident, coll: Ident): ConnectionIO[MetaProposalList] = {
val q = Select(
am.proposals.s,
val qa = Select(
select(am.proposals),
from(am)
.innerJoin(a, a.id === am.id)
.innerJoin(item, a.itemId === item.id),
a.itemId === itemId && item.cid === coll
).build
val qi = Select(
select(im.classifyProposals),
from(im)
.innerJoin(item, item.id === im.itemId),
item.cid === coll && im.itemId === itemId
).build
for {
ml <- q.query[MetaProposalList].to[Vector]
} yield MetaProposalList.flatten(ml)
mla <- qa.query[MetaProposalList].to[Vector]
mli <- qi.query[MetaProposalList].to[Vector]
} yield MetaProposalList
.flatten(mla)
.insertSecond(MetaProposalList.flatten(mli))
}
def getAttachmentMeta(
@ -160,7 +171,15 @@ object QAttachment {
chunkSize: Int
): Stream[ConnectionIO, ContentAndName] =
Select(
select(a.id, a.itemId, item.cid, item.folder, c.language, a.name, am.content),
select(
a.id.s,
a.itemId.s,
item.cid.s,
item.folder.s,
coalesce(am.language.s, c.language.s).s,
a.name.s,
am.content.s
),
from(a)
.innerJoin(am, am.id === a.id)
.innerJoin(item, item.id === a.itemId)

View File

@ -1,10 +1,8 @@
package docspell.store.queries
import cats.data.OptionT
import fs2.Stream
import docspell.common.ContactKind
import docspell.common.{Direction, Ident}
import docspell.common._
import docspell.store.qb.DSL._
import docspell.store.qb._
import docspell.store.records._
@ -17,6 +15,7 @@ object QCollective {
private val t = RTag.as("t")
private val ro = ROrganization.as("o")
private val rp = RPerson.as("p")
private val re = REquipment.as("e")
private val rc = RContact.as("c")
private val i = RItem.as("i")
@ -25,13 +24,37 @@ object QCollective {
val empty = Names(Vector.empty, Vector.empty, Vector.empty)
}
def allNames(collective: Ident): ConnectionIO[Names] =
(for {
orgs <- OptionT.liftF(ROrganization.findAllRef(collective, None, _.name))
pers <- OptionT.liftF(RPerson.findAllRef(collective, None, _.name))
equp <- OptionT.liftF(REquipment.findAll(collective, None, _.name))
} yield Names(orgs.map(_.name), pers.map(_.name), equp.map(_.name)))
.getOrElse(Names.empty)
def allNames(collective: Ident, maxEntries: Int): ConnectionIO[Names] = {
val created = Column[Timestamp]("created", TableDef(""))
union(
Select(
select(ro.name.s, lit(1).as("kind"), ro.created.as(created)),
from(ro),
ro.cid === collective
),
Select(
select(rp.name.s, lit(2).as("kind"), rp.created.as(created)),
from(rp),
rp.cid === collective
),
Select(
select(re.name.s, lit(3).as("kind"), re.created.as(created)),
from(re),
re.cid === collective
)
).orderBy(created.desc)
.limit(Batch.limit(maxEntries))
.build
.query[(String, Int)]
.streamWithChunkSize(maxEntries)
.fold(Names.empty) { case (names, (name, kind)) =>
if (kind == 1) names.copy(org = names.org :+ name)
else if (kind == 2) names.copy(pers = names.pers :+ name)
else names.copy(equip = names.equip :+ name)
}
.compile
.lastOrError
}
case class InsightData(
incoming: Int,

View File

@ -441,8 +441,9 @@ object QItem {
tn <- store.transact(RTagItem.deleteItemTags(itemId))
mn <- store.transact(RSentMail.deleteByItem(itemId))
cf <- store.transact(RCustomFieldValue.deleteByItem(itemId))
im <- store.transact(RItemProposal.deleteByItem(itemId))
n <- store.transact(RItem.deleteByIdAndCollective(itemId, collective))
} yield tn + rn + n + mn + cf
} yield tn + rn + n + mn + cf + im
private def findByFileIdsQuery(
fileMetaIds: Nel[Ident],
@ -543,11 +544,13 @@ object QItem {
def findAllNewesFirst(
collective: Ident,
chunkSize: Int
chunkSize: Int,
limit: Batch
): Stream[ConnectionIO, Ident] = {
val i = RItem.as("i")
Select(i.id.s, from(i), i.cid === collective && i.state === ItemState.confirmed)
.orderBy(i.created.desc)
.limit(limit)
.build
.query[Ident]
.streamWithChunkSize(chunkSize)
@ -557,6 +560,7 @@ object QItem {
collective: Ident,
itemId: Ident,
tagCategory: String,
maxLen: Int,
pageSep: String
): ConnectionIO[TextAndTag] = {
val tags = TableDef("tags").as("tt")
@ -564,7 +568,7 @@ object QItem {
val tagsTid = Column[Ident]("tid", tags)
val tagsName = Column[String]("tname", tags)
val q =
readTextAndTag(collective, itemId, pageSep) {
withCte(
tags -> Select(
select(ti.itemId.as(tagsItem), tag.tid.as(tagsTid), tag.name.as(tagsName)),
@ -574,25 +578,98 @@ object QItem {
)
)(
Select(
select(m.content, tagsTid, tagsName),
select(substring(m.content.s, 0, maxLen).s, tagsTid.s, tagsName.s),
from(i)
.innerJoin(a, a.itemId === i.id)
.innerJoin(m, a.id === m.id)
.leftJoin(tags, tagsItem === i.id),
i.id === itemId && i.cid === collective && m.content.isNotNull && m.content <> ""
)
).build
)
}
}
def resolveTextAndCorrOrg(
collective: Ident,
itemId: Ident,
maxLen: Int,
pageSep: String
): ConnectionIO[TextAndTag] =
readTextAndTag(collective, itemId, pageSep) {
Select(
select(substring(m.content.s, 0, maxLen).s, org.oid.s, org.name.s),
from(i)
.innerJoin(a, a.itemId === i.id)
.innerJoin(m, m.id === a.id)
.leftJoin(org, org.oid === i.corrOrg),
i.id === itemId && m.content.isNotNull && m.content <> ""
)
}
def resolveTextAndCorrPerson(
collective: Ident,
itemId: Ident,
maxLen: Int,
pageSep: String
): ConnectionIO[TextAndTag] =
readTextAndTag(collective, itemId, pageSep) {
Select(
select(substring(m.content.s, 0, maxLen).s, pers0.pid.s, pers0.name.s),
from(i)
.innerJoin(a, a.itemId === i.id)
.innerJoin(m, m.id === a.id)
.leftJoin(pers0, pers0.pid === i.corrPerson),
i.id === itemId && m.content.isNotNull && m.content <> ""
)
}
def resolveTextAndConcPerson(
collective: Ident,
itemId: Ident,
maxLen: Int,
pageSep: String
): ConnectionIO[TextAndTag] =
readTextAndTag(collective, itemId, pageSep) {
Select(
select(substring(m.content.s, 0, maxLen).s, pers0.pid.s, pers0.name.s),
from(i)
.innerJoin(a, a.itemId === i.id)
.innerJoin(m, m.id === a.id)
.leftJoin(pers0, pers0.pid === i.concPerson),
i.id === itemId && m.content.isNotNull && m.content <> ""
)
}
def resolveTextAndConcEquip(
collective: Ident,
itemId: Ident,
maxLen: Int,
pageSep: String
): ConnectionIO[TextAndTag] =
readTextAndTag(collective, itemId, pageSep) {
Select(
select(substring(m.content.s, 0, maxLen).s, equip.eid.s, equip.name.s),
from(i)
.innerJoin(a, a.itemId === i.id)
.innerJoin(m, m.id === a.id)
.leftJoin(equip, equip.eid === i.concEquipment),
i.id === itemId && m.content.isNotNull && m.content <> ""
)
}
private def readTextAndTag(collective: Ident, itemId: Ident, pageSep: String)(
q: Select
): ConnectionIO[TextAndTag] =
for {
_ <- logger.ftrace[ConnectionIO](
s"query: $q (${itemId.id}, ${collective.id}, ${tagCategory})"
s"query: $q (${itemId.id}, ${collective.id})"
)
texts <- q.query[(String, Option[TextAndTag.TagName])].to[List]
texts <- q.build.query[(String, Option[TextAndTag.TagName])].to[List]
_ <- logger.ftrace[ConnectionIO](
s"Got ${texts.size} text and tag entries for item ${itemId.id}"
)
tag = texts.headOption.flatMap(_._2)
txt = texts.map(_._1).mkString(pageSep)
} yield TextAndTag(itemId, txt, tag)
}
}

View File

@ -15,7 +15,8 @@ case class RAttachmentMeta(
content: Option[String],
nerlabels: List[NerLabel],
proposals: MetaProposalList,
pages: Option[Int]
pages: Option[Int],
language: Option[Language]
) {
def setContentIfEmpty(txt: Option[String]): RAttachmentMeta =
@ -27,8 +28,8 @@ case class RAttachmentMeta(
}
object RAttachmentMeta {
def empty(attachId: Ident) =
RAttachmentMeta(attachId, None, Nil, MetaProposalList.empty, None)
def empty(attachId: Ident, lang: Language) =
RAttachmentMeta(attachId, None, Nil, MetaProposalList.empty, None, Some(lang))
final case class Table(alias: Option[String]) extends TableDef {
val tableName = "attachmentmeta"
@ -38,7 +39,16 @@ object RAttachmentMeta {
val nerlabels = Column[List[NerLabel]]("nerlabels", this)
val proposals = Column[MetaProposalList]("itemproposals", this)
val pages = Column[Int]("page_count", this)
val all = NonEmptyList.of[Column[_]](id, content, nerlabels, proposals, pages)
val language = Column[Language]("language", this)
val all =
NonEmptyList.of[Column[_]](
id,
content,
nerlabels,
proposals,
pages,
language
)
}
val T = Table(None)
@ -49,7 +59,7 @@ object RAttachmentMeta {
DML.insert(
T,
T.all,
fr"${v.id},${v.content},${v.nerlabels},${v.proposals},${v.pages}"
fr"${v.id},${v.content},${v.nerlabels},${v.proposals},${v.pages},${v.language}"
)
def exists(attachId: Ident): ConnectionIO[Boolean] =
@ -90,13 +100,14 @@ object RAttachmentMeta {
)
)
def updateProposals(mid: Ident, plist: MetaProposalList): ConnectionIO[Int] =
def updateProposals(
mid: Ident,
plist: MetaProposalList
): ConnectionIO[Int] =
DML.update(
T,
T.id === mid,
DML.set(
T.proposals.setTo(plist)
)
DML.set(T.proposals.setTo(plist))
)
def updatePageCount(mid: Ident, pageCount: Option[Int]): ConnectionIO[Int] =

View File

@ -0,0 +1,102 @@
package docspell.store.records
import cats.data.NonEmptyList
import cats.effect._
import cats.implicits._
import docspell.common._
import docspell.store.qb.DSL._
import docspell.store.qb._
import doobie._
import doobie.implicits._
final case class RClassifierModel(
id: Ident,
cid: Ident,
name: String,
fileId: Ident,
created: Timestamp
) {}
object RClassifierModel {
def createNew[F[_]: Sync](
cid: Ident,
name: String,
fileId: Ident
): F[RClassifierModel] =
for {
id <- Ident.randomId[F]
now <- Timestamp.current[F]
} yield RClassifierModel(id, cid, name, fileId, now)
final case class Table(alias: Option[String]) extends TableDef {
val tableName = "classifier_model"
val id = Column[Ident]("id", this)
val cid = Column[Ident]("cid", this)
val name = Column[String]("name", this)
val fileId = Column[Ident]("file_id", this)
val created = Column[Timestamp]("created", this)
val all = NonEmptyList.of[Column[_]](id, cid, name, fileId, created)
}
def as(alias: String): Table =
Table(Some(alias))
val T = Table(None)
def insert(v: RClassifierModel): ConnectionIO[Int] =
DML.insert(
T,
T.all,
fr"${v.id},${v.cid},${v.name},${v.fileId},${v.created}"
)
def updateFile(coll: Ident, name: String, fid: Ident): ConnectionIO[Int] =
for {
now <- Timestamp.current[ConnectionIO]
n <- DML.update(
T,
T.cid === coll && T.name === name,
DML.set(T.fileId.setTo(fid), T.created.setTo(now))
)
k <-
if (n == 0) createNew[ConnectionIO](coll, name, fid).flatMap(insert)
else 0.pure[ConnectionIO]
} yield n + k
def deleteById(id: Ident): ConnectionIO[Int] =
DML.delete(T, T.id === id)
def deleteAll(ids: List[Ident]): ConnectionIO[Int] =
NonEmptyList.fromList(ids) match {
case Some(nel) =>
DML.delete(T, T.id.in(nel))
case None =>
0.pure[ConnectionIO]
}
def findByName(cid: Ident, name: String): ConnectionIO[Option[RClassifierModel]] =
Select(select(T.all), from(T), T.cid === cid && T.name === name).build
.query[RClassifierModel]
.option
def findAllByName(
cid: Ident,
names: NonEmptyList[String]
): ConnectionIO[List[RClassifierModel]] =
Select(select(T.all), from(T), T.cid === cid && T.name.in(names)).build
.query[RClassifierModel]
.to[List]
def findAllByQuery(
cid: Ident,
nameQuery: String
): ConnectionIO[List[RClassifierModel]] =
Select(select(T.all), from(T), T.cid === cid && T.name.like(nameQuery)).build
.query[RClassifierModel]
.to[List]
}

View File

@ -1,6 +1,6 @@
package docspell.store.records
import cats.data.NonEmptyList
import cats.data.{NonEmptyList, OptionT}
import cats.implicits._
import docspell.common._
@ -13,27 +13,38 @@ import doobie.implicits._
case class RClassifierSetting(
cid: Ident,
enabled: Boolean,
schedule: CalEvent,
category: String,
itemCount: Int,
fileId: Option[Ident],
created: Timestamp
) {}
created: Timestamp,
categoryList: List[String],
listType: ListType
) {
def autoTagEnabled: Boolean =
listType match {
case ListType.Blacklist =>
true
case ListType.Whitelist =>
categoryList.nonEmpty
}
}
object RClassifierSetting {
// the categoryList is stored as a json array
implicit val stringListMeta: Meta[List[String]] =
jsonMeta[List[String]]
final case class Table(alias: Option[String]) extends TableDef {
val tableName = "classifier_setting"
val cid = Column[Ident]("cid", this)
val enabled = Column[Boolean]("enabled", this)
val schedule = Column[CalEvent]("schedule", this)
val category = Column[String]("category", this)
val itemCount = Column[Int]("item_count", this)
val fileId = Column[Ident]("file_id", this)
val created = Column[Timestamp]("created", this)
val cid = Column[Ident]("cid", this)
val schedule = Column[CalEvent]("schedule", this)
val itemCount = Column[Int]("item_count", this)
val created = Column[Timestamp]("created", this)
val categories = Column[List[String]]("categories", this)
val listType = Column[ListType]("category_list_type", this)
val all = NonEmptyList
.of[Column[_]](cid, enabled, schedule, category, itemCount, fileId, created)
.of[Column[_]](cid, schedule, itemCount, created, categories, listType)
}
val T = Table(None)
@ -44,35 +55,19 @@ object RClassifierSetting {
DML.insert(
T,
T.all,
fr"${v.cid},${v.enabled},${v.schedule},${v.category},${v.itemCount},${v.fileId},${v.created}"
fr"${v.cid},${v.schedule},${v.itemCount},${v.created},${v.categoryList},${v.listType}"
)
def updateAll(v: RClassifierSetting): ConnectionIO[Int] =
DML.update(
T,
T.cid === v.cid,
DML.set(
T.enabled.setTo(v.enabled),
T.schedule.setTo(v.schedule),
T.category.setTo(v.category),
T.itemCount.setTo(v.itemCount),
T.fileId.setTo(v.fileId)
)
)
def updateFile(coll: Ident, fid: Ident): ConnectionIO[Int] =
DML.update(T, T.cid === coll, DML.set(T.fileId.setTo(fid)))
def updateSettings(v: RClassifierSetting): ConnectionIO[Int] =
def update(v: RClassifierSetting): ConnectionIO[Int] =
for {
n1 <- DML.update(
T,
T.cid === v.cid,
DML.set(
T.enabled.setTo(v.enabled),
T.schedule.setTo(v.schedule),
T.itemCount.setTo(v.itemCount),
T.category.setTo(v.category)
T.categories.setTo(v.categoryList),
T.listType.setTo(v.listType)
)
)
n2 <- if (n1 <= 0) insert(v) else 0.pure[ConnectionIO]
@ -86,27 +81,62 @@ object RClassifierSetting {
def delete(coll: Ident): ConnectionIO[Int] =
DML.delete(T, T.cid === coll)
/** Finds tag categories that exist and match the classifier setting.
* If the setting contains a black list, they are removed from the
* existing categories. If it is a whitelist, the intersection is
* returned.
*/
def getActiveCategories(coll: Ident): ConnectionIO[List[String]] =
(for {
sett <- OptionT(findById(coll))
cats <- OptionT.liftF(RTag.listCategories(coll))
res = sett.listType match {
case ListType.Blacklist =>
cats.diff(sett.categoryList)
case ListType.Whitelist =>
sett.categoryList.intersect(cats)
}
} yield res).getOrElse(Nil)
/** Checks the json array of tag categories and removes those that are not present anymore. */
def fixCategoryList(coll: Ident): ConnectionIO[Int] =
(for {
sett <- OptionT(findById(coll))
cats <- OptionT.liftF(RTag.listCategories(coll))
fixed = sett.categoryList.intersect(cats)
n <- OptionT.liftF(
if (fixed == sett.categoryList) 0.pure[ConnectionIO]
else DML.update(T, T.cid === coll, DML.set(T.categories.setTo(fixed)))
)
} yield n).getOrElse(0)
case class Classifier(
enabled: Boolean,
schedule: CalEvent,
itemCount: Int,
category: Option[String]
categories: List[String],
listType: ListType
) {
def enabled: Boolean =
listType match {
case ListType.Blacklist =>
true
case ListType.Whitelist =>
categories.nonEmpty
}
def toRecord(coll: Ident, created: Timestamp): RClassifierSetting =
RClassifierSetting(
coll,
enabled,
schedule,
category.getOrElse(""),
itemCount,
None,
created
created,
categories,
listType
)
}
object Classifier {
def fromRecord(r: RClassifierSetting): Classifier =
Classifier(r.enabled, r.schedule, r.itemCount, r.category.some)
Classifier(r.schedule, r.itemCount, r.categoryList, r.listType)
}
}

View File

@ -1,6 +1,6 @@
package docspell.store.records
import cats.data.NonEmptyList
import cats.data.{NonEmptyList, OptionT}
import fs2.Stream
import docspell.common._
@ -73,13 +73,24 @@ object RCollective {
.map(now => settings.classifier.map(_.toRecord(cid, now)))
n2 <- cls match {
case Some(cr) =>
RClassifierSetting.updateSettings(cr)
RClassifierSetting.update(cr)
case None =>
RClassifierSetting.delete(cid)
}
} yield n1 + n2
def getSettings(coll: Ident): ConnectionIO[Option[Settings]] = {
// this hides categories that have been deleted in the meantime
// they are finally removed from the json array once the learn classifier task is run
def getSettings(coll: Ident): ConnectionIO[Option[Settings]] =
(for {
sett <- OptionT(getRawSettings(coll))
prev <- OptionT.fromOption[ConnectionIO](sett.classifier)
cats <- OptionT.liftF(RTag.listCategories(coll))
next = prev.copy(categories = prev.categories.intersect(cats))
} yield sett.copy(classifier = Some(next))).value
private def getRawSettings(coll: Ident): ConnectionIO[Option[Settings]] = {
import RClassifierSetting.stringListMeta
val c = RCollective.as("c")
val cs = RClassifierSetting.as("cs")
@ -87,10 +98,10 @@ object RCollective {
select(
c.language.s,
c.integration.s,
cs.enabled.s,
cs.schedule.s,
cs.itemCount.s,
cs.category.s
cs.categories.s,
cs.listType.s
),
from(c).leftJoin(cs, cs.cid === c.id),
c.id === coll

View File

@ -0,0 +1,60 @@
package docspell.store.records
import cats.data.NonEmptyList
import docspell.common._
import docspell.store.qb.DSL._
import docspell.store.qb._
import doobie._
import doobie.implicits._
case class RItemProposal(
itemId: Ident,
classifyProposals: MetaProposalList,
classifyTags: List[IdRef],
created: Timestamp
)
object RItemProposal {
final case class Table(alias: Option[String]) extends TableDef {
val tableName = "item_proposal"
val itemId = Column[Ident]("itemid", this)
val classifyProposals = Column[MetaProposalList]("classifier_proposals", this)
val classifyTags = Column[List[IdRef]]("classifier_tags", this)
val created = Column[Timestamp]("created", this)
val all = NonEmptyList.of[Column[_]](itemId, classifyProposals, classifyTags, created)
}
val T = Table(None)
def as(alias: String): Table =
Table(Some(alias))
def insert(v: RItemProposal): ConnectionIO[Int] =
DML.insert(
T,
T.all,
fr"${v.itemId},${v.classifyProposals},${v.classifyTags},${v.created}"
)
def update(v: RItemProposal): ConnectionIO[Int] =
DML.update(
T,
T.itemId === v.itemId,
DML.set(
T.classifyProposals.setTo(v.classifyProposals),
T.classifyTags.setTo(v.classifyTags)
)
)
def deleteByItem(itemId: Ident): ConnectionIO[Int] =
DML.delete(T, T.itemId === itemId)
def exists(itemId: Ident): ConnectionIO[Boolean] =
Select(select(countAll), from(T), T.itemId === itemId).build
.query[Int]
.unique
.map(_ > 0)
}

View File

@ -148,6 +148,13 @@ object RTag {
).orderBy(T.name.asc).build.query[RTag].to[List]
}
def listCategories(coll: Ident): ConnectionIO[List[String]] =
Select(
T.category.s,
from(T),
T.cid === coll && T.category.isNotNull
).distinct.build.query[String].to[List]
def delete(tagId: Ident, coll: Ident): ConnectionIO[Int] =
DML.delete(T, T.tid === tagId && T.cid === coll)
}

View File

@ -11,35 +11,38 @@ import Api
import Api.Model.ClassifierSetting exposing (ClassifierSetting)
import Api.Model.TagList exposing (TagList)
import Comp.CalEventInput
import Comp.Dropdown
import Comp.FixedDropdown
import Comp.IntField
import Data.CalEvent exposing (CalEvent)
import Data.Flags exposing (Flags)
import Data.ListType exposing (ListType)
import Data.UiSettings exposing (UiSettings)
import Data.Validated exposing (Validated(..))
import Html exposing (..)
import Html.Attributes exposing (..)
import Html.Events exposing (onCheck)
import Http
import Markdown
import Util.Tag
type alias Model =
{ enabled : Bool
, categoryModel : Comp.FixedDropdown.Model String
, category : Maybe String
, scheduleModel : Comp.CalEventInput.Model
{ scheduleModel : Comp.CalEventInput.Model
, schedule : Validated CalEvent
, itemCountModel : Comp.IntField.Model
, itemCount : Maybe Int
, categoryListModel : Comp.Dropdown.Model String
, categoryListType : ListType
, categoryListTypeModel : Comp.FixedDropdown.Model ListType
}
type Msg
= GetTagsResp (Result Http.Error TagList)
| ScheduleMsg Comp.CalEventInput.Msg
| ToggleEnabled
| CategoryMsg (Comp.FixedDropdown.Msg String)
= ScheduleMsg Comp.CalEventInput.Msg
| ItemCountMsg Comp.IntField.Msg
| GetTagsResp (Result Http.Error TagList)
| CategoryListMsg (Comp.Dropdown.Msg String)
| CategoryListTypeMsg (Comp.FixedDropdown.Msg ListType)
init : Flags -> ClassifierSetting -> ( Model, Cmd Msg )
@ -52,13 +55,36 @@ init flags sett =
( cem, cec ) =
Comp.CalEventInput.init flags newSchedule
in
( { enabled = sett.enabled
, categoryModel = Comp.FixedDropdown.initString []
, category = sett.category
, scheduleModel = cem
( { scheduleModel = cem
, schedule = Data.Validated.Unknown newSchedule
, itemCountModel = Comp.IntField.init (Just 0) Nothing True "Item Count"
, itemCount = Just sett.itemCount
, categoryListModel =
let
mkOption s =
{ value = s, text = s, additional = "" }
minit =
Comp.Dropdown.makeModel
{ multiple = True
, searchable = \n -> n > 0
, makeOption = mkOption
, labelColor = \_ -> \_ -> "grey "
, placeholder = "Choose categories "
}
lm =
Comp.Dropdown.SetSelection sett.categoryList
( m_, _ ) =
Comp.Dropdown.update lm minit
in
m_
, categoryListType =
Data.ListType.fromString sett.listType
|> Maybe.withDefault Data.ListType.Whitelist
, categoryListTypeModel =
Comp.FixedDropdown.initMap Data.ListType.label Data.ListType.all
}
, Cmd.batch
[ Api.getTags flags "" GetTagsResp
@ -71,11 +97,11 @@ getSettings : Model -> Validated ClassifierSetting
getSettings model =
Data.Validated.map
(\sch ->
{ enabled = model.enabled
, category = model.category
, schedule =
{ schedule =
Data.CalEvent.makeEvent sch
, itemCount = Maybe.withDefault 0 model.itemCount
, listType = Data.ListType.toString model.categoryListType
, categoryList = Comp.Dropdown.getSelected model.categoryListModel
}
)
model.schedule
@ -89,18 +115,11 @@ update flags msg model =
categories =
Util.Tag.getCategories tl.items
|> List.sort
in
( { model
| categoryModel = Comp.FixedDropdown.initString categories
, category =
if model.category == Nothing then
List.head categories
else
model.category
}
, Cmd.none
)
lm =
Comp.Dropdown.SetOptions categories
in
update flags (CategoryListMsg lm) model
GetTagsResp (Err _) ->
( model, Cmd.none )
@ -121,28 +140,6 @@ update flags msg model =
, Cmd.map ScheduleMsg cc
)
ToggleEnabled ->
( { model | enabled = not model.enabled }
, Cmd.none
)
CategoryMsg lmsg ->
let
( mm, ma ) =
Comp.FixedDropdown.update lmsg model.categoryModel
in
( { model
| categoryModel = mm
, category =
if ma == Nothing then
model.category
else
ma
}
, Cmd.none
)
ItemCountMsg lmsg ->
let
( im, iv ) =
@ -155,39 +152,68 @@ update flags msg model =
, Cmd.none
)
CategoryListMsg lm ->
let
( m_, cmd_ ) =
Comp.Dropdown.update lm model.categoryListModel
in
( { model | categoryListModel = m_ }
, Cmd.map CategoryListMsg cmd_
)
view : Model -> Html Msg
view model =
CategoryListTypeMsg lm ->
let
( m_, sel ) =
Comp.FixedDropdown.update lm model.categoryListTypeModel
newListType =
Maybe.withDefault model.categoryListType sel
in
( { model
| categoryListTypeModel = m_
, categoryListType = newListType
}
, Cmd.none
)
view : UiSettings -> Model -> Html Msg
view settings model =
let
catListTypeItem =
Comp.FixedDropdown.Item
model.categoryListType
(Data.ListType.label model.categoryListType)
in
div []
[ div
[ class "field"
]
[ div [ class "ui checkbox" ]
[ input
[ type_ "checkbox"
, onCheck (\_ -> ToggleEnabled)
, checked model.enabled
]
[]
, label [] [ text "Enable classification" ]
, span [ class "small-info" ]
[ text "Disable document classification if not needed."
]
]
]
, div [ class "ui basic segment" ]
[ text "Document classification tries to predict a tag for new incoming documents. This "
, text "works by learning from existing documents in order to find common patterns within "
, text "the text. The more documents you have correctly tagged, the better. Learning is done "
, text "periodically based on a schedule and you need to specify a tag-group that should "
, text "be used for learning."
[ Markdown.toHtml [ class "ui basic segment" ]
"""
Auto-tagging works by learning from existing documents. The more
documents you have correctly tagged, the better. Learning is done
periodically based on a schedule. You can specify tag-groups that
should either be used (whitelist) or not used (blacklist) for
learning.
Use an empty whitelist to disable auto tagging.
"""
, div [ class "field" ]
[ label [] [ text "Is the following a blacklist or whitelist?" ]
, Html.map CategoryListTypeMsg
(Comp.FixedDropdown.view (Just catListTypeItem) model.categoryListTypeModel)
]
, div [ class "field" ]
[ label [] [ text "Category" ]
, Html.map CategoryMsg
(Comp.FixedDropdown.viewString model.category
model.categoryModel
)
[ label []
[ case model.categoryListType of
Data.ListType.Whitelist ->
text "Include tag categories for learning"
Data.ListType.Blacklist ->
text "Exclude tag categories from learning"
]
, Html.map CategoryListMsg
(Comp.Dropdown.view settings model.categoryListModel)
]
, Html.map ItemCountMsg
(Comp.IntField.viewWithInfo

View File

@ -280,7 +280,7 @@ view flags settings model =
, ( "invisible hidden", not flags.config.showClassificationSettings )
]
]
[ text "Document Classifier"
[ text "Auto-Tagging"
]
, div
[ classList
@ -289,13 +289,10 @@ view flags settings model =
]
]
[ Html.map ClassifierSettingMsg
(Comp.ClassifierSettingsForm.view model.classifierModel)
(Comp.ClassifierSettingsForm.view settings model.classifierModel)
, div [ class "ui vertical segment" ]
[ button
[ classList
[ ( "ui small secondary basic button", True )
, ( "disabled", not model.classifierModel.enabled )
]
[ class "ui small secondary basic button"
, title "Starts a task to train a classifier"
, onClick StartClassifierTask
]

View File

@ -958,7 +958,6 @@ renderSuggestions model mkName idnames tagger =
]
, div [ class "menu" ] <|
(idnames
|> List.take 5
|> List.map (\p -> a [ class "item", href "#", onClick (tagger p) ] [ text (mkName p) ])
)
]
@ -969,7 +968,7 @@ renderOrgSuggestions : Model -> Html Msg
renderOrgSuggestions model =
renderSuggestions model
.name
(List.take 5 model.itemProposals.corrOrg)
(List.take 6 model.itemProposals.corrOrg)
SetCorrOrgSuggestion
@ -977,7 +976,7 @@ renderCorrPersonSuggestions : Model -> Html Msg
renderCorrPersonSuggestions model =
renderSuggestions model
.name
(List.take 5 model.itemProposals.corrPerson)
(List.take 6 model.itemProposals.corrPerson)
SetCorrPersonSuggestion
@ -985,7 +984,7 @@ renderConcPersonSuggestions : Model -> Html Msg
renderConcPersonSuggestions model =
renderSuggestions model
.name
(List.take 5 model.itemProposals.concPerson)
(List.take 6 model.itemProposals.concPerson)
SetConcPersonSuggestion
@ -993,7 +992,7 @@ renderConcEquipSuggestions : Model -> Html Msg
renderConcEquipSuggestions model =
renderSuggestions model
.name
(List.take 5 model.itemProposals.concEquipment)
(List.take 6 model.itemProposals.concEquipment)
SetConcEquipSuggestion
@ -1001,7 +1000,7 @@ renderItemDateSuggestions : Model -> Html Msg
renderItemDateSuggestions model =
renderSuggestions model
Util.Time.formatDate
(List.take 5 model.itemProposals.itemDate)
(List.take 6 model.itemProposals.itemDate)
SetItemDateSuggestion
@ -1009,7 +1008,7 @@ renderDueDateSuggestions : Model -> Html Msg
renderDueDateSuggestions model =
renderSuggestions model
Util.Time.formatDate
(List.take 5 model.itemProposals.dueDate)
(List.take 6 model.itemProposals.dueDate)
SetDueDateSuggestion

View File

@ -11,6 +11,17 @@ type Language
= German
| English
| French
| Italian
| Spanish
| Portuguese
| Czech
| Danish
| Finnish
| Norwegian
| Swedish
| Russian
| Romanian
| Dutch
fromString : String -> Maybe Language
@ -24,6 +35,39 @@ fromString str =
else if str == "fra" || str == "fr" || str == "french" then
Just French
else if str == "ita" || str == "it" || str == "italian" then
Just Italian
else if str == "spa" || str == "es" || str == "spanish" then
Just Spanish
else if str == "por" || str == "pt" || str == "portuguese" then
Just Portuguese
else if str == "ces" || str == "cs" || str == "czech" then
Just Czech
else if str == "dan" || str == "da" || str == "danish" then
Just Danish
else if str == "nld" || str == "nd" || str == "dutch" then
Just Dutch
else if str == "fin" || str == "fi" || str == "finnish" then
Just Finnish
else if str == "nor" || str == "no" || str == "norwegian" then
Just Norwegian
else if str == "swe" || str == "sv" || str == "swedish" then
Just Swedish
else if str == "rus" || str == "ru" || str == "russian" then
Just Russian
else if str == "ron" || str == "ro" || str == "romanian" then
Just Romanian
else
Nothing
@ -40,6 +84,39 @@ toIso3 lang =
French ->
"fra"
Italian ->
"ita"
Spanish ->
"spa"
Portuguese ->
"por"
Czech ->
"ces"
Danish ->
"dan"
Finnish ->
"fin"
Norwegian ->
"nor"
Swedish ->
"swe"
Russian ->
"rus"
Romanian ->
"ron"
Dutch ->
"nld"
toName : Language -> String
toName lang =
@ -53,7 +130,54 @@ toName lang =
French ->
"French"
Italian ->
"Italian"
Spanish ->
"Spanish"
Portuguese ->
"Portuguese"
Czech ->
"Czech"
Danish ->
"Danish"
Finnish ->
"Finnish"
Norwegian ->
"Norwegian"
Swedish ->
"Swedish"
Russian ->
"Russian"
Romanian ->
"Romanian"
Dutch ->
"Dutch"
all : List Language
all =
[ German, English, French ]
[ German
, English
, French
, Italian
, Spanish
, Portuguese
, Czech
, Dutch
, Danish
, Finnish
, Norwegian
, Swedish
, Russian
, Romanian
]

View File

@ -0,0 +1,50 @@
module Data.ListType exposing
( ListType(..)
, all
, fromString
, label
, toString
)
type ListType
= Blacklist
| Whitelist
all : List ListType
all =
[ Blacklist, Whitelist ]
toString : ListType -> String
toString lt =
case lt of
Blacklist ->
"blacklist"
Whitelist ->
"whitelist"
label : ListType -> String
label lt =
case lt of
Blacklist ->
"Blacklist"
Whitelist ->
"Whitelist"
fromString : String -> Maybe ListType
fromString str =
case String.toLower str of
"blacklist" ->
Just Blacklist
"whitelist" ->
Just Whitelist
_ ->
Nothing

View File

@ -98,9 +98,13 @@ let
};
text-analysis = {
max-length = 10000;
regex-ner = {
enabled = true;
file-cache-time = "1 minute";
nlp = {
mode = "full";
clear-interval = "15 minutes";
regex-ner = {
max-entries = 1000;
file-cache-time = "1 minute";
};
};
classification = {
enabled = true;
@ -118,7 +122,6 @@ let
];
};
working-dir = "/tmp/docspell-analysis";
clear-stanford-nlp-interval = "15 minutes";
};
processing = {
max-due-date-years = 10;
@ -772,47 +775,96 @@ in {
files.
'';
};
clear-stanford-nlp-interval = mkOption {
type = types.str;
default = defaults.text-analysis.clear-stanford-nlp-interval;
description = ''
Idle time after which the NLP caches are cleared to free
memory. If <= 0 clearing the cache is disabled.
'';
};
regex-ner = mkOption {
nlp = mkOption {
type = types.submodule({
options = {
enabled = mkOption {
type = types.bool;
default = defaults.text-analysis.regex-ner.enabled;
mode = mkOption {
type = types.str;
default = defaults.text-analysis.nlp.mode;
description = ''
Whether to enable custom NER annotation. This uses the address
book of a collective as input for NER tagging (to automatically
find correspondent and concerned entities). If the address book
is large, this can be quite memory intensive and also makes text
analysis slower. But it greatly improves accuracy. If this is
false, NER tagging uses only statistical models (that also work
quite well).
The mode for configuring NLP models:
This setting might be moved to the collective settings in the
future.
1. full builds the complete pipeline
2. basic - builds only the ner annotator
3. regexonly - matches each entry in your address book via regexps
4. disabled - doesn't use any stanford-nlp feature
The full and basic variants rely on pre-build language models
that are available for only 3 lanugages at the moment: German,
English and French.
Memory usage varies greatly among the languages. German has
quite large models, that require about 1G heap. So joex should
run with -Xmx1400M at least when using mode=full.
The basic variant does a quite good job for German and
English. It might be worse for French, always depending on the
type of text that is analysed. Joex should run with about 600M
heap, here again lanugage German uses the most.
The regexonly variant doesn't depend on a language. It roughly
works by converting all entries in your addressbook into
regexps and matches each one against the text. This can get
memory intensive, too, when the addressbook grows large. This
is included in the full and basic by default, but can be used
independently by setting mode=regexner.
When mode=disabled, then the whole nlp pipeline is disabled,
and you won't get any suggestions. Only what the classifier
returns (if enabled).
'';
};
file-cache-time = mkOption {
clear-interval = mkOption {
type = types.str;
default = defaults.text-analysis.ner-file-cache-time;
default = defaults.text-analysis.nlp.clear-interval;
description = ''
The NER annotation uses a file of patterns that is derived from
a collective's address book. This is is the time how long this
file will be kept until a check for a state change is done.
Idle time after which the NLP caches are cleared to free
memory. If <= 0 clearing the cache is disabled.
'';
};
regex-ner = mkOption {
type = types.submodule({
options = {
enabled = mkOption {
type = types.int;
default = defaults.text-analysis.regex-ner.max-entries;
description = ''
Whether to enable custom NER annotation. This uses the
address book of a collective as input for NER tagging (to
automatically find correspondent and concerned entities). If
the address book is large, this can be quite memory
intensive and also makes text analysis much slower. But it
improves accuracy and can be used independent of the
lanugage. If this is set to 0, it is effectively disabled
and NER tagging uses only statistical models (that also work
quite well, but are restricted to the languages mentioned
above).
Note, this is only relevant if nlp-config.mode is not
"disabled".
'';
};
file-cache-time = mkOption {
type = types.str;
default = defaults.text-analysis.ner-file-cache-time;
description = ''
The NER annotation uses a file of patterns that is derived from
a collective's address book. This is is the time how long this
file will be kept until a check for a state change is done.
'';
};
};
});
default = defaults.text-analysis.nlp.regex-ner;
description = "";
};
};
});
default = defaults.text-analysis.regex-ner;
description = "";
default = defaults.text-analysis.nlp;
description = "Configure NLP";
};
classification = mkOption {

View File

@ -20,6 +20,9 @@ The configuration of both components uses separate namespaces. The
configuration for the REST server is below `docspell.server`, while
the one for joex is below `docspell.joex`.
You can therefore use two separate config files or one single file
containing both namespaces.
## JDBC
This configures the connection to the database. This has to be
@ -281,6 +284,70 @@ just some minutes, the web application obtains new ones
periodically. So a short time is recommended.
## File Processing
Files are being processed by the joex component. So all the respective
configuration is in this config only.
File processing involves several stages, detailed information can be
found [here](@/docs/joex/file-processing.md#text-analysis) and in the
corresponding sections in [joex default config](#joex).
Configuration allows to define the external tools and set some
limitations to control memory usage. The sections are:
- `docspell.joex.extraction`
- `docspell.joex.text-analysis`
- `docspell.joex.convert`
Options to external commands can use variables that are replaced by
values at runtime. Variables are enclosed in double braces `{{…}}`.
Please see the default configuration for what variables exist per
command.
### Classification
In `text-analysis.classification` you can define how many documents at
most should be used for learning. The default settings should work
well for most cases. However, it always depends on the amount of data
and the machine that runs joex. For example, by default the documents
to learn from are limited to 600 (`classification.item-count`) and
every text is cut after 8000 characters (`text-analysis.max-length`).
This is fine if *most* of your documents are small and only a few are
near 8000 characters). But if *all* your documents are very large, you
probably need to either assign more heap memory or go down with the
limits.
Classification can be disabled, too, for when it's not needed.
### NLP
This setting defines which NLP mode to use. It defaults to `full`,
which requires more memory for certain languages (with the advantage
of better results). Other values are `basic`, `regexonly` and
`disabled`. The modes `full` and `basic` use pre-defined lanugage
models for procesing documents of languaes German, English and French.
These require some amount of memory (see below).
The mode `basic` is like the "light" variant to `full`. It doesn't use
all NLP features, which makes memory consumption much lower, but comes
with the compromise of less accurate results.
The mode `regexonly` doesn't use pre-defined lanuage models, even if
available. It checks your address book against a document to find
metadata. That means, it is language independent. Also, when using
`full` or `basic` with lanugages where no pre-defined models exist, it
will degrade to `regexonly` for these.
The mode `disabled` skips NLP processing completely. This has least
impact in memory consumption, obviously, but then only the classifier
is used to find metadata.
You might want to try different modes and see what combination suits
best your usage pattern and machine running joex. If a powerful
machine is used, simply leave the defaults. When running on an older
raspberry pi, for example, you might need to adjust things.
# File Format
The format of the configuration files can be

Some files were not shown because too many files have changed in this diff Show More