Update documentation

2025-08-01 13:04:52 +00:00 · 2020-09-02 00:18:55 +02:00
parent 4309bd8dfd
commit 145c308461
2 changed files with 21 additions and 6 deletions
--- a/website/elm/Feature.elm
+++ b/website/elm/Feature.elm
@ -67,7 +67,7 @@ Text is extracted from all files. For scanned documents/images, OCR is used by u
    , { image = "img/analyze-feature.png"
      , header = "Text Analysis"
      , description = """
-The extracted text is analyzed and is used to find properties that can be annotated to your documents automatically.
+The extracted text is analyzed using ML techniques to find properties that can be annotated to your documents automatically.
 """
      }
    , { image = "img/filetype-feature.svg"
--- a/website/site/content/docs/webapp/metadata.md
+++ b/website/site/content/docs/webapp/metadata.md
@ -33,11 +33,26 @@ workflows, a tag category *state* may exist that includes tags like
 "assignment" semantics. Docspell doesn't propose any workflow, but it
 can help to implement some.

-The tags are *not* taken into account when creating suggestions from
-analyzed text yet. However, PDF files may contain metadata itself and
-if there is a metadata *keywords* list, these keywords are matched
-against the tags in the database. If they match, the item is tagged
-automatically.
+Docspell can try to predict a tag for new incoming documents
+automatically based on your existing data. This requires to train an
+algorithm. There are some caveats: the more data you have correctly
+tagged, the better are the results. So it won't work well for maybe
+the first 100 documents. Then the tags must somehow relate to a
+pattern in the document text. Tags like *todo* or *waiting* probably
+won't work, obviously. But the typical "document type" tag, like
+*invoice* and *receipt* is a good fit! That is why you need to provide
+a tag category so only sensible tags are being learned. The algorithm
+goes through all your items and learns patterns in the text that
+relate to the given tags. This training step can be run periodically,
+as specified in your collective settings such that docspell keeps
+learning from your already tagged data! More information about the
+algorithm can be found in the config, where it is possible to
+fine-tune this process.
+
+Another way to have items tagged automatically is when an input PDF
+file contains a list of keywords in its metadata section (this only
+applies to PDF files). These keywords are then matched against the
+tags in the database. If they match, the item is tagged with them.


 ## Organization and Person