Add constraints from config to classifier training

For large and/or many documents, training the classifier can lead to OOM errors. Some limits have been set by default.
2025-06-22 02:18:26 +00:00 · 2021-01-21 17:46:39 +01:00
parent 363cf5aef0
commit 9957c3267e
7 changed files with 87 additions and 50 deletions
--- a/modules/joex/src/main/resources/reference.conf
+++ b/modules/joex/src/main/resources/reference.conf
@ -269,9 +269,9 @@ docspell.joex {
    # All text to analyse must fit into RAM. A large document may take
    # too much heap. Also, most important information is at the
    # beginning of a document, so in most cases the first two pages
-    # should suffice. Default is 10000, which are about 2-3 pages
-    # (just a rough guess, of course).
-    max-length = 10000
+    # should suffice. Default is 8000, which are about 2-3 pages (just
+    # a rough guess, of course).
+    max-length = 8000

    # A working directory for the analyser to store temporary/working
    # files.
@ -363,7 +363,7 @@ docspell.joex {
      # If concerned with memory consumption, this restricts the
      # number of items to consider. More are better for training. A
      # negative value or zero means to train on all items.
-      item-count = 0
+      item-count = 600

      # These settings are used to configure the classifier. If
      # multiple are given, they are all tried and the "best" is