Commit Graph

38 Commits

Author SHA1 Message Date
61d5585e68 Add Ukrainian language 2022-11-09 22:24:32 +01:00
c0feb13f63 Add Estonian language
Closes: #1646
2022-11-01 01:00:16 +01:00
5ec311c331 Add polish to processing lanugages
SOLR doesn't support polish out of the box. Plugins are required for
polish. The language has been added only with basic support. For
better results, a manual setup of solr is required.

Closes: #1345
2022-05-21 14:41:16 +02:00
9d69401fea Add Lithuanian to processing languages
SOLR doesn't support Lithuanian, maybe it can be added via plugins. A
manual setup of solr is required then. It has been added with basic
support.

Closes: #1540
2022-05-21 14:36:01 +02:00
7fdd78ad06 Experiment with addons
Addons allow to execute external programs in some context inside
docspell. Currently it is possible to run them after processing files.
Addons are provided by URLs to zip files.
2022-05-15 23:46:43 +02:00
9eb9497675 Fix logging in tests 2022-02-19 23:33:01 +01:00
e483a97de7 Adopt to new loggin api 2022-02-19 21:41:38 +01:00
501c6f2988 Updating stanford corenlp to 4.3.2; adding more languages
There are models for Spanish, that have been added now. Also the
Hungarian language has been added to the list of supported
languages (for tesseract mainly, no nlp models)
2021-11-20 14:31:39 +01:00
9013f2de5b Update scalafmt settings 2021-09-22 17:23:24 +02:00
9785db0683 Change license header of all files 2021-09-21 22:35:38 +02:00
e4fecefaea Reformat with scalafmt 3.0.0 2021-08-19 08:50:30 +02:00
1901fe1a8c Adopt deprecated APIs from fs2; use fs2.Path 2021-08-07 17:51:56 +02:00
4af8dd0950 Preprocess japanese texts to find dates
Not very efficient, but should work to find the position of dates in
japanese text.
2021-07-29 01:35:15 +02:00
e8348e2809 Remove excessive spaces 2021-07-29 02:08:48 +03:00
1095a7d56f Add another Japanese test 2021-07-29 01:13:22 +03:00
119a4ffdc9 Update Japanese tests with more sensible data 2021-07-29 01:08:48 +03:00
f994d4b248 Add japanese document language 2021-07-28 20:05:48 +02:00
8e5c88fd32 Add copyright header to source files 2021-07-04 10:57:53 +02:00
bd791b4593 Upgrade code base to CE3 2021-06-22 22:53:34 +02:00
e1bbc2edf5 Apply autoformat 2021-04-10 16:31:58 +02:00
6a63694a3e Convert unit tests to munit 2021-03-10 19:48:56 +01:00
9991ad5fcc Add latvian language 2021-03-09 00:23:17 +01:00
c7d4c77e6d Allow more suggestions for date variants in English 2021-02-26 00:35:17 +01:00
ff121d462c Disable memory intensive tests on travis 2021-01-18 17:41:40 +01:00
f01646aeb5 Reorganize nlp pipeline and add nlp-unsupported language italian
Improves and reorganizes how nlp pipelines are setup. Now users can
choose from many options, depending on their hardware and usage
scenario.

This is the base to use more languages without depending on what
stanford-nlp supports. Support then is involves to text extraction and
simple regex-ner processing.
2021-01-18 17:41:40 +01:00
54a09861c4 Use model cache with basic annotator 2021-01-17 22:56:33 +01:00
4462ebae0f Resurrect the basic ner classifier 2021-01-17 22:56:33 +01:00
a699e87304 Separate ner from classification 2021-01-17 22:56:33 +01:00
75986c461f Fix ner date label boundary reporting 2021-01-10 09:10:39 +01:00
fb05e997ab Provide multiple date suggestions for English
Issue: #561
2021-01-10 09:02:26 +01:00
53c8d3031d Skip invalid dates find in texts
Fixes: #298
2020-10-02 22:37:15 +02:00
c658677032 Autoformat 2020-09-09 00:29:32 +02:00
0c97b4ef76 Initial impl of a text classifier based on stanford-nlp 2020-09-02 18:28:14 +02:00
96d2f948f2 Use collective's addressbook to configure regexner 2020-08-24 14:40:52 +02:00
fdb46da26d Add french language and upgrade stanford-nlp to 4.0.0 2020-08-23 17:48:42 +02:00
9656ba62f4 scalafmtAll 2020-03-26 18:26:00 +01:00
8143a4edcc Adding extraction primitives 2020-02-16 21:37:26 +01:00
851ee7ef0f Reorganize processing code
Use separate modules for

- text extraction
- conversion to pdf
- text analysis
2020-02-15 21:25:25 +01:00