Eike Kettner
6a63694a3e
Convert unit tests to munit
2021-03-10 19:48:56 +01:00
Eike Kettner
9991ad5fcc
Add latvian language
2021-03-09 00:23:17 +01:00
Eike Kettner
e6d9ce2c37
Remove obsolete type capabilities
...
These are now detected by the new scala compiler and lead to compile
errors.
2021-03-01 00:16:30 +01:00
Eike Kettner
c7d4c77e6d
Allow more suggestions for date variants in English
2021-02-26 00:35:17 +01:00
Eike Kettner
c7e850116f
Make the text length limit optional
2021-01-22 23:06:50 +01:00
Eike Kettner
249f9e6e2a
Extend guessing tags to all tag categories
2021-01-18 21:51:45 +01:00
Eike Kettner
3f75af0807
Add 9 more lanugages to the list of document lanugages
2021-01-18 17:41:40 +01:00
Eike Kettner
26dff18ae0
Add spanish as an example
...
Adding a new language without nlp requires now only to fill out the
pieces:
- define a list of month names to support date recognition
- add it to joex' dockerfile to be available for tesseract
- update the solr migration/field definitions
- update the elm file so it shows up on the client
2021-01-18 17:41:40 +01:00
Eike Kettner
ff121d462c
Disable memory intensive tests on travis
2021-01-18 17:41:40 +01:00
Eike Kettner
f01646aeb5
Reorganize nlp pipeline and add nlp-unsupported language italian
...
Improves and reorganizes how nlp pipelines are setup. Now users can
choose from many options, depending on their hardware and usage
scenario.
This is the base to use more languages without depending on what
stanford-nlp supports. Support then is involves to text extraction and
simple regex-ner processing.
2021-01-18 17:41:40 +01:00
Eike Kettner
aa937797be
Choose nlp mode in config file
2021-01-17 22:56:33 +01:00
Eike Kettner
54a09861c4
Use model cache with basic annotator
2021-01-17 22:56:33 +01:00
Eike Kettner
a77f67d73a
Make pipeline cache generic to be used with BasicCRFAnnotator
2021-01-17 22:56:33 +01:00
Eike Kettner
4462ebae0f
Resurrect the basic ner classifier
2021-01-17 22:56:33 +01:00
Eike Kettner
a699e87304
Separate ner from classification
2021-01-17 22:56:33 +01:00
Eike Kettner
f02f15e5bd
Move blocker into constructor of text analyser
2021-01-17 22:56:33 +01:00
Eike Kettner
b2b8ad625a
scalafmt
2021-01-17 20:11:58 +01:00
Eike Kettner
75986c461f
Fix ner date label boundary reporting
2021-01-10 09:10:39 +01:00
Eike Kettner
fb05e997ab
Provide multiple date suggestions for English
...
Issue: #561
2021-01-10 09:02:26 +01:00
Eike Kettner
716252721c
Fix cache clearing
...
It must be cancelled when obtaining a pipeline.
2021-01-07 23:31:01 +01:00
Eike Kettner
a670bbb6c2
Make idle interval when clearing nlp cache configurable
2021-01-06 23:03:00 +01:00
Eike Kettner
73a9572835
Poc for clearing stanford pipeline after some idle time
2021-01-05 23:56:20 +01:00
Tammo van Lessen
e9347176bd
Fixes an off-by-one classic to also accept dates in January
2020-11-28 00:43:35 +01:00
Eike Kettner
cf6e63785d
Fix potential index-out-of-bounds error in classifier
...
The stanford library expects a non-empty text.
2020-11-09 00:04:51 +01:00
Eike Kettner
3f697f51aa
Autoformat
2020-10-06 23:31:09 +02:00
Eike Kettner
53c8d3031d
Skip invalid dates find in texts
...
Fixes : #298
2020-10-02 22:37:15 +02:00
Eike Kettner
c658677032
Autoformat
2020-09-09 00:29:32 +02:00
Eike Kettner
97757876d5
Fix formatting
2020-09-08 00:47:42 +02:00
Eike Kettner
c9bd57592b
Don't use test data if there is just one config
...
If classifier models cannot be compared, there is no reason to test.
2020-09-07 20:02:50 +02:00
Eike Kettner
316b490008
Implement learning a text classifier from collective data
2020-09-02 18:28:14 +02:00
Eike Kettner
0c97b4ef76
Initial impl of a text classifier based on stanford-nlp
2020-09-02 18:28:14 +02:00
Eike Kettner
96d2f948f2
Use collective's addressbook to configure regexner
2020-08-24 14:40:52 +02:00
Eike Kettner
8628a0a8b3
Allow configuring stanford-ner and cache based on collective
2020-08-24 10:55:59 +02:00
Eike Kettner
fdb46da26d
Add french language and upgrade stanford-nlp to 4.0.0
2020-08-23 17:48:42 +02:00
Eike Kettner
347a029af8
Scalafix organize-imports
2020-06-28 21:20:47 +02:00
Eike Kettner
897d91475e
Update scalafmt-core to 2.6.0
2020-06-17 19:53:56 +02:00
Eike Kettner
075b665c68
Add some more tlds to look for
2020-05-24 11:48:49 +02:00
Eike Kettner
5e6ce1737c
Change recognizing dates with short years
...
Short years are now added to the current centure (2000) such that date
strings like 12/26/11 result in 12/26/2011 and not 12/26/1911.
2020-05-17 11:58:51 +02:00
Eike Kettner
c41cdeefec
Update scalafmt to 2.5.1 + scalafmtAll
2020-05-04 23:53:57 +02:00
Eike Kettner
6a1297fc95
Add a limit for text analysis
2020-03-27 22:54:49 +01:00
Eike Kettner
9656ba62f4
scalafmtAll
2020-03-26 18:26:00 +01:00
Eike Kettner
2f87065b2e
sbt scalafmtAll
2020-02-25 20:55:00 +01:00
Eike Kettner
8143a4edcc
Adding extraction primitives
2020-02-16 21:37:26 +01:00
Eike Kettner
851ee7ef0f
Reorganize processing code
...
Use separate modules for
- text extraction
- conversion to pdf
- text analysis
2020-02-15 21:25:25 +01:00