Commit Graph

46 Commits

Author SHA1 Message Date
e1bbc2edf5 Apply autoformat 2021-04-10 16:31:58 +02:00
144ea852bf Update fs2-core, fs2-io to 2.5.4 2021-03-31 21:10:42 +02:00
6a63694a3e Convert unit tests to munit 2021-03-10 19:48:56 +01:00
9991ad5fcc Add latvian language 2021-03-09 00:23:17 +01:00
e6d9ce2c37 Remove obsolete type capabilities
These are now detected by the new scala compiler and lead to compile
errors.
2021-03-01 00:16:30 +01:00
c7d4c77e6d Allow more suggestions for date variants in English 2021-02-26 00:35:17 +01:00
c7e850116f Make the text length limit optional 2021-01-22 23:06:50 +01:00
249f9e6e2a Extend guessing tags to all tag categories 2021-01-18 21:51:45 +01:00
3f75af0807 Add 9 more lanugages to the list of document lanugages 2021-01-18 17:41:40 +01:00
26dff18ae0 Add spanish as an example
Adding a new language without nlp requires now only to fill out the
pieces:

- define a list of month names to support date recognition
- add it to joex' dockerfile to be available for tesseract
- update the solr migration/field definitions
- update the elm file so it shows up on the client
2021-01-18 17:41:40 +01:00
ff121d462c Disable memory intensive tests on travis 2021-01-18 17:41:40 +01:00
f01646aeb5 Reorganize nlp pipeline and add nlp-unsupported language italian
Improves and reorganizes how nlp pipelines are setup. Now users can
choose from many options, depending on their hardware and usage
scenario.

This is the base to use more languages without depending on what
stanford-nlp supports. Support then is involves to text extraction and
simple regex-ner processing.
2021-01-18 17:41:40 +01:00
aa937797be Choose nlp mode in config file 2021-01-17 22:56:33 +01:00
54a09861c4 Use model cache with basic annotator 2021-01-17 22:56:33 +01:00
a77f67d73a Make pipeline cache generic to be used with BasicCRFAnnotator 2021-01-17 22:56:33 +01:00
4462ebae0f Resurrect the basic ner classifier 2021-01-17 22:56:33 +01:00
a699e87304 Separate ner from classification 2021-01-17 22:56:33 +01:00
f02f15e5bd Move blocker into constructor of text analyser 2021-01-17 22:56:33 +01:00
b2b8ad625a scalafmt 2021-01-17 20:11:58 +01:00
75986c461f Fix ner date label boundary reporting 2021-01-10 09:10:39 +01:00
fb05e997ab Provide multiple date suggestions for English
Issue: #561
2021-01-10 09:02:26 +01:00
716252721c Fix cache clearing
It must be cancelled when obtaining a pipeline.
2021-01-07 23:31:01 +01:00
a670bbb6c2 Make idle interval when clearing nlp cache configurable 2021-01-06 23:03:00 +01:00
73a9572835 Poc for clearing stanford pipeline after some idle time 2021-01-05 23:56:20 +01:00
e9347176bd Fixes an off-by-one classic to also accept dates in January 2020-11-28 00:43:35 +01:00
cf6e63785d Fix potential index-out-of-bounds error in classifier
The stanford library expects a non-empty text.
2020-11-09 00:04:51 +01:00
3f697f51aa Autoformat 2020-10-06 23:31:09 +02:00
53c8d3031d Skip invalid dates find in texts
Fixes: #298
2020-10-02 22:37:15 +02:00
c658677032 Autoformat 2020-09-09 00:29:32 +02:00
97757876d5 Fix formatting 2020-09-08 00:47:42 +02:00
c9bd57592b Don't use test data if there is just one config
If classifier models cannot be compared, there is no reason to test.
2020-09-07 20:02:50 +02:00
316b490008 Implement learning a text classifier from collective data 2020-09-02 18:28:14 +02:00
0c97b4ef76 Initial impl of a text classifier based on stanford-nlp 2020-09-02 18:28:14 +02:00
96d2f948f2 Use collective's addressbook to configure regexner 2020-08-24 14:40:52 +02:00
8628a0a8b3 Allow configuring stanford-ner and cache based on collective 2020-08-24 10:55:59 +02:00
fdb46da26d Add french language and upgrade stanford-nlp to 4.0.0 2020-08-23 17:48:42 +02:00
347a029af8 Scalafix organize-imports 2020-06-28 21:20:47 +02:00
897d91475e Update scalafmt-core to 2.6.0 2020-06-17 19:53:56 +02:00
075b665c68 Add some more tlds to look for 2020-05-24 11:48:49 +02:00
5e6ce1737c Change recognizing dates with short years
Short years are now added to the current centure (2000) such that date
strings like 12/26/11 result in 12/26/2011 and not 12/26/1911.
2020-05-17 11:58:51 +02:00
c41cdeefec Update scalafmt to 2.5.1 + scalafmtAll 2020-05-04 23:53:57 +02:00
6a1297fc95 Add a limit for text analysis 2020-03-27 22:54:49 +01:00
9656ba62f4 scalafmtAll 2020-03-26 18:26:00 +01:00
2f87065b2e sbt scalafmtAll 2020-02-25 20:55:00 +01:00
8143a4edcc Adding extraction primitives 2020-02-16 21:37:26 +01:00
851ee7ef0f Reorganize processing code
Use separate modules for

- text extraction
- conversion to pdf
- text analysis
2020-02-15 21:25:25 +01:00