Commit Graph

48 Commits

Author SHA1 Message Date
9d69401fea Add Lithuanian to processing languages
SOLR doesn't support Lithuanian, maybe it can be added via plugins. A
manual setup of solr is required then. It has been added with basic
support.

Closes: #1540
2022-05-21 14:36:01 +02:00
029335e607 Working poc of postgresql based fulltext search backend 2022-03-21 11:04:26 +01:00
e483a97de7 Adopt to new loggin api 2022-02-19 21:41:38 +01:00
501c6f2988 Updating stanford corenlp to 4.3.2; adding more languages
There are models for Spanish, that have been added now. Also the
Hungarian language has been added to the list of supported
languages (for tesseract mainly, no nlp models)
2021-11-20 14:31:39 +01:00
9013f2de5b Update scalafmt settings 2021-09-22 17:23:24 +02:00
9785db0683 Change license header of all files 2021-09-21 22:35:38 +02:00
637f11d0f6 Fix solr setup by adding a text_he field
This field is used for Hebrew language. Solr doesn't support it out of
the box. The new field type is just a very basic field using the
standard tokenizer and lowercase filter. It is very likely not
providing good results. Hebrew is really difficult and it requires at
least installing plugins for solr - this is out of scope for docspell.
Users can setup their solr however they like and run a re-index
afterwards.
2021-08-28 00:10:36 +02:00
589c41003f Add hebrew document language 2021-08-24 01:19:42 +03:00
e4fecefaea Reformat with scalafmt 3.0.0 2021-08-19 08:50:30 +02:00
c59d4f8a6d Add the japanese content field to solr
This is a follow up on #961. It was forgotten when the japanese
language was added.
2021-07-29 22:22:34 +02:00
8e5c88fd32 Add copyright header to source files 2021-07-04 10:57:53 +02:00
bd791b4593 Upgrade code base to CE3 2021-06-22 22:53:34 +02:00
ac7d00c28f Refactor re-index task 2021-06-07 21:17:29 +02:00
5205ee0623 Store solr migration state in a solr document 2021-06-07 17:53:37 +02:00
ebaa31898e Add missing solr migration for new language field 2021-03-12 00:16:00 +01:00
3f75af0807 Add 9 more lanugages to the list of document lanugages 2021-01-18 17:41:40 +01:00
94bb18c152 Refactor solr language fields 2021-01-18 17:41:40 +01:00
26dff18ae0 Add spanish as an example
Adding a new language without nlp requires now only to fill out the
pieces:

- define a list of month names to support date recognition
- add it to joex' dockerfile to be available for tesseract
- update the solr migration/field definitions
- update the elm file so it shows up on the client
2021-01-18 17:41:40 +01:00
360cad3304 Refactoring solr/fts migration
When re-indexing everything, skip intermediate populating the index
and do this as the very last step.

Parameterize adding new fields by their language.
2021-01-18 17:41:40 +01:00
f01646aeb5 Reorganize nlp pipeline and add nlp-unsupported language italian
Improves and reorganizes how nlp pipelines are setup. Now users can
choose from many options, depending on their hardware and usage
scenario.

This is the base to use more languages without depending on what
stanford-nlp supports. Support then is involves to text extraction and
simple regex-ner processing.
2021-01-18 17:41:40 +01:00
9c82f186d0 Add missing solr migration for french 2020-09-09 21:39:23 +02:00
fdb46da26d Add french language and upgrade stanford-nlp to 4.0.0 2020-08-23 17:48:42 +02:00
259526a088 Organize imports 2020-07-12 13:51:52 +02:00
22fa1dba13 Apply folder restriction to fulltext only search
And update index when folder changes.
2020-07-12 13:50:45 +02:00
aeba4ba913 Refactor full-text migrations and add folder to solr schema 2020-07-12 13:50:14 +02:00
347a029af8 Scalafix organize-imports 2020-06-28 21:20:47 +02:00
dc8f1a0387 Fix global re-index task to re-create the schema
Otherwise new instances could not be re-indexed.
2020-06-25 23:02:06 +02:00
0ba1736bc8 Remove items/attachments from index on delete 2020-06-25 00:00:10 +02:00
14213c4c27 Allow some solr query options in the config file 2020-06-24 23:37:20 +02:00
532caed84c Consistent logging of request/responses to solr
Using a middleware. Also add missing changesets for mariadb.
2020-06-24 21:25:46 +02:00
47697a8056 Set some logs to trace 2020-06-24 01:16:13 +02:00
7d7460b1c9 Cleanup + hiding false errors from log 2020-06-24 00:23:22 +02:00
d5c9923a6d Add a route that only searches the full-text index
It returns the results in the same order as received from the index to
preserve the relevance ordering.
2020-06-24 00:03:17 +02:00
e06a3f8fdd ScalafmtAll 2020-06-23 00:18:59 +02:00
ffbb16db45 Transport highlighting information to the client 2020-06-23 00:17:29 +02:00
a58ffd11e1 Return attachment-name from index 2020-06-22 21:28:26 +02:00
3d82e03a8a Remove solr query from debug log 2020-06-21 22:29:45 +02:00
cfe5aa8894 Use no-op fts-client if disabled + push this flag to the webui 2020-06-21 21:06:08 +02:00
14ea4091c4 Renaming things 2020-06-21 13:15:02 +02:00
9acea8307d Update full-text index when changing data 2020-06-21 00:33:39 +02:00
383614f908 Allow updating single fields in solr 2020-06-20 23:37:47 +02:00
1f4ff0d4c4 Add language to schema, extend fts-client 2020-06-20 22:44:47 +02:00
3576c45d1a First basic working solr search 2020-06-20 02:18:49 +02:00
2a0bf24088 Setup solr schema and index all data using a system task
The task runs on application start. It sets the schema using solr's
schema api and then indexes all data in the database. Each step is
memorized so that it is not executed again on subsequent starts.
2020-06-19 21:37:22 +02:00
1f4220eccb Index exsiting data in solr 2020-06-19 00:43:35 +02:00
60c079f664 Add task to index current database state 2020-06-18 22:38:45 +02:00
522daaf57e Introducing fts client into codebase 2020-06-17 23:20:46 +02:00
c7f598e3b0 Initial module setup 2020-06-17 23:20:46 +02:00