SOLR doesn't support polish out of the box. Plugins are required for
polish. The language has been added only with basic support. For
better results, a manual setup of solr is required.
Closes: #1345
SOLR doesn't support Lithuanian, maybe it can be added via plugins. A
manual setup of solr is required then. It has been added with basic
support.
Closes: #1540
There are models for Spanish, that have been added now. Also the
Hungarian language has been added to the list of supported
languages (for tesseract mainly, no nlp models)
This field is used for Hebrew language. Solr doesn't support it out of
the box. The new field type is just a very basic field using the
standard tokenizer and lowercase filter. It is very likely not
providing good results. Hebrew is really difficult and it requires at
least installing plugins for solr - this is out of scope for docspell.
Users can setup their solr however they like and run a re-index
afterwards.
Adding a new language without nlp requires now only to fill out the
pieces:
- define a list of month names to support date recognition
- add it to joex' dockerfile to be available for tesseract
- update the solr migration/field definitions
- update the elm file so it shows up on the client
When re-indexing everything, skip intermediate populating the index
and do this as the very last step.
Parameterize adding new fields by their language.
Improves and reorganizes how nlp pipelines are setup. Now users can
choose from many options, depending on their hardware and usage
scenario.
This is the base to use more languages without depending on what
stanford-nlp supports. Support then is involves to text extraction and
simple regex-ner processing.
The task runs on application start. It sets the schema using solr's
schema api and then indexes all data in the database. Each step is
memorized so that it is not executed again on subsequent starts.