3.3 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	+++ title = "Adding new language" weight = 30 +++
Adding a new language for document processing
Then there are other commits and issues to look at:
- Add Lithuanian and PR
- Add Polish and PR
- Add Spanish language
- Add Latvian language and PR
- Add Japanese language and PR, had some corrections: 1, Issue
- Add Hebrew language
Some older commits may be a bit out of date, but still show the relevant things to do. These are:
- add it to Language.scala, create a newcase objectand add it to thealllist (then fix compile errors)
- define a list of month names to support date recognition and update
DateFind.scalato recognize date patterns for that language. Add some tests toDateFindTest.
- add it to joex' dockerfile to be available for tesseract
- update the solr migration/field definitions in SolrSetup. Create a new solr migration that adds the content field for the new language - it is a copy&paste from other similar changes.
- update FtsRepositoryfor the PostgreSQL fulltext search variant: if not sure, usesimplehere
- update the elm file so it shows up on the client. Also requires to
add translations in Messages.Data.Language
Test
Check if everything is fine with sbt Test/compile. After the project
compiles without errors, run sbt fix to apply formatting fixes.
It would be good to startup docspell and check the new lanugage a bit, including whether fulltext search is working.
Sometimes, SOLR doesn't support a language. In this case the migration needs to first add the new field type. There are examples for Lithuanian and Hebrew in the code.
For the docker image, you can run
PLATFORMS=linux/amd64 ./build.sh 0.36.0-SNAPSHOT
in docker/dockerfile directory to build the docker image (just
choose some version, it doesn't matter).
Non-NLP only
Note that this is without support for NLP. Including support for NLP means that the stanford nlp library needs to provide models for it and these must be included in the build and tested a bit.
Opening issues on Github
You can also open an issue on github requesting to support a language. I kindly ask to include all necessary information, like in this issue. I know that I can dig it out from websites, but it would be nice to have everything ready. Also it is better to know from a local person some details, like which date patterns are more likely to appear than others.