From 81f7e4e322978dcac76e5656fe5981a96acee55c Mon Sep 17 00:00:00 2001 From: eikek Date: Sat, 21 May 2022 14:11:51 +0200 Subject: [PATCH] Add a quick guide for adding new languages Closes: #942 --- website/site/content/docs/dev/add-language.md | 70 +++++++++++++++++++ website/site/content/docs/dev/development.md | 6 -- 2 files changed, 70 insertions(+), 6 deletions(-) create mode 100644 website/site/content/docs/dev/add-language.md diff --git a/website/site/content/docs/dev/add-language.md b/website/site/content/docs/dev/add-language.md new file mode 100644 index 00000000..67a2b135 --- /dev/null +++ b/website/site/content/docs/dev/add-language.md @@ -0,0 +1,70 @@ ++++ +title = "Adding new language" +weight = 30 ++++ + +# Adding a new language for document processing + +Then there are other commits and issues to look at: + +- [Add Lithuanian](https://github.com/eikek/docspell/issues/1540) and [PR](https://github.com/eikek/docspell/pull/1559/commits/9d69401fea8ff07330c8a9116bd0d987827317c9) +- [Add Polish](https://github.com/eikek/docspell/issues/1345) and [PR](https://github.com/eikek/docspell/pull/1559/commits/1228937574ec52b36d5d77925c5fcdb1f536220c) +- [Add Spanish language](https://github.com/eikek/docspell/commit/26dff18ae0d32ce2b32b4d11ce381ada0e99314f) +- [Add Latvian language](https://github.com/eikek/docspell/issues/679) and [PR](https://github.com/eikek/docspell/pull/694/commits/9991ad5fcc43ccefe011a6cc4d01bdae4bcd4573) +- [Add Japanese language](https://github.com/eikek/docspell/issues/948) and [PR](https://github.com/eikek/docspell/pull/961/commits/f994d4b2488e64668ee064676f8c6469d9ccc1be), had some corrections: [1](https://github.com/eikek/docspell/commit/c59d4f8a6d021ec4b01a92320c211248503f16a5), [Issue](https://github.com/eikek/docspell/issues/973) +- [Add Hebrew language](https://github.com/eikek/docspell/pull/1027) + +Some older commits may be a bit out of date, but still show the +relevant things to do. These are: + +- add it to `Language.scala`, create a new `case object` and add it to + the `all` list (then fix compile errors) +- define a list of month names to support date recognition and update + `DateFind.scala` to recognize date patterns for that language. Add + some tests to `DateFindTest`. +- add it to joex' dockerfile to be available for tesseract +- update the solr migration/field definitions in `SolrSetup`. Create a + new solr migration that adds the content field for the new + language - it is a copy&paste from other similar changes. +- update `FtsRepository` for the PostgreSQL fulltext search variant: + if not sure, use `simple` here +- update the elm file so it shows up on the client. Also requires to + add translations in `Messages.Data.Language` + +## Test + +Check if everything is fine with `sbt Test/compile`. After the project +compiles without errors, run `sbt fix` to apply formatting fixes. + +It would be good to startup docspell and check the new lanugage a bit, +including whether fulltext search is working. + +Sometimes, SOLR doesn't support a language. In this case the migration +needs to first add the new *field type*. There are examples for +Lithuanian and Hebrew in the code. + +For the docker image, you can run + +```bash +PLATFORMS=linux/amd64 ./build.sh 0.36.0-SNAPSHOT +``` + +in `docker/dockerfile` directory to build the docker image (just +choose some version, it doesn't matter). + +## Non-NLP only + +Note that this is without support for NLP. Including support for NLP +means that the [stanford nlp](https://github.com/stanfordnlp/CoreNLP) +library needs to provide models for it and these must be included in +the build and tested a bit. + +## Opening issues on Github + +You can also open an issue on github requesting to support a language. +I kindly ask to include all necessary information, like in +[this](https://github.com/eikek/docspell/issues/1540) issue. I know +that I can dig it out from websites, but it would be nice to have +everything ready. Also it is better to know from a local person some +details, like which date patterns are more likely to appear than +others. diff --git a/website/site/content/docs/dev/development.md b/website/site/content/docs/dev/development.md index 09d3df9a..3720403c 100644 --- a/website/site/content/docs/dev/development.md +++ b/website/site/content/docs/dev/development.md @@ -206,9 +206,3 @@ publishing the release. However, for the nightly releases, this doesn't matter - everything must be automated here obviously. I also wanted the docker images to be built from the exact same artifacts that have been released at github (in contrast to being built again). - - -# Background Info - -There is a list of [ADRs](@/docs/dev/adr/_index.md) containing -internal/background info for various topics.