tenpai
e731d822dc
Add Japanese Vertical Support Branch for Tesseract and Ocrmypdf OCR ( #2505 )
...
* Add Japanese Vertical Support
* Adds Japanese Vertical mappings to default configuration.
2024-04-16 20:24:57 +02:00
eikek
924aaf720e
Fix compile warnings after scala update
2024-03-03 18:43:54 +01:00
eikek
b02a5c21fa
Merge pull request #2208 from mprasil/add-slovak-language-support
...
Add support for Slovak language
2023-12-19 23:13:11 +01:00
Miroslav Prasil
dcafd7b062
Apply sbt fix
2023-12-17 11:52:54 +00:00
Rehan Mahmood
2a39b2f6a6
Updated following dependencies as they need changes to the code to work properly:
...
- Scala
- fs2
- http4s
2023-10-31 14:24:00 -04:00
Miroslav Prasil
8826712259
Add support for Slovak language
...
Just the basic support was added following examples for other languages.
2023-08-03 14:26:19 +01:00
xshadowlegendx
4b01a399e4
add khmer lang month name
2023-03-16 23:52:14 +07:00
xshadowlegendx
54deaf2cd7
specify khmer lang date pattern
2023-03-16 23:52:03 +07:00
GooRoo
61d5585e68
Add Ukrainian language
2022-11-09 22:24:32 +01:00
eikek
c0feb13f63
Add Estonian language
...
Closes : #1646
2022-11-01 01:00:16 +01:00
eikek
2c9e012c96
Fix url parsing with trailing slash
...
Refs: #1545
2022-07-07 15:22:26 +02:00
eikek
5ec311c331
Add polish to processing lanugages
...
SOLR doesn't support polish out of the box. Plugins are required for
polish. The language has been added only with basic support. For
better results, a manual setup of solr is required.
Closes : #1345
2022-05-21 14:41:16 +02:00
eikek
9d69401fea
Add Lithuanian to processing languages
...
SOLR doesn't support Lithuanian, maybe it can be added via plugins. A
manual setup of solr is required then. It has been added with basic
support.
Closes : #1540
2022-05-21 14:36:01 +02:00
eikek
7fdd78ad06
Experiment with addons
...
Addons allow to execute external programs in some context inside
docspell. Currently it is possible to run them after processing files.
Addons are provided by URLs to zip files.
2022-05-15 23:46:43 +02:00
eikek
9eb9497675
Fix logging in tests
2022-02-19 23:33:01 +01:00
eikek
e483a97de7
Adopt to new loggin api
2022-02-19 21:41:38 +01:00
Scala Steward
652e85ccea
Reformat with scalafmt 3.3.1
2022-01-02 00:50:55 +01:00
eikek
c21b2cdd29
Update scalafmt to 3.0.8
2021-12-11 22:46:55 +01:00
eikek
501c6f2988
Updating stanford corenlp to 4.3.2; adding more languages
...
There are models for Spanish, that have been added now. Also the
Hungarian language has been added to the list of supported
languages (for tesseract mainly, no nlp models)
2021-11-20 14:31:39 +01:00
eikek
9013f2de5b
Update scalafmt settings
2021-09-22 17:23:24 +02:00
eikek
9785db0683
Change license header of all files
2021-09-21 22:35:38 +02:00
wallace
589c41003f
Add hebrew document language
2021-08-24 01:19:42 +03:00
Scala Steward
e4fecefaea
Reformat with scalafmt 3.0.0
2021-08-19 08:50:30 +02:00
eikek
1901fe1a8c
Adopt deprecated APIs from fs2; use fs2.Path
2021-08-07 17:51:56 +02:00
eikek
4af8dd0950
Preprocess japanese texts to find dates
...
Not very efficient, but should work to find the position of dates in
japanese text.
2021-07-29 01:35:15 +02:00
wallace
e8348e2809
Remove excessive spaces
2021-07-29 02:08:48 +03:00
wallace11
1095a7d56f
Add another Japanese test
2021-07-29 01:13:22 +03:00
wallace11
119a4ffdc9
Update Japanese tests with more sensible data
2021-07-29 01:08:48 +03:00
eikek
f994d4b248
Add japanese document language
2021-07-28 20:05:48 +02:00
eikek
8e5c88fd32
Add copyright header to source files
2021-07-04 10:57:53 +02:00
eikek
bd791b4593
Upgrade code base to CE3
2021-06-22 22:53:34 +02:00
Eike Kettner
e1bbc2edf5
Apply autoformat
2021-04-10 16:31:58 +02:00
Scala Steward
144ea852bf
Update fs2-core, fs2-io to 2.5.4
2021-03-31 21:10:42 +02:00
Eike Kettner
6a63694a3e
Convert unit tests to munit
2021-03-10 19:48:56 +01:00
Eike Kettner
9991ad5fcc
Add latvian language
2021-03-09 00:23:17 +01:00
Eike Kettner
e6d9ce2c37
Remove obsolete type capabilities
...
These are now detected by the new scala compiler and lead to compile
errors.
2021-03-01 00:16:30 +01:00
Eike Kettner
c7d4c77e6d
Allow more suggestions for date variants in English
2021-02-26 00:35:17 +01:00
Eike Kettner
c7e850116f
Make the text length limit optional
2021-01-22 23:06:50 +01:00
Eike Kettner
249f9e6e2a
Extend guessing tags to all tag categories
2021-01-18 21:51:45 +01:00
Eike Kettner
3f75af0807
Add 9 more lanugages to the list of document lanugages
2021-01-18 17:41:40 +01:00
Eike Kettner
26dff18ae0
Add spanish as an example
...
Adding a new language without nlp requires now only to fill out the
pieces:
- define a list of month names to support date recognition
- add it to joex' dockerfile to be available for tesseract
- update the solr migration/field definitions
- update the elm file so it shows up on the client
2021-01-18 17:41:40 +01:00
Eike Kettner
ff121d462c
Disable memory intensive tests on travis
2021-01-18 17:41:40 +01:00
Eike Kettner
f01646aeb5
Reorganize nlp pipeline and add nlp-unsupported language italian
...
Improves and reorganizes how nlp pipelines are setup. Now users can
choose from many options, depending on their hardware and usage
scenario.
This is the base to use more languages without depending on what
stanford-nlp supports. Support then is involves to text extraction and
simple regex-ner processing.
2021-01-18 17:41:40 +01:00
Eike Kettner
aa937797be
Choose nlp mode in config file
2021-01-17 22:56:33 +01:00
Eike Kettner
54a09861c4
Use model cache with basic annotator
2021-01-17 22:56:33 +01:00
Eike Kettner
a77f67d73a
Make pipeline cache generic to be used with BasicCRFAnnotator
2021-01-17 22:56:33 +01:00
Eike Kettner
4462ebae0f
Resurrect the basic ner classifier
2021-01-17 22:56:33 +01:00
Eike Kettner
a699e87304
Separate ner from classification
2021-01-17 22:56:33 +01:00
Eike Kettner
f02f15e5bd
Move blocker into constructor of text analyser
2021-01-17 22:56:33 +01:00
Eike Kettner
b2b8ad625a
scalafmt
2021-01-17 20:11:58 +01:00