docspell

mirror of https://github.com/TheAnachronism/docspell.git synced 2025-06-22 10:28:27 +00:00

Author	SHA1	Message	Date
Scala Steward	e4fecefaea	Reformat with scalafmt 3.0.0	2021-08-19 08:50:30 +02:00
eikek	1901fe1a8c	Adopt deprecated APIs from fs2; use fs2.Path	2021-08-07 17:51:56 +02:00
Scala Steward	558007235b	Update tika-core to 2.0.0 Include new ODF parser from tika-2.0.0	2021-07-25 13:08:18 +02:00
eikek	8e5c88fd32	Add copyright header to source files	2021-07-04 10:57:53 +02:00
eikek	bd791b4593	Upgrade code base to CE3	2021-06-22 22:53:34 +02:00
Eike Kettner	e1bbc2edf5	Apply autoformat	2021-04-10 16:31:58 +02:00
Eike Kettner	6a63694a3e	Convert unit tests to munit	2021-03-10 19:48:56 +01:00
Eike Kettner	f01646aeb5	Reorganize nlp pipeline and add nlp-unsupported language italian Improves and reorganizes how nlp pipelines are setup. Now users can choose from many options, depending on their hardware and usage scenario. This is the base to use more languages without depending on what stanford-nlp supports. Support then is involves to text extraction and simple regex-ner processing.	2021-01-18 17:41:40 +01:00
Eike Kettner	4fd6e02ec0	Improve glob and filter archive entries	2020-11-11 21:01:23 +01:00
Eike Kettner	e26d7129e7	Add fix for mariadb text columns The `text` data type can only store up to 64kb data. The `mediumtext` up to 16M and `longtext` up to 4G. Issue: #297	2020-10-02 16:50:51 +02:00
Eike Kettner	c658677032	Autoformat	2020-09-09 00:29:32 +02:00
Eike Kettner	0599176ae8	Update scala to 2.13.3	2020-08-01 01:03:43 +02:00
Eike Kettner	da68405f9b	Extract meta data from pdfs using pdfbox	2020-07-18 23:04:46 +02:00
Eike Kettner	347a029af8	Scalafix organize-imports	2020-06-28 21:20:47 +02:00
Eike Kettner	c41cdeefec	Update scalafmt to 2.5.1 + scalafmtAll	2020-05-04 23:53:57 +02:00
Eike Kettner	9656ba62f4	scalafmtAll	2020-03-26 18:26:00 +01:00
Eike Kettner	cf7ccd572c	Improve handling encodings Html and text files are not fixed to be UTF-8. The encoding is now detected, which may not work for all files. Default/fallback will be utf-8. There is still a problem with mails that contain html parts not in utf8 encoding. The mail text is always returned as a string and the original encoding is lost. Then the html is stored using utf-8 bytes, but wkhtmltopdf reads it using latin1. It seems that the `--encoding` setting doesn't override encoding provided by the document.	2020-03-23 22:51:28 +01:00
Eike Kettner	6b1156182c	Add support for eml (rfc822 email) files	2020-03-19 22:42:40 +01:00
Eike Kettner	4ed7a137f7	Add support for archive files Each attachment is now first extracted into potentially multiple ones, if it is recognized as an archive. This is the first step in processing. The original archive file is also stored and the resulting attachments are associated to their original archive. First support is implemented for zip files.	2020-03-19 22:42:27 +01:00
Eike Kettner	2f87065b2e	sbt scalafmtAll	2020-02-25 20:55:00 +01:00
Eike Kettner	9b1349734e	Convert some files to pdf	2020-02-19 02:03:10 +01:00
Eike Kettner	bd605b8c94	Add first drafts for converting	2020-02-18 01:31:22 +01:00
Eike Kettner	e0682464b5	Configure pdf extraction; move Logger and DataType to common	2020-02-17 14:01:36 +01:00
Eike Kettner	3d615181e0	Early draft for text extraction	2020-02-17 01:57:22 +01:00
Eike Kettner	8143a4edcc	Adding extraction primitives	2020-02-16 21:37:26 +01:00
Eike Kettner	3deba44282	Rename example files	2020-02-15 12:52:24 +01:00
Eike Kettner	1309c8b7fa	Move mimetype detection to docspell-files	2020-02-14 22:06:18 +01:00
Eike Kettner	5c3d2b2e28	Rename example-files to files	2020-02-14 11:14:09 +01:00

28 Commits