docspell

mirror of https://github.com/TheAnachronism/docspell.git synced 2025-06-22 02:18:26 +00:00

Author	SHA1	Message	Date
eikek	8269a73a83	Extend config for external commands (#2536 ) Allows to configure external commands and provide different arguments based on runtime values, like language. It extends the current config of a command to allow a `arg-mappings` section. An example for ocrmypdf: ```conf ocrmypdf = { enabled = true command = { program = "ocrmypdf" ### new arg-mappings arg-mappings = { "mylang" = { value = "{{lang}}" mappings = [ { matches = "deu" args = [ "-l", "deu", "--pdf-renderer", "sandwich" ] }, { matches = ".*" args = [ "-l", "{{lang}}" ] } ] } } #### end new arg-mappings args = [ ### will be replaced with corresponding args from "mylang" mapping "{{mylang}}", "--skip-text", "--deskew", "-j", "1", "{{infile}}", "{{outfile}}" ] timeout = "5 minutes" } working-dir = ${java.io.tmpdir}"/docspell-convert" } ``` The whole section will be first processed to replace all `{{…}}` patterns with corresponding values. Then `arg-mappings` will be looked at and the first match (value == matches) in its `mappings` array is used to replace its name in the arguments to the command.	2024-03-08 21:34:42 +01:00
eikek	fe4a300b0e	Update pdfbox to 3.0.0	2023-11-06 00:06:49 +01:00
Rehan Mahmood	2a39b2f6a6	Updated following dependencies as they need changes to the code to work properly: - Scala - fs2 - http4s	2023-10-31 14:24:00 -04:00
eikek	d413b16b03	Allow to always use OCR extracted text Fixes: #1628	2022-07-07 17:58:03 +02:00
eikek	7fdd78ad06	Experiment with addons Addons allow to execute external programs in some context inside docspell. Currently it is possible to run them after processing files. Addons are provided by URLs to zip files.	2022-05-15 23:46:43 +02:00
eikek	9eb9497675	Fix logging in tests	2022-02-19 23:33:01 +01:00
eikek	e483a97de7	Adopt to new loggin api	2022-02-19 21:41:38 +01:00
eikek	bc1ec90b6e	Allow subsampling when generating preview images This cuts down considerably when high-dpi images are provided in pdfs. The test file, scanned with 600dpi resulting in a 5.4M pdf file contains a 9900x13800 image. This image is loaded into memory in order to scale it down by PDFBox. This easily results in out of memory errors (this image requires already ~400M). With subsampling the size is reduced at most by a factor of 8. Still recommended to avoid large dpi image-only scans for text based documents or increase the heap size for joex.	2022-01-13 00:04:50 +01:00
eikek	c21b2cdd29	Update scalafmt to 3.0.8	2021-12-11 22:46:55 +01:00
eikek	9013f2de5b	Update scalafmt settings	2021-09-22 17:23:24 +02:00
eikek	9785db0683	Change license header of all files	2021-09-21 22:35:38 +02:00
Scala Steward	e4fecefaea	Reformat with scalafmt 3.0.0	2021-08-19 08:50:30 +02:00
eikek	1901fe1a8c	Adopt deprecated APIs from fs2; use fs2.Path	2021-08-07 17:51:56 +02:00
Scala Steward	558007235b	Update tika-core to 2.0.0 Include new ODF parser from tika-2.0.0	2021-07-25 13:08:18 +02:00
eikek	8e5c88fd32	Add copyright header to source files	2021-07-04 10:57:53 +02:00
eikek	02b8078f01	Use fs2 Files api	2021-06-22 23:17:32 +02:00
eikek	bd791b4593	Upgrade code base to CE3	2021-06-22 22:53:34 +02:00
Eike Kettner	e1bbc2edf5	Apply autoformat	2021-04-10 16:31:58 +02:00
Eike Kettner	8c6ad8fc4e	This test only doesn't work on my ci	2021-03-13 16:57:08 +01:00
Eike Kettner	6a63694a3e	Convert unit tests to munit	2021-03-10 19:48:56 +01:00
Eike Kettner	cfa36a5270	Fix preview png tests Outcome was checked manually.	2021-03-01 00:33:57 +01:00
Eike Kettner	a77f34b7ba	Add a processing step to retrieve page counts	2020-11-09 11:08:24 +01:00
Eike Kettner	f4e50c5229	Provide endpoints to submit tasks to re-generate previews The scaling factor can be given in the config file. When this changes, images can be regenerated via POSTing to certain endpoints. It is possible to regenerate just one attachment preview or all within a collective.	2020-11-09 09:00:02 +01:00
Eike Kettner	350a271b22	Add simple pdf page preview function	2020-11-08 01:25:14 +01:00
Eike Kettner	c658677032	Autoformat	2020-09-09 00:29:32 +02:00
Eike Kettner	cec4948710	Add pdf meta data to extracted text to add it to full-text index	2020-07-19 01:07:49 +02:00
Eike Kettner	209c068436	Use keywords in pdfs to search for existing tags During processing, keywords stored in PDF metadata are used to look them up in the tag database and associate any existing tags to the item. See #175	2020-07-19 00:28:04 +02:00
Eike Kettner	da68405f9b	Extract meta data from pdfs using pdfbox	2020-07-18 23:04:46 +02:00
Eike Kettner	347a029af8	Scalafix organize-imports	2020-06-28 21:20:47 +02:00
Eike Kettner	2e88207ff1	Post process all extracted text Removes 0 bytes and leading/trailing whitespace	2020-05-25 13:56:06 +02:00
Eike Kettner	ee394eae86	Try streamline the different impls for `MimeType`	2020-05-25 09:24:24 +02:00
Eike Kettner	c41cdeefec	Update scalafmt to 2.5.1 + scalafmtAll	2020-05-04 23:53:57 +02:00
Eike Kettner	9656ba62f4	scalafmtAll	2020-03-26 18:26:00 +01:00
Eike Kettner	cf7ccd572c	Improve handling encodings Html and text files are not fixed to be UTF-8. The encoding is now detected, which may not work for all files. Default/fallback will be utf-8. There is still a problem with mails that contain html parts not in utf8 encoding. The mail text is always returned as a string and the original encoding is lost. Then the html is stored using utf-8 bytes, but wkhtmltopdf reads it using latin1. It seems that the `--encoding` setting doesn't override encoding provided by the document.	2020-03-23 22:51:28 +01:00
Eike Kettner	2f87065b2e	sbt scalafmtAll	2020-02-25 20:55:00 +01:00
Eike Kettner	97305d27ff	Integrate support for more files into processing and upload The restriction that only pdf files can be uploaded is removed. All files can now be uploaded. The processing may not process all. It is still possible to restrict file uploads by types via a configuration.	2020-02-19 23:27:00 +01:00
Eike Kettner	9b1349734e	Convert some files to pdf	2020-02-19 02:03:10 +01:00
Eike Kettner	5869e2ee6e	Streamline extern-conv stdin/infile	2020-02-18 12:43:47 +01:00
Eike Kettner	0dcc00836b	Make logger configurable in system commands	2020-02-18 12:02:43 +01:00
Eike Kettner	e0682464b5	Configure pdf extraction; move Logger and DataType to common	2020-02-17 14:01:36 +01:00
Eike Kettner	3d615181e0	Early draft for text extraction	2020-02-17 01:57:22 +01:00
Eike Kettner	8143a4edcc	Adding extraction primitives	2020-02-16 21:37:26 +01:00
Eike Kettner	851ee7ef0f	Reorganize processing code Use separate modules for - text extraction - conversion to pdf - text analysis	2020-02-15 21:25:25 +01:00

43 Commits