docspell

mirror of https://github.com/TheAnachronism/docspell.git synced 2024-11-13 02:31:10 +00:00

Author	SHA1	Message	Date
Eike Kettner	3473cbb773	Use collective data with NER annotation	2020-08-25 20:40:44 +02:00
Eike Kettner	96d2f948f2	Use collective's addressbook to configure regexner	2020-08-24 14:40:52 +02:00
Eike Kettner	8628a0a8b3	Allow configuring stanford-ner and cache based on collective	2020-08-24 10:55:59 +02:00
Eike Kettner	3986487f11	Add api docs and cleanup	2020-08-13 21:22:54 +02:00
Eike Kettner	41ea071555	Add a task to convert all pdfs that have not been converted	2020-08-13 01:06:13 +02:00
Eike Kettner	07e9a9767e	Add a task to re-process files of an item	2020-08-12 22:29:56 +02:00
Eike Kettner	09d74b7e80	Return item notes with search results In order to not make the response very large, a admin can define a limit on how much to return.	2020-08-05 00:09:37 +02:00
Eike Kettner	45b0deeced	Print solr url on start This is useful info to see which url has been selected, same as db connection.	2020-08-01 15:59:14 +02:00
Eike Kettner	1fc57fc2b2	Set default value for min-text-len to 500 This value is used to decide whether to try OCR or not. If text is below this value, OCR is run and both results are compared. It was set to 10, which is just one or two words. Since the context for docspell are documents, this value is too low.	2020-08-01 15:46:00 +02:00
Eike Kettner	cec4948710	Add pdf meta data to extracted text to add it to full-text index	2020-07-19 01:07:49 +02:00
Eike Kettner	209c068436	Use keywords in pdfs to search for existing tags During processing, keywords stored in PDF metadata are used to look them up in the tag database and associate any existing tags to the item. See #175	2020-07-19 00:28:04 +02:00
Eike Kettner	bd20165d1a	Use given folder-id when adding initial fts docs	2020-07-18 23:04:01 +02:00
Eike Kettner	3d49ceaab5	Use ocrmypdf tool to create pdf/a during conversion - Use another external tool to convert pdf to pdf which also adds the extracted text as another layer into the pdf - Although not used, the external conversion routine will now check for an existing text file that is named as the pdf file with extension `.txt`. If present it is included in the conversion result and will be used as the extracted text. - text extraction for pdf files happens now on the converted file, because it may already contain the text from the conversion step and thus avoids running OCR twice. - All errors during conversion are not fatal; processing continues without a converted file.	2020-07-18 17:19:29 +02:00
Eike Kettner	5b01c93711	Add a folder-id to item processing This allows to define a folder when uploading files. All generated items are associated to this folder on creation.	2020-07-14 23:18:39 +02:00
Eike Kettner	259526a088	Organize imports	2020-07-12 13:51:52 +02:00
Eike Kettner	22fa1dba13	Apply folder restriction to fulltext only search And update index when folder changes.	2020-07-12 13:50:45 +02:00
Eike Kettner	aeba4ba913	Refactor full-text migrations and add folder to solr schema	2020-07-12 13:50:14 +02:00
Eike Kettner	e387b5513f	Remove items in non-member folders from sql search results	2020-07-11 22:25:56 +02:00
Eike Kettner	752a94a9e2	Implement space operations	2020-07-11 01:30:28 +02:00
Eike Kettner	347a029af8	Scalafix organize-imports	2020-06-28 21:20:47 +02:00
Eike Kettner	41c0f70d3b	Fix cancelling jobs A request to cancel a job was not processed correctly. The cancelling routine of a task must run, regardless of the (non-final) state. Now it works like this: if a job is currently running, it is interrupted and its cancel routine is invoked. It then enters "cancelled" state. If it is stuck, it is loaded and only its cancel routine is run. If it is in a final state or waiting, it is removed from the queue.	2020-06-26 23:08:27 +02:00
Eike Kettner	d79ae6233a	Restrict proposals for due date Avoid dates too far in the future.	2020-06-26 16:58:17 +02:00
Eike Kettner	91da3b149e	Reducing default retries to 2 Many errors cannot be recovered from by retrying. There is currently no way to distinguish these states so it is now set to a lower value to have not long wait times until an item arrives.	2020-06-25 23:57:01 +02:00
Eike Kettner	dc8f1a0387	Fix global re-index task to re-create the schema Otherwise new instances could not be re-indexed.	2020-06-25 23:02:06 +02:00
Eike Kettner	14213c4c27	Allow some solr query options in the config file	2020-06-24 23:37:20 +02:00
Eike Kettner	532caed84c	Consistent logging of request/responses to solr Using a middleware. Also add missing changesets for mariadb.	2020-06-24 21:25:46 +02:00
Eike Kettner	47697a8056	Set some logs to trace	2020-06-24 01:16:13 +02:00
Eike Kettner	e06a3f8fdd	ScalafmtAll	2020-06-23 00:18:59 +02:00
Eike Kettner	ffbb16db45	Transport highlighting information to the client	2020-06-23 00:17:29 +02:00
Eike Kettner	cfe5aa8894	Use no-op fts-client if disabled + push this flag to the webui	2020-06-21 21:06:08 +02:00
Eike Kettner	0d8b03fc61	Add backend operations for re-creating the full-text index	2020-06-21 15:46:51 +02:00
Eike Kettner	14ea4091c4	Renaming things	2020-06-21 13:15:02 +02:00
Eike Kettner	2f6e531c45	Refactoring index migration task	2020-06-21 01:37:23 +02:00
Eike Kettner	1f4ff0d4c4	Add language to schema, extend fts-client	2020-06-20 22:44:47 +02:00
Eike Kettner	2a0bf24088	Setup solr schema and index all data using a system task The task runs on application start. It sets the schema using solr's schema api and then indexes all data in the database. Each step is memorized so that it is not executed again on subsequent starts.	2020-06-19 21:37:22 +02:00
Eike Kettner	60c079f664	Add task to index current database state	2020-06-18 22:38:45 +02:00
Eike Kettner	146d1b0562	Make data to index more flexible and extensible	2020-06-17 23:20:46 +02:00
Eike Kettner	522daaf57e	Introducing fts client into codebase	2020-06-17 23:20:46 +02:00
Eike Kettner	897d91475e	Update scalafmt-core to 2.6.0	2020-06-17 19:53:56 +02:00
Eike Kettner	7a3d2e4dc6	Extract `OItemSearch` from `OItem`	2020-06-15 23:13:48 +02:00
Eike Kettner	e5b90eff34	Allow client to load items in batches	2020-06-06 11:05:15 +02:00
Eike Kettner	4b0eb650f2	Rename package to avoid name clashes	2020-05-25 16:22:09 +02:00
Eike Kettner	56624515a5	ScalafmtAll	2020-05-25 13:56:06 +02:00
Eike Kettner	ee394eae86	Try streamline the different impls for `MimeType`	2020-05-25 09:24:24 +02:00
Eike Kettner	4694433e38	Fix attachment positions It worked for new items, because the implicit offset was 0. when adding archives to existing items, there are already attachments and the new attachments are added to the end. This won't work if files are added concurrently, because there is no quick and reliable way to determine the offset then.	2020-05-24 15:13:30 +02:00
Eike Kettner	1dde43e092	Only process attachments in task arguments When files are added to an item, the attachments already present must not be "re-processed".	2020-05-24 13:29:38 +02:00
Eike Kettner	4e49c78e72	Change some log levels of item processing task	2020-05-24 12:54:35 +02:00
Eike Kettner	f4949446e3	Allow to specify an item id to amend files to existing items	2020-05-23 20:15:55 +02:00
Eike Kettner	25d089da6c	Update state and proposals only on invalid items Invalid items are those that are not ready, and not shown to the user. When changing metadata, it should only be changed, if the item was not already shown to the user.	2020-05-23 15:46:24 +02:00
Eike Kettner	855d4eefa8	Set progress in a linear way between each step	2020-05-23 15:33:58 +02:00
Eike Kettner	d9782582d8	Use `max-mails` setting with higher priority The `mail-chunk-size` is set to its configured value or `max-mails` whichever is lower.	2020-05-20 22:44:29 +02:00
Eike Kettner	c0259dba7e	Allow to enable debug flag for javamail	2020-05-20 22:15:25 +02:00
Eike Kettner	2858d6b853	Notify job executors at the end of the task	2020-05-20 19:44:45 +02:00
Eike Kettner	31a1abf395	Add server limits to importing mails task	2020-05-20 17:52:38 +02:00
Eike Kettner	f2d67dc816	Initial impl of import from mailbox user task	2020-05-20 17:52:38 +02:00
Eike Kettner	852455c610	Add upload operation to task arguments	2020-05-20 17:52:38 +02:00
Eike Kettner	a4be63fd77	Add stub for scan-mailbox task	2020-05-20 17:52:38 +02:00
Eike Kettner	d65c1e0d36	Use date from e-mails to set item date	2020-05-17 11:58:51 +02:00
Eike Kettner	3e10e2175a	Sort by weights better and save them	2020-05-17 11:58:51 +02:00
Scala Steward	5d6658770e	Update emil-common, emil-doobie, ... to 0.6.0	2020-05-17 11:55:53 +02:00
Eike Kettner	6747a86fea	Simplify jsoup sanitizer to reuse from emil	2020-05-14 23:56:08 +02:00
Eike Kettner	9c882e1be9	Fix package name	2020-05-10 21:03:12 +02:00
Eike Kettner	bd5066740d	Joex depends on backend module The job executor depends on backend module, since it may control the application via user tasks. The `ONode` can now be moved from the store module into the backend module.	2020-05-10 21:03:12 +02:00
Eike Kettner	c41cdeefec	Update scalafmt to 2.5.1 + scalafmtAll	2020-05-04 23:53:57 +02:00
Eike Kettner	0a1b3fcf95	Set list-id header for notification mails	2020-04-30 21:23:56 +02:00
Eike Kettner	75a66ecb86	Update http4s to 0.21.4	2020-04-29 01:05:13 +02:00
Eike Kettner	fa10fe3fae	Update scala to 2.13.2	2020-04-24 22:24:31 +02:00
Eike Kettner	315ea63f44	Improve notify mail template	2020-04-23 23:17:34 +02:00
Eike Kettner	84e0ebf1a2	Add a flag for restricting overdue items	2020-04-23 21:37:03 +02:00
Eike Kettner	d52efdfcf0	Improve mail template	2020-04-22 23:41:09 +02:00
Eike Kettner	ffc1cdee51	Sort due items by their earliest due date	2020-04-22 22:21:28 +02:00
Eike Kettner	e1f9ae2629	Include links to items into mail template	2020-04-22 21:53:25 +02:00
Eike Kettner	2723d6b43b	Implement notify-due-items task	2020-04-22 21:08:45 +02:00
Eike Kettner	ad772c0c25	Server-side stub impl for notify-due-items	2020-04-22 21:08:45 +02:00
Eike Kettner	1206105f0b	Fix several bugs with handling e-mail files - When converting from html->pdf, the wkhtmltopdf program exits with errors if the document contains invalid links. The content is now cleaned before handed to wkhtmltopdf. - Update emil library which fixes a bug when reading mails without explicit transfer encoding (8bit) - Add a info header to converted mails	2020-04-07 22:38:25 +02:00
Eike Kettner	6a1297fc95	Add a limit for text analysis	2020-03-27 22:54:49 +01:00
Eike Kettner	9656ba62f4	scalafmtAll	2020-03-26 18:26:00 +01:00
Eike Kettner	09ea724c13	Store message-id of eml files	2020-03-25 22:00:51 +01:00
Eike Kettner	e305b46708	Extract tnef attachments and fix incomplete html The wkhtmltopdf requires the content encoding set correctly in the document.	2020-03-24 23:40:29 +01:00
Eike Kettner	0b80572664	Fix encodings for mails with non-utf8 html parts	2020-03-24 23:40:29 +01:00
Eike Kettner	cf7ccd572c	Improve handling encodings Html and text files are not fixed to be UTF-8. The encoding is now detected, which may not work for all files. Default/fallback will be utf-8. There is still a problem with mails that contain html parts not in utf8 encoding. The mail text is always returned as a string and the original encoding is lost. Then the html is stored using utf-8 bytes, but wkhtmltopdf reads it using latin1. It seems that the `--encoding` setting doesn't override encoding provided by the document.	2020-03-23 22:51:28 +01:00
Eike Kettner	cba466ed47	Set item due date candidate After processing, set the due date of an item to the first candidate. The earliest due date is considered best match.	2020-03-20 22:39:09 +01:00
Eike Kettner	6b1156182c	Add support for eml (rfc822 email) files	2020-03-19 22:42:40 +01:00
Eike Kettner	4ed7a137f7	Add support for archive files Each attachment is now first extracted into potentially multiple ones, if it is recognized as an archive. This is the first step in processing. The original archive file is also stored and the resulting attachments are associated to their original archive. First support is implemented for zip files.	2020-03-19 22:42:27 +01:00
Eike Kettner	f0449dd2ce	Properly initialize thread pools	2020-03-17 22:37:12 +01:00
Eike Kettner	00ca6b5697	Improve text analysis - Search for consecutive labels - Sort list of candidates by a weight - Search for organizations using person labels	2020-03-17 22:34:50 +01:00
Eike Kettner	718e44a21c	Add cleanup jobs task	2020-03-09 20:24:00 +01:00
Eike Kettner	854a596da3	Integrate periodic tasks The first use case for periodic task is the cleanup of expired invitation keys. This is part of a house-keeping periodic task.	2020-03-08 22:49:49 +01:00
Eike Kettner	616c333fa5	Implement storage routines for periodic scheduler	2020-03-08 13:56:23 +01:00
Eike Kettner	1e598bd902	Sketch a scheduler for running periodic tasks Periodic tasks are special in that they are usually kept around and started based on a schedule. A new component checks periodic tasks and submits them in the queue once they are due. In order to avoid duplicate periodic jobs, the tracker of a job is used to store the periodic job id. Each time a periodic task is due, it is first checked if there is a job running (or queued) for this task.	2020-03-08 12:55:03 +01:00
Eike Kettner	2f87065b2e	sbt scalafmtAll	2020-02-25 20:55:00 +01:00
Eike Kettner	ec419c7bfd	Adopt nix modules to new config	2020-02-22 12:40:56 +01:00
Eike Kettner	3f316ab4d0	Update config file doc	2020-02-20 21:10:00 +01:00
Eike Kettner	97305d27ff	Integrate support for more files into processing and upload The restriction that only pdf files can be uploaded is removed. All files can now be uploaded. The processing may not process all. It is still possible to restrict file uploads by types via a configuration.	2020-02-19 23:27:00 +01:00
Eike Kettner	0dcc00836b	Make logger configurable in system commands	2020-02-18 12:02:43 +01:00
Eike Kettner	bd605b8c94	Add first drafts for converting	2020-02-18 01:31:22 +01:00
Eike Kettner	e0682464b5	Configure pdf extraction; move Logger and DataType to common	2020-02-17 14:01:36 +01:00
Eike Kettner	3d615181e0	Early draft for text extraction	2020-02-17 01:57:22 +01:00
Eike Kettner	851ee7ef0f	Reorganize processing code Use separate modules for - text extraction - conversion to pdf - text analysis	2020-02-15 21:25:25 +01:00
Eike Kettner	ce22b727b1	Add new convert module and sketch its integration	2020-02-11 00:33:52 +01:00
Eike Kettner	3be90d64d5	Move `SystemCommand` to common module	2020-02-10 22:23:06 +01:00
Eike Kettner	ba3865ef5e	Starting to support more file types First, files are be converted to PDF for archiving. It is also easier to create a preview. This is done via the `ConvertPdf` processing task (which is not yet implemented). Text extraction then tries first with the original file. If that fails, OCR is done on the (potentially) converted pdf file. To not loose information of the original file, it is saved using the table `attachment_source`. If the original file is already a pdf, or the conversion did not succeed, the `attachment` and `attachment_source` record point to the same file.	2020-02-10 12:42:45 +01:00
Eike Kettner	fc3e22e399	Apply scalafmt to all files	2019-12-30 21:44:13 +01:00
Eike Kettner	2ad1586d00	Set stricter compile options and fix cookie data	2019-09-28 22:17:45 +02:00
Eike Kettner	831cd8b655	Initial version. Features: - Upload PDF files let them analyze - Manage meta data and items - See processing in webapp	2019-09-21 22:02:36 +02:00
Eike Kettner	6154e6a387	Initial application stub	2019-09-21 14:54:03 +02:00

... 3 4 5 6 7

306 Commits