mirror of
https://github.com/TheAnachronism/docspell.git
synced 2025-06-22 02:18:26 +00:00
Adding extraction primitives
This commit is contained in:
@ -112,7 +112,7 @@ If conversion is not supported for the input file, it is skipped. If
|
||||
conversion fails, the error is propagated to let the retry mechanism
|
||||
take care.
|
||||
|
||||
### What types?
|
||||
#### What types?
|
||||
|
||||
Which file types should be supported? At a first step, all major
|
||||
office documents, common images, plain text (i.e. markdown) and html
|
||||
@ -123,6 +123,12 @@ There is always the preference to use jvm internal libraries in order
|
||||
to be more platform independent and to reduce external dependencies.
|
||||
But this is not always possible (like doing OCR).
|
||||
|
||||
<div class="thumbnail">
|
||||
<img src="./img/process-files.png" title="Overview processing files">
|
||||
</div>
|
||||
|
||||
#### Conversion
|
||||
|
||||
- Office documents (`doc`, `docx`, `xls`, `xlsx`, `odt`, `ods`):
|
||||
unoconv (see [ADR 9](0009_convert_office_docs))
|
||||
- HTML (`html`): wkhtmltopdf (see [ADR 7](0007_convert_html_files))
|
||||
@ -130,9 +136,19 @@ But this is not always possible (like doing OCR).
|
||||
- Images (`jpg`, `png`, `tif`): Tesseract (see [ADR
|
||||
10](0010_convert_image_files))
|
||||
|
||||
#### Text Extraction
|
||||
|
||||
- Office documents (`doc`, `docx`, `xls`, `xlsx`): Apache Poi
|
||||
- Office documends (`odt`, `ods`): Apache Tika (including the sources)
|
||||
- HTML: not supported, extract text from converted PDF
|
||||
- Images (`jpg`, `png`, `tif`): Tesseract
|
||||
- Text/Markdown: n.a.
|
||||
- PDF: Apache PDFBox or Tesseract
|
||||
|
||||
## Links
|
||||
|
||||
* [Convert HTML Files](0007_convert_html_files)
|
||||
* [Convert Plain Text](0008_convert_plain_text)
|
||||
* [Convert Office Documents](0009_convert_office_docs)
|
||||
* [Convert Image Files](0010_convert_image_files)
|
||||
* [Extract Text from Files](0011_extract_text)
|
||||
|
77
modules/microsite/docs/dev/adr/0011_extract_text.md
Normal file
77
modules/microsite/docs/dev/adr/0011_extract_text.md
Normal file
@ -0,0 +1,77 @@
|
||||
---
|
||||
layout: docs
|
||||
title: Extract Text from Files
|
||||
---
|
||||
|
||||
# Extract Text from Files
|
||||
|
||||
## Context and Problem Statement
|
||||
|
||||
With support for more file types there must be a way to extract text
|
||||
from all of them. It is better to extract text from the source files,
|
||||
in contrast to extracting the text from the converted pdf file.
|
||||
|
||||
There are multiple options and multiple file types. Again, most
|
||||
priority is to use a java/scala library to reduce external
|
||||
dependencies.
|
||||
|
||||
## Considered Options
|
||||
|
||||
### MS Office Documents
|
||||
|
||||
There is only one library I know: [Apache
|
||||
POI](https://poi.apache.org/). It supports `doc(x)` and `xls(x)`.
|
||||
However, it doesn't support open-document format (odt and ods).
|
||||
|
||||
### OpenDocument Format
|
||||
|
||||
There are two libraries:
|
||||
|
||||
- [Apache Tika Parser](https://tika.apache.org/)
|
||||
- [ODFToolkit](https://github.com/tdf/odftoolkit)
|
||||
|
||||
*Tika:* The tika-parsers package contains an opendocument parser for
|
||||
extracting text. But it has a huge dependency tree, since it is a
|
||||
super-package containing a parser for almost every common file type.
|
||||
|
||||
*ODF Toolkit:* This depends on [Apache Jena](https://jena.apache.org)
|
||||
and also pulls in quite some dependencies (while not as much as
|
||||
tika-parser). It is not too bad, since it is a library for
|
||||
manipulating opendocument files. But all I need is to only extract
|
||||
text. I created tests that extracted text from my odt/ods files. It
|
||||
worked at first sight, but running the tests in a loop resulted in
|
||||
strange nullpointer exceptions (it only worked the first run).
|
||||
|
||||
### Richtext
|
||||
|
||||
Richtext is supported by the jdk (using `RichtextEditorKit` from
|
||||
swing).
|
||||
|
||||
### PDF
|
||||
|
||||
For "image" pdf files, tesseract is used. For "text" PDF files, the
|
||||
library [Apache PDFBox](https://pdfbox.apache.org) can be used.
|
||||
|
||||
There also is [iText](https://github.com/itext/itext7) with a AGPL
|
||||
license.
|
||||
|
||||
### Images
|
||||
|
||||
For images and "image" PDF files, there is already tesseract in place.
|
||||
|
||||
### HTML
|
||||
|
||||
HTML must be converted into a PDF file before text can be extracted.
|
||||
|
||||
### Text/Markdown
|
||||
|
||||
These files can be used as-is, obviously.
|
||||
|
||||
|
||||
## Decision Outcome
|
||||
|
||||
- MS Office files: POI library
|
||||
- Open Document files: Tika, but integrating the few source files that
|
||||
make up the open document parser. Due to its huge dependency tree,
|
||||
the library is not added.
|
||||
- PDF: Apache PDFBox. I know this library better than itext.
|
BIN
modules/microsite/docs/dev/adr/img/process-files.png
Normal file
BIN
modules/microsite/docs/dev/adr/img/process-files.png
Normal file
Binary file not shown.
After Width: | Height: | Size: 49 KiB |
43
modules/microsite/docs/dev/adr/process-files.puml
Normal file
43
modules/microsite/docs/dev/adr/process-files.puml
Normal file
@ -0,0 +1,43 @@
|
||||
@startuml
|
||||
scale 1200 width
|
||||
title: Processing Files
|
||||
skinparam monochrome true
|
||||
skinparam backgroundColor white
|
||||
skinparam rectangle {
|
||||
roundCorner<<Input>> 25
|
||||
roundCorner<<Output>> 5
|
||||
}
|
||||
rectangle Input <<Input>> {
|
||||
file "html"
|
||||
file "plaintext"
|
||||
file "image"
|
||||
file "msoffice"
|
||||
file "rtf"
|
||||
file "odf"
|
||||
file "pdf"
|
||||
}
|
||||
|
||||
node toBoth [
|
||||
PDF + TXT
|
||||
]
|
||||
node toPdf [
|
||||
PDF
|
||||
]
|
||||
node toTxt [
|
||||
TXT
|
||||
]
|
||||
|
||||
image --> toBoth:<tesseract>
|
||||
html --> toPdf:<wkhtmltopdf>
|
||||
toPdf --> toTxt:[pdfbox]
|
||||
plaintext --> html:[flexmark]
|
||||
msoffice --> toPdf:<unoconv>
|
||||
msoffice --> toTxt:[poi]
|
||||
rtf --> toTxt:[jdk]
|
||||
rtf --> toPdf:<unoconv>
|
||||
odf --> toTxt:[tika]
|
||||
odf --> toPdf:<unoconv>
|
||||
pdf --> toTxt:<tesseract>
|
||||
pdf --> toTxt:[pdfbox]
|
||||
plaintext -> toTxt:[identity]
|
||||
@enduml
|
Reference in New Issue
Block a user