Adding extraction primitives

This commit is contained in:
Eike Kettner
2020-02-16 21:37:26 +01:00
parent 851ee7ef0f
commit 8143a4edcc
46 changed files with 2731 additions and 83 deletions

View File

@ -112,7 +112,7 @@ If conversion is not supported for the input file, it is skipped. If
conversion fails, the error is propagated to let the retry mechanism
take care.
### What types?
#### What types?
Which file types should be supported? At a first step, all major
office documents, common images, plain text (i.e. markdown) and html
@ -123,6 +123,12 @@ There is always the preference to use jvm internal libraries in order
to be more platform independent and to reduce external dependencies.
But this is not always possible (like doing OCR).
<div class="thumbnail">
<img src="./img/process-files.png" title="Overview processing files">
</div>
#### Conversion
- Office documents (`doc`, `docx`, `xls`, `xlsx`, `odt`, `ods`):
unoconv (see [ADR 9](0009_convert_office_docs))
- HTML (`html`): wkhtmltopdf (see [ADR 7](0007_convert_html_files))
@ -130,9 +136,19 @@ But this is not always possible (like doing OCR).
- Images (`jpg`, `png`, `tif`): Tesseract (see [ADR
10](0010_convert_image_files))
#### Text Extraction
- Office documents (`doc`, `docx`, `xls`, `xlsx`): Apache Poi
- Office documends (`odt`, `ods`): Apache Tika (including the sources)
- HTML: not supported, extract text from converted PDF
- Images (`jpg`, `png`, `tif`): Tesseract
- Text/Markdown: n.a.
- PDF: Apache PDFBox or Tesseract
## Links
* [Convert HTML Files](0007_convert_html_files)
* [Convert Plain Text](0008_convert_plain_text)
* [Convert Office Documents](0009_convert_office_docs)
* [Convert Image Files](0010_convert_image_files)
* [Extract Text from Files](0011_extract_text)

View File

@ -0,0 +1,77 @@
---
layout: docs
title: Extract Text from Files
---
# Extract Text from Files
## Context and Problem Statement
With support for more file types there must be a way to extract text
from all of them. It is better to extract text from the source files,
in contrast to extracting the text from the converted pdf file.
There are multiple options and multiple file types. Again, most
priority is to use a java/scala library to reduce external
dependencies.
## Considered Options
### MS Office Documents
There is only one library I know: [Apache
POI](https://poi.apache.org/). It supports `doc(x)` and `xls(x)`.
However, it doesn't support open-document format (odt and ods).
### OpenDocument Format
There are two libraries:
- [Apache Tika Parser](https://tika.apache.org/)
- [ODFToolkit](https://github.com/tdf/odftoolkit)
*Tika:* The tika-parsers package contains an opendocument parser for
extracting text. But it has a huge dependency tree, since it is a
super-package containing a parser for almost every common file type.
*ODF Toolkit:* This depends on [Apache Jena](https://jena.apache.org)
and also pulls in quite some dependencies (while not as much as
tika-parser). It is not too bad, since it is a library for
manipulating opendocument files. But all I need is to only extract
text. I created tests that extracted text from my odt/ods files. It
worked at first sight, but running the tests in a loop resulted in
strange nullpointer exceptions (it only worked the first run).
### Richtext
Richtext is supported by the jdk (using `RichtextEditorKit` from
swing).
### PDF
For "image" pdf files, tesseract is used. For "text" PDF files, the
library [Apache PDFBox](https://pdfbox.apache.org) can be used.
There also is [iText](https://github.com/itext/itext7) with a AGPL
license.
### Images
For images and "image" PDF files, there is already tesseract in place.
### HTML
HTML must be converted into a PDF file before text can be extracted.
### Text/Markdown
These files can be used as-is, obviously.
## Decision Outcome
- MS Office files: POI library
- Open Document files: Tika, but integrating the few source files that
make up the open document parser. Due to its huge dependency tree,
the library is not added.
- PDF: Apache PDFBox. I know this library better than itext.

Binary file not shown.

After

Width:  |  Height:  |  Size: 49 KiB

View File

@ -0,0 +1,43 @@
@startuml
scale 1200 width
title: Processing Files
skinparam monochrome true
skinparam backgroundColor white
skinparam rectangle {
roundCorner<<Input>> 25
roundCorner<<Output>> 5
}
rectangle Input <<Input>> {
file "html"
file "plaintext"
file "image"
file "msoffice"
file "rtf"
file "odf"
file "pdf"
}
node toBoth [
PDF + TXT
]
node toPdf [
PDF
]
node toTxt [
TXT
]
image --> toBoth:<tesseract>
html --> toPdf:<wkhtmltopdf>
toPdf --> toTxt:[pdfbox]
plaintext --> html:[flexmark]
msoffice --> toPdf:<unoconv>
msoffice --> toTxt:[poi]
rtf --> toTxt:[jdk]
rtf --> toPdf:<unoconv>
odf --> toTxt:[tika]
odf --> toPdf:<unoconv>
pdf --> toTxt:<tesseract>
pdf --> toTxt:[pdfbox]
plaintext -> toTxt:[identity]
@enduml