Here is a detailed table of Apache PDFBox features, with a focus on extracting text and its format, along with whether it supports these functionalities directly.
Feature Category | Functionality | Supported by PDFBox | Description |
---|---|---|---|
Text Extraction | Extract plain text | âś… Yes | Use PDFTextStripper to extract the plain text from the PDF. |
Extract text with font and style | âś… Partial | PDFTextStripper can retrieve font names, sizes, and positions but does not directly support advanced formatting like bold/italic. | |
Extract text color | âś… Partial | With custom parsing of PDF graphics operators, text color can be retrieved. | |
Extract text alignment | âś… Partial | Requires analyzing the text position relative to the page layout. | |
Extract text with line breaks | âś… Yes | Preserves line breaks and text flow based on PDF structure. | |
Extract text from a specific region | âś… Yes | Use PDFTextStripperByArea to define and extract text from a bounding box. | |
Extract Unicode text | âś… Yes | Full support for Unicode text, including multi-language PDFs. | |
Font and Style Extraction | Extract font names | âś… Yes | PDFont objects provide font names for text. |
Extract font sizes | âś… Yes | Use PDFont and text transformation matrices to calculate font size. | |
Extract bold/italic formatting | ❌ No | PDFBox does not directly support detecting bold/italic styles; you can infer this from font names or weights. | |
Graphical and Layout Analysis | Extract text positions (coordinates) | âś… Yes | Text coordinates can be retrieved using TextPosition objects. |
Extract text order and reading flow | âś… Partial | Reading order depends on the logical structure in the PDF. It may require custom processing for complex layouts. | |
Extract paragraphs and headers | ❌ No | Requires custom logic as PDFBox does not inherently understand semantic structures like paragraphs or headers. | |
Extract page-level text and layout | âś… Yes | Extract text along with its position on each page. | |
Metadata and Extras | Extract metadata about fonts | âś… Yes | PDFontDescriptor provides detailed metadata about fonts (e.g., embedded fonts, styles). |
Extract text with hyperlinks | âś… Partial | Hyperlinks can be extracted by analyzing annotations and their associated text. |
Does PDFBox Support Extracting Text Along with Format?
- Partially Supported:
- PDFBox can extract basic text formatting like font size, font name, and positions using
PDFTextStripper
andTextPosition
. - Advanced formatting (e.g., bold, italic) and semantic structures (e.g., paragraphs, headers) are not directly supported but can be inferred with custom logic.
- PDFBox can extract basic text formatting like font size, font name, and positions using
How to Extract Text Along with Format?
Here’s how you can extract text and basic formatting using PDFBox
:
Step 1: Use PDFTextStripper
This extracts the text while preserving positions.
Step 2: Override writeString()
Override this method to capture text details like font, size, and position.
Example Code:
import org.apache.pdfbox.pdmodel.PDDocument
import org.apache.pdfbox.text.PDFTextStripper
import org.apache.pdfbox.text.TextPosition
class TextWithFormatExtractor : PDFTextStripper() {
override fun writeString(text: String, textPositions: List<TextPosition>) {
for (textPosition in textPositions) {
println(
"Text: ${textPosition.unicode}, " +
"Font: ${textPosition.font.name}, " +
"Font Size: ${textPosition.fontSizeInPt}, " +
"X: ${textPosition.xDirAdj}, Y: ${textPosition.yDirAdj}"
)
}
}
}
fun extractTextWithFormat(pdfPath: String) {
val document = PDDocument.load(File(pdfPath))
val stripper = TextWithFormatExtractor()
stripper.setSortByPosition(true)
stripper.getText(document)
document.close()
}
Limitations
-
Complex Formatting:
- Bold/italic styles are not inherently stored in the text but depend on font characteristics (e.g., names or weights).
- You may need to analyze font names like
Times-Bold
to detect bold styles.
-
Semantic Structures:
- PDFs do not store paragraphs or headers as structured data; they are treated as a collection of text objects.
Next Steps
- Enhance the extractor:
- Use custom logic to group text into paragraphs or detect headers based on font size and position.
- Use PDFBox Utilities:
- Combine
PDFTextStripper
with layout or graphics analysis for advanced formatting.
- Combine
Let me know if you’d like to implement a specific enhancement!