Package org.apache.pdfbox.util
Class PDFMarkedContentExtractor
java.lang.Object
org.apache.pdfbox.util.PDFStreamEngine
org.apache.pdfbox.util.PDFMarkedContentExtractor
This is an stream engine to extract the marked content of a pdf.
- Version:
- $Revision$
- Author:
- koch
-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected String
encoding that text will be written in (or null). -
Constructor Summary
ConstructorsConstructorDescriptionInstantiate a new PDFTextStripper object.PDFMarkedContentExtractor
(String encoding) Instantiate a new PDFTextStripper object.Instantiate a new PDFTextStripper object. -
Method Summary
Modifier and TypeMethodDescriptionvoid
beginMarkedContentSequence
(COSName tag, COSDictionary properties) void
protected void
This will process a TextPosition object and add the text to the list of characters on a page.void
Methods inherited from class org.apache.pdfbox.util.PDFStreamEngine
getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getTotalCharCnt, getValidCharCnt, getXObjects, inspectFontEncoding, isForceParsing, processEncodedText, processOperator, processOperator, processStream, processSubStream, registerOperatorProcessor, resetEngine, setColorSpaces, setFonts, setForceParsing, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix
-
Field Details
-
outputEncoding
encoding that text will be written in (or null).
-
-
Constructor Details
-
PDFMarkedContentExtractor
Instantiate a new PDFTextStripper object. This object will load properties from PDFMarkedContentExtractor.properties and will not do anything special to convert the text to a more encoding-specific output.- Throws:
IOException
- If there is an error loading the properties.
-
PDFMarkedContentExtractor
Instantiate a new PDFTextStripper object. Loading all of the operator mappings from the properties object that is passed in. Does not convert the text to more encoding-specific output.- Parameters:
props
- The properties containing the mapping of operators to PDFOperator classes.- Throws:
IOException
- If there is an error reading the properties.
-
PDFMarkedContentExtractor
Instantiate a new PDFTextStripper object. This object will load properties from PDFMarkedContentExtractor.properties and will apply encoding-specific conversions to the output text.- Parameters:
encoding
- The encoding that the output will be written in.- Throws:
IOException
- If there is an error reading the properties.
-
-
Method Details
-
beginMarkedContentSequence
-
endMarkedContentSequence
public void endMarkedContentSequence() -
xobject
-
processTextPosition
This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.- Overrides:
processTextPosition
in classPDFStreamEngine
- Parameters:
text
- The text to process.
-
getMarkedContents
-