Class PDFMarkedContentExtractor

java.lang.Object
org.apache.pdfbox.util.PDFStreamEngine
org.apache.pdfbox.util.PDFMarkedContentExtractor

public class PDFMarkedContentExtractor extends PDFStreamEngine
This is an stream engine to extract the marked content of a pdf.
Version:
$Revision$
Author:
koch
  • Field Details

    • outputEncoding

      protected String outputEncoding
      encoding that text will be written in (or null).
  • Constructor Details

    • PDFMarkedContentExtractor

      public PDFMarkedContentExtractor() throws IOException
      Instantiate a new PDFTextStripper object. This object will load properties from PDFMarkedContentExtractor.properties and will not do anything special to convert the text to a more encoding-specific output.
      Throws:
      IOException - If there is an error loading the properties.
    • PDFMarkedContentExtractor

      public PDFMarkedContentExtractor(Properties props) throws IOException
      Instantiate a new PDFTextStripper object. Loading all of the operator mappings from the properties object that is passed in. Does not convert the text to more encoding-specific output.
      Parameters:
      props - The properties containing the mapping of operators to PDFOperator classes.
      Throws:
      IOException - If there is an error reading the properties.
    • PDFMarkedContentExtractor

      public PDFMarkedContentExtractor(String encoding) throws IOException
      Instantiate a new PDFTextStripper object. This object will load properties from PDFMarkedContentExtractor.properties and will apply encoding-specific conversions to the output text.
      Parameters:
      encoding - The encoding that the output will be written in.
      Throws:
      IOException - If there is an error reading the properties.
  • Method Details

    • beginMarkedContentSequence

      public void beginMarkedContentSequence(COSName tag, COSDictionary properties)
    • endMarkedContentSequence

      public void endMarkedContentSequence()
    • xobject

      public void xobject(PDXObject xobject)
    • processTextPosition

      protected void processTextPosition(TextPosition text)
      This will process a TextPosition object and add the text to the list of characters on a page. It takes care of overlapping text.
      Overrides:
      processTextPosition in class PDFStreamEngine
      Parameters:
      text - The text to process.
    • getMarkedContents

      public List<PDMarkedContent> getMarkedContents()