Class PDFParser

java.lang.Object
org.apache.pdfbox.pdfparser.BaseParser
org.apache.pdfbox.pdfparser.PDFParser
Direct Known Subclasses:
NonSequentialPDFParser

public class PDFParser extends BaseParser
This class will handle the parsing of the PDF document.
Version:
$Revision: 1.53 $
Author:
Ben Litchfield
  • Field Details

    • isFDFDocment

      protected boolean isFDFDocment
    • xrefTrailerResolver

      protected XrefTrailerResolver xrefTrailerResolver
      Collects all Xref/trailer objects and resolves them into single object using startxref reference.
  • Constructor Details

    • PDFParser

      public PDFParser(InputStream input) throws IOException
      Constructor.
      Parameters:
      input - The input stream that contains the PDF document.
      Throws:
      IOException - If there is an error initializing the stream.
    • PDFParser

      public PDFParser(InputStream input, RandomAccess rafi) throws IOException
      Constructor to allow control over RandomAccessFile.
      Parameters:
      input - The input stream that contains the PDF document.
      rafi - The RandomAccessFile to be used in internal COSDocument
      Throws:
      IOException - If there is an error initializing the stream.
    • PDFParser

      public PDFParser(InputStream input, RandomAccess rafi, boolean force) throws IOException
      Constructor to allow control over RandomAccessFile. Also enables parser to skip corrupt objects to try and force parsing
      Parameters:
      input - The input stream that contains the PDF document.
      rafi - The RandomAccessFile to be used in internal COSDocument
      force - When true, the parser will skip corrupt pdf objects and will continue parsing at the next object in the file
      Throws:
      IOException - If there is an error initializing the stream.
  • Method Details

    • setTempDirectory

      public void setTempDirectory(File tmpDir)
      This is the directory where pdfbox will create a temporary file for storing pdf document stream in. By default this directory will be the value of the system property java.io.tmpdir.
      Parameters:
      tmpDir - The directory to create scratch files needed to store pdf document streams.
    • isContinueOnError

      protected boolean isContinueOnError(Exception e)
      Returns true if parsing should be continued. By default, forceParsing is returned. This can be overridden to add application specific handling (for example to stop parsing when the number of exceptions thrown exceed a certain number).
      Parameters:
      e - The exception if available. Can be null if there is no exception available
      Returns:
      true if parsing could be continued, otherwise false
    • parse

      public void parse() throws IOException
      This will parse the stream and populate the COSDocument object. This will close the stream when it is done parsing.
      Throws:
      IOException - If there is an error reading from the stream or corrupt data is found.
    • parseHeader

      protected void parseHeader() throws IOException
      Throws:
      IOException
    • getDocument

      public COSDocument getDocument() throws IOException
      This will get the document that was parsed. parse() must be called before this is called. When you are done with this document you must call close() on it to release resources.
      Returns:
      The document that was parsed.
      Throws:
      IOException - If there is an error getting the document.
    • getPDDocument

      public PDDocument getPDDocument() throws IOException
      This will get the PD document that was parsed. When you are done with this document you must call close() on it to release resources.
      Returns:
      The document at the PD layer.
      Throws:
      IOException - If there is an error getting the document.
    • getFDFDocument

      public FDFDocument getFDFDocument() throws IOException
      This will get the FDF document that was parsed. When you are done with this document you must call close() on it to release resources.
      Returns:
      The document at the PD layer.
      Throws:
      IOException - If there is an error getting the document.
    • parseStartXref

      protected boolean parseStartXref() throws IOException
      This will parse the startxref section from the stream. The startxref value is ignored.
      Returns:
      false on parsing error
      Throws:
      IOException - If an IO error occurs.
    • parseXrefTable

      protected boolean parseXrefTable(long startByteOffset) throws IOException
      This will parse the xref table from the stream and add it to the state The XrefTable contents are ignored.
      Parameters:
      startByteOffset - the offset to start at
      Returns:
      false on parsing error
      Throws:
      IOException - If an IO error occurs.
    • parseTrailer

      protected boolean parseTrailer() throws IOException
      This will parse the trailer from the stream and add it to the state.
      Returns:
      false on parsing error
      Throws:
      IOException - If an IO error occurs.
    • readVersionInTrailer

      protected void readVersionInTrailer(COSDictionary parsedTrailer)
      The document catalog can also have a /Version parameter which overrides the version specified in the header if, and only if it is greater.
      Parameters:
      parsedTrailer - the parsed catalog in the trailer
    • parseXrefStream

      public void parseXrefStream(COSStream stream, long objByteOffset) throws IOException
      Fills XRefTrailerResolver with data of given stream. Stream must be of type XRef.
      Parameters:
      stream - the stream to be read
      objByteOffset - the offset to start at
      Throws:
      IOException - if there is an error parsing the stream
    • parseXrefStream

      public void parseXrefStream(COSStream stream, long objByteOffset, boolean isStandalone) throws IOException
      Fills XRefTrailerResolver with data of given stream. Stream must be of type XRef.
      Parameters:
      stream - the stream to be read
      objByteOffset - the offset to start at
      isStandalone - should be set to true if the stream is not part of a hybrid xref table
      Throws:
      IOException - if there is an error parsing the stream
    • clearResources

      public void clearResources()
      Release all used resources.
      Overrides:
      clearResources in class BaseParser