Class NonSequentialPDFParser


public class NonSequentialPDFParser extends PDFParser
PDFParser which first reads startxref and xref tables in order to know valid objects and parse only these objects. Thus it is closer to a conforming parser than the sequential reading of PDFParser. This class can be used as a PDFParser replacement. First parse() must be called before page objects can be retrieved, e.g. getPDDocument(). This class is a much enhanced version of QuickParser presented in PDFBOX-1104 by Jeremy Villalobos.
  • Field Details

    • SYSPROP_PARSEMINIMAL

      public static final String SYSPROP_PARSEMINIMAL
      See Also:
    • SYSPROP_EOFLOOKUPRANGE

      public static final String SYSPROP_EOFLOOKUPRANGE
      See Also:
    • DEFAULT_TRAIL_BYTECOUNT

      protected static final int DEFAULT_TRAIL_BYTECOUNT
      See Also:
    • EOF_MARKER

      protected static final char[] EOF_MARKER
      EOF-marker.
    • STARTXREF_MARKER

      protected static final char[] STARTXREF_MARKER
      StartXRef-marker.
    • OBJ_MARKER

      protected static final char[] OBJ_MARKER
      obj-marker.
    • securityHandler

      protected SecurityHandler securityHandler
      The security handler.
    • TMP_FILE_PREFIX

      public static final String TMP_FILE_PREFIX
      See Also:
  • Constructor Details

    • NonSequentialPDFParser

      public NonSequentialPDFParser(String filename) throws IOException
      Constructs parser for given file using memory buffer.
      Parameters:
      filename - the filename of the pdf to be parsed
      Throws:
      IOException - If something went wrong.
    • NonSequentialPDFParser

      public NonSequentialPDFParser(File file, RandomAccess raBuf) throws IOException
      Constructs parser for given file using given buffer for temporary storage.
      Parameters:
      file - the pdf to be parsed
      raBuf - the buffer to be used for parsing
      Throws:
      IOException - If something went wrong.
    • NonSequentialPDFParser

      public NonSequentialPDFParser(File file, RandomAccess raBuf, String decryptionPassword) throws IOException
      Constructs parser for given file using given buffer for temporary storage.
      Parameters:
      file - the pdf to be parsed
      raBuf - the buffer to be used for parsing
      decryptionPassword - password to be used for decryption
      Throws:
      IOException - If something went wrong.
    • NonSequentialPDFParser

      public NonSequentialPDFParser(InputStream input) throws IOException
      Constructor.
      Parameters:
      input - input stream representing the pdf.
      Throws:
      IOException - If something went wrong.
    • NonSequentialPDFParser

      public NonSequentialPDFParser(InputStream input, RandomAccess raBuf, String decryptionPassword) throws IOException
      Constructor.
      Parameters:
      input - input stream representing the pdf.
      raBuf - the buffer to be used for parsing
      decryptionPassword - password to be used for decryption.
      Throws:
      IOException - If something went wrong.
  • Method Details

    • setEOFLookupRange

      public void setEOFLookupRange(int byteCount)
      Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker. If not set we use default value DEFAULT_TRAIL_BYTECOUNT. <p<We check that new value is at least 16. However for practical use cases this value should not be lower than 1000; even 2000 was found to not be enough in some cases where some trailing garbage like HTML snippets followed the EOF marker.

      In case system property SYSPROP_EOFLOOKUPRANGE is defined this value will be set on initialization but can be overwritten later.

      Parameters:
      byteCount - number of trailing bytes
    • initialParse

      protected void initialParse() throws IOException
      The initial parse will first parse only the trailer, the xrefstart and all xref tables to have a pointer (offset) to all the pdf's objects. It can handle linearized pdfs, which will have an xref at the end pointing to an xref at the beginning of the file. Last the root object is parsed.
      Throws:
      IOException - If something went wrong.
    • setPdfSource

      protected final void setPdfSource(long fileOffset) throws IOException
      Sets BaseParser.pdfSource to start next parsing at given file offset.
      Parameters:
      fileOffset - file offset
      Throws:
      IOException - If something went wrong.
    • releasePdfSourceInputStream

      protected final void releasePdfSourceInputStream() throws IOException
      Enable handling of alternative pdfSource implementation.
      Throws:
      IOException - If something went wrong.
    • getStartxrefOffset

      protected final long getStartxrefOffset() throws IOException
      Looks for and parses startxref. We first look for last '%%EOF' marker (within last DEFAULT_TRAIL_BYTECOUNT bytes (or range set via setEOFLookupRange(int)) and go back to find startxref.
      Returns:
      the offset of StartXref
      Throws:
      IOException - If something went wrong.
    • lastIndexOf

      protected int lastIndexOf(char[] pattern, byte[] buf, int endOff)
      Searches last appearance of pattern within buffer. Lookup before _lastOff and goes back until 0.
      Parameters:
      pattern - pattern to search for
      buf - buffer to search pattern in
      endOff - offset (exclusive) where lookup starts at
      Returns:
      start offset of pattern within buffer or -1 if pattern could not be found
    • readPattern

      protected final void readPattern(char[] pattern) throws IOException
      Reads given pattern from BaseParser.pdfSource. Skipping whitespace at start and end.
      Parameters:
      pattern - pattern to be skipped
      Throws:
      IOException - if pattern could not be read
    • parse

      public void parse() throws IOException
      This will parse the stream and populate the COSDocument object. This will close the stream when it is done parsing.
      Overrides:
      parse in class PDFParser
      Throws:
      IOException - If there is an error reading from the stream or corrupt data is found.
    • getPdfFile

      protected File getPdfFile()
      Return the pdf file.
      Returns:
      the pdf file
    • isLenient

      public boolean isLenient()
      Return true if parser is lenient. Meaning auto healing capacity of the parser are used.
      Returns:
      true if parser is lenient
    • setLenient

      public void setLenient(boolean lenient) throws IllegalArgumentException
      Change the parser leniency flag. This method can only be called before the parsing of the file.
      Parameters:
      lenient -
      Throws:
      IllegalArgumentException - if the method is called after parsing.
    • deleteTempFile

      protected void deleteTempFile()
      Remove the temporary file. A temporary file is created if this class is instantiated with an InputStream
    • getSecurityHandler

      public SecurityHandler getSecurityHandler()
      Returns security handler of the document or null if document is not encrypted or parse() wasn't called before.
      Returns:
      the security handler.
    • getPDDocument

      public PDDocument getPDDocument() throws IOException
      This will get the PD document that was parsed. When you are done with this document you must call close() on it to release resources. Overwriting super method was necessary in order to set security handler.
      Overrides:
      getPDDocument in class PDFParser
      Returns:
      The document at the PD layer.
      Throws:
      IOException - If there is an error getting the document.
    • getPageNumber

      public int getPageNumber() throws IOException
      Returns the number of pages in a document.
      Returns:
      the number of pages.
      Throws:
      IOException - if PAGES or other needed object is missing
    • getPage

      public PDPage getPage(int pageNr) throws IOException
      Returns the page requested with all the objects loaded into it.
      Parameters:
      pageNr - starts from 0 to the number of pages.
      Returns:
      the page with the given pagenumber.
      Throws:
      IOException - If something went wrong.
    • parseObjectDynamically

      protected final COSBase parseObjectDynamically(COSObject obj, boolean requireExistingNotCompressedObj) throws IOException
      This will parse the next object from the stream and add it to the local state. This is taken from PDFParser and reduced to parsing an indirect object.
      Parameters:
      obj - object to be parsed (we only take object number and generation number for lookup start offset)
      requireExistingNotCompressedObj - if true object to be parsed must not be contained within compressed stream
      Returns:
      the parsed object (which is also added to document object)
      Throws:
      IOException - If an IO error occurs.
    • parseObjectDynamically

      protected COSBase parseObjectDynamically(int objNr, int objGenNr, boolean requireExistingNotCompressedObj) throws IOException
      This will parse the next object from the stream and add it to the local state. This is taken from PDFParser and reduced to parsing an indirect object.
      Parameters:
      objNr - object number of object to be parsed
      objGenNr - object generation number of object to be parsed
      requireExistingNotCompressedObj - if true the object to be parsed must be defined in xref (comment: null objects may be missing from xref) and it must not be a compressed object within object stream (this is used to circumvent being stuck in a loop in a malicious PDF)
      Returns:
      the parsed object (which is also added to document object)
      Throws:
      IOException - If an IO error occurs.
    • decryptDictionary

      protected final void decryptDictionary(COSDictionary dict, long objNr, long objGenNr) throws IOException
      Parameters:
      dict - the dictionary to be decrypted
      objNr - the object number
      objGenNr - the object generation number
      Throws:
      IOException - ff something went wrong
    • decryptString

      protected final void decryptString(COSString str, long objNr, long objGenNr) throws IOException
      Decrypts given COSString.
      Parameters:
      str - the string to be decrypted
      objNr - the object number
      objGenNr - the object generation number
      Throws:
      IOException - ff something went wrong
    • decrypt

      protected final void decrypt(COSBase pb, int objNr, int objGenNr) throws IOException
      Decrypts given object.
      Parameters:
      pb - the object to be decrypted
      objNr - the object number
      objGenNr - the object generation number
      Throws:
      IOException - ff something went wrong
    • parseCOSStream

      protected COSStream parseCOSStream(COSDictionary dic, RandomAccess file) throws IOException
      This will read a COSStream from the input stream using length attribute within dictionary. If length attribute is a indirect reference it is first resolved to get the stream length. This means we copy stream data without testing for 'endstream' or 'endobj' and thus it is no problem if these keywords occur within stream. We require 'endstream' to be found after stream data is read.
      Overrides:
      parseCOSStream in class BaseParser
      Parameters:
      dic - dictionary that goes with this stream.
      file - file to write the stream to when reading.
      Returns:
      parsed pdf stream.
      Throws:
      IOException - if an error occurred reading the stream, like problems with reading length attribute, stream does not end with 'endstream' after data read, stream too short etc.