Package org.apache.pdfbox.pdfparser
Class NonSequentialPDFParser
java.lang.Object
org.apache.pdfbox.pdfparser.BaseParser
org.apache.pdfbox.pdfparser.PDFParser
org.apache.pdfbox.pdfparser.NonSequentialPDFParser
PDFParser which first reads startxref and xref tables in order to know valid
objects and parse only these objects. Thus it is closer to a conforming
parser than the sequential reading of
PDFParser
.
This class can be used as a PDFParser
replacement. First
parse()
must be called before page objects can be retrieved, e.g.
getPDDocument()
.
This class is a much enhanced version of QuickParser
presented
in PDFBOX-1104 by
Jeremy Villalobos.-
Field Summary
FieldsModifier and TypeFieldDescriptionprotected static final int
protected static final char[]
EOF-marker.protected static final char[]
obj-marker.protected SecurityHandler
The security handler.protected static final char[]
StartXRef-marker.static final String
static final String
static final String
Fields inherited from class org.apache.pdfbox.pdfparser.PDFParser
isFDFDocment, xrefTrailerResolver
Fields inherited from class org.apache.pdfbox.pdfparser.BaseParser
DEF, document, ENDOBJ, ENDSTREAM, forceParsing, pdfSource, PROP_PUSHBACK_SIZE
-
Constructor Summary
ConstructorsConstructorDescriptionNonSequentialPDFParser
(File file, RandomAccess raBuf) Constructs parser for given file using given buffer for temporary storage.NonSequentialPDFParser
(File file, RandomAccess raBuf, String decryptionPassword) Constructs parser for given file using given buffer for temporary storage.Constructor.NonSequentialPDFParser
(InputStream input, RandomAccess raBuf, String decryptionPassword) Constructor.NonSequentialPDFParser
(String filename) Constructs parser for given file using memory buffer. -
Method Summary
Modifier and TypeMethodDescriptionprotected final void
Decrypts given object.protected final void
decryptDictionary
(COSDictionary dict, long objNr, long objGenNr) protected final void
decryptString
(COSString str, long objNr, long objGenNr) Decrypts given COSString.protected void
Remove the temporary file.getPage
(int pageNr) Returns the page requested with all the objects loaded into it.int
Returns the number of pages in a document.This will get the PD document that was parsed.protected File
Return the pdf file.Returns security handler of the document ornull
if document is not encrypted orparse()
wasn't called before.protected final long
Looks for and parses startxref.protected void
The initial parse will first parse only the trailer, the xrefstart and all xref tables to have a pointer (offset) to all the pdf's objects.boolean
Return true if parser is lenient.protected int
lastIndexOf
(char[] pattern, byte[] buf, int endOff) Searches last appearance of pattern within buffer.void
parse()
This will parse the stream and populate the COSDocument object.protected COSStream
parseCOSStream
(COSDictionary dic, RandomAccess file) This will read a COSStream from the input stream using length attribute within dictionary.protected COSBase
parseObjectDynamically
(int objNr, int objGenNr, boolean requireExistingNotCompressedObj) This will parse the next object from the stream and add it to the local state.protected final COSBase
parseObjectDynamically
(COSObject obj, boolean requireExistingNotCompressedObj) This will parse the next object from the stream and add it to the local state.protected final void
readPattern
(char[] pattern) Reads given pattern fromBaseParser.pdfSource
.protected final void
Enable handling of alternative pdfSource implementation.void
setEOFLookupRange
(int byteCount) Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker.void
setLenient
(boolean lenient) Change the parser leniency flag.protected final void
setPdfSource
(long fileOffset) SetsBaseParser.pdfSource
to start next parsing at given file offset.Methods inherited from class org.apache.pdfbox.pdfparser.PDFParser
clearResources, getDocument, getFDFDocument, isContinueOnError, parseHeader, parseStartXref, parseTrailer, parseXrefStream, parseXrefStream, parseXrefTable, readVersionInTrailer, setTempDirectory
Methods inherited from class org.apache.pdfbox.pdfparser.BaseParser
isClosing, isClosing, isEndOfName, isEOL, isEOL, isWhitespace, isWhitespace, parseBoolean, parseCOSArray, parseCOSDictionary, parseCOSName, parseCOSString, parseCOSString, parseDirObject, readExpectedString, readGenerationNumber, readInt, readLine, readLong, readObjectNumber, readString, readString, readStringNumber, readUntilEndStream, setDocument, skipSpaces
-
Field Details
-
SYSPROP_PARSEMINIMAL
- See Also:
-
SYSPROP_EOFLOOKUPRANGE
- See Also:
-
DEFAULT_TRAIL_BYTECOUNT
protected static final int DEFAULT_TRAIL_BYTECOUNT- See Also:
-
EOF_MARKER
protected static final char[] EOF_MARKEREOF-marker. -
STARTXREF_MARKER
protected static final char[] STARTXREF_MARKERStartXRef-marker. -
OBJ_MARKER
protected static final char[] OBJ_MARKERobj-marker. -
securityHandler
The security handler. -
TMP_FILE_PREFIX
- See Also:
-
-
Constructor Details
-
NonSequentialPDFParser
Constructs parser for given file using memory buffer.- Parameters:
filename
- the filename of the pdf to be parsed- Throws:
IOException
- If something went wrong.
-
NonSequentialPDFParser
Constructs parser for given file using given buffer for temporary storage.- Parameters:
file
- the pdf to be parsedraBuf
- the buffer to be used for parsing- Throws:
IOException
- If something went wrong.
-
NonSequentialPDFParser
public NonSequentialPDFParser(File file, RandomAccess raBuf, String decryptionPassword) throws IOException Constructs parser for given file using given buffer for temporary storage.- Parameters:
file
- the pdf to be parsedraBuf
- the buffer to be used for parsingdecryptionPassword
- password to be used for decryption- Throws:
IOException
- If something went wrong.
-
NonSequentialPDFParser
Constructor.- Parameters:
input
- input stream representing the pdf.- Throws:
IOException
- If something went wrong.
-
NonSequentialPDFParser
public NonSequentialPDFParser(InputStream input, RandomAccess raBuf, String decryptionPassword) throws IOException Constructor.- Parameters:
input
- input stream representing the pdf.raBuf
- the buffer to be used for parsingdecryptionPassword
- password to be used for decryption.- Throws:
IOException
- If something went wrong.
-
-
Method Details
-
setEOFLookupRange
public void setEOFLookupRange(int byteCount) Sets how many trailing bytes of PDF file are searched for EOF marker and 'startxref' marker. If not set we use default valueDEFAULT_TRAIL_BYTECOUNT
. <p<We check that new value is at least 16. However for practical use cases this value should not be lower than 1000; even 2000 was found to not be enough in some cases where some trailing garbage like HTML snippets followed the EOF marker.In case system property
SYSPROP_EOFLOOKUPRANGE
is defined this value will be set on initialization but can be overwritten later.- Parameters:
byteCount
- number of trailing bytes
-
initialParse
The initial parse will first parse only the trailer, the xrefstart and all xref tables to have a pointer (offset) to all the pdf's objects. It can handle linearized pdfs, which will have an xref at the end pointing to an xref at the beginning of the file. Last the root object is parsed.- Throws:
IOException
- If something went wrong.
-
setPdfSource
SetsBaseParser.pdfSource
to start next parsing at given file offset.- Parameters:
fileOffset
- file offset- Throws:
IOException
- If something went wrong.
-
releasePdfSourceInputStream
Enable handling of alternative pdfSource implementation.- Throws:
IOException
- If something went wrong.
-
getStartxrefOffset
Looks for and parses startxref. We first look for last '%%EOF' marker (within lastDEFAULT_TRAIL_BYTECOUNT
bytes (or range set viasetEOFLookupRange(int)
) and go back to findstartxref
.- Returns:
- the offset of StartXref
- Throws:
IOException
- If something went wrong.
-
lastIndexOf
protected int lastIndexOf(char[] pattern, byte[] buf, int endOff) Searches last appearance of pattern within buffer. Lookup before _lastOff and goes back until 0.- Parameters:
pattern
- pattern to search forbuf
- buffer to search pattern inendOff
- offset (exclusive) where lookup starts at- Returns:
- start offset of pattern within buffer or
-1
if pattern could not be found
-
readPattern
Reads given pattern fromBaseParser.pdfSource
. Skipping whitespace at start and end.- Parameters:
pattern
- pattern to be skipped- Throws:
IOException
- if pattern could not be read
-
parse
This will parse the stream and populate the COSDocument object. This will close the stream when it is done parsing.- Overrides:
parse
in classPDFParser
- Throws:
IOException
- If there is an error reading from the stream or corrupt data is found.
-
getPdfFile
Return the pdf file.- Returns:
- the pdf file
-
isLenient
public boolean isLenient()Return true if parser is lenient. Meaning auto healing capacity of the parser are used.- Returns:
- true if parser is lenient
-
setLenient
Change the parser leniency flag. This method can only be called before the parsing of the file.- Parameters:
lenient
-- Throws:
IllegalArgumentException
- if the method is called after parsing.
-
deleteTempFile
protected void deleteTempFile()Remove the temporary file. A temporary file is created if this class is instantiated with an InputStream -
getSecurityHandler
Returns security handler of the document ornull
if document is not encrypted orparse()
wasn't called before.- Returns:
- the security handler.
-
getPDDocument
This will get the PD document that was parsed. When you are done with this document you must call close() on it to release resources. Overwriting super method was necessary in order to set security handler.- Overrides:
getPDDocument
in classPDFParser
- Returns:
- The document at the PD layer.
- Throws:
IOException
- If there is an error getting the document.
-
getPageNumber
Returns the number of pages in a document.- Returns:
- the number of pages.
- Throws:
IOException
- if PAGES or other needed object is missing
-
getPage
Returns the page requested with all the objects loaded into it.- Parameters:
pageNr
- starts from 0 to the number of pages.- Returns:
- the page with the given pagenumber.
- Throws:
IOException
- If something went wrong.
-
parseObjectDynamically
protected final COSBase parseObjectDynamically(COSObject obj, boolean requireExistingNotCompressedObj) throws IOException This will parse the next object from the stream and add it to the local state. This is taken fromPDFParser
and reduced to parsing an indirect object.- Parameters:
obj
- object to be parsed (we only take object number and generation number for lookup start offset)requireExistingNotCompressedObj
- iftrue
object to be parsed must not be contained within compressed stream- Returns:
- the parsed object (which is also added to document object)
- Throws:
IOException
- If an IO error occurs.
-
parseObjectDynamically
protected COSBase parseObjectDynamically(int objNr, int objGenNr, boolean requireExistingNotCompressedObj) throws IOException This will parse the next object from the stream and add it to the local state. This is taken fromPDFParser
and reduced to parsing an indirect object.- Parameters:
objNr
- object number of object to be parsedobjGenNr
- object generation number of object to be parsedrequireExistingNotCompressedObj
- iftrue
the object to be parsed must be defined in xref (comment: null objects may be missing from xref) and it must not be a compressed object within object stream (this is used to circumvent being stuck in a loop in a malicious PDF)- Returns:
- the parsed object (which is also added to document object)
- Throws:
IOException
- If an IO error occurs.
-
decryptDictionary
protected final void decryptDictionary(COSDictionary dict, long objNr, long objGenNr) throws IOException - Parameters:
dict
- the dictionary to be decryptedobjNr
- the object numberobjGenNr
- the object generation number- Throws:
IOException
- ff something went wrong
-
decryptString
Decrypts given COSString.- Parameters:
str
- the string to be decryptedobjNr
- the object numberobjGenNr
- the object generation number- Throws:
IOException
- ff something went wrong
-
decrypt
Decrypts given object.- Parameters:
pb
- the object to be decryptedobjNr
- the object numberobjGenNr
- the object generation number- Throws:
IOException
- ff something went wrong
-
parseCOSStream
This will read a COSStream from the input stream using length attribute within dictionary. If length attribute is a indirect reference it is first resolved to get the stream length. This means we copy stream data without testing for 'endstream' or 'endobj' and thus it is no problem if these keywords occur within stream. We require 'endstream' to be found after stream data is read.- Overrides:
parseCOSStream
in classBaseParser
- Parameters:
dic
- dictionary that goes with this stream.file
- file to write the stream to when reading.- Returns:
- parsed pdf stream.
- Throws:
IOException
- if an error occurred reading the stream, like problems with reading length attribute, stream does not end with 'endstream' after data read, stream too short etc.
-