Class PDFText2HTML


public class PDFText2HTML extends PDFTextStripper
Wrap stripped text in simple HTML, trying to form HTML paragraphs. Paragraphs broken by pages, columns, or figures are not mended.
Author:
jjb - http://www.johnjbarton.com
  • Constructor Details

    • PDFText2HTML

      public PDFText2HTML(String encoding) throws IOException
      Constructor.
      Parameters:
      encoding - The encoding to be used
      Throws:
      IOException - If there is an error during initialization.
  • Method Details

    • writeHeader

      protected void writeHeader() throws IOException
      Write the header to the output document. Now also writes the tag defining the character encoding.
      Throws:
      IOException - If there is a problem writing out the header to the document.
    • writePage

      protected void writePage() throws IOException
      This will print the text of the processed page to "output". It will estimate, based on the coordinates of the text, where newlines and word spacings should be placed. The text will be sorted only if that feature was enabled.
      Overrides:
      writePage in class PDFTextStripper
      Throws:
      IOException - If there is an error writing the text.
    • endDocument

      public void endDocument(PDDocument pdf) throws IOException
      This method is available for subclasses of this class. It will be called after processing of the document finishes.
      Overrides:
      endDocument in class PDFTextStripper
      Parameters:
      pdf - The PDF document that is being processed.
      Throws:
      IOException - If an IO error occurs.
    • getTitle

      protected String getTitle()
      This method will attempt to guess the title of the document using either the document properties or the first lines of text.
      Returns:
      returns the title.
    • startArticle

      protected void startArticle(boolean isltr) throws IOException
      Write out the article separator (div tag) with proper text direction information.
      Overrides:
      startArticle in class PDFTextStripper
      Parameters:
      isltr - true if direction of text is left to right
      Throws:
      IOException - If there is an error writing to the stream.
    • endArticle

      protected void endArticle() throws IOException
      Write out the article separator.
      Overrides:
      endArticle in class PDFTextStripper
      Throws:
      IOException - If there is an error writing to the stream.
    • writeString

      protected void writeString(String text, List<TextPosition> textPositions) throws IOException
      Write a string to the output stream, maintain font state, and escape some HTML characters. The font state is only preserved per word.
      Overrides:
      writeString in class PDFTextStripper
      Parameters:
      text - The text to write to the stream.
      textPositions - the corresponding text positions
      Throws:
      IOException - If there is an error writing to the stream.
    • writeString

      protected void writeString(String chars) throws IOException
      Write a string to the output stream and escape some HTML characters.
      Overrides:
      writeString in class PDFTextStripper
      Parameters:
      chars - String to be written to the stream
      Throws:
      IOException - If there is an error writing to the stream.
    • writeParagraphEnd

      protected void writeParagraphEnd() throws IOException
      Writes the paragraph end "

      " to the output. Furthermore, it will also clear the font state. Write something (if defined) at the end of a paragraph.
      Overrides:
      writeParagraphEnd in class PDFTextStripper
      Throws:
      IOException - if something went wrong