Basic PDFBox Tutorial

PDFBox is an open source project written in Java.  It comes as a JAR file and therefore can be used in Java applications to create, manipulate and extract data from PDF (Portable Document Format) files. More details can be found at http://pdfbox.apache.org/

Quick Start:

  1. Download PDFBox at http://www.apache.org/dyn/closer.cgi/pdfbox/1.8.6/pdfbox-app-1.8.6.jar
  2. Add to CLASSPATH
  3. Download documentation at http://repo1.maven.org/maven2/org/apache/pdfbox/pdfbox/1.8.6/pdfbox-1.8.6-javadoc.jar (Look further below on how to extract jar files) or view the docs online at http://pdfbox.apache.org/docs/1.8.6/javadocs/.
  4. Within the documentation, start by looking at PDDocument located in the package org.apache.pdfbox.pdmodel.
  5. Start coding

In Detail:

  1. Installing PDFBox.
  2. FAQs.
  3. Programming with PDFBox.
Help!! If you are already using PDFBox and have an issue with PDFBox and cannot find answers, you can ask the wider PDFBox community (including developers) through the official PDFBox mailing list. You will have to join the mailing list as replies are sent only to users of the mailing list. To join visit http://pdfbox.apache.org/mailinglists.html. Second, post your queries through the mailing list. Note that replies are usually sent to only users of the mailing list - this means that if you are not part of the mailing list, your query would have been answered and sent to all users on the mailing list except you! By joining the mailing list you also get to view solutions to problems others are facing. All queries and replies to queries are to be sent to the mailing list and not directly to developers.

1. Installing PDFBox:

Version 1.8.6

1. Download this jar file http://www.apache.org/dyn/closer.cgi/pdfbox/1.8.6/pdfbox-app-1.8.6.jar

This will take you to a webpage suggesting a site for your download. Download using that link.

2. Add this jar file with its full path to your CLASSPATH. Make sure your CLASSPATH has a .; added before your entry. Here is how a sample CLASSPATH entry may look like.

.;C:\Documents and Settings\user\My Documents\Downloads\pdfbox-app-1.8.6.jar

(Note: If you find it difficult to set the CLASSPATH refer to this page.)

3. You can now write your code. Further down are some sample code. Refer to the FAQs below if you have other problems.


Version 1.8.5

1. Download this jar file http://www.apache.org/dyn/closer.cgi/pdfbox/1.8.5/pdfbox-app-1.8.5.jar

This will take you to a webpage suggesting a site for your download. Download using that link.

2. Add this jar file with its full path to your CLASSPATH. Make sure your CLASSPATH has a .; added before your entry. Here is how a sample CLASSPATH entry may look like.

.;C:\Documents and Settings\user\My Documents\Downloads\pdfbox-app-1.8.5.jar

(Note: If you find it difficult to set the CLASSPATH refer to this page.)

3. You can now write your code. Further down this page are some sample code. Refer to the FAQs below if you have other problems.


(Go to the end of this page for instructions on installation of previous versions)


2. FAQs:

1. Where can I find more examples or sample code for PDFBox?

Source1 (The best. Look for the "PDFBox Cookbook" link on this webpage)
http://pdfbox.apache.org/userguide/faq.html  

Source2: Download the source code for the PDFBox version you are after from the following locations.

 
 
 
Note: If you have any problems with the above links you can look under http://archive.apache.org/dist/pdfbox/
 
Extract the files and look under the following locations

Version 1.8.6 - "folder where extracted\pdfbox-1.8.6\pdfbox\src\main\java\org\apache\pdfbox"

Version 1.8.5 - "folder where extracted\pdfbox-1.8.5\pdfbox\src\main\java\org\apache\pdfbox"
 
Version 1.8.4 - "folder where extracted\pdfbox-1.8.4\pdfbox\src\main\java\org\apache\pdfbox"

Version 1.8.3 - "folder where extracted\pdfbox-1.8.3\pdfbox\src\main\java\org\apache\pdfbox"
 
Version 1.8.0 - "folder where extracted\pdfbox-1.8.0\pdfbox\src\main\java\org\apache\pdfbox" (Note there are 22 files).

Version 1.6.0 - "folder where extracted\pdfbox-1.6.0\pdfbox\src\main\java\org\apache\pdfbox\examples"

Version 1.5.0 - "folder where extracted\pdfbox-1.5.0\pdfbox\src\main\java\org\apache\pdfbox\examples"
 

2. Where can I get the source code?




3. Where can I download the Java Doc for PDFBox?

You can view the API Docs online at http://pdfbox.apache.org/docs/1.8.6/javadocs/ or you can download them at the following links.



You can extract the documentation from the jar file and go through these files to understand PDFBox better. You can extract the contents of a jar file using the jar command (example below) in java from the command prompt (Refer to this link on how to extract contents from a Jar file).
    jar xf pdfbox-1.5.0-javadoc.jar //Be aware that this may overwrite some files. Refer to the link above.
4. Where can I get up-to-date information/help etc.?

Apart from doing your own research you can join the mailing list at PDFBox where you can ask questions. You can get to know answers to problems others are facing. It's worth joining. Please note that when you send a question to the mailing list, the reply is usually sent to only users of the mailing list - this means that if you have not joined the mailing list, you might be under the impression that your question was not answered when in fact it might have been answered and sent to all the users on the mailing list! To join the mailing list refer to http://pdfbox.apache.org/mailinglists.html

5. I have problems configuring my computer to compile or run Java programs. Are there any resources to solve this problem?

If you have problems configuring Java programs to compile or run on your computer, refer to this page.

6. I have included commons-logging-1.1.1.jar in my CLASSPATH but I am still getting compiler errors (applies to certain previous versions only. Look at the installation tips at the end of this page).

I have had problems setting up my classpath. First, you must prefix your CLASSPATH with .;  Otherwise when you try to compile or run your program from the command prompt you will face problems. Look at my CLASSPATH for in the installation tips for version 1.0.0. Second, the commons-logging-1.1.1.jar has to appear before the pdfbox-1.0.0.jar. My Java code would not execute when it was the other way around. Here is a link that gives you a good idea of how to set your CLASSPATH. It refers to a different situation but gives you a basic idea of how to set CLASSPATH.

7. Are there any conditions regarding which jars have to be added and which need not be (applies to version 1.1.0)?

In addition to the pdfbox-1.1.0.jar, the following are mandatory

commons-logging-1.1.1.jar
fontbox-1.1.0.jar
jempbox-1.1.0.jar
Take a look at  http://pdfbox.apache.org/dependencies.html. This lists the optional and mandatory jar files.

3. Programming with PDFBox:

As my knowledge of PDFBox is limited you may not get the answers that you are looking but I hope to keep updating this page as I learn.

Now that you have installed PDFBox you might wonder what to do next.
  1. Try running the sample code you have downloaded (refer to  FAQ 1 above).
  2. Next, try to get a basic understanding of the PDF specification. All PDF files are expected to follow this specification and knowing it helps. As you go through the PDF specification you will understand PDFBox better (as to why certain classes/processes exist). I am attempting to summarize the PDF specification at this link, but my effort is far from over and still in progress. Have a look at it and refer to further reliable resources mentioned on that page.
  3. Start looking at the PDFBox documentation.In the documentation that you extracted, start by looking at PDDocument located in the package org.apache.pdfbox.pdmodel
  4. Finally, write lots of code.
Creating a new PDF file (just a blank page):

This section refers to this sample code.


If you have been through the FAQs and looked at the sample code you would have a basic idea on how PDFBox works. PDFBox considers the class PDDocument found in the package org.apache.pdfbox.pdmodel to be equivalent of a PDF file. If you are opening or creating a PDF file you will be working with this class. Open the JavaDoc (refer to FAQ 3) and have a good look at the methods found in this class.

A PDF file generally consists of one or more pages. It would be safe to assume that all PDF files will have at least one page. So if you are creating a PDF file using the you would need at least one page. The class that helps you represent a page is PDPage (again found under the same pdmodel package).

Here are the steps to create a Blank PDF file.
  1. Create a PDF file (using PDDocument)
  2. Add a page (using PDPage)
  3. Save the PDF file (using the save method in PDDocument) and
  4. Close it (using the close method in PDDocument again).
Look at the sample code. The code is very simple to understand and after looking at the code attempt to create your first blank PDF!!

Here is my code based on the sample code to create a Blank PDF file - bad programming style but gave me great satisfaction upon execution.

/**
 * @(#)BlankPDF.java
 *
 *
 * @author Stephen H
 * @version 1.00 2011/6/20
 */
import org.apache.pdfbox.pdmodel.*;
import java.io.*;

public class BlankPDF {
    public static void main(String[] args) {
        PDDocument doc = null;
        try{
            doc = new PDDocument();
        } catch (IOException ie){
            System.out.println(ie);
        }
        doc.addPage(new PDPage());
        try{
            doc.save("Empty PDF.pdf");
            doc.close();
        } catch (Exception io){
            System.out.println(io);
        }
    }
}


Creating a new PDF file (with text):

This section refers to this sample code.

If you read the previous section you will have a fair idea on how to create a PDF file with a blank page. In this section we will discuss how to create a PDF file with some text in it! However, I must admit that some of the classes are pretty new to me and apart from the reasons as to how we use them, I do not know much about them. Hopefully, I will be able to understand these classes better in the future. As I stressed earlier they will make a lot more sense to me once I learn more from the PDF specification. Anyway, the process is very similar.
  1. Create a PDF file (using PDDocument)
  2. Add a page (using PDPage)
  3. Add contents to the page using PDPageContentStream and its methods mentioned below.
    1. Use beginText().
    2. Set the font and font size to use for your text using setFont().
    3. Specify the text location where the text is to be entered/typed using moveTextPositionByAmount().
    4. Enter the text using drawString().
    5. Use endText().
    6. Use close().
  4. Save the PDF file (using the save method in PDDocument) and
  5. Close it (using the close method in PDDocument again).
Note: Even though the steps involving setFont() and moveTextPositionByAmount() might seem to be optional, they are not. If you skip any or both of these steps you will be left with a PDF file that cannot be opened by the PDF reader. The error message that I had with Adobe Reader was "There was an error opening this document. The file is damaged and could not be repaired."

Refer to the sample code. Here is my code based on the sample code. My apologies for the bad programming style.

/**
 * @(#)PDFWithText.java
 *
 *
 * @author Stephen H
 * @version 1.00 2011/6/21
 */

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.edit.PDPageContentStream;
import org.apache.pdfbox.pdmodel.font.PDType1Font;
import org.apache.pdfbox.pdmodel.font.PDFont;

public class PDFWithText {
    public static void main(String[] args) {
        PDDocument doc = null;
        PDPage page = null;

       try{
           doc = new PDDocument();
           page = new PDPage();

           doc.addPage(page);
           PDFont font = PDType1Font.HELVETICA_BOLD;

           PDPageContentStream content = new PDPageContentStream(doc, page);
           content.beginText();
           content.setFont( font, 12 );
           content.moveTextPositionByAmount( 100, 700 );
           content.drawString("Hello from www.printmyfolders.com");

           content.endText();
           content.close();
          doc.save("PDFWithText.pdf");
          doc.close();
    } catch (Exception e){
        System.out.println(e);
    }
}
}


Working with an existing PDF file
:

If you are looking at working with existing PDF files, the class that will mean a lot to you will be the PDDocument class again (found in the package org.apache.pdfbox.pdmodel).

The most important method is the static overloaded method load. The purpose of this method is to assign a PDF file to the PDDocument class. You will first load the PDF file using this method and then manipulate the PDF using other methods of PDDocument. Out of the many overloaded load methods, here are some.

static PDDocument load(File file) //This will load a document from a file.
static PDDocument load(InputStream input) //This will load a document from an input stream.
static PDDocument load(URL url) //This will load a document from a url.

Refer to the JavaDoc for details of these and other overloaded methods.

You can now use these methods to get more information about the PDF file.

int getNumberOfPages() //Returns the number of pages in the PDF file.
boolean isEncrypted() //Lets you know if the PDF file is encrypted or not.
void print() //Sends the PDF document to a printer.
boolean removePage(int pageNumber) //Removes the page referred to by the page number.
void save(String fileName) //Saves the PDF file under the file name.
void silentPrint() //This will send the PDF to the default printer without prompting the user for any printer settings.
void close() //This will close the file.

Converting an image to a PDF file:
 
Here is the code to convert a image file to a PDF file.
 
/**
* PDFBoxTest.java
*
*
* @author Stephen H
* @version 1.00 2012/6/2
*/
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.pdmodel.edit.*;
import org.apache.pdfbox.pdmodel.graphics.xobject.*;
import java.io.*;
 
public class PDFBoxTest{
    
     // Note that this code works ONLY with jpg files
    public static void main(String[] args) {
        PDDocument doc = null;
        try{
          /* Step 1: Prepare the document.
           */
         doc = new PDDocument();
         PDPage page = new PDPage();
         doc.addPage(page);
         
         /* Step 2: Prepare the image
          * PDJpeg is the class you use when dealing with jpg images.
          * You will need to mention the jpg file and the document to which it is to be added
          * Note that if you complete these steps after the creating the content stream the PDF
          * file created will show "Out of memory" error.
          */
         
         PDXObjectImage image = null;
         image = new PDJpeg(doc, new FileInputStream("image.jpg"));
         
         /* Create a content stream mentioning the document, the page in the dcoument where the content stream is to be added.
          * Note that this step has to be completed after the above two steps are complete.
          */
         PDPageContentStream content = new PDPageContentStream(doc, page);
 
       /* Step 3:
        * Add (draw) the image to the content stream mentioning the position where it should be drawn
        * and leaving the size of the image as it is
        */
         content.drawImage(image,20,20);
         content.close();
       
         /* Step 4:
          * Save the document as a pdf file mentioning the name of the file
          */
        
        doc.save("ImageNowPdf.pdf");
       
        } catch (Exception e){
             System.out.println("Exception");
        }
    }
}

 
Extracting text from a PDF file:

If you are interested in extracting text from a PDF file, you will need to learn about the PDFTextStripper class. This is found in the package org.apache.pdfbox.util

You can use the constructor PDFTextStripper() to create a new object. Be aware that it throws an IOException. Here are some methods to extract text

String getText(PDDocument doc) //This will return the text of a document. Remember it returns a 'String'
void writeText(PDDocument doc, Writer outputStream) //This will take a PDDocument and write the text of that document to the print writer.

As you would have realised the PDF file from where the data is to be extracted has already been loaded onto the PDDocument (refer to previous sections) object.

You can also specify the pages that you want to extract.

public void setStartPage(int startPageValue) //Where startPageValue is the starting page. The first page of the PDF is 1, second page is 2 and so on.
public void setEndPage(int endPageValue) //Where endPageValue is the last page that you want to extract. The first page of the PDF is 1, second page is 2 and so on.


Make sure you use the close() method from the PDDocument class to close the PDF file.

void close() //This will close the file.

Here is a sample code that covers most of all that we have learnt.

import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.util.*;

public class PDFTest {

 public static void main(String[] args){
 PDDocument pd;
 BufferedWriter wr;
 try {
         File input = new File("C:\\Invoice.pdf");  // The PDF file from where you would like to extract
         File output = new File("C:\\SampleText.txt"); // The text file where you are going to store the extracted data
         pd = PDDocument.load(input);
         System.out.println(pd.getNumberOfPages());
         System.out.println(pd.isEncrypted());
         pd.save("CopyOfInvoice.pdf"); // Creates a copy called "CopyOfInvoice.pdf"
         PDFTextStripper stripper = new PDFTextStripper();
         stripper.setStartPage(3); //Start extracting from page 3
         stripper.setEndPage(5); //Extract till page 5
         wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output)));
         stripper.writeText(pd, wr);
         if (pd != null) {
             pd.close();
         }
        // I use close() to flush the stream.
        wr.close();
 } catch (Exception e){
         e.printStackTrace();
        }
     }
}

Here is some code to extract phone numbers from a PDF file. Try it out!! If you want to know more about the PDF specification, click here.

Installing previous versions of PDFBox:

Version 1.8.4

1. Download this jar file http://www.apache.org/dyn/closer.cgi/pdfbox/1.8.4/pdfbox-app-1.8.4.jar

This will take you to a webpage suggesting a site for your download. Download using that link.

2. Add this jar file with its full path to your CLASSPATH. Make sure your CLASSPATH has a .; added before your entry. Here is how a sample CLASSPATH entry may look like.

.;C:\Documents and Settings\user\My Documents\Downloads\pdfbox-app-1.8.4.jar

(Note: If you find it difficult to set the CLASSPATH refer to this page.)

3. You can now write your code. Further down this page are some sample code. Refer to the FAQs below if you have other problems.


Version 1.8.3  

1. Download this jar file http://www.apache.org/dyn/closer.cgi/pdfbox/1.8.3/pdfbox-app-1.8.3.jar

2. Add this jar file with its full path to your CLASSPATH. Make sure your CLASSPATH has a .; added before your entry. Here is how a sample CLASSPATH entry may look like.

.;C:\Documents and Settings\user\My Documents\Downloads\pdfbox-app-1.8.3.jar

(Note: If you find it difficult to set the CLASSPATH refer to this page.)

3. You can now write your code. Refer to the FAQs below if you have other problems.


Version  1.8.0 

1. Download this jar file http://www.apache.org/dyn/closer.cgi/pdfbox/1.8.0/pdfbox-app-1.8.0.jar

2. Add this jar file with its full path to your CLASSPATH. Make sure your CLASSPATH has a .; added before your entry. Here is how a sample CLASSPATH entry may look like.

.;C:\Documents and Settings\user\My Documents\Downloads\pdfbox-app-1.8.0.jar

(Note: If you find it difficult to set the CLASSPATH refer to this page.)

3. You can now write your code. Refer to the FAQs below if you have other problems.



Version  1.7.1 

1. Download this jar file http://www.apache.org/dyn/closer.cgi/pdfbox/1.7.1/pdfbox-app-1.7.1.jar

2. Add this jar file with its full path to your CLASSPATH. Make sure your CLASSPATH has a .; added before your entry. Here is how a sample CLASSPATH entry may look like.

.;C:\Documents and Settings\user\My Documents\Downloads\pdfbox-app-1.7.1.jar

(Note: If you find it difficult to set the CLASSPATH refer to this page.)

3. You can now write your code. Refer to the FAQs above if you have other problems

Version 1.6.0 

1. Download this jar file http://www.apache.org/dyn/closer.cgi/pdfbox/1.6.0/pdfbox-app-1.6.0.jar

2. Add this jar file with its full path to your CLASSPATH. Make sure your CLASSPATH has a .; added before your entry. Here is how my CLASSPATH entry looks like.

.;C:\Documents and Settings\user\My Documents\Downloads\pdfbox-app-1.6.0.jar

(Note: If you find it difficult to set the CLASSPATH refer to this page.)

3. You can now write your code. Refer to the FAQs above if you have other problems


Version. 1.5.0

1. Download this file. http://www.apache.org/dyn/closer.cgi/pdfbox/1.5.0/pdfbox-app-1.5.0.jar

2. Add the file with its full path to your CLASSPATH. Make sure your CLASSPATH has a .; added before your entry. Here is how my CLASSPATH entry looks like.

.;C:\Documents and Settings\user\My Documents\Downloads\pdfbox-app-1.5.0.jar

(Note: If you find it difficult to set the CLASSPATH refer to this page.)

3. You can now write your code. Refer to the FAQs above if you have other problems


Version 1.4.0

1. Download this file. http://www.apache.org/dyn/closer.cgi/pdfbox/1.4.0/pdfbox-app-1.4.0.jar

2. Add the file with its full path to your CLASSPATH. Make sure your CLASSPATH has a .; added before your entry. Here is how my CLASSPATH entry looks like.

.;C:\Documents and Settings\user\My Documents\Downloads\pdfbox-app-1.4.0.jar

(Note: If you find it difficult to set the CLASSPATH refer to this page.)

3. You can now write your code. Refer to the FAQs above if you have other problems


Version 1.3.1


1. Download this file. http://www.apache.org/dyn/closer.cgi/pdfbox/1.3.1/pdfbox-app-1.3.1.jar

2. Add the file with its full path to your CLASSPATH. Make sure your CLASSPATH has a .; added before your entry. Here is how my CLASSPATH entry looks like.

.;C:\Documents and Settings\user\My Documents\Downloads\pdfbox-app-1.3.1.jar

(Note: If you find it difficult to set the CLASSPATH refer to this page.)

3. You can now write your code. Refer to the FAQs above if you have other problems


Version 1.0.0 (Refer to next section for Version 1.1.0)

1. Download the following file and extract its contents.

commons-logging-1.1.1.jar
found under folder path pdfbox-1.0.0-bin\pdfbox1.0.0\external

fontbox-1.0.0.jar
found under folder path pdfbox-1.0.0-bin\pdfbox-1.0.0\external

jempbox-1.0.0.jar
found under folder path pdfbox-1.0.0-bin\pdfbox-1.0.0\external

pdfbox-1.0.0.jar
found under folder path pdfbox-1.0.0-bin\pdfbox-1.0.0

On my computer, the CLASSPATH is

.;C:\Documents and Settings\Stephen\My Documents\pdfbox-1.0.0-bin\pdfbox-1.0.0\external\commons-logging-1.1.1.jar;C:\Documents and Settings\Stephen\My Documents\pdfbox-1.0.0-bin\pdfbox-1.0.0\external\fontbox-1.0.0.jar;C:\Documents and Settings\Stephen\My Documents\pdfbox-1.0.0-bin\pdfbox-1.0.0\external\jempbox-1.0.0.jar;C:\Documents and Settings\Stephen\My Documents\pdfbox-1.0.0-bin\pdfbox-1.0.0\pdfbox-1.0.0.jar;

(Note: If you find it difficult to set the CLASSPATH refer to this page.)

3. You can now write your code. Make sure you import the necessary classes.

Version 1.1.0

1. Download the following files.


2. Extract the contents of commons-logging-1.1.1-bin.zip. The file that we need, which is, "commons-logging-1.1.1.jar" is found under the folder path commons-logging-1.1.1-bin\commons-logging-1.1.1

3. Add the following jars to your CLASSPATH.

commons-logging-1.1.1.jar
pdfbox-1.1.0.jar
fontbox-1.1.0.jar
jempbox-1.1.0.jar

(Note: If you find it difficult to set the CLASSPATH refer to this page.)

4. You can now write your code. Make sure you import the necessary classes.

I love your feedback and suggestions. Please leave a comment below or contact me at steve@printmyfolders.com.
 
All the best!
Steve

Please leave your feedback


Comments