This tutorial has the following sections:
PDFBox is an open source Java project that enables creation and manipulation of PDF (Portable Document Format) files in Java. 1. Installing PDFBox: Version 1.8.0 (Go to the end of this page for instructions on installation of previous versions) 2. FAQs: 1. Where can I find more examples or sample code for PDFBox? Source1 (The best): http://pdfbox.apache.org/userguide/cookbook.html Source2: Download the source code for the PDFBox version you are after from the following locations. Note: If you have any problems with the above links you can look under http://archive.apache.org/dist/pdfbox/ Extract the files and look under the following locations Ver 1.8.0 - "folder where extracted\pdfbox-1.8.0\pdfbox\src\main\java\org\apache\pdfbox" (Note there are 22 files). 2. Where can I get the source code? Version 1.7.1 - http://www.apache.org/dyn/closer.cgi/pdfbox/1.7.1/pdfbox-1.7.1-src.zip Version 1.6.0 - http://www.apache.org/dyn/closer.cgi/pdfbox/1.6.0/pdfbox-1.6.0-src.zip Version 1.5.0 - http://www.apache.org/dyn/closer.cgi/pdfbox/1.5.0/pdfbox-1.5.0-src.zip 3. Where can I download the Java Doc for PDFBox? Version 1.8.0 - http://repo1.maven.org/maven2/org/apache/pdfbox/pdfbox/1.8.0/pdfbox-1.8.0-javadoc.jar Version 1.7.1 - http://repo1.maven.org/maven2/org/apache/pdfbox/pdfbox/1.7.1/pdfbox-1.7.1-javadoc.jar Version 1.6.0 - http://repo1.maven.org/maven2/org/apache/pdfbox/pdfbox/1.6.0/pdfbox-1.6.0-javadoc.jar Version 1.5.0 - http://repo1.maven.org/maven2/org/apache/pdfbox/pdfbox/1.5.0/pdfbox-1.5.0-javadoc.jar Version 1.4.0 - http://repo1.maven.org/maven2/org/apache/pdfbox/pdfbox/1.4.0/pdfbox-1.4.0-javadoc.jar (Thanks to the PDFBox mailing list) You can extract the documentation from the jar file and go through these files to understand PDFBox better. You can extract the contents of a jar file using the jar command (example below) in java from the command prompt (Refer to this link on how to extract contents from a Jar file). 4. Where can I get up-to-date information/help etc.?Apart from doing your own research you can join the mailing list at PDFBox where you can ask questions. You can get to know answers to problems others are facing. It's worth joining. Please note that when you send a question to the mailing list, the reply is usually sent to only users of the mailing list - this means that if you have not joined the mailing list, you might be under the impression that your question was not answered when in fact it might have been answered and sent to all the users on the mailing list! To join the mailing list refer to http://pdfbox.apache.org/mail-lists.html. 5. I have problems configuring my computer to compile or run Java programs. Are there any resources to solve this problem? If you have problems configuring Java programs to compile or run on your computer, refer to this page. For CLASSPATH problems refer to this page. 6. I have included commons-logging-1.1.1.jar in my CLASSPATH but I am still getting compiler errors (applies to certain previous versions only. Look at the installation tips at the end of this page). I have had problems setting up my classpath. First, you must prefix your CLASSPATH with .; Otherwise when you try to compile or run your program from the command prompt you will face problems. Look at my CLASSPATH for in the installation tips for version 1.0.0. Second, the commons-logging-1.1.1.jar has to appear before the pdfbox-1.0.0.jar. My Java code would not execute when it was the other way around. Here is a link that gives you a good idea of how to set your CLASSPATH. It refers to a different situation but gives you a basic idea of how to set CLASSPATH. 7. Are there any conditions regarding which jars have to be added and which need not be (applies to version 1.1.0)? In addition to the pdfbox-1.1.0.jar, the following are mandatory commons-logging-1.1.1.jar fontbox-1.1.0.jar jempbox-1.1.0.jar Take a look at http://pdfbox.apache.org/dependencies.html. This lists the optional and mandatory jar files. 3. Programming with PDFBox: As my knowledge of PDFBox is limited you may not get the answers that you are looking but I hope to keep updating this page as I learn. Now that you have installed PDFBox you might wonder what to do next. Try running the sample code you have downloaded (refer to FAQ 1 above). Next, try to get a basic understanding of the PDF specification. All PDF files are supposed to follow certain rules or format and understanding the specification helps you program/design code better. As you go through the PDF specification you will understand PDFBox better (as to why certain classes/processes exist). I am attempting to summarize the PDF specification at this link, but my effort is far from over and still in progress. Have a look at it though and refer to further reliable resources mentioned on that page. Finally write lots of code. Creating a new PDF file (just a blank page): This section refers to this sample code. If you have been through the FAQs and looked at the sample code you would have a basic idea on how PDFBox works. PDFBox considers the class PDDocument found in the package org.apache.pdfbox.pdmodel to be equivalent of a PDF file. If you are opening or creating a PDF file you will be working with this class. Open the JavaDoc (refer to FAQ 3) and have a good look at the methods found in this class. A PDF file generally consists of one or more pages. It would be safe to assume that all PDF files will have at least one page. So if you are creating a PDF file using the you would need at least one page. The class that helps you represent a page is PDPage (again found under the same pdmodel package). Here are the steps to create a Blank PDF file.
Look at the sample code. The code is very simple to understand and after looking at the code attempt to create your first blank PDF!! Here is my code based on the sample code to create a Blank PDF file - bad programming style but gave me great satisfaction upon execution. /** * @(#)BlankPDF.java * * * @author Stephen H * @version 1.00 2011/6/20 */ import org.apache.pdfbox.pdmodel.*; import java.io.*; public class BlankPDF { public static void main(String[] args) { PDDocument doc = null; try{ doc = new PDDocument(); } catch (IOException ie){ System.out.println(ie); } doc.addPage(new PDPage()); try{ doc.save("Empty PDF.pdf"); doc.close(); } catch (Exception io){ System.out.println(io); } } } Creating a new PDF file (with text): This section refers to this sample code. If you read the previous section you will have a fair idea on how to create a PDF file with a blank page. In this section we will discuss how to create a PDF file with some text in it! However, I must admit that some of the classes are pretty new to me and apart from the reasons as to how we use them, I do not know much about them. Hopefully, I will be able to understand these classes better in the future. As I stressed earlier they will make a lot more sense to me once I learn more from the PDF specification. Anyway, the process is very similar.
Note: Even though the steps involving setFont() and moveTextPositionByAmount() might seem to be optional, they are not. If you skip any or both of these steps you will be left with a PDF file that cannot be opened by the PDF reader. The error message that I had with Adobe Reader was "There was an error opening this document. The file is damaged and could not be repaired." Refer to the sample code. Here is my code based on the sample code. My apologies for the bad programming style. /** * @(#)PDFWithText.java * * * @author Stephen H * @version 1.00 2011/6/21 */ import org.apache.pdfbox.pdmodel.PDDocument; import org.apache.pdfbox.pdmodel.PDPage; import org.apache.pdfbox.pdmodel.edit.PDPageContentStream; import org.apache.pdfbox.pdmodel.font.PDType1Font; import org.apache.pdfbox.pdmodel.font.PDFont; public class PDFWithText { public static void main(String[] args) { PDDocument doc = null; PDPage page = null; try{ doc = new PDDocument(); page = new PDPage(); doc.addPage(page); PDFont font = PDType1Font.HELVETICA_BOLD; PDPageContentStream content = new PDPageContentStream(doc, page); content.beginText(); content.setFont( font, 12 ); content.moveTextPositionByAmount( 100, 700 ); content.drawString("Hello from www.printmyfolders.com"); content.endText(); content.close(); doc.save("PDFWithText.pdf"); doc.close(); } catch (Exception e){ System.out.println(e); } } } Working with an existing PDF file: If you are looking at working with existing PDF files, the class that will mean a lot to you will be the PDDocument class again (found in the package org.apache.pdfbox.pdmodel). The most important method is the static overloaded method load. The purpose of this method is to assign a PDF file to the PDDocument class. You will first load the PDF file using this method and then manipulate the PDF using other methods of PDDocument. Out of the many overloaded load methods, here are some. static PDDocument load(File file) //This will load a document from a file.static PDDocument load(InputStream input) //This will load a document from an input stream.static PDDocument load(URL url) //This will load a document from a url.Refer to the JavaDoc for details of these and other overloaded methods. You can now use these methods to get more information about the PDF file. int getNumberOfPages() //Returns the number of pages in the PDF file.boolean isEncrypted() //Lets you know if the PDF file is encrypted or not.void print() //Sends the PDF document to a printer.boolean removePage(int pageNumber) //Removes the page referred to by the page number.void save(String fileName) //Saves the PDF file under the file name.void silentPrint() //This will send the PDF to the default printer without prompting the user for any printer settings.Converting an image to a PDF file: Here is the code to convert a image file to a PDF file. /** * PDFBoxTest.java * * * @author Stephen H * @version 1.00 2012/6/2 */ import org.apache.pdfbox.pdmodel.*; import org.apache.pdfbox.pdmodel.edit.*; import org.apache.pdfbox.pdmodel.graphics.xobject.*; import java.io.*; public class PDFBoxTest{ // Note that this code works ONLY with jpg files public static void main(String[] args) { PDDocument doc = null; try{ /* Step 1: Prepare the document. */ doc = new PDDocument(); PDPage page = new PDPage(); doc.addPage(page); /* Step 2: Prepare the image * PDJpeg is the class you use when dealing with jpg images. * You will need to mention the jpg file and the document to which it is to be added * Note that if you complete these steps after the creating the content stream the PDF * file created will show "Out of memory" error. */ PDXObjectImage image = null; image = new PDJpeg(doc, new FileInputStream("image.jpg")); /* Create a content stream mentioning the document, the page in the dcoument where the content stream is to be added. * Note that this step has to be completed after the above two steps are complete. */ PDPageContentStream content = new PDPageContentStream(doc, page); /* Step 3: * Add (draw) the image to the content stream mentioning the position where it should be drawn * and leaving the size of the image as it is */ content.drawImage(image,20,20); content.close(); /* Step 4: * Save the document as a pdf file mentioning the name of the file */ doc.save("ImageNowPdf.pdf"); } catch (Exception e){ System.out.println("Exception"); } } } Extracting text from a PDF file:
If you are interested in extracting text from a PDF file, you will need to learn about the PDFTextStripper class. This is found in the package org.apache.pdfbox.util You can use the constructor PDFTextStripper() to create a new object. Be aware that it throws an IOException. Here are some methods to extract text String getText(PDDocument doc) //This will return the text of a document. Remember it returns a 'String'void writeText(PDDocument doc, Writer outputStream) //This will take a PDDocument and write the text of that document to the print writer.As you would have realised the PDF file from where the data is to be extracted has already been loaded onto the PDDocument (refer to previous sections) object. Make sure you use the close() method from the PDDocument class to close the PDF file. void close() //This will close the file.Here is a sample code that covers most of all that we learnt. import java.io.*; import org.apache.pdfbox.pdmodel.*; import org.apache.pdfbox.util.*; public class PDFTest { public static void main(String[] args){ PDDocument pd; BufferedWriter wr; try { File input = new File("C:\\Invoice.pdf"); // The PDF file from where you would like to extract File output = new File("C:\\SampleText.txt"); // The text file where you are going to store the extracted data pd = PDDocument.load(input); System.out.println(pd.getNumberOfPages()); System.out.println(pd.isEncrypted()); pd.save("CopyOfInvoice.pdf"); // Creates a copy called "CopyOfInvoice.pdf" PDFTextStripper stripper = new PDFTextStripper(); wr = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(output))); stripper.writeText(pd, wr); if (pd != null) { pd.close(); } // I use close() to flush the stream. wr.close(); } catch (Exception e){ e.printStackTrace(); } } } Here is some code to extract phone numbers from a PDF file. Try it out!! If you want to know more about the PDF specification, click here. Installing previous versions of PDFBox: Version 1.7.1 1. Download this jar file 3. You can now write your code. Refer to the FAQs above if you have other problems Version 1.6.0 1. Download this jar file http://www.apache.org/dyn/closer.cgi/pdfbox/1.6.0/pdfbox-app-1.6.0.jar 2. Add this jar file with its full path to your CLASSPATH. Make sure your CLASSPATH has a .; added before your entry. Here is how my CLASSPATH entry looks like. .;C:\Documents and Settings\user\My Documents\Downloads\pdfbox-app-1.6.0.jar (Note: If you find it difficult to set the CLASSPATH refer to this page.) 3. You can now write your code. Refer to the FAQs above if you have other problems Version. 1.5.0 1. Download this file. http://www.apache.org/dyn/closer.cgi/pdfbox/1.5.0/pdfbox-app-1.5.0.jar 2. Add the file with its full path to your CLASSPATH. Make sure your CLASSPATH has a .; added before your entry. Here is how my CLASSPATH entry looks like. .;C:\Documents and Settings\user\My Documents\Downloads\pdfbox-app-1.5.0.jar (Note: If you find it difficult to set the CLASSPATH refer to this page.) 3. You can now write your code. Refer to the FAQs above if you have other problems Version 1.4.0 1. Download this file. http://www.apache.org/dyn/closer.cgi/pdfbox/1.4.0/pdfbox-app-1.4.0.jar 2. Add the file with its full path to your CLASSPATH. Make sure your CLASSPATH has a .; added before your entry. Here is how my CLASSPATH entry looks like. .;C:\Documents and Settings\user\My Documents\Downloads\pdfbox-app-1.4.0.jar (Note: If you find it difficult to set the CLASSPATH refer to this page.) 3. You can now write your code. Refer to the FAQs above if you have other problems Version 1.3.1 1. Download this file. http://www.apache.org/dyn/closer.cgi/pdfbox/1.3.1/pdfbox-app-1.3.1.jar 2. Add the file with its full path to your CLASSPATH. Make sure your CLASSPATH has a .; added before your entry. Here is how my CLASSPATH entry looks like. .;C:\Documents and Settings\user\My Documents\Downloads\pdfbox-app-1.3.1.jar (Note: If you find it difficult to set the CLASSPATH refer to this page.) 3. You can now write your code. Refer to the FAQs above if you have other problems Version 1.0.0 (Refer to next section for Version 1.1.0) 1. Download the following file and extract its contents. 2. Add the following jars to your CLASSPATH. commons-logging-1.1.1.jar found under folder path pdfbox-1.0.0-bin\pdfbox1.0.0\external fontbox-1.0.0.jar found under folder path pdfbox-1.0.0-bin\pdfbox-1.0.0\external jempbox-1.0.0.jar found under folder path pdfbox-1.0.0-bin\pdfbox-1.0.0\external pdfbox-1.0.0.jar found under folder path pdfbox-1.0.0-bin\pdfbox-1.0.0 On my computer, the CLASSPATH is .;C:\Documents and Settings\Stephen\My
Documents\pdfbox-1.0.0-bin\pdfbox-1.0.0\external\commons-logging-1.1.1.jar;C:\Documents
and Settings\Stephen\My
Documents\pdfbox-1.0.0-bin\pdfbox-1.0.0\external\fontbox-1.0.0.jar;C:\Documents
and Settings\Stephen\My
Documents\pdfbox-1.0.0-bin\pdfbox-1.0.0\external\jempbox-1.0.0.jar;C:\Documents
and Settings\Stephen\My
Documents\pdfbox-1.0.0-bin\pdfbox-1.0.0\pdfbox-1.0.0.jar; (Note: If you find it difficult to set the CLASSPATH refer to this page.) 3. You can now write your code. Make sure you import the necessary classes. Version 1.1.0 1. Download the following files. http://apache.mirror.aussiehq.net.au//commons/logging/binaries/commons-logging-1.1.1-bin.zip http://www.apache.org/dyn/closer.cgi/pdfbox/1.1.0/pdfbox-1.1.0.jar http://www.apache.org/dyn/closer.cgi/pdfbox/1.1.0/fontbox-1.1.0.jar http://www.apache.org/dyn/closer.cgi/pdfbox/1.1.0/jempbox-1.1.0.jar 2. Extract the contents of commons-logging-1.1.1-bin.zip. The file that we need, which is, "commons-logging-1.1.1.jar" is found under the folder path commons-logging-1.1.1-bin\commons-logging-1.1.1 3. Add the following jars to your CLASSPATH. commons-logging-1.1.1.jar pdfbox-1.1.0.jar fontbox-1.1.0.jar jempbox-1.1.0.jar (Note: If you find it difficult to set the CLASSPATH refer to this page.) 4. You can now write your code. Make sure you import the necessary classes. Feedback and suggestions welcome at steve@printmyfolders.com All the best! Steve |