Extracting Phone Numbers from a PDF

Here is a simple program to extract phone numbers from a PDF file.

We assume here that the phone numbers are 10 digits long.

Here is the code

// Import statements
import java.io.*;
import org.apache.pdfbox.pdmodel.*;
import org.apache.pdfbox.util.*;
import java.util.regex.*;

public class PDFtest {
 public static void main(String[] args){
 PDDocument pd;
 try {
         //  PDF file from the phone numbers are extracted
         File input = new File("C:\\invoice.pdf");

         // StringBuilder to store the extracted text
         StringBuilder sb = new StringBuilder();
         pd = PDDocument.load(input);
         PDFTextStripper stripper = new PDFTextStripper();

         // Add text to the StringBuilder from the PDF
         sb.append(stripper.getText(pd));

         // Regex. For those who do not know. The Pattern refers to the format you are looking for.
         // In our example, we are looking for numbers with 10 digits with atleast one surrounding whitespaces
         // on both ends.
         Pattern p = Pattern.compile("\\s\\d\\d\\d\\d\\d\\d\\d\\d\\d\\d\\s");

         // Matcher refers to the actual text where the pattern will be found
         Matcher m = p.matcher(sb);

         while (m.find()){
             // group() method refers to the next number that follows the pattern we have specified.
             System.out.println(m.group());
         }

         if (pd != null) {
             pd.close();
         }
 } catch (Exception e){
         e.printStackTrace();
        }
     }
}
 
In my computer I got a list of phone numbers. You can modify the pattern to look for other numbers or data.

I love your feedback and suggestions. Please leave a comment below or contact me at steve@printmyfolders.com.

Please leave your comments or suggestions




Comments