Extracting Phone Numbers from a PDF

Here is a simple program to extract phone numbers from a PDF file.

We assume here that the phone numbers are 10 digits long.

Here is the code

// Import statements import java.io.*; import org.apache.pdfbox.pdmodel.*; import org.apache.pdfbox.util.*; import java.util.regex.*; public class PDFtest { public static void main(String[] args){ PDDocument pd; try { // PDF file from the phone numbers are extracted File input = new File("C:\\invoice.pdf"); // StringBuilder to store the extracted text StringBuilder sb = new StringBuilder(); pd = PDDocument.load(input); PDFTextStripper stripper = new PDFTextStripper(); // Add text to the StringBuilder from the PDF sb.append(stripper.getText(pd)); // Regex. For those who do not know. The Pattern refers to the format you are looking for. // In our example, we are looking for numbers with 10 digits with atleast one surrounding whitespaces // on both ends. Pattern p = Pattern.compile("\\s\\d\\d\\d\\d\\d\\d\\d\\d\\d\\d\\s"); // Matcher refers to the actual text where the pattern will be found Matcher m = p.matcher(sb); while (m.find()){ // group() method refers to the next number that follows the pattern we have specified. System.out.println(m.group()); } if (pd != null) { pd.close(); } } catch (Exception e){ e.printStackTrace(); } } }

In my computer I got a list of phone numbers. You can modify the pattern to look for other numbers or data.

I love your feedback and suggestions. Please leave a comment below or contact me at steve@printmyfolders.com.