Search PDFs With PHP, MySQL, and PdfToText

Being able to search a PDF is a very useful feature on any web site.  The problem is that there aren’t many languages that give you the tools to do so right out of the box.  PHP is no exception to this.  If you want to search PDF files you’ll need some third-party tools and a little bit of ingenuity.

Pre-requisites

You’ll server will need to have the following configuration.

  • PHP (>=4)
  • MySQL (>=4)
  • Linux (Distro of your choice)

Step 1:  Download PdfToText

PdfToText is a program written in C that will quickly convert the contents of a PDF to text.  We’re going to use it just for that purpose.  You download the file at http://www.foolabs.com/xpdf/download.html.  Once you have downloaded the file, go ahead and place it somewhere in your web site directory and extract it (on most linux systems “tar -xzf [file]” will do the trick).  Once it’s unzipped, you’ll see a program called “pdftotext”, which is what we’re after.

Step 2:  Convert the PDF to Text

As an astute reader, you’ve probably noticed by now that PdfToText is not a PHP file.  So how are we going to use it?  Well, we’re going to use the “backtick” (the ~ [tilda] key) operator.

function convert_to_text($pdf) {
     $output = `./pdftotext {$pdf} temp.txt`;
     return $output
}

The backtick operator will execute any command on the command line, trap it’s output, and return it to the caller.  It’s worth noting that the backtick operator will only return output from standard out.

This is probably the hardest part of this tutorial.  There may be problems with write permissions on the directory, or ownership problems, but if you can get it to work, you’re all set.

Step 3:  Read the Text

Now that the PDF has been converted to a text file, we need to get that information back in to PHP.  To do that, we use the file_get_contents functions.

function get_text() {
     $text = file_get_contents("temp.txt");
     return $text;
}

Step 4:  Store the Data

This part of the tutorial assumes 2 things.  1) That you have a table named pdf_data, and 2) That the table has a column called pdf_contents that is full-text searchable (If you need help setting this sort of thing up, leave a comment).

function store_data() {
     $text = mysql_real_escape_string(get_text());
     $query = "INSERT INTO pdf_data (pdf_contents) VALUES ('{$text}')";
     mysql_query($query);
}

Step 5:  Search the Data

The final step is actually searching the data.  To do that, we’ll use the full-text searching capability of MySQL.

function search_data($term) {
     $term = mysql_real_escape_string($term);
     $query = "SELECT * FROM pdf_data MATCH(pdf_contents) AGAINST ('$term')";
     $result = mysql_query($query);
     while($row = mysql_fetch_array($result)) {
          //Do stuff with returned data.
     }
}

Where “Do stuff with returned data” is, you can do whatever you want.  MySQL is going to return the rows to you in order of relevance (descending).  The most relevant result will be first, followed by the second most, and third most, and so on.

Other Notes

  • PdfToText may or may not be the best way to do this, but it is one of the simplest.  There are a handful of libraries out there for creating PDFs in PHP, but surprisingly few for something as common as reading a PDF.
  • There are binaries and source files available for PdfToText on their web site(here).
  • This tutorial could be expanded a lot.  If you have questions or requests, please ask!