Creating a simple search algorithm with PHP and MySQL

Search is an important feature on a website. When my few readers want to look for a particular passage on my blog, they use the search box. It used to be powered by Google Search, but I have since then changed it to my own home-brewed version not because I can do better but because it was an interesting challenge.

If you are in a hurry and just want your site to be searchable, well do what I did before, use Google.

<form action="/search.php" method="get">
    <input type="text" name="query" placeholder="Search term"/>
    <input type="submit" value="Search"/>
</form>

// In search.php file

$term = isset($_GET['query'])?$_GET['query']: '';
$term = urlencode($term);
$website = urlencode("www.yourwebsite.com");
$redirect = "https://www.google.com/search?q=site%3A{$website}+{$term}";
header("Location: $redirect");
exit;

What it does is pretty simple. Get the term passed by the user, and forward it to Google search page. Limit the search result to our current domain using the site: keyword in the search query. All your pages that are indexed by Google will be available through search now. If you do want to handle your search in house however, then keep reading.

Homemade Search Solution

Before we go any further, try using the search box on this blog. It uses the same process that I will describe below. If you feel that this is what you want then please continue reading.

This solution is catered to small websites. I make use of LIKE with wild cards on both ends, meaning your search cannot be indexed. This means the solution will work fine for your blog or personal website that doesn't contain tons of data. Port it to a bigger website and it might become very slow. MySQL offers Full Text Search which is not what we are doing here.

Note: If you have 5000 blog posts you are still fine. Computers are super fast.

We will take the structure of this blog as a reference. Each blog post has:

For every field that matches with our search term, we will give it a score. The score will be based on the importance of the match :

// the exact term matches is found in the title
$scoreFullTitle = 6; 

// match the title in part
$scoreTitleKeyword = 5;

// the exact term matches is found in the summary
$scoreFullSummary = 5;

// match the summary in part
$scoreSummaryKeyword = 4;

// the exact term matches is found in the content
$scoreFullDocument = 4;

// match the document in part
$scoreDocumentKeyword = 3;

// matches a category
$scoreCategoryKeyword = 2;

// matches the url
$scoreUrlKeyword = 1;

Before we get started, there are a few words that do not contribute much to a search that should be removed. Example "in","it","a","the","of" .... We will filter those out and feel free to add any word you think is irrelevant. Another thing is, we want to limit the length of our query. We don't want a user to write a novel in the search field and crash our MySQL server.

// Remove unnecessary words from the search term and return them as an array
function filterSearchKeys($query){
    $query = trim(preg_replace("/(\s+)+/", " ", $query));
    $words = array();
    // expand this list with your words.
    $list = array("in","it","a","the","of","or","I","you","he","me","us","they","she","to","but","that","this","those","then");
    $c = 0;
    foreach(explode(" ", $query) as $key){
        if (in_array($key, $list)){
            continue;
        }
        $words[] = $key;
        if ($c >= 15){
            break;
        }
        $c++;
    }
    return $words;
}

// limit words number of characters
function limitChars($query, $limit = 200){
    return substr($query, 0,$limit);
}

Our helper functions can now limit character count and filter useless words. The way we will implement our algorithm is by giving a score every time we find a match. We will match words using the if statement and accumulate points as we match more words. At the end we can use that score to sort our results

Note: I will not be showing how to connect to MySQL database. If you are having problems to efficiently connect to the database I recommend reading this short post about database abstraction.

Let's give our function a structure first. Note I left placeholders so we can implement sections separately.

function search($query){

    $query = trim($query);
    if (mb_strlen($query)===0){
        // no need for empty search right?
        return false; 
    }
    $query = limitChars($query);

    // Weighing scores
    $scoreFullTitle = 6;
    $scoreTitleKeyword = 5;
    $scoreFullSummary = 5;
    $scoreSummaryKeyword = 4;
    $scoreFullDocument = 4;
    $scoreDocumentKeyword = 3;
    $scoreCategoryKeyword = 2;
    $scoreUrlKeyword = 1;

    $keywords = filterSearchKeys($query);
    $escQuery = DB::escape($query); // see note above to get db object
    $titleSQL = array();
    $sumSQL = array();
    $docSQL = array();
    $categorySQL = array();
    $urlSQL = array();

    /** Matching full occurrences PLACE HOLDER **/

    /** Matching Keywords PLACE HOLDER **/

    $sql = "SELECT p.p_id,p.p_title,p.p_date_published,p.p_url,
            p.p_summary,p.p_content,p.thumbnail,
            (
                (-- Title score
                ".implode(" + ", $titleSQL)."
                )+
                (-- Summary
                ".implode(" + ", $sumSQL)." 
                )+
                (-- document
                ".implode(" + ", $docSQL)."
                )+
                (-- tag/category
                ".implode(" + ", $categorySQL)."
                )+
                (-- url
                ".implode(" + ", $urlSQL)."
                )
            ) as relevance
            FROM post p
            WHERE p.status = 'published'
            HAVING relevance > 0
            ORDER BY relevance DESC,p.page_views DESC
            LIMIT 25";
    $results = DB::query($sql);
    if (!$results){
        return false;
    }
    return $results;
}

In the query, all scores will be summed up as the relevance variable and we can use it to sort the results.

Matching full occurrences

We make sure we have some keywords first then add our query.

if (count($keywords) > 1){
    $titleSQL[] = "if (p_title LIKE '%".$escQuery."%',{$scoreFullTitle},0)";
    $sumSQL[] = "if (p_summary LIKE '%".$escQuery."%',{$scoreFullSummary},0)";
    $docSQL[] = "if (p_content LIKE '%".$escQuery."%',{$scoreFullDocument},0)";
}

Those are the matches with higher score. If the search term matches an article that contains these, they will have higher chances of appearing on top.

Matching keywords occurrences

We loop through all keywords and check if they match any of the fields. For the category match, I used a sub-query since a post can have multiple categories.

foreach($keywords as $key){
    $titleSQL[] = "if (p_title LIKE '%".DB::escape($key)."%',{$scoreTitleKeyword},0)";
    $sumSQL[] = "if (p_summary LIKE '%".DB::escape($key)."%',{$scoreSummaryKeyword},0)";
    $docSQL[] = "if (p_content LIKE '%".DB::escape($key)."%',{$scoreDocumentKeyword},0)";
    $urlSQL[] = "if (p_url LIKE '%".DB::escape($key)."%',{$scoreUrlKeyword},0)";
    $categorySQL[] = "if ((
    SELECT count(category.tag_id)
    FROM category
    JOIN post_category ON post_category.tag_id = category.tag_id
    WHERE post_category.post_id = p.post_id
    AND category.name = '".DB::escape($key)."'
                ) > 0,{$scoreCategoryKeyword},0)";
}

Also as pointed by a commenter below, we have to make sure that the these variables are not empty arrays or the query will fail.

// Just incase it's empty, add 0
if (empty($titleSQL)){
    $titleSQL[] = 0;
}
if (empty($sumSQL)){
    $sumSQL[] = 0;
}
if (empty($docSQL)){
    $docSQL[] = 0;
}
if (empty($urlSQL)){
    $urlSQL[] = 0;
}
if (empty($tagSQL)){
    $tagSQL[] = 0;
}

At the end the queries are all concatenated and added together to determine the relevance of the post to the search term.


The full code:

// Remove unnecessary words from the search term and return them as an array
function filterSearchKeys($query){
    $query = trim(preg_replace("/(\s+)+/", " ", $query));
    $words = array();
    // expand this list with your words.
    $list = array("in","it","a","the","of","or","I","you","he","me","us","they","she","to","but","that","this","those","then");
    $c = 0;
    foreach(explode(" ", $query) as $key){
        if (in_array($key, $list)){
            continue;
        }
        $words[] = $key;
        if ($c >= 15){
            break;
        }
        $c++;
    }
    return $words;
}

// limit words number of characters
function limitChars($query, $limit = 200){
    return substr($query, 0,$limit);
}

function search($query){

    $query = trim($query);
    if (mb_strlen($query)===0){
        // no need for empty search right?
        return false; 
    }
    $query = limitChars($query);

    // Weighing scores
    $scoreFullTitle = 6;
    $scoreTitleKeyword = 5;
    $scoreFullSummary = 5;
    $scoreSummaryKeyword = 4;
    $scoreFullDocument = 4;
    $scoreDocumentKeyword = 3;
    $scoreCategoryKeyword = 2;
    $scoreUrlKeyword = 1;

    $keywords = filterSearchKeys($query);
    $escQuery = DB::escape($query); // see note above to get db object
    $titleSQL = array();
    $sumSQL = array();
    $docSQL = array();
    $categorySQL = array();
    $urlSQL = array();

    /** Matching full occurences **/
    if (count($keywords) > 1){
        $titleSQL[] = "if (p_title LIKE '%".$escQuery."%',{$scoreFullTitle},0)";
        $sumSQL[] = "if (p_summary LIKE '%".$escQuery."%',{$scoreFullSummary},0)";
        $docSQL[] = "if (p_content LIKE '%".$escQuery."%',{$scoreFullDocument},0)";
    }

    /** Matching Keywords **/
    foreach($keywords as $key){
        $titleSQL[] = "if (p_title LIKE '%".DB::escape($key)."%',{$scoreTitleKeyword},0)";
        $sumSQL[] = "if (p_summary LIKE '%".DB::escape($key)."%',{$scoreSummaryKeyword},0)";
        $docSQL[] = "if (p_content LIKE '%".DB::escape($key)."%',{$scoreDocumentKeyword},0)";
        $urlSQL[] = "if (p_url LIKE '%".DB::escape($key)."%',{$scoreUrlKeyword},0)";
        $categorySQL[] = "if ((
        SELECT count(category.tag_id)
        FROM category
        JOIN post_category ON post_category.tag_id = category.tag_id
        WHERE post_category.post_id = p.post_id
        AND category.name = '".DB::escape($key)."'
                    ) > 0,{$scoreCategoryKeyword},0)";
    }

    // Just incase it's empty, add 0
    if (empty($titleSQL)){
        $titleSQL[] = 0;
    }
    if (empty($sumSQL)){
        $sumSQL[] = 0;
    }
    if (empty($docSQL)){
        $docSQL[] = 0;
    }
    if (empty($urlSQL)){
        $urlSQL[] = 0;
    }
    if (empty($tagSQL)){
        $tagSQL[] = 0;
    }

    $sql = "SELECT p.p_id,p.p_title,p.p_date_published,p.p_url,
            p.p_summary,p.p_content,p.thumbnail,
            (
                (-- Title score
                ".implode(" + ", $titleSQL)."
                )+
                (-- Summary
                ".implode(" + ", $sumSQL)." 
                )+
                (-- document
                ".implode(" + ", $docSQL)."
                )+
                (-- tag/category
                ".implode(" + ", $categorySQL)."
                )+
                (-- url
                ".implode(" + ", $urlSQL)."
                )
            ) as relevance
            FROM post p
            WHERE p.status = 'published'
            HAVING relevance > 0
            ORDER BY relevance DESC,p.page_views DESC
            LIMIT 25";
    $results = DB::query($sql);
    if (!$results){
        return false;
    }
    return $results;
}

Now your search.php file can look like this:

$term = isset($_GET['query'])?$_GET['query']: '';
$search_results = search($term);

if (!$search_results) {
    echo 'No results';
    exit;
}

// Print page with results here.

We created a simple search algorithm that can handle a fair amount of content. I arbitrarily chose the score for each match, feel free to tweak it to something that works best for you. And there is always room for improvement.

It is a good idea to track the search term coming from your users, this way you can see if most users search for the same thing. If there is a pattern, then you can save them a trip and just cache the results using Memcached.

If you want to see this search algorithm in action, go ahead and try looking for an article on the search box on top of the page. I have added extra features like returning the part where the match was found in the text. Feel free to add features to yours.


Comments

Zaryel :

Very nice tutorial ! It is a simple but still powerful algorithm. And your explanations are great.

Thank you.

Zaryel

Ibrahim :

Thanks Zaryel, I'm glad I could help :)

Ian Mustafa :

This example could be implemented on SQL with nested queries, as other example like MATCH() AGAINST() wouldn't. Great work Ibrahim!

Rob :

Thank you for this interesting approach.

The code fails to run if the method escape() is not defined, and in your database abstraction example it is not.

It isn't obvious to me what that should be.

Thanks.

Ibrahim :

Thanks for making this observation @Rob. I have updated the database abstraction post to include the escape function. Basically, all it is is a wrapper around the mysql escaping function.

public static function escape($string){
    $con = self::connect();
    return $con->real_escape_string($string);
}

adeem :

Great article, thank you!

Ivan Venediktov :

I have noticed that if you are searching for a single word you are getting and error where implode is used.

Ibrahim :

Hi Ivan

Thank you for testing the code and finding a bug. It looks like on my own version I had fixed this issue and forgot to update the blog. Here is a few lines of code I added to make sure the variable are not empty arrays:

// Before the assigning all the values to the $sql variable
// Just incase it's empty, add 0
if (empty($titleSQL)){
    $titleSQL[] = 0;
}
if (empty($sumSQL)){
    $sumSQL[] = 0;
}
if (empty($docSQL)){
    $docSQL[] = 0;
}
if (empty($urlSQL)){
    $urlSQL[] = 0;
}
if (empty($tagSQL)){
    $tagSQL[] = 0;
}

$sql = "SELECT p.p_id,p.p_title,p.p_date_published,p.p_url,
...

I have updated to code on the blog post.

Bilal :

This is so helpful. Is there a way to search based on a pattern? I really need it along with this for my Project.

Ibrahim :

@Bilal you can always add an extra filter to the code to parse the keywords. But what kind of pattern are you referring to?

Bilal :

@ibrahim Any pattern at all, which could filter the results even more. Pattern could be based on anything. Need it really fast :| Would really appreciate it!

Ibrahim :

@Bilal what is a pattern? If you want to write code that is going to be useful to you, you first have to identify what it is going to do for you.

A pattern by it self doesn't mean anything. But if you have a specific pattern you want to match, that is a start. I can't figure out what you need or write code for you, but I can definitely guide you in the right direction.

Bilal :

@ibrahim My project: It gather news from different sources around the world and plots it on to an interactive map. I used php as my main language. Now i need users to be able to search news which the crawler gathers and saves in an array. Your searching algorithm works on data stored in your mysql database right? I need to use all these features of your searching algorithm to work on the array which has all the news stored in it.

As far as what the pattern should be, i was hoping something like: user's search history should have an impact on the results being shown by a specific user. e.g: if all of his previous searches have been on football, the newer searches will find a link between the previous searches with the current search.

I hope i was able to convey my point. Only have a month to finish my project and i'm totally stuck. Can't seem to get done with the 2 problems mentioned above.

William :

What an excellent tutorial. Thank you very much for posting this.

Ibrahim :

Thanks @William

Let's hear your thoughts

For my eyes only