Search is an important feature on a website. When my few readers want to look for a particular passage on my blog, they use the search box. It used to be powered by Google Search, but I have since then changed it to my own home-brewed version not because I can do better but because it was an interesting challenge.

If you are in a hurry and just want your site to be searchable, well do what I did before, use Google.

<form action="/search.php" method="get">
    <input type="text" name="query" placeholder="Search term"/>
    <input type="submit" value="Search"/>
</form>

// In search.php file

$term = isset($_GET['query'])?$_GET['query']: '';
$term = urlencode($term);
$website = urlencode("www.yourwebsite.com");
$redirect = "https://www.google.com/search?q=site%3A{$website}+{$term}";
header("Location: $redirect");
exit;

What it does is pretty simple. Get the term passed by the user, and forward it to Google search page. Limit the search result to our current domain using the site: keyword in the search query. All your pages that are indexed by Google will be available through search now. If you do want to handle your search in house however, then keep reading.

Homemade Search Solution

Before we go any further, try using the search box on this blog. It uses the same process that I will describe below. If you feel that this is what you want then please continue reading.

This solution is catered to small websites. I make use of LIKE with wild cards on both ends, meaning your search cannot be indexed. This means the solution will work fine for your blog or personal website that doesn't contain tons of data. Port it to a bigger website and it might become very slow. MySQL offers Full Text Search which is not what we are doing here.

Note: If you have 5000 blog posts you are still fine. Computers are super fast.

We will take the structure of this blog as a reference. Each blog post has:

  • A title p_title
  • A url p_url
  • A summary p_summary
  • A post content p_content
  • And catergories category.tagname

For every field that matches with our search term, we will give it a score. The score will be based on the importance of the match :

// the exact term matches is found in the title
$scoreFullTitle = 6; 

// match the title in part
$scoreTitleKeyword = 5;

// the exact term matches is found in the summary
$scoreFullSummary = 5;

// match the summary in part
$scoreSummaryKeyword = 4;

// the exact term matches is found in the content
$scoreFullDocument = 4;

// match the document in part
$scoreDocumentKeyword = 3;

// matches a category
$scoreCategoryKeyword = 2;

// matches the url
$scoreUrlKeyword = 1;

Before we get started, there are a few words that do not contribute much to a search that should be removed. Example "in","it","a","the","of" .... We will filter those out and feel free to add any word you think is irrelevant. Another thing is, we want to limit the length of our query. We don't want a user to write a novel in the search field and crash our MySQL server.

// Remove unnecessary words from the search term and return them as an array
function filterSearchKeys($query){
    $query = trim(preg_replace("/(\s+)+/", " ", $query));
    $words = array();
    // expand this list with your words.
    $list = array("in","it","a","the","of","or","I","you","he","me","us","they","she","to","but","that","this","those","then");
    $c = 0;
    foreach(explode(" ", $query) as $key){
        if (in_array($key, $list)){
            continue;
        }
        $words[] = $key;
        if ($c >= 15){
            break;
        }
        $c++;
    }
    return $words;
}

// limit words number of characters
function limitChars($query, $limit = 200){
    return substr($query, 0,$limit);
}

Our helper functions can now limit character count and filter useless words. The way we will implement our algorithm is by giving a score every time we find a match. We will match words using the if statement and accumulate points as we match more words. At the end we can use that score to sort our results

Note: I will not be showing how to connect to MySQL database. If you are having problems to efficiently connect to the database I recommend reading this short post about database abstraction.

Let's give our function a structure first. Note I left placeholders so we can implement sections separately.

function search($query){

    $query = trim($query);
    if (mb_strlen($query)===0){
        // no need for empty search right?
        return false; 
    }
    $query = limitChars($query);

    // Weighing scores
    $scoreFullTitle = 6;
    $scoreTitleKeyword = 5;
    $scoreFullSummary = 5;
    $scoreSummaryKeyword = 4;
    $scoreFullDocument = 4;
    $scoreDocumentKeyword = 3;
    $scoreCategoryKeyword = 2;
    $scoreUrlKeyword = 1;

    $keywords = filterSearchKeys($query);
    $escQuery = DB::escape($query); // see note above to get db object
    $titleSQL = array();
    $sumSQL = array();
    $docSQL = array();
    $categorySQL = array();
    $urlSQL = array();

    /** Matching full occurrences PLACE HOLDER **/

    /** Matching Keywords PLACE HOLDER **/

    $sql = "SELECT p.p_id,p.p_title,p.p_date_published,p.p_url,
            p.p_summary,p.p_content,p.thumbnail,
            (
                (-- Title score
                ".implode(" + ", $titleSQL)."
                )+
                (-- Summary
                ".implode(" + ", $sumSQL)." 
                )+
                (-- document
                ".implode(" + ", $docSQL)."
                )+
                (-- tag/category
                ".implode(" + ", $categorySQL)."
                )+
                (-- url
                ".implode(" + ", $urlSQL)."
                )
            ) as relevance
            FROM post p
            WHERE p.status = 'published'
            HAVING relevance > 0
            ORDER BY relevance DESC,p.page_views DESC
            LIMIT 25";
    $results = DB::query($sql);
    if (!$results){
        return false;
    }
    return $results;
}

In the query, all scores will be summed up as the relevance variable and we can use it to sort the results.

Matching full occurrences

We make sure we have some keywords first then add our query.

if (count($keywords) > 1){
    $titleSQL[] = "if (p_title LIKE '%".$escQuery."%',{$scoreFullTitle},0)";
    $sumSQL[] = "if (p_summary LIKE '%".$escQuery."%',{$scoreFullSummary},0)";
    $docSQL[] = "if (p_content LIKE '%".$escQuery."%',{$scoreFullDocument},0)";
}

Those are the matches with higher score. If the search term matches an article that contains these, they will have higher chances of appearing on top.

Matching keywords occurrences

We loop through all keywords and check if they match any of the fields. For the category match, I used a sub-query since a post can have multiple categories.

foreach($keywords as $key){
    $titleSQL[] = "if (p_title LIKE '%".DB::escape($key)."%',{$scoreTitleKeyword},0)";
    $sumSQL[] = "if (p_summary LIKE '%".DB::escape($key)."%',{$scoreSummaryKeyword},0)";
    $docSQL[] = "if (p_content LIKE '%".DB::escape($key)."%',{$scoreDocumentKeyword},0)";
    $urlSQL[] = "if (p_url LIKE '%".DB::escape($key)."%',{$scoreUrlKeyword},0)";
    $categorySQL[] = "if ((
    SELECT count(category.tag_id)
    FROM category
    JOIN post_category ON post_category.tag_id = category.tag_id
    WHERE post_category.post_id = p.post_id
    AND category.name = '".DB::escape($key)."'
                ) > 0,{$scoreCategoryKeyword},0)";
}

Also as pointed by a commenter below, we have to make sure that the these variables are not empty arrays or the query will fail.

// Just incase it's empty, add 0
if (empty($titleSQL)){
    $titleSQL[] = 0;
}
if (empty($sumSQL)){
    $sumSQL[] = 0;
}
if (empty($docSQL)){
    $docSQL[] = 0;
}
if (empty($urlSQL)){
    $urlSQL[] = 0;
}
if (empty($tagSQL)){
    $tagSQL[] = 0;
}

At the end the queries are all concatenated and added together to determine the relevance of the post to the search term.


The full code:

// Remove unnecessary words from the search term and return them as an array
function filterSearchKeys($query){
    $query = trim(preg_replace("/(\s+)+/", " ", $query));
    $words = array();
    // expand this list with your words.
    $list = array("in","it","a","the","of","or","I","you","he","me","us","they","she","to","but","that","this","those","then");
    $c = 0;
    foreach(explode(" ", $query) as $key){
        if (in_array($key, $list)){
            continue;
        }
        $words[] = $key;
        if ($c >= 15){
            break;
        }
        $c++;
    }
    return $words;
}

// limit words number of characters
function limitChars($query, $limit = 200){
    return substr($query, 0,$limit);
}

function search($query){

    $query = trim($query);
    if (mb_strlen($query)===0){
        // no need for empty search right?
        return false; 
    }
    $query = limitChars($query);

    // Weighing scores
    $scoreFullTitle = 6;
    $scoreTitleKeyword = 5;
    $scoreFullSummary = 5;
    $scoreSummaryKeyword = 4;
    $scoreFullDocument = 4;
    $scoreDocumentKeyword = 3;
    $scoreCategoryKeyword = 2;
    $scoreUrlKeyword = 1;

    $keywords = filterSearchKeys($query);
    $escQuery = DB::escape($query); // see note above to get db object
    $titleSQL = array();
    $sumSQL = array();
    $docSQL = array();
    $categorySQL = array();
    $urlSQL = array();

    /** Matching full occurences **/
    if (count($keywords) > 1){
        $titleSQL[] = "if (p_title LIKE '%".$escQuery."%',{$scoreFullTitle},0)";
        $sumSQL[] = "if (p_summary LIKE '%".$escQuery."%',{$scoreFullSummary},0)";
        $docSQL[] = "if (p_content LIKE '%".$escQuery."%',{$scoreFullDocument},0)";
    }

    /** Matching Keywords **/
    foreach($keywords as $key){
        $titleSQL[] = "if (p_title LIKE '%".DB::escape($key)."%',{$scoreTitleKeyword},0)";
        $sumSQL[] = "if (p_summary LIKE '%".DB::escape($key)."%',{$scoreSummaryKeyword},0)";
        $docSQL[] = "if (p_content LIKE '%".DB::escape($key)."%',{$scoreDocumentKeyword},0)";
        $urlSQL[] = "if (p_url LIKE '%".DB::escape($key)."%',{$scoreUrlKeyword},0)";
        $categorySQL[] = "if ((
        SELECT count(category.tag_id)
        FROM category
        JOIN post_category ON post_category.tag_id = category.tag_id
        WHERE post_category.post_id = p.post_id
        AND category.name = '".DB::escape($key)."'
                    ) > 0,{$scoreCategoryKeyword},0)";
    }

    // Just incase it's empty, add 0
    if (empty($titleSQL)){
        $titleSQL[] = 0;
    }
    if (empty($sumSQL)){
        $sumSQL[] = 0;
    }
    if (empty($docSQL)){
        $docSQL[] = 0;
    }
    if (empty($urlSQL)){
        $urlSQL[] = 0;
    }
    if (empty($tagSQL)){
        $tagSQL[] = 0;
    }

    $sql = "SELECT p.p_id,p.p_title,p.p_date_published,p.p_url,
            p.p_summary,p.p_content,p.thumbnail,
            (
                (-- Title score
                ".implode(" + ", $titleSQL)."
                )+
                (-- Summary
                ".implode(" + ", $sumSQL)." 
                )+
                (-- document
                ".implode(" + ", $docSQL)."
                )+
                (-- tag/category
                ".implode(" + ", $categorySQL)."
                )+
                (-- url
                ".implode(" + ", $urlSQL)."
                )
            ) as relevance
            FROM post p
            WHERE p.status = 'published'
            HAVING relevance > 0
            ORDER BY relevance DESC,p.page_views DESC
            LIMIT 25";
    $results = DB::query($sql);
    if (!$results){
        return false;
    }
    return $results;
}

Now your search.php file can look like this:

$term = isset($_GET['query'])?$_GET['query']: '';
$search_results = search($term);

if (!$search_results) {
    echo 'No results';
    exit;
}

// Print page with results here.

We created a simple search algorithm that can handle a fair amount of content. I arbitrarily chose the score for each match, feel free to tweak it to something that works best for you. And there is always room for improvement.

It is a good idea to track the search term coming from your users, this way you can see if most users search for the same thing. If there is a pattern, then you can save them a trip and just cache the results using Memcached.

If you want to see this search algorithm in action, go ahead and try looking for an article on the search box on top of the page. I have added extra features like returning the part where the match was found in the text. Feel free to add features to yours.