CSSRule: Improving Search with Regular Expressions

Monday, January 03, 2011

Improving Search with Regular Expressions

This weekend my focus was on improving the search function and making it work smoothly with the existing categories. I want to add new sections to the categories sidebar later. The footer's slowly coming together. I put in a new background which has a gradient on top of it. If you see banding, let me know.

Originally I was using Blogger search and sometime later I switched to Google Blog search for my categories sidebar links. I wasn't happy with either one. Their searches were not robust and I didn't have much control over how the results were presented. The purpose of this post is to show you how you can write your own search and stop relying on Google or Blogger. Here's what my category search markup used to look like:

<li class='categories'>
<a href='http://cssbakery.blogspot.com/search/label/Modifying%20Blogger'
target='_blank'><div id='cat1'/></a>

<a href='http://cssbakery.blogspot.com/search/label/Modifying%20Blogger'
target='_blank'>Modifying Blogger with CSS/HTML</a>
</li>

Note that the %20 you see in the URLs just means a space character. So in my Blogger template I actually typed "Modifying Blogger", and when I save it, Blogger converts it to "Modifying%20Blogger". Anyway, here's what the same category links look like now with the new category search tool:

<a href='http://blogsearch.google.com/blogsearch?q=site%3Acssbakery.com+Blogger' >
Modifying Blogger with CSS/HTML</a>

I have some Javascript that pulls out the "q=" part of the URL and uses it for my own custom search tool (so I actually ignore the blogsearch.google.com part of this URL. I still have it point to google as a backup just in case my Javascript file hasn't loaded or the user has Javascript disabled).

At first my search algorithm was just looking for substring matches using the PHP substr_count($haystack,$needle) function. That function returns the number of occurrences of $needle in $haystack. The problem with that was that when I searched for strings like "id", it would match parts of whole words like "Holidays". I didn't want that, so instead of using substr_count(), I switched to using function preg_match_all($regex,$haystack,$matches).

With preg_match_all(), you pass in a regular expression, and it returns the number of matches found in $haystack (it also returns a set of matches via the 3rd parameter, but for my purposes I'm only interested in the number of matches). So the most difficult part of this is coming up with a good regular expression.

Regular expressions are used for pattern matching. They consist of strings written in a special syntax that specifies what to match. The regular expression is interpreted by a processing engine that uses it to determine a set of matches within a body of text. PHP's regular expression engine uses a syntax that is common among Unix/Linux based languages and tools. Certain characters have special meaning. The caret symbol (^) matches the beginning of a line of text. The dollar sign ($) matches end of line. So for example, the pattern ^abc$ would match a complete line that consisted of only the letters abc. Regular expressions can get very complex, but are very powerful. You can read more about PHP's regular expressions here.

We want to match occurrences where the search term is surrounded by spaces, as in: "the id attribute", but we also want to match cases like these: "set the id." (the search term is followed by a period), "Id is important" (search term at beginning of sentence), and even plurals like: "set all the ids for these divs". It should also handle cases where the search term is followed immediately by some whitespace character such as a newline, or a tab.

Here is what I came up with that I think satisfies all those criteria:

$count = preg_match_all('/([^a-z]|^)'.$needle.'([^a-z]|$|(s[^a-z]))/',$haystack,$matches);

The first part of the regular expression is: ([^a-z]|^) This says the character immediately preceding the search term can be anything except characters a thru z, or it can be the beginning of line (^).

The second part of the expression is ([^a-z]|$|(s[^a-z])). This says the character following the search term can be anything except a thru z, or it can be end of line ($), or an 's' character followed by any character that is not a thru z. (This last part handles the plural of the search term, so we match "ids" as well as "id", but not "idsay" for example.)

<a href='http://blogsearch.google.com/blogsearch?q=site%3Acssbakery.com+Internet' 
target='_blank' title='Internet'>

<img alt='web' src='http://pics.cssrule.com/pics/webcake.jpg' title='Internet'/></a>

<a href='http://blogsearch.google.com/blogsearch?q=site%3Acssbakery.com+Internet' 
target='_blank' title='Internet'>The Internets</a>
</li>

Another thing that I do in my search algorithm is to give more weight to posts where the search term appears in the title of the post, or in the labels. Posts where the title or label matches the search term will always appear at the top of my search results. If the match is only in the text of the post, it will still be listed, but further down the list. In fact, I even put a separator in the output so you can see the title/label matches above the separator, and the other matches below it. The algorithm also ranks posts higher in the list based on how many times the search term appears in the title, labels, as well as the body of the post.

In the CSS, I styled an ordered list which comes with sequential numbers instead of bullets:

.searchresults ol {
  padding-left: 25px; 
}
.searchresults ol li {
  margin-top: 5px;
}
.searchresults ol li a, .searchresults ol {
  font-family: "Trebuchet MS",Verdana,Arial,Helvetica,sans-serif;
  font-size: 13px;
  color: #606060;
  font-weight: normal;
}
.searchresults ol li a:hover {
  color: #30AECE;
}
.searchresults {
  margin-top: 45px;
}
.searchresults div.divider {
  margin: 40px 0;
}
.searchresults h3 {
  color: #CC6600;
  font-size: 30px;
  font-weight: normal;
  font-family: "Trebuchet MS",Verdana,Arial,Helvetica,sans-serif;
}

Another minor thing I did is to have the page automatically scroll to the top when you click on a category search. I did this with the following Javascript statement: