Cheese Makers Forum FAQ Equipment part 1 Equipment part 2 History

Saturday, March 13, 2010

Tech Talk: Using KinoSearch to Build a Custom Search Engine

Projects such as Google custom search are fine for some websites, but when you want to offer customized search for a website that could include privileged information, it really is a non-starter. So, as a web developer, there are two real viable alternatives.

The first is to use 'like' statements in a SQL query. Projects such as phpbb use this type of approach, since it is easy to understand, reflects updates to your data instantly, and very quick to get going. There is a downside to this approach though. If your search query was something like:
select plantName from plantInstance where plantName like '\%$var\%';

Then you run into all sorts of issues, like spelling, punctuation, grammar, and word order. Not to mention what if you want to search long fields, like TEXT/MEDIUMTEXT/LONGTEXT columns? And what if you want to offer boolean operators, such as AND/OR/NOT? This approach rapidly becomes a non-trivial amount of work.

The alternative to this approach is to use an actual search engine such as Lucene, Plucene, or KinoSearch. There are plenty of others, but those are the three I looked at. I ended up choosing KinoSearch since it's Perl implementation and libraries are very full featured, easy to use and understand, and are fast as hell.

What a full featured search library such as KinoSearch offers is quite exciting. It builds a custom inverted index of data you specify, and automatically (well, almost) parses search queries with intelligence about the language, grammar, and potential misspellings of the search query. It then hands back ranked results based on how close the query was to the document. Yay, it's a real search engine!

The first task was to determine what an actual document meant in the context of Plantacious.com. In this context, I decided that the following were probably going to be relevant to a users search:

  • Plants in the database (public data)
  • Plants that a user owns (private data)
  • Calendar entries (private data)
  • Photographs (could be private or public based on a flag)
  • Notes (private)
  • Comments (public)

Each of these 'documents' would use a slightly different query to generate the result set, and the 'document location' would be the dynamic URL for the data. So, to get started you create a KinoSearch object:

use KinoSearch::InvIndexer;
use KinoSearch::Analysis::PolyAnalyzer;


Then you create an analyzer and the index objects:

my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new( language => 'en' );

my $invindexer = KinoSearch::InvIndexer->new(
invindex => '/path/to/new/index/kinoIndex.new',
create => 1,
analyzer => $analyzer,
);

This creates an analyzer that will parse english results, and creates a brand new index file/directory named kinoIndex.new. Then you create the fields you want to store in the index, in this case it is:

$invindexer->spec_field( name => 'pc_key' );
$invindexer->spec_field( name => 'pc_name' );
$invindexer->spec_field( name => 'pc_description' );
$invindexer->spec_field( name => 'pct_name' );
$invindexer->spec_field( name => 'pcst_name' );
$invindexer->spec_field( name => 'fertilizer_type' );
$invindexer->spec_field( name => 'leaf_type' );
$invindexer->spec_field( name => 'flower_color' );
$invindexer->spec_field( name => 'flower_shape' );
$invindexer->spec_field( name => 'flower_size' );
$invindexer->spec_field( name => 'fruit_shape' );
$invindexer->spec_field( name => 'fruit_size' );
$invindexer->spec_field( name => 'fruit_color' );
$invindexer->spec_field( name => 'url' );
$invindexer->spec_field( name => 'result_location' );
$invindexer->spec_field( name => 'user_key' );
$invindexer->spec_field( name => 'cal_entry_name' );
$invindexer->spec_field( name => 'cal_entry_description' );
$invindexer->spec_field( name => 'cal_schedule_date' );
$invindexer->spec_field( name => 'thumb_url');
$invindexer->spec_field( name => 'img_url');

Not every field is used for every document, but since I want a single, centralized search that can give results from a myriad of document types, they must all be declared.

Then, you iterate through a result set and add them to the index. In the case of plants (this is a big one) it looks like this:


#Index plants
my $plants=$db->sql("select * from plantClass pc, plantClassType pct,
plantClassSubType pcst, pictures pic where
pc.pct_key=pct.pct_key and
pc.pic_key=pic.pic_key and
pc.pcst_key=pcst.pcst_key;");

for (my $i=0; $i<@{$plants}; $i++) {
my $doc = $invindexer->new_doc;
$doc->set_value( pc_key => "$plants->[$i]->{pc_key}" );
$doc->set_value( pc_name => "$plants->[$i]->{pc_name}" );
$doc->set_value( pc_description => "$plants->[$i]->{pc_description}" );
$doc->set_value( pct_name => "$plants->[$i]->{pct_name}" );
$doc->set_value( pcst_name => "$plants->[$i]->{pcst_name}" );
$doc->set_value( fertilizer_type => "$plants->[$i]->{fertilizer_type}" );
$doc->set_value( leaf_type => "$plants->[$i]->{leaf_type}" );
$doc->set_value( flower_color => "$plants->[$i]->{flower_color}" );
$doc->set_value( flower_shape => "$plants->[$i]->{flower_shape}" );
$doc->set_value( flower_size => "$plants->[$i]->{flower_size}" );
$doc->set_value( fruit_shape => "$plants->[$i]->{fruit_shape}" );
$doc->set_value( fruit_size => "$plants->[$i]->{fruit_size}" );
$doc->set_value( fruit_color => "$plants->[$i]->{fruit_color}" );
$doc->set_value( url => "http://www.plantacious.com/garden/main.pl?a=getPlantClass&pc_key=$plants->[$i]->{pc_key}" );
$doc->set_value( result_location => "Plants");
$doc->set_value( img_url => "$plants->[$i]->{url}");
$doc->set_value( thumb_url => "$plants->[$i]->{turl}");
$doc->set_value( user_key => "0" );

$invindexer->add_doc($doc);

But because I also want to create static documents for Google to index (helps with other search engines, since they don't really like dynamic content very much), a 'sitemap' is also created using LWP. LWP allows the indexer to make a simple 'GET' request for the actual page, then persist it to an HTML file that googlebot can slurp up. Not all documents need this (i.e. private information), but these are public pages, so making it easier for the big search engines to find this content is pretty critical.

#Only do this at noon on Sundays
if ($date->[0]->{myhour} eq '7-12') {
my $ua=LWP::UserAgent->new;
my $req=HTTP::Request->new(GET => "http://www.plantacious.com/garden/main.pl?a=getPlantClass&pc_key=$plants->[$i]->{pc_key}&bot=1");
my $res = $ua->request($req);
warn "http://www.plantacious.com/garden/main.pl?a=getPlantClass&pc_key=$plants->[$i]->{pc_key}&bot=1";

my $content=$res->content;

#Fixup relative links
$content =~ s/\?a=/http:\/\/www\.plantacious\.com\/garden\/main\.pl\?a=/g;

#And write out the content to the sitemap
open (OUT, ">/var/www/html/garden/sitemap/plant_$plants->[$i]->{pc_key}.html");
print OUT "$content";
close(OUT);
}


There are a few other housekeeping items to take care of that are well documented in the KinoSearch POD, but that is basically it for creating the index. Iterate through your 'documents' (webpages), add them to the index, clean up afterwards and you are done.

The second phase is actually searching the index. This is done with an analyzer and searcher object.

use KinoSearch::Searcher;
use KinoSearch::Analysis::PolyAnalyzer;

my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new( language => 'en' );
my $searcher = KinoSearch::Searcher->new(
invindex => '/path/to/index/kinoIndex',
analyzer => $analyzer,
);

my $hits = $searcher->search( query => "$query" );
$hits->seek(0,20);


This executes the query and returns the first 20 results. It should be obvious how to paginate your search query from that little snippet. Then, to iterate you fetch hashrefs:

while ( my $hit = $hits->fetch_hit_hashref ) {
#This will dump out all the values from the matching record
print Dumper $hits;
}


In the actual code there are checks to see what kind of page is returned, if a user is allowed to see it, logic to figure out if displaying an image next to the result is appropriate, and so on. But it really is that easy. If you want to know more, check out KinoSearch on CPAN. At Plantacious, the search when presented with say a plant, will check and see if there is an associated photograph. If there isn't one, a second query is issued to look for an appropriate image already in the database, and if one is found it will be displayed. This is why sometimes not quite the right plant is displayed during search results currently (since there just aren't enough photos yet), but as time goes on it will get better and better, all due to more images being available.

Cheers!

No comments:

Post a Comment

Creative Commons License
Cheese A Day by Jeremy Pickett is licensed under a Creative Commons Attribution-Noncommercial 3.0 United States License.
Based on a work at cheeseaday.blogspot.com.