Spellchecking with Sunspot and Solr

Aug 15, 2011

Setting up spellchecking in Solr is a little complicated - you have to set up the spellcheck component, define a dictionary, and add the component to the necessary search handlers. Even then, Sunspot doesn’t support spellchecking by default. Here I’ll explain how to set up a basic spellchecking system using the out-of-the-box solrconfig that comes with Sunspot, and give you some code I wrote that provides an interface between Sunspot and Solr’s spellchecking system.

Setting up spell checking

There’s a couple steps to this. First, you should decide what you’re going to want to spell check on. You might want to auto-correct all English words in a search, but more likely you want to help a user find something they might have misspelled. For example, somebody’s name or a product name. To do that, pop open solrconfig.xml and find the searchComponent definition for “spellcheck.” Change the “field” from “name” to a field that is actually in your application. Fields set up by Sunspot are dynamic fields, so it’s probably the name you defined in your “searchable” block followed by a postfix delineating what kind of field it is. For example, let’s say you have a User model that looks like this:

class User < ActiveRecord::Base
  searchable do
    text :username
  end
  ...
end

Then in the “field” segment, you’d want to put in “username_text”, like so:

<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
    <str name="queryAnalyzerFieldType">textSpell</str>
    <lst name="spellchecker">
      <str name="name">default</str>
      <str name="field">username_text</str>
      <str name="spellcheckIndexDir">./spellchecker</str>
      <str name="buildOnCommit">true</str>
    </lst>
    ...
</searchComponent>

Notice we added the “buildOnCommit” parameter, which will cause the dictionary to re-build when new records with the username field are committed. Unless you want to manually re-build the dictionary from time to time, you probably want this parameter or “buildOnOptimize”.

Finally, you’ll want to set your search handlers to use the spellcheck component so that it returns spelling suggestions along with your result set. Add “spellcheck” to the “last-components” array at the end of these two search handlers, along with a default option of “spellcheck=true”. You probably also want to set “a default count, set “collate=true” as a default, which will generate a new search string made up of the top spellcheck suggestion for each word in the query. It should look something like this:

<requestHandler name="standard" class="solr.SearchHandler" default="true">
  <lst name="defaults">
    <str name="echoParams">explicit</str>
    <str name="spellcheck.collate">true</str>
    <str name="spellcheck">true</str>
  </lst>
  <arr name="last-components">
    <str>spellcheck</str>
  </arr>
</requestHandler>

<requestHandler name="dismax" class="solr.SearchHandler">
  <lst name="defaults">
    <str name="defType">dismax</str>
    <str name="echoParams">explicit</str>
    <float name="tie">0.01</float>
    <str name="mm">
      2&lt;-1 5&lt;-2 6&lt;90%
   </str>
    <int name="ps">100</int>
    <str name="q.alt">*:*</str>
    <str name="f.name.hl.fragsize">0</str>
    <str name="f.name.hl.alternateField">name_text</str>
    <str name="f.text.hl.fragmenter">regex</str>
    <str name="spellcheck.collate">true</str>
    <str name="spellcheck">true</str>
  </lst>
  <arr name="last-components">
    <str>spellcheck</str>
  </arr>
</requestHandler>

We actually use a few more options to specify returning only more popular results, and limiting the number of spelling suggestions. Our setup looks more like this:

<requestHandler name="dismax" class="solr.SearchHandler">
  <lst name="defaults">
    ...
    <str name="spellcheck.dictionary">default</str>
    <str name="spellcheck.onlyMorePopular">true</str>
    <str name="spellcheck.extendedResults">false</str>
    <str name="spellcheck.count">3</str>
    <str name="spellcheck.collate">true</str>
    <str name="spellcheck">true</str>
  </lst>
  <arr name="last-components">
    <str>spellcheck</str>
  </arr>
</requestHandler>

For a complete list of what those options do, see the SpellCheckComponent description on the Solr Wiki.

Now, to construct the dictionary for the first time, go to http://yoursearchserver.com/solr/spell?q=*&spellcheck.build=true - it might take a bit to build the dictionary, but once that request completes it should be ready. Finally, perform a normal search through the solr admin interface at http://yoursearchserver.com/solr/admin - preferably a misspelled version of a username. Notice the block at the end of the end of the returned document? It should look something like this (I searched for “delll ultrasharp”):

<lst name="spellcheck">
  <lst name="suggestions">
    <lst name="delll">
      <int name="numFound">2</int>
      <int name="startOffset">0</int>
      <int name="endOffset">5</int>
      <arr name="suggestion">
        <str>Jello</str>
        <str>Tell</str>
      </arr>
    </lst>
    <lst name="ultrasharp">
      <int name="numFound">3</int>
      <int name="startOffset">6</int>
      <int name="endOffset">16</int>
      <arr name="suggestion">
        <str>ultraDice</str>
        <str>UltraBall</str>
        <str>UltraDeep</str>
      </arr>
  </lst>
  <str name="collation">Jello ultraDice</str>
</lst>

There’s the spelling suggestions, and even the collated search string. But how do we access that from Sunspot in our Rails application?

Integrating with Sunspot

Sunspot doesn’t support spelling suggestions out of the box. So I wrote a really quick interface to the spellchecker and added it to the searching DSL. With the configuration changes above to make sure you’re indexing search terms, drop this code in something like ‘/lib/sunspot_spellcheck.rb’ and require it in an initializer.

module Sunspot
  module Query
    class Spellcheck < Connective::Conjunction
      attr_accessor :options

      def initialize(options = {})
        @options = options
      end

      def to_params
        options = {}
        @options.each do |key, val|
          options["spellcheck." + Sunspot::Util.method_case(key)] = val
        end
        { :spellcheck => true }.merge(options)
      end
    end
  end
end

module Sunspot
  module Query
    class CommonQuery
      def spellcheck options = {}
        @components << Spellcheck.new(options)
      end
    end
  end
end

module Sunspot
  module Search
    class AbstractSearch
      attr_accessor :solr_result

      def raw_suggestions
        ["spellcheck", "suggestions"].inject(@solr_result){|h,k| h && h[k]}
      end

      def suggestions
        suggestions = ["spellcheck", "suggestions"].inject(@solr_result){|h,k| h && h[k]}
        return nil unless suggestions.is_a?(Array)

        suggestions_hash = {}
        index = -1
        suggestions.each do |sug|
          index += 1
          next unless sug.is_a?(String)
          break unless suggestions.count > index + 1
          suggestions_hash[sug] = suggestions[index+1].try(:[], "suggestion") || suggestions[index+1]
        end
        suggestions_hash
      end

      def all_suggestions
        suggestions.inject([]){|all, current| all += current}
      end

      def collation
        suggestions.try(:[], "collation")
      end
    end
  end
end

module Sunspot
  module DSL
    class StandardQuery
      def spellcheck options = {}
        @query.spellcheck(options)
      end
    end
  end
end

module Sunspot
  module Util
    class<<self
      def method_case(string_or_symbol)
        string = string_or_symbol.to_s
        first = true
        string.split('_').map! { |word| word = first ? word : word.capitalize; first = false; word }.join
      end
    end
  end
end

Now when you perform a search you can instruct it to return spelling suggestions like so:

@search = User.search do
  keywords params[:q]
  spellcheck
end

This will ensure the “spellcheck=true” param is passed into the Solr request. This should be unnecessary, since we put that in the defaults for standard and disMax searches in our Solrconfig.xml above. However, there’s more: you can pass options to the spellchecker by passing a hash to the spellcheck method. That looks something like this:

@search = User.search do
  keywords params[:q]
  spellcheck :only_more_popular => true, :count => 5
end

Now the spellchecker will only return more popular suggestions, and five of them, regardless of the defaults in set in Solrconfig.xml . Handy, no?

To access the spelling suggestions in the resulting search object, call the “suggestions” method. It will return a hash whose keys are the search terms and values are an array of suggestions. It will look something like this:

# after searching for "angr brds"
@search.suggestions
#=>{ "angr" => ["angry", "tanga","bang"], "brds" => ["birds", "words", "nerds"], "collation" => "angry birds"}

This way you can easily parse through the terms in the query and get at the suggestions. Also, if you just want the collation (so you could suggest an alternate search similar to Google’s “Did you mean?” feature), you can call the “collation” method, and if the search returned a collated suggestion, it will be returned as a string.

@search.collation
#=> "angry birds"

We’re working on slicing and dicing dictionaries and different types of searches on our site to try and put the best near-matches in front of the user, but hopefully this will get you started. Please leave any questions / comments below!

Ruby QuickRef

Aug 9, 2011

Check out this website I found at zenspider.com

Ruby QuickRef has a list of all the perl-style globals available in Ruby. Nice to know.

Airbnb traveler took everything

Jul 27, 2011

Three difficult days ago, I returned home from an exhausting week of business travel to an apartment that I no longer recognized. To an apartment that had been ransacked.

via ejroundtheworld.blogspot.com

Personally, I had always thought of Airbnb as a paid form of Couchsurfing, or a way to rent guest rooms or attached spaces, not a subletting service. I can't imagine letting someone live in my home unsupervised for any amount of time without a huge sublet contract. Especially someone I had never even met in person. Did this guy mail his keys out or something?

Airbnb is not insane, but this story sure is.

AbstractSingletonProxyFactoryBean

Jul 26, 2011

public abstract class AbstractSingletonProxyFactoryBean
extends ProxyConfig
implements FactoryBean, BeanClassLoaderAware, InitializingBean

Convenient proxy factory bean superclass for proxy factory beans that create only singletons.
Manages pre- and post-interceptors (references, rather than interceptor names, as in ProxyFactoryBean) and provides consistent interface management.

via static.springsource.org

Convenient.

Simple bijective function (base(n) encode/decode) — Gist

Jul 13, 2011

# Simple bijective function
#   Basically encodes any integer into a base(n) string,
#     where n is ALPHABET.length.
#   Based on pseudocode from http://stackoverflow.com/questions/742013/how-to-code-a-url-shortener/742047#742047

ALPHABET =
  "abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789".split(//)
  # make your own alphabet using:
  # (('a'..'z').to_a + ('A'..'Z').to_a + (0..9).to_a).shuffle.join

def bijective_encode(i)
  # from http://refactormycode.com/codes/125-base-62-encoding
  # with only minor modification
  return ALPHABET[0] if i == 0
  s = ''
  base = ALPHABET.length
  while i > 0
    s << ALPHABET[i.modulo(base)]
    i /= base
  end
  s.reverse
end

def bijective_decode(s)
  # based on base2dec() in Tcl translation 
  # at http://rosettacode.org/wiki/Non-decimal_radices/Convert#Ruby
  i = 0
  base = ALPHABET.length
  s.each_char { |c| i = i * base + ALPHABET.index(c) }
  i
end

# Two little demos:

# Encoding ints, decoding them back:
num = 125
(num..(num+10)).each do |i|
  print i, " ", bijective_encode(i), " ", bijective_decode(bijective_encode(i)), "\n"
end

# Decoding string mentioned in original SO question:
puts bijective_decode("e9a")

via gist.github.com

Easy way to handle url / id shortening. Nifty!

John Eberly » Rails fuzzy searching with Sunspot gem

Jul 11, 2011

SEARCH FOR: ‘Jon Smath’ and get => ‘John Smith’

via blog.eberly.org

Phonetic tokenization looks like a much better solution for correcting typos than fuzzy searching based on Levenstein distance.

The Problem with Google+

Jul 9, 2011

While everybody's been praising Google+ over its beautiful interface design, cute animations, and utterly reasonable privacy model, I've found myself not using it. I can think of two reasons for that - one, it's still bound to my ancient gmail account and not the Google Apps account I currently use, and two, it makes me think too much.

Making me think too much about my social networks could be a real blocker. In forcing every one of your connections to be classified into a specific group ("professional contacts", "friends", "bdsm buddies"), Google+ really makes you think about how well you know this contact, what you might want to share with them, and where the best place to file them is. Personally, I have a lot of overlap between multiple groups of friends, coworkers, and professional contacts.

The problem gets compounded when you have to decide who to share each item with. Since you have to choose who you want to share things with on every post, you have to think a lot about your circles and decide who would be interested in what, and who you're comfortable sharing things with. Or just mark everything 'public', in which case Google+ isn't much better than a public blog.

Finally, the problem in deciding who would be interested in a post I want to share is that I'm never sure who would be interested in what. When I post something to my tumblr / twitter / facebook, I'm usually surprised by who responds. If I post some code I wrote for a server monitoring script, I can guess who's going to be interested in that. But if I post some video game news I'm excited about or a music video I like, my coworkers are just as likely to be interested as my friends.

With Twitter, everything is public by default - you don't have to wonder if it's okay to post that racy dream you had about your boss last night. You know it's not. If you make your account private, whenever you allow someone to follow you you have to decide if you want to allow them into your private life, but it's a binary decision. Yes or no. It's not complicated. Facebook is the same - either you let someone in or you don't. But Google+ encourages you to think harder, and personally I don't want to put that much effort into social networking.

Demeter: It’s not just a good idea. It’s the law. | Virtuous Code

Jul 6, 2011

For all classes C. and for all methods M attached to C, all objects to which M sends a message must be instances of classes associated with the following classes:

The argument classes of M (including C).

The instance variable classes of C.

(Objects created by M, or by functions or methods which M calls, and objects in global variables are considered as arguments of M.)

via avdi.org

Looks like a good idiom. Avdi's explanation is pretty solid, too.

The brew command - GitHub

Jun 29, 2011

Use man brew to view the manpage.

Command Description

brew --cache Print path to Homebrew’s download cache (usually ~/Library/Caches/Homebrew)

brew --cellar Print path to Homebrew’s Cellar (usually /usr/local/Cellar)

brew --config Print system configuration info

brew --env Print Homebrew’s environment

brew --prefix Print path to Homebrew’s prefix (usually /usr/local)

brew --prefix [formula] Print where formula is installed

brew audit Audit all formulae for common code and style issues

brew cleanup [formula] Remove older versions from the Cellar for all (or specific) formulae¹

brew create [url] Generate formula for downloadable file at url, then open it in $BREW_EDITOR or $EDITOR²

brew create [tarball-url] --cache Generate formula (including MD5), then download the tarball

brew create --fink [formula] Open Fink’s search page in your browser, so you can see how they do formula

brew create --macports [formula] Open MacPorts’ search page in your browser, so you can see how they do formula

brew doctor Check your Homebrew installation for common issues

brew edit Open all of Homebrew for editing in TextMate

brew edit [formula] Open [formula] in $HOMEBREW_EDITOR or $EDITOR

brew fetch --force -v --HEAD [formula] Download source package for formula; for tarballs, also prints MD5 and SHA1 checksums

brew home Open Homebrew’s homepage in your browser

brew home [formula] Opens formula ’s homepage in your browser

brew info Print summary of installed packages

brew info [formula] Print info for formula (regardless of whether formula is installed)

brew info --github [formula] Open Github’s History page for formula in your browser

brew install [formula] Install formula

brew install --HEAD [formula] Install the HEAD version of formula (if its formula defines HEAD)

brew install --force --HEAD [formula] Install a newer HEAD version of formula (if its formula defines HEAD)

brew link [formula] Symlink all installed files for formula into the Homebrew prefix³

brew list [formula] List all installed files for formula (or all installed formulae with no arguments )

brew outdated List formulae that have an updated version available (brew install formula will install the newer version)

brew prune Remove dead symlinks from Homebrew’s prefix⁴

brew remove [formula] Uninstall formula

brew search List all available formula

brew search [formula] Search for formula in all available formulae

brew search /[formula]/ Search for /formula/ (as regex) in all available formulae

brew unlink [formula] Unsymlink formula from Homebrew’s prefix

brew update Update formulae and Homebrew itself

brew upgrade Install newer versions of outdated packages

You can update outdated packages with any of the following:

brew upgrade

brew install `brew outdated`

brew outdated | xargs brew install

¹ To delete a specific version, just go to the folder in the Cellar and rm -rf it; alternatively, drag it to the trash in Finder.

² Homebrew tries to guess the formula name and version. If it fails, you’ll have to make your own template. I suggest copying wget ’s.

³ Symlinking is automatically performed when installing formulae. It’s useful for DIY installation, or swapping out versions of a package you have multiple installs of.

⁴ This is generally not needed. However, it can be useful if you are doing DIY installations.

Command	Description
`brew --cache`	Print path to Homebrew’s download cache (usually `~/Library/Caches/Homebrew`)
`brew --cellar`	Print path to Homebrew’s Cellar (usually `/usr/local/Cellar`)
`brew --config`	Print system configuration info
`brew --env`	Print Homebrew’s environment
`brew --prefix`	Print path to Homebrew’s prefix (usually `/usr/local`)
`brew --prefix [formula]`	Print where `formula` is installed
`brew audit`	Audit all formulae for common code and style issues
`brew cleanup [formula]`	Remove older versions from the Cellar for all (or specific) formulae¹
`brew create [url]`	Generate formula for downloadable file at `url`, then open it in `$BREW_EDITOR` or `$EDITOR`²
`brew create [tarball-url] --cache`	Generate formula (including MD5), then download the tarball
`brew create --fink [formula]`	Open Fink’s search page in your browser, so you can see how they do `formula`
`brew create --macports [formula]`	Open MacPorts’ search page in your browser, so you can see how they do `formula`
`brew doctor`	Check your Homebrew installation for common issues
`brew edit`	Open all of Homebrew for editing in TextMate
`brew edit [formula]`	Open [formula] in `$HOMEBREW_EDITOR` or `$EDITOR`
`brew fetch --force -v --HEAD [formula]`	Download source package for `formula`; for tarballs, also prints MD5 and SHA1 checksums
`brew home`	Open Homebrew’s homepage in your browser
`brew home [formula]`	Opens `formula` ’s homepage in your browser
`brew info`	Print summary of installed packages
`brew info [formula]`	Print info for `formula` (regardless of whether `formula` is installed)
`brew info --github [formula]`	Open Github’s History page for `formula` in your browser
`brew install [formula]`	Install `formula`
`brew install --HEAD [formula]`	Install the `HEAD` version of `formula` (if its formula defines `HEAD`)
`brew install --force --HEAD [formula]`	Install a newer `HEAD` version of `formula` (if its formula defines `HEAD`)
`brew link [formula]`	Symlink all installed files for `formula` into the Homebrew prefix³
`brew list [formula]`	List all installed files for `formula` (or all installed formulae with no arguments )
`brew outdated`	List formulae that have an updated version available (`brew install formula` will install the newer version)
`brew prune`	Remove dead symlinks from Homebrew’s prefix⁴
`brew remove [formula]`	Uninstall `formula`
`brew search`	List all available formula
`brew search [formula]`	Search for `formula` in all available formulae
`brew search /[formula]/`	Search for `/formula/` (as regex) in all available formulae
`brew unlink [formula]`	Unsymlink `formula` from Homebrew’s prefix
`brew update`	Update formulae and Homebrew itself
`brew upgrade`	Install newer versions of outdated packages

via github.com

This page was waaay too hard to find. Every page / readme / tutorial about homebrew is just how to install the damn thing, not how to use it.

Protip: Ubuntu / Jetty user has /bin/false shell

Jun 23, 2011

While setting up Jenkins to use our Github repo for continuous integration, I found I had to log in as Ubuntu’s “jetty” user to verify that an ssh key had been generated and was being used. However, because apt-get sets up the jetty user as a non-login shell, the default shell was /bin/false which ends execution immediately. Therefor, I couldn’t do sudo su jetty.

Instead, I had to use sudo su -p jetty, so it would use my current user’s shell (/bin/bash) instead.

Learn something new every day.