So, I’ve been experimenting with Apache Nutch a little bit. I decided to make a quick search app thingy. Ultimately, it didn’t work out – I wanted to do more advanced parsing on the pages I was crawling than Nutch or Hounder or any other crawler I found was really capable of, but I made some fun stuff.

I made a Rakefile to make running the Nutch crawl command easier. I also made a YML file to organize the params for the search. Here’s the content of each, in case anyone is curious.

lib/tasks/nutch.rake :

namespace :nutch do

  config = YAML.load_file("#{Rails.root}/config/nutch.yml")[Rails.env]
  urls_file = "#{Rails.root}/nutch/urls/*"
  nutch_home = "#{Rails.root}/nutch/"

  def run(command)
    puts command
    system "#{command} >&2"
  end

  desc "tiny crawl to test configuration"
  task :test do
    run "NUTCH_HOME=#{nutch_home} #{config['bin']} crawl #{urls_file} -dir nutch/crawl -depth 2 -topN 10"
  end

  desc "basic nutch crawl"
  task :crawl do
    run "NUTCH_HOME=#{nutch_home} #{config['bin']} crawl #{urls_file} -dir nutch/crawl -depth #{config['depth']} -topN #{config['topN']}"
  end
end

config/nutch.yml :

development:
  bin: /Users/username/bin/nutch
  threads: 5
  depth: 5
  topN: 10

test:
  bin: /usr/bin/nutch
  threads: 5
  depth: 5
  topN: 50

production:
  bin: /usr/bin/nutch
  threads: 5
  depth: 5
  topN: 50

Anyway, maybe that will help somebody get started integrating Nutch into a rails app. Or maybe not. Good luck! Ask me any questions in the comments.