So, I’ve been experimenting with Apache Nutch a little bit. I decided to make a quick search app thingy. Ultimately, it didn’t work out – I wanted to do more advanced parsing on the pages I was crawling than Nutch or Hounder or any other crawler I found was really capable of, but I made some fun stuff.
I made a Rakefile to make running the Nutch crawl command easier. I also made a YML file to organize the params for the search. Here’s the content of each, in case anyone is curious.
lib/tasks/nutch.rake :
namespace :nutch do config = YAML.load_file("#{Rails.root}/config/nutch.yml")[Rails.env] urls_file = "#{Rails.root}/nutch/urls/*" nutch_home = "#{Rails.root}/nutch/" def run(command) puts command system "#{command} >&2" end desc "tiny crawl to test configuration" task :test do run "NUTCH_HOME=#{nutch_home} #{config['bin']} crawl #{urls_file} -dir nutch/crawl -depth 2 -topN 10" end desc "basic nutch crawl" task :crawl do run "NUTCH_HOME=#{nutch_home} #{config['bin']} crawl #{urls_file} -dir nutch/crawl -depth #{config['depth']} -topN #{config['topN']}" end end
config/nutch.yml :
development: bin: /Users/username/bin/nutch threads: 5 depth: 5 topN: 10 test: bin: /usr/bin/nutch threads: 5 depth: 5 topN: 50 production: bin: /usr/bin/nutch threads: 5 depth: 5 topN: 50
Anyway, maybe that will help somebody get started integrating Nutch into a rails app. Or maybe not. Good luck! Ask me any questions in the comments.