a t e v a n s . c o m

(╯°□°)╯︵ <ǝlqɐʇ/>

So, I’ve been experimenting with Apache Nutch a little bit. I decided to make a quick search app thingy. Ultimately, it didn’t work out – I wanted to do more advanced parsing on the pages I was crawling than Nutch or Hounder or any other crawler I found was really capable of, but I made some fun stuff.

I made a Rakefile to make running the Nutch crawl command easier. I also made a YML file to organize the params for the search. Here’s the content of each, in case anyone is curious.

lib/tasks/nutch.rake :

namespace :nutch do

  config = YAML.load_file("#{Rails.root}/config/nutch.yml")[Rails.env]
  urls_file = "#{Rails.root}/nutch/urls/*"
  nutch_home = "#{Rails.root}/nutch/"

  def run(command)
    puts command
    system "#{command} >&2"
  end

  desc "tiny crawl to test configuration"
  task :test do
    run "NUTCH_HOME=#{nutch_home} #{config['bin']} crawl #{urls_file} -dir nutch/crawl -depth 2 -topN 10"
  end

  desc "basic nutch crawl"
  task :crawl do
    run "NUTCH_HOME=#{nutch_home} #{config['bin']} crawl #{urls_file} -dir nutch/crawl -depth #{config['depth']} -topN #{config['topN']}"
  end
end

config/nutch.yml :

development:
  bin: /Users/username/bin/nutch
  threads: 5
  depth: 5
  topN: 10

test:
  bin: /usr/bin/nutch
  threads: 5
  depth: 5
  topN: 50

production:
  bin: /usr/bin/nutch
  threads: 5
  depth: 5
  topN: 50

Anyway, maybe that will help somebody get started integrating Nutch into a rails app. Or maybe not. Good luck! Ask me any questions in the comments.

So, we needed to clear out memcached. You don’t always want to do this when you deploy, but sometimes layout changes, stream changes, or other major application changes require flushing memcached on production. The logical solution is to put this in the same place as the other stuff you do when you deploy, or otherwise “sometimes.”

To do that, I set up a Rake task. Using Rake has a few advantages – you can use the Rails environment and the Rails configuration info, so you don’t have to make assumptions about where memcached is located or how it’s accessed (init.d vs service). Also, you can clear out application caches – pages rendered to the filesystem, caches stored in the db, or other weird stuff your app might do.

My rake task looks like this:

namespace :cache do
  desc 'Clear memcache'
  task :clear => :environment do
    Rails.cache.clear
    CACHE.flush
  end
end

Then, I set up a Capistrano action to run the rake task on our production servers. I put it in a shared.rb file so it applies to any environment in our deploy scheme – you might want to put it directly in your Capfile, or in the appropriate environment file in config/deploy if you use capistrano-ext. Here’s my cap action:

Capistrano::Configuration.instance(:must_exist).load do
  namespace :memcached do
    desc "Flush memcached"
    task :clear, :roles => [:cache], :only => {:memcached => true} do
      run "cd #{deploy_to}/current && /usr/bin/env rake cache:clear RAILS_ENV=#{stage}"
    end
  end
end

Remember, you have to cd into the directory of your deployed rails app in order to use rake, you must have rake installed on the target machine, and you must set RAILS_ENV to the appropriate setting. Since we’re using multistage, I used the stage variable, but you might want to hard-code production if your deployment setup is more simple.

Feel free to leave questions in the comments!

Short Version: The easiest way to invalidate a cached view that depends on a model whenever the model updates is to use the model’s updated_at time in the cache key. Rails does this automatically if you pass a model object into the cache() method for fragment caching. The cache never expires, but a new update will result in a cache miss and regeneration. Memcached will get rid of the old records for you when it runs low on memory.

Rant Edition: So, I was working out how to do cache invalidation for models, and I find the Rails 2.3 conceptual model for invalidation flawed. You create sweepers, which are ostensibly observers that get called whenever a model object updates. Except they really don’t – you have to set them up with a controller object, and when the model callbacks fire, they call the cache invalidation method from that controller. The way you set them up is via the cache_sweeper method on… the controller. So really, sweepers get tied to one controller, since calling it from another controller would result in the cache invalidation methods being called on the wrong controller.

For example, let’s say we have a Dashboard controller which shows the most recent blog posts on a community site. Since it depends on the Post model, ideally you’d want to invalidate the Dashboard cache on the Post after_save callback. So you create a sweeper to listen on the Post object and invalidate the index cache on after_save. The problem is, the sweeper only gets loaded on the DashboardController; all the changes to Post happen in the PostsController. So maybe you can set up the sweeper to instantiate a DashboardController to invalidate the cache there, and then call the sweeper on the PostsController. But now wait, what about the BlogController? If you delete a blog, you get rid of all of its posts, so you have to put the sweeper there, too…

Essentially, you end up in the position of asking which controller actions will affect a model; and that’s not always obvious. It’s an ill-posed question to boot: the answer is “all of them.” You want the sweeper to fire every time the model is updated, no matter the controller. The real question you should be asking is, “What caches do I need to invalidate when this model object updates?”

I looked around a bit, but I didn’t see any clean-looking Railsy things that operate on this question. Maybe I missed something?

The workaround for the time being is as above; using the updated_at time as a cache key. Major thanks to RailsLab for providing a relatively clean answer to this problem.

So, theoretically, someone should be able to run your rails migrations and get the structure of the current production database. I tried to do this today on a big ‘ole existing codebase, and had to change a huge number of migrations just to get up to current. Here’s some stuff I found that was inappropriate, and probably shouldn’t be done with migrations.

  1. Using model classes in migrations to create / update data, run SQL, etc. If you ever prune your models, rename or get rid of them, the old migrations will be broken.
  2. Handling stuff other than data in migrations. A few migrations had cron management in them – the ideal way to do that is to store your crontab in the app code, then copy / overwrite it in Capistrano when you deploy. That way Cron gets versioned along with everything else, and you don’t have to mess with it when you’re trying to get a database up and running.

Anything else that should not be done in migrations?

I went through the process of setting up a development Mac today. Before I could even get to the environment, I found I had a huge set of stuff to go through to make the computer respond to my commands in a way that’s familiar. Call it a computer Ikea nesting instinct. Here’s the basic rundown:

  1. Turn it on, user install, whatever normal issues are.
  2. Set up hot corner for screen saver, right command for Expose, right option for Desktop
  3. Quicksilver (set hotkey to Alt+Space)
  4. Chrome
  5. Setup Chrome sync so my bookmarks, history and auto-suggest are what I expect
  6. iTerm (Bookmarks –> manage profiles –> keyboard profiles –> global –> option key as +esc)
  7. TextMate (get haml bundle, alt+cmd+l to show line numbers, soft tabs 2)
  8. Zshell (install brew or macports if neither of them are installed… then zsh)
  9. oh-my-zsh — comment out auto_name_dirs in lib/directories.zsh
  10. Setup zsh plugins, create alias for gpom=“git push origin master” … could Dropbox this, but machines are kind of different, and it’s not a lot of config yet
  11. RVM
  12. Change shortcuts for any app with tabs to Alt+Lftarrow and Alt+Rtarrow (default for Chrome, Textmate, iTerm)
  13. Add Ctrl+s shortcut for Chrome “Find Next” item

I’m sure I’ll find more stuff in the next few days; I’ll keep updating the list. This is mostly for my reference, but maybe someone else will find it interesting.

Not exactly clean, not disastrous either. Mac laptop for development, windows box for cross-platform testing, super-comfy office chair + ottoman setup, and speakers for annoying the neighbors. I get a lot done here, even with the plethora of distractions at home.

Ran into an interesting problem today - let’s say we have a form that does more than one thing. Give it a setup something like this:

<form action="/" method="post">
  <input type="hidden" name="option_value" value="1" />
  <input name="search_options" type="text" />
  <input type="submit" name="commit" value="Search Options" />
  <br />

  <input type="hidden" name="choice_value" value="1" />
  <input name="search_choices" type="text" />
  <input type="submit" name="commit" value="Search Choices" />
  <br />

  <input name="some_text" type="text" />
  <input type="submit" name="commit" value="Submit Form" />
</form>

This is a pretty decent setup if you want to use those text boxes to search for valid choices for changing option_value or choice_value. Often, if you’re dealing with related records, you’ll have too many choices to throw a dropdown or select box at it, and you’ll want to have some kind of graceful degradation so you don’t have to rely completely on AJAX auto-completes.

The key point is, you want the form to do different things based on whether the user clicked ‘Search Options,’ ‘Search Choices,’ or ‘Submit Form.’ In my case, I wanted to keep (but not persist) the other values on the form, but display a set of options for the user to choose from if they searched for Options or Choices.

But what happens if they’re typing in search_options or search_choices and they hit enter? That’s an interesting problem.

In Chrome, here’s what happens: it gets the first type=”submit” input on the form, and submits with that value. So in this case, it will send commit=”Search Options.” Even if they were typing in search_choices. Not really ideal, is it?

My solution was to implement a hidden field to see if they hit enter instead of clicking a real submit button. I initially tried with this markup:

<form action="/" method="post"> <input type="submit" name="commit" value="Enter" style="display: none" />

Of course, you’d want to hide that in your stylesheets, not inline css, but this is an example. Unfortunately, it doesn’t work. Turns out Chrome ignores submit tags set to display:none. So, I tried it with visibility:hidden, and that works. Using visibility:hidden means the element still takes up the same amount of space in the display model, it just isn’t accessible to the user. So you have to find somewhere clever to hide it. My final markup looked more like this:

<form action="/" method="post">
  <input name="some_text" type="text" />
  <input type="submit" name="commit" value="Enter" style="display: none" /><br />

  <input type="hidden" name="option_value" value="1" />
  <input name="search_options" type="text" />
  <input type="submit" name="commit" value="Search Options" />
  <br />

  <input type="hidden" name="choice_value" value="1" />
  <input name="search_choices" type="text" />
  <input type="submit" name="commit" value="Search Choices" />
  <br />
  <br />

  <input type="submit" name="commit" value="Submit Form" />
</form>

This way, you can check if commit=”Enter”, and if it does, you can check which of the search fields has content, and conduct an appropriate search based on that. There is a case where it isn’t really possible to tell what the user intended; if they type something into both search_options and search_choices and then hit enter, you can’t be sure which one they meant to search, and there’s no indication of where the cursor was when the user hit enter. In that case, I defaulted to the search_options behavior - without javascript I can’t think of anything better to do.

I wonder if there’s a better solution for browsers? Maybe set a variable like _submitted_from with the name of the text field the user hit ‘enter’ on as the value? I guess there isn’t a variable they can create that isn’t used by some form somewhere on the internet, but it certainly seems worth knowing when you’re trying to divine user intention from stateless form submissions.

So, recently I went to work on a small project; I built a micro Flickr interface as an iPhone web app for a weekend project. I used the jQTouch javascript framework along with Sinatra to build it. If you read the O’Reilly file on iPhone web app development, jQTouch looks pretty good. I ran into some major show-stopping bugs pretty quickly, though.

For example, here is their basic template code:

<div id='home'> 
  <div class='toolbar'> 
    <h1>Home</h1> 
  </div> 
  <ul> 
    <li>Item 1</li> 
    <li>Item 2</li> 
    <li> 
      About 
    </li> 
  </ul> 
</div> 
<div id='about'> 
  <div class='toolbar'> 
    Back 
    <h1>About</h1> 
  </div> 
  <ul> 
    <li>About 1</li> 
    <li>About 2</li> 
  </ul> 
</div>

That works fine; it matched jQTouch’s template system and creates a beautiful interface with nesting and back buttons and everything. Now, let’s say I want to add a little style, or maybe have some content not in a list on the home screen. In that case, you’d want to add a container, like so:

<div id='home'> 
  <div class='toolbar'> 
    <h1>Home</h1> 
  </div> 
  <div> 
    <ul> 
      <li>Item 1</li> 
      <li>Item 2</li> 
      <li> 
        About 
      </li> 
    </ul> 
  </div> 
</div> 
<div id='about'> 
  <div class='toolbar'> 
    Back 
    <h1>About</h1> 
  </div> 
  <ul> 
    <li>About 1</li> 
    <li>About 2</li> 
  </ul> 
</div>

Boom! Too late, already broken. Navigation no longer works. When you click that “About” link, which normally takes you to the about screen (denoted by the #about div), you instead get an error where the application continually tries to go ‘back’ but there are no pages in the global history hash.

God help you if you want to register click or tap handlers. Basically, I ended up implementing all the navigation by hand, and because of that it’s buggy. Next time, I’ll do more research on mobile js frameworks - maybe go with iUI or jQueryMobile.

I occasionally use Siege to benchmark stuff. It's a super-quick way to do some scalability testing and find out if your application can stand up to some load. It can also help you identify slow pages and bottlenecks in the application. Recently I wanted to find out where a toy I'm making hockeysticks - that is, where the graph of request time / concurrent requests curves sharply upward.

To do this, I ran Siege ten times, stepping up the number of concurrent requests logarithmically. I set an option in my ~/.siegerc file to log all runs of siege to a file in my home directory, and it gave me a nice csv log output file. Unfortunately, headers were not included! And it didn't match the single-run output for siege. So I had a bunch of numbers, and no real indication of what they meant. I searched for documentation, but came up empty-handed. So I dug through the source code and found out. Here's the breakdown:

date total requests elapsed time (sec) bytes (MB) avg response time (sec) req / sec bytes / sec (MB / s) total time / elapsed time successes failures
2010-11-19 13:23:22 125 59.44 0 0.01 2.10 0.00 0.02 125 0

As you can see from the graph, the important numbers here are requests per second and average response time.

Heroku currently (11/17/2010) has an issue wherein it refuses to play nice with Bundler. I have a pretty simple Gemfile for a little toy I’m working on, and it looks like this:

source :rubygems
gem 'sinatra'
gem 'haml'

I ran bundle install, checked in the Gemfile and Gemfile.lock, and pushed to Heroku aaaaannnndd…

You have deleted from the Gemfile:
  * version: 1.0.6

I looked at what it was choking on in the Gemfile.lock, and it turned out to be under the METADATA section at the bottom. Specifically, the last two lines in my Gemfile.lock were:

METADATA
  version: 1.0.6

After much searching, I found nothing. I couldn’t see any particular reason that those lines were there, didn’t find anyone else having the same issue, or anyone else having Bundler choke on metadata. So, I deleted the line. Now the end of my Gemfile.lock looks like:

METADATA

Ran “git push heroku master” and boom! No error. Soblem prolved.

Well, not really. Ideally, Heroku wouldn’t choke on metadata in Gemfile.lock.

Mastodon