Roo XLSX Parsing: Invalid value for Integer (ArgumentError)

tl;dr - got a weird error when opening a .xlsx sheet using Roo, wrote a quick fix for it. The error had a stack trace like this:

from /Users/agius/.rbenv/versions/2.7.0/lib/ruby/gems/2.7.0/gems/nokogiri-1.15.4-x86_64-darwin/lib/nokogiri/xml/node_set.rb:234:in `upto'
from /Users/agius/.rbenv/versions/2.7.0/lib/ruby/gems/2.7.0/gems/nokogiri-1.15.4-x86_64-darwin/lib/nokogiri/xml/node_set.rb:235:in `block in each'
from /Users/agius/.rbenv/versions/2.7.0/lib/ruby/gems/2.7.0/gems/roo-2.10.0/lib/roo/excelx/sheet_doc.rb:224:in `block (2 levels) in extract_cells'
from /Users/agius/.rbenv/versions/2.7.0/lib/ruby/gems/2.7.0/gems/roo-2.10.0/lib/roo/excelx/sheet_doc.rb:101:in `cell_from_xml'
from /Users/agius/.rbenv/versions/2.7.0/lib/ruby/gems/2.7.0/gems/nokogiri-1.15.4-x86_64-darwin/lib/nokogiri/xml/node_set.rb:234:in `each'
from /Users/agius/.rbenv/versions/2.7.0/lib/ruby/gems/2.7.0/gems/nokogiri-1.15.4-x86_64-darwin/lib/nokogiri/xml/node_set.rb:234:in `upto'
from /Users/agius/.rbenv/versions/2.7.0/lib/ruby/gems/2.7.0/gems/nokogiri-1.15.4-x86_64-darwin/lib/nokogiri/xml/node_set.rb:235:in `block in each'
from /Users/agius/.rbenv/versions/2.7.0/lib/ruby/gems/2.7.0/gems/roo-2.10.0/lib/roo/excelx/sheet_doc.rb:114:in `block in cell_from_xml'
from /Users/agius/.rbenv/versions/2.7.0/lib/ruby/gems/2.7.0/gems/roo-2.10.0/lib/roo/excelx/sheet_doc.rb:172:in `create_cell_from_value'
from /Users/agius/.rbenv/versions/2.7.0/lib/ruby/gems/2.7.0/gems/roo-2.10.0/lib/roo/excelx/sheet_doc.rb:172:in `new'
from /Users/agius/.rbenv/versions/2.7.0/lib/ruby/gems/2.7.0/gems/roo-2.10.0/lib/roo/excelx/cell/number.rb:16:in `initialize'
from /Users/agius/.rbenv/versions/2.7.0/lib/ruby/gems/2.7.0/gems/roo-2.10.0/lib/roo/excelx/cell/number.rb:27:in `create_numeric'
/Users/agius/.rbenv/versions/2.7.0/lib/ruby/gems/2.7.0/gems/roo-2.10.0/lib/roo/excelx/cell/number.rb:27:in `Integer': invalid value for Integer(): "" (ArgumentError)

I made a pull request with a fix here, but it hasn’t been addressed or merged. Kind of a bummer, feels like this happens nearly every time I try to contribute to open-source projects. You can implement the fix yourself by monkey-patching like so:

# frozen_string_literal: true

module Roo
  class Excelx
    class Cell
      class Number < Cell::Base
        def create_numeric(number)
          return number if Excelx::ERROR_VALUES.include?(number)
          case @format
          when /%/
            cast_float(number)
          when /\.0/
            cast_float(number)
          else
            (number.include?('.') || (/\A[-+]?\d+E[-+]?\d+\z/i =~ number)) ? cast_float(number) : cast_int(number, 10)
          end
        end
        
        def cast_float(number)
          return 0.0 if number == ''

          Float(number)
        end

        def cast_int(number, base = 10)
          return 0 if number == ''

          Integer(number, base)
        end
      end
    end
  end
end

If you’re running Rails, drop that in lib/roo/excelx/cell/number.rb and make sure it’s require -ed in an initializer or on app boot. Your app should now be able to handle the “bad” spreadsheets producing the error.

The longer story

This wasn’t the initial error I was debugging, I think the original error was a NoMethodError on the Roo::Spreadsheet object or something. From seeing the initial error on Bugsnag, I was able to determine which file it came from, but it seemed like the stack trace was truncated. So I downloaded the sheet and tried to reproduce locally. The results were weird: I didn’t get the production error, and didn’t seem to get a “proper” local error, either. When running inside a rails console, I just saw the message:

> xlsx.each_row_streaming {|row| pp [:row, row] }
...lots of output...
[:row,
 [#<Roo::Excelx::Cell::String:0x00007f8982913580
   @cell_value="Header 1",
   @coordinate=[1, 1],
   @value="Header 1">,
  #<Roo::Excelx::Cell::String:0x00007f8982913148
   @cell_value="Header 2",
   @coordinate=[1, 2],
   @value="Header 2">,
  #<Roo::Excelx::Cell::String:0x00007f8982912ea0
   @cell_value="Header 3",
   @coordinate=[1, 3],
   @value="Header 3">]]
invalid value for Integer(): ""
> $!
nil
> $@
nil
> 

Like, huh? Where’s the stack trace? Why isn’t the error in $! , the magic Ruby global for “last error message?” I did manage to get a full stack trace by writing a test that tried to parse the “bad” spreadsheet. Maybe IRB or Pry was swallowing the error somehow.

Once I got the stack trace above, it looked like something was wrong in Roo itself, and not in our code or the way we were calling it. It seemed like something that should be a simple fix. The “Numbers” app that comes with OSX opened the bad spreadsheet no problem, and uploading to Google Sheets also didn’t pose any issue. So what was going on with Roo? Clearly it was trying to parse something as an Integer when it was an empty string, but how did it get there?

Investigating the raw XML

XLSX documents are just XML files zipped up in a particular way. You can unzip them and read the raw XML if you want. Since the each_row_streaming command above indicated which cell was causing the problem, I thought I’d dig in and see if it was some weird Unicode character conversion, mis-encoded file, or what it might be.

To unzip your XLSX file, just use your OS’s built-in unzipping tool:

$ unzip Example-sheet.xlsx -d Example-sheet-unzip
$ find empty_sheet
empty_sheet
empty_sheet/[Content_Types].xml
empty_sheet/_rels
empty_sheet/_rels/.rels
empty_sheet/xl
empty_sheet/xl/workbook.xml
empty_sheet/xl/worksheets
empty_sheet/xl/worksheets/sheet1.xml
...snip...
$ open xl/worksheets/sheet1.xml

Yup! Sure is XML. It’s not, like, super-readable XML, but you can kinda reverse-engineer it. Like reading a restaurant menu in a language you don’t know.

<c r="K58" s="1" t="s">
  <v>4</v>
</c>

Here’s a cell in the spreadsheet. You can see it’s cell K58 - column K, row 58. The t="s" only seems to appear for cells with string values in them, not the numeric ones, so it probably means “type=string” . The s="1" , I kinda think means “spreadsheet 1” , since each workbook can have multiple tabs of sheets on it that can reference each other. Don’t quote me on that, it was out of scope for this investigation.

The value inside the <v>4</v> tags indicates two possible things:

for numeric-type cells, it’s the numeric value in the cell, either int or float
for string-type cells, it’s a reference to the file sharedStrings.xml , which is just a big array of all the strings in the spreadsheet. Helps reduce filesize by de-duping strings, I guess

Knowing all that, and knowing the specific error message invalid value for Integer(): "" , I could kind of deduce what might have gone wrong with the cell that seemed to stump the Roo parser:

<c r="L58">
  <v/>
</c>

I checked some other spreadsheets that didn’t have any errors, and I couldn’t see any instances of a self-closing tag like this. Other spreadsheets had full valid XML tags:

<c r="N26">
  <v></v>
</c>

Roo didn’t blow a gasket on this type of cell. It seemed like this might be the issue. I did some Googling and figured out how to zip the dir back up as an XLSX file so I could test out my theory:

Modifying the XML in an XLSX file for fun & profit

I followed this guide to unzipping and re-zipping XLSX files. Re-posting in case the original site goes down or something:

unzip Example-sheet.xlsx -d Example-sheet-unzip
cd into the extracted zip dir Example-sheet-unzip
make edits to the xml as desired, probably in a file like xl/worksheets/sheet1.xml
use python2 ../zipdir.py Example-sheet-edited.xlsx . (see script below) to compress the objects
this should generate an .xlsx file open-able by Numbers, Excel, GSheets

#!/usr/bin/python

# Name: zipdir.py
# Version: 1.0
# Created: 2016-11-13
# Last modified: 2016-11-13
# Purpose: Creates a zip file given a directory where the files to be zipped
# are stored and the name of the output file. Don't include the .zip extension
# when specifying the zip file name.
# Usage: zipdir.py output_filename dir_name
# Note: if the output file name and directory are not specified on the
# command line, the script will prompt for them.

import sys, shutil

if len(sys.argv) == 1:
   dir_name = raw_input("Directory name: ")
   output_filename = raw_input("Zip file name: ")
elif len(sys.argv) == 3:
   output_filename = sys.argv[1]
   dir_name = sys.argv[2]
else:
   print "Incorrect number of arguments! Usage: zipdir.py output_filename dir_name"
   exit()

shutil.make_archive(output_filename, 'zip', dir_name)

Once I did all that and zipped the file back up, Roo was able to parse it and print the expected results in IRB and in my RSpec test case. Bingo! Scientific proof, basically.

I also took a stab at generating some “bad” XML files myself, since I didn’t want to check a real file with production data into our git repo if I could avoid it. I couldn’t get Numbers or Google Sheets to generate a self-closing XML tag in the output. They only generated valid open-and-close-style tags even for empty cells.

A little more investigation revealed the file came from an external web site’s reporting feature. The site was using Kendo UI with Angular on the front-end, which had a built-in export-to-Excel feature. Like most problems in web development, the blame falls on a horrible front-end framework reinventing the wheel with JavaScript.

Developing the fix

Since I couldn’t reproduce the exact path to make a “bad” file, I just wrote one myself. Made a quick, nearly-empty sheet in Numbers, exported it, and followed the steps above to change a valid <v></v> tag into an invalid <v/> tag. This I could check in to the repo along with an RSpec test reproducing the error condition. Now we could code something!

Since I had the stack trace, I could look at the file directly. The relevant class is here on Github, though I usually just use bundle show roo to inspect the exact code on my machine. Since there’s only one place with Integer() in the file, and this is basically another type of nil-check, it seemed like a pretty easy fix.

I’ve had bad luck getting fixes merged into open-source projects. Most maintainers seem overworked, short on time, and extremely particular about the kind of code they want in their repo. Even submitting a full pull request meeting the maintainers’ expectations for code style, testing, documentation, and all the other ancillary considerations, it can take a lot of back-and-forth and usually weeks to get a PR merged. I didn’t think we should wait this long to get our process fixed, so I went with monkey-patching as a first approach.

I made the modifications you can see in the monkey-patch 🙈 above. I figured we’d probably want to check for this empty-string, no-format condition for both Float and Integer values, so it’s probably worth factoring out into a separate method. This would also make it easier to extend later with a rescue block, if needed.

Added the patch to our repo, ran the specs to make sure it worked, and put up a pull request on our app. A fix is never really done until it’s verified to be working on production, so once I got code review, merged & deployed, I re-ran the file to verify it produced the expected results instead of an error. Good to go!

Upstreaming

Since I had written an essentially-empty XLSX file to test on locally, I figured it would be easy to make a PR to Roo itself. I cloned their repo, and their test suite was in RSpec and MiniTest and mostly what I expected. They had a whole pile of files with a particular naming scheme, and an easy test harness to parse the file and check for expected results.

Converted the monkey-patch into a diff on the Roo::Excelx::Cell::Number class, added my empty file as a test case, and added a unit test for the class to boot. Maybe I could have written more tests to cover every possible branch condition, but I figured I’d get the PR up and see what the maintainers thought about the approach.

Sadly I can’t end it with “they loved my PR and merged it” – not sure why my PR got ghosted when others are getting reviewed and merged. But at least we got it working for our app, so ¯\(ツ)/¯

Wrap-up

Hope that helps somebody out there! I learned a lot just debugging this one bad spreadsheet, so I figured it was worth a write-up. If you have any questions or comments, feel free to email me about it, since social media has kinda fallen off a cliff lately. Best of luck, internet programmers!

a t e v a n s . c o m