tl;dr: check out this sample code on Github for an AWS Textract client and results parser
At my previous company, we wanted to use Textract to get some table-based data out of PDFs. This was in the medical field, so the only “programmatic” interface we had to the system was to set up an inbox that would receive emails from it, and those emails might contain PDFs with the data we wanted. Medicine be like that.
However, the output of Textract can be a little hard to work with. It’s just a big bag of “blocks” - elements it has identified on the page - with a geometry, confidence, and relationships to each other. They’re returned as a paginated list, and you have to reconstruct the element hierarchy in your client. This became critical when trying to visualize where in a given “type” of document the information we wanted was located. Was it the last LINE element on a page? Was it a WORD element located inside some other elements? I wanted to visualize this to get a better look.
The result-parsing code in their tutorials is in Python, and is of the most uninspiring big-bag-of-functions type, so I thought about how to manage this in Ruby. Mostly I just wanted some data structure where I could call .parent
on particular element and recurse up to the page level, kind of like the DOM in html-land.
I ended up with some code that looks like this:
class Node
attr_reader :block, :parent, :children
def initialize(block, parent: nil, blocks_map: {})
@block = block
@parent = parent
@children = []
return if block.relationships.nil?
block.relationships.each do |rel|
next unless rel.type == 'CHILD'
next if rel.ids.nil? || rel.ids.empty?
rel.ids.each do |block_id|
blk = blocks_map[block_id]
next if blk.nil?
@children << self.class.new(blk, parent: self, blocks_map: blocks_map)
end
end
end
end
This gave me a tree object with a reasonable structure. If I wanted to get fancier, I could add a grep
method to search the node text and its children, or other recursive tree-based functionality. If we wanted to get really fancy, we could sort the tree by x * y
in the geometry, making it easy to walk the tree from top-left to bottom-right.
But since we were writing pretty basic extractors, this was enough to let me walk through, find the element I wanted with the right block.text
value, and walk up its parents to see where it lived in the document structure.
I added some code to print the whole tree to console so you can easily visualize it:
def to_s
txt = if block.text.nil?
''
elsif block.text.length > 10
"#{block.text[0..7]}..."
else
block.text
end
"<#{block.block_type} #{txt} #{block.id}>"
end
def print_tree(indent = 0)
indent_txt = indent > 0 ? ' ' * (indent * 2) : ''
puts "#{indent_txt}#{to_s}"
children.each {|chld| chld.print_tree(indent + 1) }
end
This is in leiu of just making a nicer inspect
method and using something like awesome_print or the built-in pp method. While those are great, we don’t really need the Ruby object ids and other properties for this visualization - they just clutter up the terminal. We could overwrite def inspect
to show only the info we want, but I feel like that’s a POLA violation, so it’s better to just write this functionlaity where it belongs.
If you’d like to run a Textract analysis and play with the results, I’ve got the sample code up on Github. It’s not well-tested or ready for deployment, but it can be a starting point if you want to do a quick integration of Textract into your own Ruby project. Hope this helps someone!