This is Part 1 in a multi-part series to detail the creation of a “simple” project combining Ruby, MongoDB, RSpec, Sinatra, and Capybara in preperation for a larger-scale side project set to begin January 2013. For more in this series, see the Pokephile category. Part 1 details getting started with MongoDB and creating a collection using data scraped off the web using Nokogiri. The code for this side-project is located on Github.
A little background
NoSQL is a database service used when working with a large amount of data that doesn’t fit a relational model (read: wikipedia). It allows for mass storage without the overhead of SQL relations. There are many types of schemaless database services (here’s a list), but in particular I’ve been looking into what’s called “Document Store.”
Documents can be any number of key-value fields with a unique id. Document Store services usually encode data in a simple format such as XML, YAML, JSON, or BSON for storage purposes. MongoDB is a document store service which uses BSON to store documents. In Mongo, we connect to a specific database and then we can look through “collections,” which are more-or-less equivalent to “tables” in relational databases.
What about MongoDB and the Ruby driver?
The first step is to get MongoDB working on your machine. Install MongoDB for your system – on Ubuntu 12.10 I do this:
Then we start up the daemon:
What’s the concept?
The concept here is that we are going to have a database populated with Pokemon. The user types a Pokemon’s name into a search field and submits the form, which brings up an image of the Pokemon and some useful information.
Since I would like to focus on MongoDB, we can start by populating our database with Pokemon. If you’re not familiar with Pokemon, there are lots of them (~650 at the date of this blog post). For my purposes, I may want to only add the first ~150 Pokemon, or I may want to add every Pokemon imaginable. I want it to be easy to add more if any new ones are added. So I’m going to start this project by creating a Populater, and we’re going to use TDD to help us create it.
If you don’t have RSpec installed, it’s as easy as opening up a shell and:
I’m going to put the Populater in a tools directory, and I’m going to put my spec files in a test/spec directory. The directory structure I want to use is as follows:
1 2 3 4 5
In the ‘tools/test/spec’ directory, I create ‘populater_spec.rb.’ We’ll write our first test:
1 2 3 4 5 6 7
The syntax for RSpec is mostly pseudo-English, so it’s fairly straightforward to follow. The first ‘describe’ block says that we are describing the Populater class. The second ‘describe’ block says that we are describing the ‘new’ method of the ‘Populater’ class. The inner-most block is our test. We want to make sure that no exception is thrown when we create a new Populater. To run this test, open a terminal and type:
1 2 3
We get a big fat compile error, obviously due to the fact that there’s no such thing as a ‘Populater’ class. So create the file ‘populater.rb’ in ‘project/tools/populate’ and create the class:
And include the ‘Populater’ class in our spec file:
1 2 3 4 5 6 7 8 9
Now run rspec. Hooray, we’re passing all our tests! Let’s add another test and some let’s have RSpec do a little work before each test.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
The ‘before:each’ syntax tells RSpec to perform this action before running each test. This way, we don’t have to type out ‘Populater.new’ in each test. When we run RSpec, this test passes. Now let’s actually do something meaningful in our new call. We want the Populater to empty all Pokemon from our database as it begins. In order to do this, we need to also tell the Populater what database to use, so we’ll refactor slightly to pass in the name of our database to the Populater.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Similar to the ‘before:each’ syntax, the ‘before:all’ syntax runs the statement once. Here we want to get a handle to the ‘pokemons’ collection from our ‘test’ database. In our test, we run a ‘find’ with no arguments on the ‘pokemons’ collection to query everything in that collection. We also have an ‘insert’ statement where we insert an arbitrary document into our collection. You’ll note later that this garbage document looks nothing like the Pokemon documents we insert, which is just another reason to love document-store databases. We run RSpec and we fail the test. Let’s open up ‘populater.rb’ and fix this.
1 2 3 4 5 6 7 8
Test fixed. We connect to the same database and access the same collection and remove all the old data on intialize. So now we actually want to add Pokemon to the collection. We’ll pick up a new ‘describe’ block for an ‘add_pokemon’ method. We’ll then test that calling it with 0 adds no Pokemon to the collection.
1 2 3 4 5 6 7 8
When we run our tests, we get a NoMethodError and fail. We create a trivial fix in populater.rb
1 2 3 4 5 6
And we pass the test, having added 0 Pokemon to our database. Let’s do it with 1 now.
1 2 3 4 5 6 7 8 9
We fail. Another trivial fix:
1 2 3 4 5 6 7 8 9
We pass again. We’ll also pass when checking for multiple Pokemon:
1 2 3 4 5 6 7 8 9
But we’re missing substance. There’s only garbage being shoved in our database. Our TDD methodology breaks down slightly here because we want our database to have dynamic information scraped from a website, and I don’t want to hard code any data nor do I want to scrape the same website in my tests and my implementation. So we’re going to do a little bit of behind-the-scenes stuff and test that the fields we want are simply not nil. I want each Pokemon to have a number, name, an array of types, and a link to an image:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
There are many websites where you can get this kind of data for each Pokemon, but I chose the Pokemon Wiki for its consistency. In the initializer of the Populater, I open up the URL using Nokogiri so I can access the sweet, creamy data contained within. In my add_pokemon method, I extract this data I want based on the way the table is set up on the website. To continue, we need to install the Nokogiri gem:
And now we add the logic to add_pokemon:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
I’ll admit The add_pokemon method is now quite a bit more daunting to interpret. Here’s the breakdown of what’s going on: Nokogiri finds us the table tag with class of ‘wikitable sortable’ and we iterate over that. There are two breaking conditions of our loop: we hit the max number of Pokemon as given, or we can’t find anymore Pokemon in the table. So we check that we haven’t hit our max. Then we find the Pokemon’s number in the table after we manually parse the HTML. In the case of this table, the first row is all garbage, so we continue to the next row if we are on the first row. We then grab the name from the table, which is luckily always in the same place. The branch is for the special case of Pokemon #000 (Missingo), which is set up slightly differently in the table for some reason. We create an empty array and shove our types in it, but we have to be careful because not all Pokemon have two types. We then create a document in the braces and insert it into the collection. The final step is to decrement the loop counter.
Tests pass. We now have a working Populater! Now we can either write a script or open up the irb and populate as necessary and we know that the Populater is functional:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
If you want to further familiarize yourself with the MongoDB Ruby driver, you should check out the MongoDB Koans. Unfortunately, the original MongoDB Koans have not been updated in a while, and so my more recent installations of Ruby and the MongoDB driver didn’t work. I found a set of updated koans which worked with my install of Ruby 1.9.3. However, the updated version also had a couple of annoying issues with deprecations, so I created my own fork on GitHub with the fixes.