We recently covered finding journeys within a shape using PostGIS, and this time I’m going to talk about how we generated data to test and fine-tune our matching algorithm.

Although the SQL query we covered last time is fairly straightforward and easy to test in a unit test, we wanted some way to play with accurate data – adjusting various parameters to get good results – which we found was something that needed testing by hand.

So we needed some way to get a set of testing data. Since the product was a new one, we had no data from users, so we had to generate data ourselves. Simply randomly generating coordinates wasn’t going to work (given that Great Britain is an island, you’re fairly likely to generate a coordinate in the sea somewhere), and we wanted precise coordinates (individual houses/buildings), so using a list of coordinates of major cities/towns was out as well.

In the end, the solution we came up with was to use Google’s geocoding service (which we were already using in the app) to search for something we could be sure would return a result given a general locality, such as a town or city.

Here’s the general process we used:

  1. To start with, we needed a list of towns and cities. We ended up getting our list from Wikipedia. A quick screen-scrape later, we had an array of city names.
  2. Next, we needed something to search Google with. We did a few rounds of testing with our list of cities, and got the best results with “station”, “restaurant”, “post office”, and – this being the UK – “Tesco”.
  3. We then wrote a simple service object that would randomly generate a list of journeys (using FactoryGirl) and associate them with a geocoded origin and destination, sourced from our list of city names.

The service object

Here’s a cut-down version of the service we wrote:

class GeoSeeder
  CITIES = %w[London Edinburgh Cardiff ...]
  PLACES = %w[station restaurant post\ office tesco]

  def initialize(options = {})
    @num_of_users    = options.fetch(:users).to_i
    @num_of_journeys = options.fetch(:journeys).to_i

  def run


  def generate_users!
    @users ||= FactoryGirl.create_list(:user, @num_of_users)

  def generate_journeys!
    cities_collection.each do |(origin, destination)|
      origin_search      = "#{random_place} near #{origin}"
      destination_search = "#{random_place} near #{destination}"

      FactoryGirl.create(:journey, user: random_user, origin: origin_search, destination: destination_search)


  def random_place

  def cities_collection

  def random_user

To ensure that we don’t try to generate a journey with the same origin and destination, we use Array#permutation to generate a list of all the possible pairs of cities, from which we then grab a subset to use to create journeys (we have to call #to_a first, since #permutation returns an Enumerator).

We also make liberal use of Array#sample, which returns either a single element or n random elements from an array (so you can think of it being roughly equivalent to [...].shuffle.take(n)).

Geocoding is handled within the model by the geocoder gem, so we just need to construct the search string that’s passed onto Google’s geocoding API. Since the service is only a one-off task, run in development, we also include a sleep() call to avoid triggering the API’s rate-limiting.

We then created a rake task that calls the service object:

# lib/tasks/geoseed.rake
task :geoseed => :environment do
  GeoSeeder.new(users: ENV["USERS_COUNT"], journeys: ENV["JOURNEYS_COUNT"]).run

and call it like so:

$ USERS_COUNT=20 JOURNEYS_COUNT=50 bin/rake geoseed