We recently covered finding journeys within a shape using PostGIS, and this time I’m going to talk about how we generated data to test and fine-tune our matching algorithm.

Although the SQL query we covered last time is fairly straightforward and easy to test in a unit test, we wanted some way to play with accurate data – adjusting various parameters to get good results – which we found was something that needed testing by hand.

So we needed some way to get a set of testing data. Since the product was a new one, we had no data from users, so we had to generate data ourselves. Simply randomly generating coordinates wasn’t going to work (given that Great Britain is an island, you’re fairly likely to generate a coordinate in the sea somewhere), and we wanted precise coordinates (individual houses/buildings), so using a list of coordinates of major cities/towns was out as well.

In the end, the solution we came up with was to use Google’s geocoding service (which we were already using in the app) to search for something we could be sure would return a result given a general locality, such as a town or city.

Here’s the general process we used:

  1. To start with, we needed a list of towns and cities. We ended up getting our list from Wikipedia. A quick screen-scrape later, we had an array of city names.
  2. Next, we needed something to search Google with. We did a few rounds of testing with our list of cities, and got the best results with “station”, “restaurant”, “post office”, and – this being the UK – “Tesco”.
  3. We then wrote a simple service object that would randomly generate a list of journeys (using FactoryGirl) and associate them with a geocoded origin and destination, sourced from our list of city names.

The service object

Here’s a cut-down version of the service we wrote:

class GeoSeeder
  CITIES = %w[London Edinburgh Cardiff ...]
  PLACES = %w[station restaurant post\ office tesco]

  def initialize(options = {})
    @num_of_users    = options.fetch(:users).to_i
    @num_of_journeys = options.fetch(:journeys).to_i
  end

  def run
    generate_users!
    generate_journeys!
  end

  private

  def generate_users!
    @users ||= FactoryGirl.create_list(:user, @num_of_users)
  end

  def generate_journeys!
    cities_collection.each do |(origin, destination)|
      origin_search      = "#{random_place} near #{origin}"
      destination_search = "#{random_place} near #{destination}"

      FactoryGirl.create(:journey, user: random_user, origin: origin_search, destination: destination_search)

      sleep(5)
    end
  end

  def random_place
    PLACES.sample
  end

  def cities_collection
    CITIES.permutation(2).to_a.sample(@num_of_journeys)
  end

  def random_user
    @users.sample
  end
end

To ensure that we don’t try to generate a journey with the same origin and destination, we use Array#permutation to generate a list of all the possible pairs of cities, from which we then grab a subset to use to create journeys (we have to call #to_a first, since #permutation returns an Enumerator).

We also make liberal use of Array#sample, which returns either a single element or n random elements from an array (so you can think of it being roughly equivalent to [...].shuffle.take(n)).

Geocoding is handled within the model by the geocoder gem, so we just need to construct the search string that’s passed onto Google’s geocoding API. Since the service is only a one-off task, run in development, we also include a sleep() call to avoid triggering the API’s rate-limiting.

We then created a rake task that calls the service object:

# lib/tasks/geoseed.rake
task :geoseed => :environment do
  GeoSeeder.new(users: ENV["USERS_COUNT"], journeys: ENV["JOURNEYS_COUNT"]).run
end

and call it like so:

$ USERS_COUNT=20 JOURNEYS_COUNT=50 bin/rake geoseed