Idempotent seed files

06 Jan 16

Making your seed files idempotent for an easier project setup

Getting an existing Rails project set on your machine can be a right hassle, and is often fraught with out-of-date docs and code. Chief among these is the db/seeds.rb file, which receives a lot of attention at the outset, but is often neglected as the project progresses. This is particularly a problem if someone leaves the project for a while, and then returns to find that the database structure has moved on whilst they were away, and their dev data is now broken.

Obviously, keeping the seeds file up-to-date is a developer-discipline issue, but I’ve found it to be a great help to have an idempotent seed file, which can be run and re-run every time someone git pulls the project.

Idempotent?

Idempotent is one of those great computer-sciencey words to throw about, and probably misused a fair bit. Strictly speaking, it’s a mathematical term that means that something remains the same when some operation is applied to it, using itself as input (for example, 1 x 1 is still 1). In computer circles, it’s generally come to mean an operation that can be performed multiple times with the same outcome each time.

Standard seeds file often fit this description – consider the Rails default example:

cities = City.create([{ name: 'Chicago' }, { name: 'Copenhagen' }])
Mayor.create(name: 'Emanuel', city: cities.first)

If you ran that multiple times, you’d either end up with multiple City and Mayor records (there’s only one Copenhagen!), or a validation error that may or may not halt execution halfway-through.

A fix

To make this seed file idempotent, we only need to make a minor change – instead of using create, we switch over to find_or_create_by, which (as the name suggests) will first look in the database for a record matching the attributes hash we give it, or create a new record with those same attributes, returning the record in either situation.

So, for our default seed file (further updated since find_or_create_by can’t handle arrays like create, but I’d argue this revision is a littler clearer):

chicago    = City.find_or_create_by(name: 'Chicago')
copenhagen = City.find_or_create_by(name: 'Copenhagen')

Mayor.find_or_create_by(name: 'Emanuel', city: chicago)

Generally, most records will have some column(s) that uniquely identify them, with the content of the other columns not being so important. For those cases, we can pass a block to find_or_create_by, which will yield the record for adding attributes if no record is found:

City.find_or_create_by(name: 'Chicago') do |city|
# This only runs when creating a new record (before it's saved)
city.population = 2_719_000 # an example column that likely to change, and doesn't uniquely identify the record
end

For passing a simple hash of attributes, we can also use create_with as an alternative:

City.create_with(population: 2_719_000).find_or_create_by(name: 'Chicago')

Revising data

Of course, this strategy falls down if we need to change the seed data in some way – e.g. changing the spelling of mayor’s name, or the city that he’s mayor of – since we won’t create the revised record as the original already exists. We could chain an update_attributes or two onto the new/found records, or start the seed file by deleting out-of-date records, but that’d probably get messy fast.

In these cases, a custom Rake task is probably a better approach, especially since you’re most likely writing one to migrate your production data anyway. They do have the downside of having pass on a list of tasks to your teammates of Rake tasks to run in a certain order, but hopefully this list won’t get too long! If it does, the aforementioned teammate may be better off dropping their copy of their database and starting afresh.

Conclusion

I hope this helps make your project setup a little easier – making your processes idempotent is generally a good thing to do (e.g. email-sending ActiveJobs, payment processing gateways), and I’ve certainly found that to be the case with seed files.