Duplicate Data
Forgotten to back validates_uniqueness_of with a unique constraint in your database? Oh no! Here is some SQL that will pull out all the duplicate records for you.
1 2 3 4 5 6 7 8 9 |
User.find_by_sql <<-EOS SELECT * FROM users WHERE name IN ( SELECT name FROM users GROUP BY name HAVING count(name) > 1); EOS |
You will need your own strategy for resolving the duplicates, since it is totally dependent on your data. Some ideas:
- Arbitrarily deleting one of the records. Perhaps based on latest update time? Don’t forget about child records! If you have forgotten a uniqueness constraint it is likely you have also forgotten a foreign key, so you will have to delete child records manually.
- Merge the records, including child records.
- Manually resolving the conflicts on a case by case basis. Possible if there are not too many duplicates.
STI is the global variable of data modelling
A Single Table Inheritance table is really easy to both update and query. This makes it ideal for rapid prototyping: just throw some extra columns on it and you are good to go! This is why STI is so popular, and it fits perfectly into the Rails philosophy of getting things up and running fast.
Fast coding techniques do not always transfer into solid, maintainable code however. It is really easy to hack something together with global variables, but we eschew them when writing industry code. STI falls into the same category. I have written about the downsides of STI before: it clutters your data model, weakens your data integrity, and can be difficult to index. STI is a fast technique to get started with, but is not necessarily a great option for maintainable applications, especially when there are other modelling techniques such as class table inheritance available.
Updating Class Table Inheritance Tables
My last post covered querying class table inheritance tables; this one presents a method for updating them. Having set up our ActiveRecord models using composition, we can use a standard rails method accepts_nested_attributes_for to allow easy one-form updating of the relationship.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
class Item < ActiveRecord::Base validates_numericality_of :quantity SUBCLASSES = [:dvd, :car] SUBCLASSES.each do |class_name| has_one class_name end accepts_nested_attributes_for *SUBCLASSES end @item = Dvd.create!( :title => 'The Matix', :item => Item.create!(:quantity => 1)) @item.update_attributes( :quantity => 2, :dvd_attributes => { :id => @item.dvd.id, :title => 'The Matrix'}) |
This issues the following SQL to the database:
1 2 |
UPDATE "items" SET "quantity" = 10 WHERE ("items"."id" = 12)
UPDATE "dvds" SET "title" = 'The Matrix' WHERE ("dvds"."id" = 12)
|
Note that depending on your application, you may need some extra locking to ensure this method is concurrent, for example if you allow items to change type. Be sure to read the accepts_nested_attributes_for documentation for the full API.
I talk about this sort of thing in my “Your Database Is Your Friend” training sessions. They are happening throughout the US and UK in the coming months. One is likely coming to a city near you. Head on over to www.dbisyourfriend.com for more information and free screencasts
Class Table Inheritance and Eager Loading
Consider a typical class table inheritance table structure with items as the base class and dvds and cars as two subclasses. In addition to what is strictly required, items also has an item_type parameter. This denormalization is usually a good idea, I will save the justification for another post so please take it for granted for now.
The easiest way to map this relationship with Rails and ActiveRecord is to use composition, rather than trying to hook into the class loading code. Something akin to:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 |
class Item < ActiveRecord::Base SUBCLASSES = [:dvd, :car] SUBCLASSES.each do |class_name| has_one class_name end def description send(item_type).description end end class Dvd < ActiveRecord::Base belongs_to :item validates_presence_of :title, :running_time validates_numericality_of :running_time def description title end end class Car < ActiveRecord::Base belongs_to :item validates_presence_of :make, :registration def description make end end |
A naive way to fetch all the items might look like this:
1 |
Item.all(:include => Item::SUBCLASSES) |
This will issue one initial query, then one for each subclass. (Since Rails 2.1, eager loading is done like this rather than joining.) This is inefficient, since at the point we preload the associations we already know which subclass tables we should be querying. There is no need to query all of them. A better way is to hook into the Rails eager loading ourselves to ensure that only the tables required are loaded:
1 2 3 |
Item.all(opts).tap do |items| preload_associations(items, items.map(&:item_type).uniq) end |
Wrapping that up in a class method on items is neat because we can then use it as a kicker at the end of named scopes or associations – person.items.preloaded, for instance.
Here are some tests demonstrating this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
require 'test/test_helper' class PersonTest < ActiveRecord::TestCase setup do item = Item.create!(:item_type => 'dvd') dvd = Dvd.create!(:item => item, :title => 'Food Inc.') end test 'naive eager load' do items = [] assert_queries(3) { items = Item.all(:include => Item::SUBCLASSES) } assert_equal 1, items.size assert_queries(0) { items.map(&:description) } end test 'smart eager load' do items = [] assert_queries(2) { items = Item.preloaded } assert_equal 1, items.size assert_queries(0) { items.map(&:description) } end end # Monkey patch stolen from activerecord/test/cases/helper.rb ActiveRecord::Base.connection.class.class_eval do IGNORED_SQL = [/^PRAGMA/, /^SELECT currval/, /^SELECT CAST/, /^SELECT @@IDENTITY/, /^SELECT @@ROWCOUNT/, /^SAVEPOINT/, /^ROLLBACK TO SAVEPOINT/, /^RELEASE SAVEPOINT/, /SHOW FIELDS/] def execute_with_query_record(sql, name = nil, &block) $queries_executed ||= [] $queries_executed << sql unless IGNORED_SQL.any? { |r| sql =~ r } execute_without_query_record(sql, name, &block) end alias_method_chain :execute, :query_record end |
I talk about this sort of thing in my “Your Database Is Your Friend” training sessions. They are happening throughout the US and UK in the coming months. One is likely coming to a city near you. Head on over to www.dbisyourfriend.com for more information and free screencasts
Last minute training in Seattle
If you or someone you know missed out on Saturday, I’ve scheduled a last minute database training for Seattle tomorrow. Register here. Last chance before I head to Chicago for a training on Friday.
Constraints assist understanding
The hardest thing for a new developer on a project to wrap his head around is not the code. For the most part, ruby code stays the same across projects. My controllers look like your controllers, my models look like your models. What defines an application is not the code, but the domain. The business concepts, and how they are translated into code, can take weeks or months to understand cleanly. Modelling your domain in a way that it is easily understood is an important principle to speed up this learning process.
In an application I am looking at there is an email field in the user model. It is defined as a string that allows null values. This is confusing. I need to figure in what circumstances a null value makes sense (can they choose to withhold that piece of information? Is there a case where a new column I am adding should be null?), which is extra information I need to locate and process before I can understand the code. There is a validates_presence_of declaration on the attribute, but production data has some null values. Two parts of the application are telling me two contradicting stories about the domain.
Further, when I am tracking down a bug in the application, eliminating the possibility that a column could be null is an extra step I need to take. The data model is harder to reason about because there are more possible states than strictly necessary.
Allowing a null value in a column creates another piece of information that a developer has to process. It creates an extra question that needs to be answered when reading the code: in what circumstances is a null value appropriate? Multiply this problem out to multiple columns (and factor in other sub-optimal modeling techniques not covered here), and the time to understanding quickly grows out of hand.
Adding not-null constraints on your database is a quick and cheap way to bring your data model inline with the code that sits on top of it. In addition to cutting lines of code, cut out extraneous information from your data model. For little cost, constraints simplify your application conceptually and allow your data to be reasoned about more efficiently.
I talk about this sort of thing in my “Your Database Is Your Friend” training sessions. They are happening throughout the US and UK in the coming months. One is likely coming to a city near you. Head on over to www.dbisyourfriend.com for more information and free screencasts
Concurrency with AASM, Isolation Levels
I’ve posted two guest articles over on the Engine Yard blog this week on database related topics:
They’re in the same vein as what I’ve been posting here, so worth a read if you’ve been digging it.
The US tour kicks off this Saturday in San Francisco, and there’s still a couple of spots available. You can still register over at www.dbisyourfriend.com
“Your Database Is Your Friend” training sessions are happening throughout the US and UK in the coming months. One is likely coming to a city near you. For more information and free screencasts, head on over to www.dbisyourfriend.com
Relational Or NoSQL With Rails?
With all the excitement in the Rails world about “NoSQL” databases like MongoDB, CouchDB, Cassandra and Redis, I am often asked why am I running a course on relational databases?
The “database is your friend” ethos is not about relational databases; it’s about finding the sweet spot compromise between the tools you have available to you. Typically the database has been underused in Rails applications—to the detriment of both quality and velocity—and my goal is to provide tools and understanding to ameliorate this neglect, no matter whether you are using Oracle or Redis.
The differences between relational and NoSQL databases have been documented extensively. To quickly summarize the stereotypes: relational gives you solid transactions and joins, NoSQL is fast and scales. In addition, the document oriented NoSQL databases (NoSQL is a bit of a catch-all: there’s a big difference between key/value stores and document databases) enable you to store “rich” documents, a powerful modelling tool.
That’s a naive summary, but gives you a general idea of the ideologies. To make a fair comparison between the two you need to understand both camps. If you don’t know what a relational database can do for you in terms of transactional support or data integrity, you will not know what your are losing when choosing NoSQL. Conversely, if you are not familiar with document modelling techniques and why denormalization isn’t so scary, you are going to underrate NoSQL technologies and handicap yourself with a relational database.
For example, representing a many-to-many relationship in a relational database might look something like:
1 2 3 |
Posts(id, title, body) PostTags(post_id, tag_id) Tags(id, name) |
This is a standard normalization, and relational databases are tuned to deal with this scenario using joins and foreign keys. In a document database, the typical way to represent this is:
1 2 3 4 5 |
{
title: 'My Post',
body: 'This post has a body',
tags: ['ruby', 'rails']
}
|
Notice the denormalization of tags so that there is no longer a table for it, creating a very nice conceptual model—everything to do with a post is included in the one object. The developer only superficially familiar with document modelling will quickly find criticisms, however. To choose just one, how do you get a list of all tags? This specific problem has been addressed by the document crowd, but not in a way that relational developers are used to thinking: map/reduce.
1 2 3 4 5 6 7 8 9 10 |
db.runCommand({
mapreduce: 'posts',
map: function() {
for (index in this.tags) {
emit(this.tags[index], 1);
}
},
reduce: function(key, values) { return; },
out: "tags"
})
|
This function can be run periodically to create a tags collection from the posts collection. It’s not quite real-time, but will be close enough for most uses. (Of course if you do want real-time, there are other techniques you can use.) Yes, the query is more complicated than just selecting out of Tags, but inserting and updating an individual post (the main use case) is simpler.
I’m not arguing one side or another here. This is just one simplistic example to illustrate my point that if you don’t know how to use document database specific features such as map/reduce, or how to model your data in such a way as to take advantage of them, you won’t be able to adequately evaluate those databases. Similarly, if you don’t know how to use pessimistic locking or referential integrity in a relational database, you will not see how much time and effort it could be saving you over trying to implement such robustness in a NoSQL database that wasn’t designed for it.
It is imperative that no matter which technology you ultimately choose for your application (or even if you mix the two!), that you understand both sides thoroughly so that you can accurately weigh up the costs and benefits of each.
The pitch
This is why I’m excited to announce a brand new training session on MongoDB. For the upcoming US tour, this session will be only be offered once exclusively at the Lone Star Ruby Conference. The original relational training is the day before the conference (at the same venue), to create a two day database training bonanza: relational on Wednesday 25th August, MongoDB on Thursday 26th.
We’ll be adding MongoDB to an existing site—Spacebook, the social network for astronauts!—to not only learn MongoDB in isolation, but practically how to integrate it into your existing infrastructure. The day starts with the basics: What it is, what it isn’t, how to use it, how to integrate with Rails, and we’ll build and investigate some of the typical MongoDB use cases like analytics tracking. As we become comfortable, we will move into some more advanced querying and data modelling techniques that MongoDB excels at to ensure we are getting the most out of the technology, and discuss when such techniques are appropriate.
Since I am offering the MongoDB training in association with the Lone Star Ruby Conference, you will have to register for the conference to attend. At only an extra $175 above the conference ticket, half price of the normal cost, the Lone Star Ruby Conference MongoDB session is the cheapest this training will ever be offered, not to mention all the win of the rest of the conference! Aside from the training, it has a killer two-day line up of talks which are going to be awesome. I’m especially excited about the two keynotes by Tom Preson-Werner and Blake Mizerany, and there’s some good database related talks to get along to: Adam Keys is giving the low down on the new ActiveModel in rails 3, Jesse Wolgamott is comparing different NoSQL technologies, and Bernerd Schaefer will be talking about what Mongoid (the ORM we’ll be using with Spacebook) is doing to stay at the head of the pack. I’ll certainly be hanging around.
Register for the relational training separately. There’s a $50 early bird discount for the next week (in addition to the half price Mongo training), but if you miss that and are attending both sessions get in touch and I’ll extend the offer for you. This is probably going to send me broke, but I really just want to get this information out there. Cheaper, higher quality software makes our industry better for everyone.
“Your Database Is Your Friend” training sessions are happening throughout the US and UK in the coming months. One is likely coming to a city near you. For more information and free screencasts, head on over to www.dbisyourfriend.com
Five Tips For Adding Foreign Keys To Existing Apps
You’re convinced foreign keys are a good idea, but how should you retroactively add them to your production application? Here are some tips to help you out.
Identify and fix orphan records. If orphan records exist, creating a foreign key will fail. Use the following SQL to identify children that reference a parent that doesn’t exist:
1 |
SELECT * FROM children LEFT JOIN parents ON parent_id = parents.id WHERE parents.id IS NULL |
Begin with new or unimportant relationships. With any new change, it’s best to walk before you run. Targeting the most important relationships in your application head on can quickly turn into a black hole. Adding foreign keys to new or low value relationships first means you have a smaller code base that is affected, and allows you to test your test suite and plugins for compatibility over a smaller area. Get this running in production early, so any issues will crop up early on low value code where they’ll be easier to fix. Be agile in your approach and iterate.
Move away from fixtures and mocking in your tests. Rails fixture code is not designed to work well with foreign keys. (Fixtures are generally not a good idea regardless.) Also, the intense stubbing of models that was in vogue back when rspec first came on the scene doesn’t play nice either. The current best practice is to use object factories (such as Machinist) to create your test data, and this works well with foreign keys.
Use restrict rather than cascade for ON DELETE. You still want to keep on_destroy logic in your models, so even if conceptually a cascading delete makes sense, implement it using the :dependent => :destroy option to has_many, with a restrict option at the database level to ensure all cascading deletes run through your callbacks.
Be pragmatic. Ideally every relationship will have a foreign key, but for that model filled with weird hacks and supported by a massive old school test suite, it may be just too much effort to get everything working smoothly with database constraints. In this case, set up a test suite that runs over your production data regularly to quickly identify any data problems that arise (see the SQL above).
Foreign keys give you confidence and piece of mind about your data and your application. Rails may be afraid of them, but that doesn’t mean you have to be.
July through September I am running full day training sessions in the US and UK on how to make use of your database and write solid Rails code, increasing your quality without compromising your velocity. Chances are I’m coming to your city, so check it out at http://www.dbisyourfriend.com
acts_as_state_machine is not concurrent
Here is a short 4 minute screencast in which I show you how the acts as state machine (AASM) gem fails in a concurrent environment, and also how to fix it.
(If embedding doesn’t work or the text is too small to read, you can grab a high resolution version direct from Vimeo)
It’s a pretty safe bet that you want to obtain a lock before all state transitions, so you can use a bit of method aliasing to do just that. This gives you much neater code than the quick fix I show in the screencast, just make sure you understand what it is doing!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
class ActiveRecord::Base def self.obtain_lock_before_transitions AASM::StateMachine[self].events.keys.each do |t| define_method("#{t}_with_lock!") do transaction do lock! send("#{t}_without_lock!") end end alias_method_chain "#{t}!", :lock end end end class Tractor # ... aasm_event :buy do transitions :to => :bought, :from => [:for_sale] end obtain_lock_before_transitions end |
This is a small taste of my DB is your friend training course, that helps you build solid rails applications by finding the sweet spot between stored procedures and treating your database as a hash. July through September I am running full day sessions in the US and UK. Chances are I’m coming to your city. Check it out at http://www.dbisyourfriend.com
Three Reasons Why You Shouldn't Use Single Table Inheritance
It creates a cluttered data model. Why don’t we just have one table called objects and store everything as STI? STI tables have a tendency to grow and expand as an application develops, and become intimidating and unweildy as it isn’t clear which columns belong to which models.
It forces you to use nullable columns. A comic book must have an illustrator, but regular books don’t have an illustrator. Subclassing Book with Comic using STI forces you to allow illustrator to be null at the database level (for books that aren’t comics), and pushes your data integrity up into the application layer, which is not ideal.
It prevents you from efficiently indexing your data. Every index has to reference the type column, and you end up with indexes that are only relevant for a certain type.
The only time STI is the right answer is when you have models with exactly the same data, but different behaviour. You don’t compromise your data model, and everything stays neat and tidy. I have yet to see a case in the wild where this rule holds, though.
If you are using STI (or inheritance in general) to share code, you’re doing it wrong. Having many tables does not conflict with the Don’t-Repeat-Yourself principle. Ruby has modules, use them. (I once had a project where a 20 line hash drove the creation of migrations, models, data loaders and test blueprints.)
What you should be doing is using Class Table Inheritance. Rails doesn’t “support it natively”, but that doesn’t particularly mean much since it’s a simple pattern to implement yourself, especially if you take advantage of named scopes and delegators. Your data model will be much easier to work with, easier to understand, and more performant.
I expand on this topic and guide you through a sample implementation in my DB is your friend training course. July through September I am running full day sessions in the US and UK. Chances are I’m coming to your city. Check it out at http://www.dbisyourfriend.com
Debugging Deadlocks In Rails
Here is an 13 minute long screencast in which I show you how to go about tracking down a deadlock in a ruby on rails application. I make two assumptions:
- You are using MySQL
- You know the difference between shared and exclusive locks (in short: a shared lock allows other transactions to read the row, an exclusive blocks out everyone)
(If embedding doesn’t work or the text is too small to read, you can grab a high resolution version direct from Vimeo)
This is only one specific example of a deadlock, in reality there are many ways this can occur. The process for tracking them down is always the same though. If you get stuck, read through the innodb documentation again. Something normally jumps out. If you are not sure what ruby code is generating what SQL, the query trace plugin is excellent. It gives you a stack trace for every single SQL statement ActiveRecord generates.
This is a small taste of the type of thing I cover in my DB is your friend training course. July through September I am running full day sessions in the US and UK. Chances are I’m coming to your city. Check it out at http://www.dbisyourfriend.com
acts_as_list will break in production
acts_as_list doesn’t work in a typical production deployment. It pretends to for a while, but every application will eventually have issues with it that result in real problems for your users. Here is a short 4 minute long screencast showing you how it breaks, and also a quick fix which will prevent your data from becoming corrupted.
(View it over at Vimeo if embedding doesn’t work for you)
Here is the “quick fix” I apply in the screencast. It’s ugly, but it will work.
1 2 3 4 5 6 7 8 |
def move_down Tractor.transaction do Tractor.connection.execute("SET SESSION TRANSACTION ISOLATION LEVEL SERIALIZABLE") @tractor = Tractor.find(params[:id]) @tractor.move_to_bottom end redirect_to(tractors_path) end |
Some things to note when fixing your application in a nicer way:
- This is not MySQL specific, all databases will exhibit this behaviour.
- The isolation level needs to be set as the first statement in the transaction (or globally, but you don’t want serializable globally!)
- For bonus points, add a unique index to the position column, though you’ll have to re-implement most of
acts_as_listto make it work. - It’s possible to do this under read committed, but it’s pretty complicated and optimised for concurrent access rather than individual performance.
- Obtaining a row lock before moving will fix this specific issue, but won’t address all the edge cases.
_This is a small taste of the type of thing I cover in my DB is your friend training course. July through September I am running full day sessions in the US and UK. Chances are I’m coming to your city. Check it out at http://www.dbisyourfriend.com _
Rails DB Training - US/UK Tour
You may not know, but I run an advanced rails training session titled “Your Database Is Your Friend”. Previously, I have only done this in Australia. Starting late July, I will be running this session throughout the United States and the United Kingdom. I’m still planning dates and venues, if you or someone you know is interested in hosting a session, please get in touch.
For details, see the DB is your friend rails training website.
Ruby debugging with puts, tap and Hirb
I use puts heaps when debugging. Combined with tap, it’s pretty handy. You can jump right in the middle of a method chain without having to move things around into variables.
1 |
x = long.chain.of.methods.tap {|x| puts x }.to.do.something.with
|
I thought hey why don’t I merge the two? And for bonus points, add in Hirb’s table display to format my models nicely. These are fairly personal customizations, and aren’t specific to a project, so I put them in my own ~/.railsrc file rather than each project.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 |
# config/initializers/developer_specific_customizations.rb if %w(development test).include?(Rails.env) railsrc = "#{ENV['HOME']}/.railsrc" load(railsrc) if File.exist?(railsrc) end # ~/.railsrc require 'hirb' Hirb.enable :pager => false class Object def tapp(prefix = nil, &block) block ||= lambda {|x| x } tap do |x| value = block[x] value = Hirb::View.formatter.format_output(value) || value.inspect if prefix print prefix if value.lines.count > 1 print ":\n" else print ": " end end puts value end end end # Usage (in your spec files, perhaps?) "hello".tapp # => hello "hello".tapp('a') # => a - "hello "hello".tapp(&:length) # => 5 MyModel.first.tapp # => # +----+-------------------------+ # | id | created_at | # +----+-------------------------+ # | 7 | 2009-12-29 00:15:56 UTC | # +----+-------------------------+ # 1 row in set |