Robot Has No Heart

Xavier Shay blogs here

A robot that does not have a heart

Deleting duplicate data with PostgreSQL

Here is an update to a query I posted a while back for detecting duplicate data. It allows you to select all but one of the resulting duplicates, for easy deletion. It only works on PostgreSQL, but is pretty neat. It uses a window function!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
DELETE FROM users 
USING (
  SELECT id, first_value(id) OVER (
    PARTITION BY name ORDER BY created_at DESC
  ) first_id
  FROM users
  WHERE name IN (
    SELECT name 
    FROM users 
    GROUP BY name 
    HAVING count(name) > 1
  )
) dups
WHERE dups.id != dups.first_id AND users.id = dups.id;

The order by is optional, but handy if you need to select a particular row rather than just an arbitrary one. You need an extra sub-query because you can’t have window functions in a where clause.

For more tasty PostgreSQL tricks, check out my Meet PostgreSQL screencast, a steal at only $12 plug plug plug.

Logging SQL statistics in rails

When your sysadmin comes to you whinging with a valid concern that your app is reading 60 gazillion records from the DB, you kinda wish you had a bit more information than % time spent in the DB. So I wrote a plugin that counts both the number of selects/updates/inserts/deletes and also the number of records affected. [This plugin is no longer available, the code is below for posterity.]

That does the counting, you need to decide how to log it. I am personally quite partial to adding it to the request log line, thus getting stats per request:

1
2
3
4
5
# vendor/rails/actionpack/lib/action_controller/benchmarking.rb:75
log_message << " | Select Records: #{ActiveRecord::Base.connection.select_record_count}"
log_message << " | Selects: #{ActiveRecord::Base.connection.select_count}"

ActiveRecord::Base.connection.reset_counters!

Don’t forget the last line, otherwise you get cumulative numbers. That may be handy, but I doubt it. We’re only logging selects because that’s all we care about at the moment. I am sure this will change in time.

UPDATE: Moved to github, bzr repo is no longer available

UPDATE 2: Pasted code inline below, it’s way old and probably doesn’t work anymore.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
module ActiveRecord::ConnectionAdapters
  class MysqlAdapter
    class << self
      def counters
        @counters ||= []
      end

      def attr_accessor_with_default(name, default)
        attr_accessor name
        define_method(name) do
          instance_variable_get(:"@#{name}") || default 
        end
      end

      def define_counter(name, record_func = lambda {|ret| ret })
        attr_accessor_with_default("#{name}_count", 0)
        attr_accessor_with_default("#{name}_record_count", 0)

        define_method("#{name}_with_counting") do |*args|
          ret = send("#{name}_without_counting", *args)
          send("#{name}_count=", send("#{name}_count") + 1)
          send("#{name}_record_count=", send("#{name}_record_count") + record_func[ret])
          ret
        end
        alias_method_chain name, :counting

        self.counters << name
      end
    end

    define_counter :select, lambda {|ret| ret.length }
    define_counter :update
    define_counter :insert
    define_counter :delete

    def reset_counters!
      self.class.counters.each do |counter|
        self.send("#{counter}_count=", 0)
        self.send("#{counter}_record_count=", 0)
      end
    end
  end
end

A Banana a Day

SQL Optimization

Consider the following join (a is a char(32), b is varchar):

1
select * from t1, t2 where instr(t1.a, t2.b) = 0

The actual code I worked with was a bit more complex, but essentially that is it. It works fine, however performance is rather lacking. This is because for each for every row in t2, the DBMS must perform the instr() function for every row in t1 to check if t1.a is in t2.b until it finds a match.

If a field is used in a function, any indexes on that field cannot be used

That’s bad. In this case, since a is fixed length, we can rewrite the query thusly:

1
select * from t1, t2 where substr(t1.a,0,32) = t2.b

This way, the substr is only performed once for each row of b, and the result can quickly be checked against an index on t2.b.

In the particular case I was working on (there were multiple joins to be rewritten), this cut execution time down from 2 minutes to just under a second.

Lonely Hammer Syndrome

When all you have is a hammer, every problem looks like a nail. After optimising the query referred to above, I found out its context. It’s a data collection problem consisting of a server log where each url contains a UID, a csv that contains details about each uid (page title, etc…), and the two needed to be collated. The original process was:

  1. Import both tables into an access db (the server log into a table with one field – whole_line)
  2. Using link tables and SQL string functions, convert the server log into a friendlier format in a table on a local oracle server
  3. Collate the two files using the above mentioned sql
  4. Export collated data to csv (from access)
  5. Append to master file

Basically, we’re going from csv, to access, to oracle, back through access out to csv again. Additionally, many of these steps required manual intervention. This process has to be done every week – how can it be optimised? Of course there are many ways – I looked into Java and JDBC to cut access out of the loop, but thought if I’m going to the effort I may as well cut out oracle as well. Perl is reknowned for its log-munging ability, so I put together a script (rather quickly – thank you regular expressions!) which can now automatically give formatted data in barely a few minutes, compared to up to 60 minutes of manual labour using the old method.

The moral – expose yourself to as many tools as possible. You don’t need to be an expert in them (I had to google pretty much every perl command :S), but if you know the pros and cons of each you can cut down both development and operational time substantially.

A pretty flower Another pretty flower