Robot Has No Heart

Xavier Shay blogs here

A robot that does not have a heart

Finding related content with Sphinx

Previous efforts to find related posts with the classifier gem yielded no fruit, so I tried another approach using sphinx. Turned out to be a winner.

The basic theory is to index all posts by tag, then to find related posts just use the current post’s tags as a search string. Remember to exclude the current post from the search results. For this blog, I use tags for the main categories, which were corrupting the results – most everything is tagged ‘Ruby’ so it doesn’t add any value in determining likeness. So rather than indexing all tags I excluded some of the main ones.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
class Post < ActiveRecord::Base
  has_many :searchable_tags, 
           :through    => :taggings,
           :source     => :tag,
           :conditions => "tags.name NOT IN ('Ruby', 'Code', 'Life')"
  
  def related_posts(number = 3)
    Post.search(:limit => number + 1, :conditions => {
      :tag_list => tag_list.join("|")
    }).reject {|x| x == self }.first(number)
  end

  define_index do
    indexes searchable_tags(:name), :as => :tag_list
    # If you want to use this for normal search as well you'll have to 
    # add in title/body here as well
  end
end

For a more complete example, see the relevant RHNH commits: cdc0bf and d4d844

Showing links to related content is a good way to stop the bottom of your page from being a ‘dead end’. In the event that no related posts are found, I’m linking to the archives instead.

Classifier gem rubbish for recommending posts

Chatting with Tim today he suggested maybe using Classifier::LSI would be a cool way to offer ‘related posts’ suggestions for a blog.

Not really knowing anything about it, I whipped up a prototype rake task. It creates the index then marshals it to disk because it takes ages to create and it’s not much fun to play with when you have to wait minutes each time. It then presents 3 related suggestions for each post.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
require 'classifier'

namespace :lsi do
  task :test => :environment do
    if File.exists?("lsidata.dump")
      lsi = File.open("lsidata.dump") {|f| Marshal.load(f) }
    else  
      lsi = Classifier::LSI.new
      Post.find(:all, :order => 'published_at DESC').each do |post|
        text = post.body
        categories = post.tags.collect(&:name)
        puts "Indexing " + post.title
        lsi.add_item(text, *categories)
      end
      File.open("lsidata.dump", "w") {|f| Marshal.dump(lsi, f) }
    end

    Post.find(:all).each do |post|
      puts post.title
      puts lsi.find_related(post.body, 3).collect {|i| Post.find_by_body(i).title }.inspect
    end
  end
end

Here’s the data for my last 5 articles. I don’t know what I was expecting, but this just doesn’t seem very helpful. I don’t have a very rich set of tags on my posts, so that probably has something to do with it. Was kind of hoping it would just look at text and all just work * waves hands *.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Seagate 500Gb FreeAgent Pro external drive - first impressions
  - Building Firefox Extensions
  - The Colemak Diaries
  - Counting ActiveRecord associations: count, size or length?
Coconut Oats
  - The Colemak Diaries
  - Summertime Tagliarini
  - Mary Iron Chef - Chocolate Jaffa Boxes
Mary Iron Chef - Chocolate Jaffa Boxes
  - The Colemak Diaries
  - Building Firefox Extensions
  - Summertime Tagliarini
Paypal IPN fails date standards
  - Building Firefox Extensions
  - Straight Sailing with Magellan
  - The Colemak Diaries
I'm number 8!
  - Extending Rails
  - Practical Hpricot: SVG
  - Day of days

Next step is to try tagging my stuff better and seeing if that helps out.

Getting classifier working

Quick side note – pure ruby classifier doesn’t work out of the box with rails because it also redefines Array#sum. If you install the GSL lib and the ruby bindings (see classifier docs) you’ll still need this one line patch to classifier to get it to work:

1
2
3
4
5
6
7
8
9
10
11
12
Index: lib/classifier/lsi.rb
===================================================================
--- lib/classifier/lsi.rb       (revision 31)
+++ lib/classifier/lsi.rb       (working copy)
@@ -25,6 +25,8 @@
   # please consult Wikipedia[http://en.wikipedia.org/wiki/Latent_Semantic_Indexing].
   class LSI
     
+    include GSL if $GSL
+    
     attr_reader :word_list
     attr_accessor :auto_rebuild

UPDATE: I’ve forked classifier on github, so you can just grab that version if you like.

A pretty flower Another pretty flower