Robot Has No Heart

Xavier Shay blogs here

A robot that does not have a heart

Rails XHTML Validation with LibXML/HTML Tidy

I improved upon the XHTML validation technique I showed yesterday to add nicer error messages, and also support for local testing via HTML Tidy. HTML Tidy isn’t quite as good as W3C – for example it missed a label that was pointing to an invalid ID, but it runs hell fast. For W3C testing I’m now using libXML to parse the response to actually list the errors rather than just tell you they exist.

And it’s all customizable by setting the MARKUP_VALIDATOR environment variables. Options are: w3c, tidy, tidy_no_warnings. Tidy is the default.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def assert_valid_markup(markup=@response.body)
  ENV['MARKUP_VALIDATOR'] ||= 'tidy'
  case ENV['MARKUP_VALIDATOR']
  when 'w3c'
    # Thanks http://scottraymond.net/articles/2005/09/20/rails-xhtml-validation
    require 'net/http'
    response = Net::HTTP.start('validator.w3.org') do |w3c|
      query = 'fragment=' + CGI.escape(markup) + '&output=xml'
      w3c.post2('/check', query)
    end
    if response['x-w3c-validator-status'] != 'Valid'
      error_str = "XHTML Validation Failed:\n"
      parser = XML::Parser.new
      parser.string = response.body
      doc = parser.parse

      doc.find("//result/messages/msg").each do |msg|
        error_str += "  Line %i: %s\n" % [msg["line"], msg]
      end

      flunk error_str
    end

  when 'tidy', 'tidy_no_warnings'
    require 'tidy'
    errors = []
    Tidy.open(:input_xml => true) do |tidy|
      tidy.clean(markup)
      errors.concat(tidy.errors)
    end
    Tidy.open(:show_warnings=> (ENV['MARKUP_VALIDATOR'] != 'tidy_no_warnings')) do |tidy|
      tidy.clean(markup)
      errors.concat(tidy.errors)
    end
    if errors.length > 0
      error_str = ''
      errors.each do |e|
        error_str += e.gsub(/\n/, "\n  ")
      end
      error_str = "XHTML Validation Failed:\n  #{error_str}"
      
      assert_block(error_str) { false }
    end    
  end
end

Getting Tidy to work was an ordeal, the ruby documentation is rather lacking. It also behaves in weird ways – the call to errors returns a one element array, with all the errors bundled together in the one string.

LibXML was a little tricky – there’s no obvious way to parse an XML document in memory. You’d think XML::Document.new(xml) would do the trick, since there’s a XML::Document.file(filename) method, but that actually uses the entire XML document as the version string. Not so handy. Turns out you need to create an XML::Parser object instead, as I’ve done above. The docs don’t mention this (anywhere obvious, that is), I found a thread in the LibXML mailing list.

A pretty flower Another pretty flower