Archive for the ‘development’ Category

Coding well is embarrassing

Wednesday, November 11th, 2009

I don’t use SciPy (a science/maths programming library), or indeed Python much at all, but I can closely relate to the experience of the author of this post.

Especially these quotes:

The bottleneck in writing code isn’t in the writing of the code, it’s in understanding and conceptualising what needs to be done. Once you’ve done that, i.e. come up with mathematical objects and equations that describe your algorithm, you simply express these in a few lines [..] and hit go.

[..] you spend months developing your complex algorithms and when you’re done you show somebody the result of all your efforts — a page or two of code. It looks like something that somebody could have written in an afternoon. Even worse, you start to suspect that if you had really known [..] and spent a few days carefully thinking about the problem to start with, then you probably could have coded it in an afternoon.

He’s talking about generating financial algorithms but I feel the same way about many things I’ve done. So many times I’ve written reams of code, fleshing out the way I think something should be, only to later realise I could generalise much of it and in the process shedding hundreds of lines and days/weeks of work. I just don’t know what I want until I see it expressed in front of me, and until I know what the product should be, I can’t generalise. It’s as if the process of coding is analogous to analysing a number series – it’s only when one has enough points of data, say 5 or 10, that one can guess at the equation that generated it. In my case, it’s only when I’ve implemented the same damn thing with minor changes 5 times in a row that I finally see the pattern and am able to collapse it down into an abstracted general form.

And then days, weeks, months, later, when one inspects the results of one’s labours, one’s greeted with a couple thousand lines of code, if that, with probably only a few hundred lines of actual “live” structural code. It does indeed look like something you could have bashed out in an afternoon if you’d really tried. And no doubt I could type it all again in a few hours. The trick is knowing what to type ..

UUIDs in Rails redux

Tuesday, April 15th, 2008

I have covered forcing ActiveRecord to respect UUID data types in Migrations before. That helps us create our database – now what about in use? We need to create the UUIDs and store them in the database.

These examples all rely on the uuidtools gem, so install that if you haven’t already (and require it somewhere in environment.rb).

1. Setting a UUID using ActiveRecord callbacks

If you don’t need the UUID in the object upon creation but only want to ensure it’s there upon save, do this. Suggestion initially from this page, changes are mine.

We will use the before_create callback to ask AR to add a UUID of our choosing before the record is saved.

Add this to your lib directory:

# lib/uuid_helper.rb
require 'uuidtools'
 
module UUIDHelper
  def before_create
    self.id = UUID.random_create.to_s
  end
end

And now include this in your models:

class Airframe < ActiveRecord::Base
  include UUIDHelper
 
  #my stuff
 
end
>> Airframe.new
=> #< Airframe id: nil, maker_id: nil>
>> Airframe.create!
=> #< Airframe id: "1a82a408-32e6-480e-941d-073a7e793299", maker_id: nil>

2. Initialising a model with a UUID

If you want the UUID in the model before save, i.e. upon initialisation, we have to get a little more fancy:

# lib/uuid_init.rb
require 'uuidtools'
 
module UUIDInit
  def initialize(attrs = {}, &block) 
   super 
   ['id'] = UUID.random_create.to_s
  end
end

Now include this in your models:

class Flightpath  < ActiveRecord::Base
 
  include UUIDInit
 
  # my stuff
 
end
>> Flightpath.new
=> #< Flightpath created_at: nil, id: "5e5bcd63-070d-4252-8556-2876ddd83b54">

Be aware that it will conflict with any other initialisation you do in there, so you might want to simply copy in the whole method if you need other fields upon initialisation:

class User < ActiveRecord::Base
 
  def initialize(attrs = {}, &block) 
   super 
   ['balance'] = 0.0
   ['id'] = UUID.random_create.to_s
  end
 
end
>> User.new
=> #

3. Sessions

All this is very well for your own models, but what about Rails’ inbuilt sessions? By default, they want an autoincrementing integer primary key.

The good news is it’s easy to override. Your migration should look like this:

create_table "sessions", :id => false, :force => true do |t|
  t.string   "session_id"
  t.text     "data"
  t.datetime "updated_at"
  t.datetime "created_at"
end

Now add this to your environment.rb file:

# config/environment.rb
CGI::Session::ActiveRecordStore::Session.primary_key = 'session_id'

And this to your Application Controller:

# app/controllers/application.rb
class ApplicationController < ActionController::Base
 
before_filter :config_session # at the top, if possible
 
def config_session
  session.model.id = session.session_id
end
 
end

And voila, your session store is using the session_id as its primary key. I don’t see any point in using a UUID for your sessions’ PK, but if you want to you’ll find an example override class in:

actionpack/lib/action_controller/session/active_record_store.rb.

Remember to drop any preexisting sessions table in your database, or it will likely complain of null ids when you switch to session_id as your primary key.

Import successful

Wednesday, April 9th, 2008

I’m happy to report that I’ve successfully imported all posts and comments from this blog into a new system. While I’m still unsure as to when that’s going to go live, I hope it’s soon – I’m sick of this blog. I mean just look at it! Ugly as hell. And so fucking common.

Anyway, looking forward to rm -rf’ing wordpress once and for all. Tally-ho!

Off topic but not really worthy of another post is a small annoyance I have with people talking about randomness in UUIDs. I often hear this quote bandied about as some sort of “proof” that UUID collisions will never occur:

1 trillion UUIDs would have to be created every nanosecond for 10 billion years to exhaust the number of UUIDs

This is a useless factoid which, while sounding superficially impressive, has no relevance to actual use of UUIDs. For starters – there’s no central authority tracking UUID use. Even in internal projects there’s usually nothing tracking their use. Most projects that I’m aware of have none, zero, zip duplicate protection or even detection. Not that this is a big problem, as this example a little further down the same Wikipedia page explains:

[..] after generating 1 billion UUIDs every second for the next 100 years, the probability of creating just one duplicate would be about 50%

Now admittedly that’s still a pretty unlikely event. Assuming you have a decent generator, you shouldn’t lose any sleep that your UUIDs are colliding .. or ever will in your lifetime. But 1 billion a second for 100 years is a lot less that 1 trillion a nanosecond for 10 billion years. Don’t just wave away about 15 orders of magnitude just to make a point.

DataMapper – one ORM to rule them all?

Tuesday, April 8th, 2008

I’ve just watched the DataMapper presentation from Mountain West 2008, and it’s very interesting. I’ve been thinking we need a high-level super-abstraction library for Ruby for some time, and DataMapper (DM) looks like it might grow into fitting that bill.

What’s wrong with what we’re using now? Let me count the ways – or rather, let me count how many different sources to and from which I am reading and writing data in a Rails app of mine today:

  1. An RDMBS (Postgres), using ActiveRecord
  2. A number of resources, using ActiveResource
  3. A document-based database, using a custom i/o library
  4. YAML config files, using a constant read from a file on server load
  5. Cookies, via ActionController
  6. (not really a full point) data structure information in Model files

That’s too many. And I see no sign of this trend slowing down – why, I just found out about the new project today, yet another library to address document-based databases. We have at least three gems to read and write to CouchDB. Thingfish uses Sequel. If you use Amazon S3 that’s yet another. Enough!

It’s more and more obvious that these new developments are being written at the wrong level of abstraction. I don’t know what we can do about Cookies but all the others – RDBMS, RESTful resources, YAML files, etc – are the same types of data and should be accessible by a common API within Rails or Merb or plain Ruby or anywhere. So why can’t we access them all via a common method?

The correct way to do this is to have a single ORM at the highest possible level of abstraction. Storage adapters can be added to it as drivers. Then, if you need to split storage types for whatever reason, you can configure that on a case by case basis.

DataMapper enables this, and provides a plug-in system – if you can implement 10 or so basic actions in your storage system, and support 5 or so basic data types, you should be able to use it transparently. To me, this is a very appealing step forward. There are many types of data storage, all with their strengths and weaknesses. If we can flexibly and transparently include pretty much any useful type of storage into our programs using a single consistent API and with just a little config, once, in one place, that’s a huge win.

Why not just improve ActiveRecord? I think it’s too late for that. AR is a huge, tangled mess which many believe should just be scrapped and re-written from scratch, me included. Well, the good news is that DM has basically done that, and it’s smaller, faster, more modular, cleaner and – best of all – mostly compatible.

UPDATE: Whilst looking at the weblog of Rails Core member Rick Olson, aka Technoweenie, for info on ActiveDocument I came across this wonderfully candid comment on AR associations:

I can point to a few other more complex areas of the associations code (the crazy eager joins code is why I personally haven’t used :include in any of my apps in over a year).

Straight from the horse’s mouth! Couldn’t agree more.

Inconsolata

Tuesday, April 8th, 2008

I have used Monaco for all my fixed-width text needs since I can remember. I do everything in Monaco – write emails, use terminal, TextMate, everything. I remember I used to use IRC in Monaco in fricking 1994. I love Monaco. “That awesome font on the Mac”, I used to think of it as.

But lately my eyes have been tiring while staring at TextMate all day so I’ve been tweaking a bit to try and reduce eyestrain. Changing syntax colouring, reducing contrast here, increasing it there – that kind of thing. Seems to work, a bit, I guess. But as part of that I also decided to check out what other fonts people thought were good and by that route I came across Inconsolata, perhaps the first monospace font I’ve ever thought was even comparable to wonderful, wonderful Monaco.

The problem is the serifs on Monaco. The lower-case “i”, for example, looks fantastic without its left bottom serif at 10 points. I maintain to this day that 10 pt Monaco is the single best way to represent text ever fucking invented, but at 14pt (the size I use it at in TextMate) the serifs that look so kewl at 10pt start to work against it a little.

Inconsolata is basically Monaco with a few tweaks that make a surprising amount of difference. I am not some kind of beret-wearing font expert, so I have no idea how to describe it, but at 14-16pt I find Inconsolata slightly easier to read than Monaco. My terminology is probably completely wrong, but it feels like Inconsolata is “hinted” slightly better – one doesn’t need to stare at a word quite so hard to discern its composition. The lower-case “i” is a good example – the serifs at the bottom, and slightly different spacing, make it more obviously a lower-case i at a glance. It’s subtle, but when you stare at something for 6+ hours per day even a subtle improvement becomes more than worthwhile.

Anyway, it’s a free & open source font and an easy install for OSX – download the OpenType file and just double click – so if fixed-width fonts are a big part of your day, I recommend giving it a shot and seeing what you think.

RESTful confusion

Tuesday, April 1st, 2008

REST confuses me. I’m trying to come up with a consistent, elegant RESTful interface to a web appplication, and I just can’t understand how I’m supposed to do it while maintaining “correct” RESTful design.

I see the advantages of using REST over, say, XML-RPC. I want to use it. But I’ve got serious questions and I can’t find any answers.

At the front of my list of questions is the biggest, most basic problem – that of credentials. How the hell do you send credentials as part of a REST request?

From wikipedia:

HTTP separates the notions of a web server and a web browser. This allows the implementation of each to vary from the other based on the client-server principle. When used RESTfully, HTTP is stateless. Each message contains all the information necessary to understand the request when combined with state at the resource. As a result, neither the client nor the server needs to remember any communication state between messages. Any state retained by the server must be modeled as a resource.

REST is supposed to be completely stateless. I understand that. That means no “sessions” or login keys or cookies for the client to remember and reuse. That also means the credentials must be sent with every request if I’m understanding it right.

So where are the credentials supposed to be? In the URL? Or in the form body (assuming a POST request)? If so, what about a GET request, which doesn’t have form body? And am I to assume that all REST requests requiring credentials are supposed to go over TLS? Why do none of the examples I see have any of this in them?

Since the whole point of using this is to standardise, are there standard form names (eg for username/password) we’re supposed to use? I have never seen what I consider to be a “canonical” example of this. In fact I have never seen a RESTful system using credentials of any kind which matches the first basic rule of what “is not REST” according to the RestWiki:

What is not REST?
- Stateful systems

Any system where you do not need to send the credentials more than once is stateful by definition.

Assume that we’re going to cheat a little and implement, say, a stateful session so we don’t have to log on every single time. Should the session token go in the URL? It has to, if it’s a GET and it usually will be. Or should we be faking out cookies? Again – if the session token is in the URL (a “secret” URL) that will need to be over TLS or anyone will be able to sniff it. But how else can you do sessions over HTTP GET without cookies? HTTP BASIC AUTH?

Confused yet? Me too. Has anyone ever actually implemented a pure RESTful system in an application that requires authentication?

PostgreSQL macports upgrade destroyed data

Friday, March 28th, 2008

A little reminder for me today of a few things:

1. Don’t install important things like databases via convenience systems like macports
2. If you do, keep very good backups.

Luckily, although I’d neglected 1, I hew very closely to 2 so I didn’t lose anything. But I’m pretty astonished that an uninstall/reinstall of a port destroyed user data – totally unacceptable.

Be careful with what you install via port systems. For convenience, they’re wonderful. But don’t trust them with your data, and BACKUP!

Horrible WP export data format

Wednesday, March 12th, 2008

So, I’m writing blog software, and one of the obvious things I want to do is import from this blog. As a first step towards that, I export the entire contects using WP’s export tool (after purging more than 17,000 comment spams since I last manually purged) and this is what it looks like (previous post to this):

>
>Interesting words in your OSX Dictionary>
>https://fukamachi.org/wp/2008/03/11/fake-words-in-your-osx-dictionary/>
>Tue, 11 Mar 2008 03:03:46 +0000>
>Sho>
        >>
        >>
        >>
         domain="tag">>
         domain="tag">>
 
 isPermaLink="false">https://fukamachi.org/wp/2008/03/11/fake-words-in-your-osx-dictionary/>
>>
>>
>713>
>2008-03-11 12:03:46>
>2008-03-11 03:03:46>
>open>
>open>
>fake-words-in-your-osx-dictionary>
>publish>
>0>
>0>
>post>
        >

Jesus, that is *horrible*. Firstly, if the post_type is defined only towards the end, what’s with the post_id, post_date, post_name etc? It’s a post – of post_type post! Secondly, where’s the “updated at” field? What’s the “dc:” namespace for the creator tag only? What’s with having an “isPermalink” switch in the guid tag? The permalink is in the link tag, I presume. Why does it need to be content:encoded when obviously the content is CDATA – implying that WP somehow supports XML parsing inside some contents!? Why is pubDate camelCase while everything else is underline_style? Man, I hate camelCase. Etc etc. What a mess.

I know what you’re thinking: that’s just RSS format! Sure it’s ugly, it’s RSS! Well, no. The RSS is similar but different for this post – I examined the feed for that, too. Note that the description is empty, it isn’t in the RSS. So they’re using a modified RSS format to store internal data. If they’re not going to store description, but just generate it on the fly – why export empty description tags?!

Just for comparison, here’s the much nicer atom feed. Obviously doesn’t have all the wp: internal data, but I much prefer the design:

>
        >
                >Sho>
                >https://fukamachi.org/>
        >
         <span style="color: #000066;">type</span>=<span style="color: #ff0000;">"html"</span><span style="color: #000000; font-weight: bold;">></span>
>
         rel="alternate" type="text/html" href="https://fukamachi.org/wp/2008/03/11/fake-words-in-your-osx-dictionary/" />
        >https://fukamachi.org/wp/2008/03/11/fake-words-in-your-osx-dictionary/>
        >2008-03-11T03:04:31Z>
        >2008-03-11T03:03:46Z>
         scheme="https://fukamachi.org/wp" term="Language" />
         scheme="https://fukamachi.org/wp" term="leopard" />
         scheme="https://fukamachi.org/wp" term="mac" />
         scheme="https://fukamachi.org/wp" term="dictionary" />
         scheme="https://fukamachi.org/wp" term="esquivalience" />
         type="html">>
         type="html" xml:base="https://fukamachi.org/wp/2008/03/11/fake-words-in-your-osx-dictionary/">
                Using Leopard? Try this. Look up the word esquivalience by selecting it and choosing dictionary from the contextual menu. Read the dictionary definition, then the wikipedia one underneath : )

]]>
>
>

Note logical, consistent design, self-closing tags, and other innovations.

UPDATE: Check out the comment format:

>
>3>
>>
>>
>http://nigger.org/>
>127.0.0.1>
>2005-07-16 10:23:48>
>2005-07-16 14:23:48>
>Hey, is this that new gay nigger cock website I've been hearing about?>
>1>
>>
>0>
>

The comment author is CDATA, but the content isn’t? WTF?

Reducing code duplication with Rails ActionMailer

Sunday, March 9th, 2008

ActionMailer is a weird, fussy black box whose intricacies produce a lot of code duplication, especially for dealing with multi-language email, where you need to use a different .erb file for each language. Here’s an example of how you can reduce some of that duplication.

Code speaks louder than words so I’ll just cut and paste – you can see pretty easily what I’ve done. Works, although it’s completely undocumented and is probably not very good practise.

Previous, highly redundant code. I’ve snipped this a lot – you don’t have to use much imagination to see why I don’t like this:

def tokenmail_eng(user, subject)
  subject       subject
  body          :token => user.password_hash,
                :password => user.password,
                :user_id => user.id,
                :name => user.nickname
  recipients    user.email_address_with_name
  from          ''
  sent_on       Time.now
  headers       "Reply-to" => ""
end
 
def tokenmail_jpn(user, subject)
  subject       subject
  body          :token => user.password_hash,
                :password => user.password,
                :user_id => user.id,
                :name => user.nickname
  recipients    user.email_address_with_name
  from          ''
  sent_on       Time.now
  headers       "Reply-to" => ""
end
 
def tokenmail_zhs(user, subject)
  subject       subject
  body          :token => user.password_hash,
                :password => user.password,
                :user_id => user.id,
                :name => user.nickname
  recipients    user.email_address_with_name
  from          ''
  sent_on       Time.now
  headers       "Reply-to" => ""
end

Woah. That is *awful*. When I finally decided to clean up my mail code I found several problems even in my cut and pasted code – when you do that a few times, and then go in to make a change, you are almost guaranteed to leave something out. That is precisely why cut and pasting code is something you should never, ever do.

So how do we clean this up and merge into a “generic” mail, while fooling ActionMailer into using the correct file?

Turns out we can do this:

def tokenmail_gen(user, subject)
  subject       subject
  body          :token => user.password_hash,
                :password => user.password,
                :user_id => user.id,
                :name => user.nickname
  recipients    user.email_address_with_name
  from          ''
  sent_on       Time.now
  headers       "Reply-to" => ""
end
 
def tokenmail_eng(user, subject)
 tokenmail_gen(user, subject)
end
 
def tokenmail_jpn(user, subject)
 tokenmail_gen(user, subject)
end
 
def tokenmail_zhs(user, subject)
 tokenmail_gen(user, subject)
end
 
def tokenmail_zht(user, subject)
 tokenmail_gen(user, subject)
end

That’s so much better it brings a tear to my eye. Needless to say you will need to handle translation of the subjectline in your Controller.

SERIOUS NOTE: This is undocumented and while it seems to work, I haven’t heard of anyone else doing it and it might not work in future. Use at your own risk and make sure your tests cover it!

Eval, friend and enemy

Sunday, March 9th, 2008

The prevailing wisdom in the Ruby world is that using eval is bad. It’s brittle, hackish and inflexible – if you find yourself using it, it’s usually a sign you’re doing something wrong.

Well, I agree, but the keyword is “usually”. Sometimes using eval is a lot better than any alternatives, and it can help work around problems in, say, other people’s code, where the design decisions they made stop you doing what you want. I avoided using it for a long time, maybe based on my acceptance of the consensus opinion – but recently I’ve had cases where it’s the “lesser of two evils”, and I wanted to share one with you.

The example is Rails’ horrible mailing system. It’s one of the worst parts of Rails – but let me temper that by saying it’s a hard problem and difficult to see how else they could have done it. A Mailer in Rails is kind of a pseudo-model black box that waits to be sent a list of parameters, then inserts some of them into a specific text file, then sends that text using the remainder of the parameters.

All very well and good so far. But the problem with this is when you want to have mail in multiple languages. So you don’t just have

Notifier.deliver_welcome_mail(params)

you have

Notfier.deliver_welcome_mail_eng(params)
Notfier.deliver_welcome_mail_jpn(params)
Notfier.deliver_welcome_mail_spa(params)

etc. All with corresponding entries in the model, and matching text files.

Now, I don’t know any way to get around the need for those text files and entries in the model that doesn’t involve writing some huge text generator (bad) or hacking Rails itself (even worse). But we can at least improve on the statements used to call these deliveries.

Here’s an example of what I used to have:

if .language_iso == 'zht'
  Notifier.deliver_invite_zht(, , subject)
elsif .language_iso == 'zhs'
  Notifier.deliver_invite_zhs(, , subject)
elsif .language_iso == 'jpn'
  Notifier.deliver_invite_jpn(, , subject)
else
  Notifier.deliver_invite_eng(, , subject)
end

That’s a shortened version of what can be much longer if or case statements (this is pretty old code..). But you can see this gets very nasty, very quickly.

Using eval we can change that to:

eval("Notifier.deliver_invite_#{.language_iso}(, , subject)")

Yes, it’s still nasty. But it’s one line of nastiness as opposed to 10 (or many more). I think this is one example of a case where using eval is definitely better than the alternative.

Next, I’m experimenting with trying to reduce the horrendous duplication in the Mailer pseudomodel itself.

mongrel vs. thin

Thursday, February 28th, 2008

So, we’ve got a promising new web server in the ruby world – thin. Fantastic news, and it’s using some excellent libraries – the brilliant eventmachine, and the ragel HTTP parser from mongrel (ie, the only good thing about mongrel) – both of which I am using in other projects. Very promising, looks well designed, maintainable and clean code. Unlike mongrel.

So, there’s only one thing we care about with web servers. What’s the performance?

Here’s some stats on an actual serious rails app doing actual work with an actual database and everything. Production mode and I gave each server a couple of “warm-up” runs before the run copied below. I’m just going to include the whole output.

$ ab -n 1000 -c 50 http://0.0.0.0:3000/
 
Server Software:        Mongrel
Server Hostname:        0.0.0.0
Server Port:            3000
 
Document Path:          /
Document Length:        1418 bytes
 
Concurrency Level:      50
Time taken for tests:   18.633091 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      1779000 bytes
HTML transferred:       1418000 bytes
Requests per second:    53.67 [#/sec] (mean)
Time per request:       931.655 [ms] (mean)
Time per request:       18.633 [ms] (mean, across all concurrent requests)
Transfer rate:          93.22 [Kbytes/sec] received
 
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.9      0      10
Processing:    23  909 130.8    927    1763
Waiting:       19  908 131.1    927    1763
Total:         23  909 130.1    927    1763
 
Percentage of the requests served within a certain time (ms)
  50%    927
  66%    936
  75%    944
  80%    949
  90%    973
  95%   1061
  98%   1067
  99%   1069
 100%   1763 (longest request)
 
Server Software:        thin
Server Hostname:        0.0.0.0
Server Port:            3000
 
Document Path:          /
Document Length:        1418 bytes
 
Concurrency Level:      50
Time taken for tests:   18.120868 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      1746000 bytes
HTML transferred:       1418000 bytes
Requests per second:    55.18 [#/sec] (mean)
Time per request:       906.043 [ms] (mean)
Time per request:       18.121 [ms] (mean, across all concurrent requests)
Transfer rate:          94.09 [Kbytes/sec] received
 
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    1   2.7      1      31
Processing:   683  893 123.6    881    1710
Waiting:      605  721 120.3    718    1473
Total:        683  894 123.7    882    1711
 
Percentage of the requests served within a certain time (ms)
  50%    882
  66%    888
  75%    902
  80%    908
  90%    961
  95%   1002
  98%   1474
  99%   1710
 100%   1711 (longest request)

Hm, nothing much in that – an insignificant improvement. Basically, that’s Rails slowness we’re measuring here. Let’s cut most of that out of the picture and go straight for a “pulse” controller – returns absolutely nothing but the two-byte string “OK” meaning that it’s actually running. I implemented that for monitoring without hitting the front page every few seconds with monit. Let’s take a look:

Server Software:        Mongrel
Server Hostname:        0.0.0.0
Server Port:            3000
 
Document Path:          /api/pulse
Document Length:        2 bytes
 
Concurrency Level:      50
Time taken for tests:   8.405170 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      269000 bytes
HTML transferred:       2000 bytes
Requests per second:    118.97 [#/sec] (mean)
Time per request:       420.259 [ms] (mean)
Time per request:       8.405 [ms] (mean, across all concurrent requests)
Transfer rate:          31.17 [Kbytes/sec] received
 
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.6      0       4
Processing:    14  412  72.4    399     810
Waiting:       13  411  72.3    398     808
Total:         14  412  72.0    399     810
 
Percentage of the requests served within a certain time (ms)
  50%    399
  66%    470
  75%    477
  80%    480
  90%    486
  95%    489
  98%    491
  99%    493
 100%    810 (longest request)
 
 
Server Software:        thin
Server Hostname:        0.0.0.0
Server Port:            3000
 
Document Path:          /api/pulse
Document Length:        2 bytes
 
Concurrency Level:      50
Time taken for tests:   6.65994 seconds
Complete requests:      1000
Failed requests:        0
Write errors:           0
Total transferred:      236000 bytes
HTML transferred:       2000 bytes
Requests per second:    164.85 [#/sec] (mean)
Time per request:       303.300 [ms] (mean)
Time per request:       6.066 [ms] (mean, across all concurrent requests)
Transfer rate:          37.92 [Kbytes/sec] received
 
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   1.2      1      10
Processing:   200  298  49.2    278     460
Waiting:       54  239  47.5    224     404
Total:        202  299  49.0    278     460
 
Percentage of the requests served within a certain time (ms)
  50%    278
  66%    294
  75%    354
  80%    356
  90%    362
  95%    372
  98%    457
  99%    460
 100%    460 (longest request)

That’s much more of an improvement.

I’ve tested it at some length now and have encountered no stability problems – not that it takes much to beat The CrashMaster™ mongrel. In fact, I have no love for mongrel at all (please see my recent posts on hacking it so it doesn’t refuse to launch upon encountering its own PID files from previous crashes) and so I’m switching to thin, effective now. I’ll let you know how it goes!

UPDATE: Yet another potential competitor has emerged – ebb, which appears to be even faster than Thin in (artificial) benchmarks. However, it’s definitely not ready for prime time – it ran an order of magnitude slower than Thin in my initial testing, and I eventually force quit it after becoming unresponsive. Still, those benchmarks paint a pretty promising picture of what a C implementation can do. Here’s my bad results anyway.

$ ebb_rails start -e production
$ ab -n 100 -c 5 http://0.0.0.0:3000/api/pulse # i reduced the numbers
 
Server Software:        
Server Hostname:        0.0.0.0
Server Port:            3000
 
Document Path:          /api/pulse
Document Length:        2 bytes
 
Concurrency Level:      5
Time taken for tests:   49.116267 seconds
Complete requests:      100
Failed requests:        0
Write errors:           0
Total transferred:      18800 bytes
HTML transferred:       200 bytes
Requests per second:    2.04 [#/sec] (mean)
Time per request:       2455.813 [ms] (mean)
Time per request:       491.163 [ms] (mean, across all concurrent requests)
Transfer rate:          0.37 [Kbytes/sec] received
 
Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.8      0       8
Processing:   442 2394 552.0   2379    3699
Waiting:      442 2394 552.1   2378    3699
Total:        442 2394 552.1   2379    3699
 
Percentage of the requests served within a certain time (ms)
  50%   2379
  66%   2661
  75%   2795
  80%   2835
  90%   2997
  95%   3481
  98%   3665
  99%   3699
 100%   3699 (longest request)

Obviously something is horribly wrong with that result.

StrokeDB

Tuesday, February 12th, 2008

Another competitor to the exciting CouchDB project has emerged – and this time it’s in pure Ruby, not god damn Erlang, so it’s very interesting to me. Check it out here.

By the way, another project I’ve talked about before, ThingFish, has been through countless revisions and code growth – there’s not a day that goes by when I’m not treated to 50 or so changed files in response to my svn up. But the focus of the project seems to have changed from being a document-centric database to being some kind of network file store. None of it works, as far as I can tell, and I have no idea what they are doing.. Thingfish developers: what on earth is your project for?

Anyway, exciting times for this class of database, which I strongly believe is the future of large-scale web apps.

hacking native UUID support into Schema:Dump

Monday, February 11th, 2008

Want to use PostgreSQL’s native UUID datatype but AR won’t let you use it with migrations?

/Library/Ruby/Gems/1.8/gems/activerecord-2.0.2/lib/active_record/connection_adapters/postgresql_adapter.rb:

# insert into
def simplified_type
  # UUID type
  when /^uuid$/
  :uuid
# insert into
def native_database_types
  :uuid      => { :name => "uuid" },

Well, that’ll get your data OUT of the database, but AR will throw a fit when you try to load it back in unless you also add uuid into the range of column types TableDefinition will accept:

in /Library/Ruby/Gems/1.8/gems/activerecord-2.0.2
/lib/active_record/connection_adapters/abstract/schema_definition.rb:

# insert into
def column(name, type, options = {})
 
%w( string text integer float decimal datetime timestamp time date binary boolean uuid ).each do |column_type|

Now you can do this:

    t.uuid     "uuid",   :null => false

About the nastiest possible hack you can do but works in/out. Here’s a patch if you don’t want to do it yourself, but no guarantees.

UPDATE:

And don’t forget to write your migrations like this to stop AR from inserting its “helpful” id columns with autoincrementing serials which your DB doesn’t need and can’t use:

  def self.up
    create_table :transactions, :id => false do |t|
      t.uuid     "id",  :null => false
      t.timestamps
    end
  end

UPDATE 2:

I now do not recommend doing this. It’s more trouble than it’s worth. There is very little you gain in forcing native UUID type in Postgres, and the complexity, hacks, loss of cross-platform compatibility and general annoyance you face are just not worth it.

Just use a string class for any UUIDs. Of course, the final hint on this page – the no-id switch for migrations – is still useful and you should use that.

Strict databases are great for discipline

Friday, February 8th, 2008

Oh, boy. Nothing exposes one’s shitty programming habits like using a new, stricter databse like PostgreSQL. All over the place I’m discovering, and fixing, code in which I’d demonstrated a lax attitude to data integrity – from trying to insert invalid data into columns I’d previously specified should not be null (datetimes, mostly) to declaring booleans with 1 or 0 instead of true and false.

It’s annoying, sure, but it’s also quite satisfying. A database that won’t take any shit is great for your programming discipline and must lead to better, more reliable code. I really can’t believe how much MySQL allows you to get away with, and I’m very glad my eyes were opened to the problems with the data I was storing before I had to manually repair million-line tables in production.

I have discovered a few annoyances with PostgreSQL, though – just to be fair. Its sequence system is pretty silly – after importing data you have to then go and explicitly set new sequence numbers for any autoincrement fields (PgSQL calls them “serials”). A useful facility but I think the default should just be to increment the sequence on import. I have of course written a script to automate this but still.

Another complaint is regarding the security, which if anything seems *too* strict. When you use Schemas to share logical databases between multiple users, any user but the owner of the “foreign” schema must have privileges granted explicitly on not only every table they plan to use in that schema, but on the aforementioned sequence tables too! I can understand the first, kind of, although there should be an option to grant privileges at the schema level, but the second is just silly – if you have rights to add new rows to a table, it is implied you should also have rights to increment the sequence. A needless bit of complexity.

That said, I’m overall delighted with the migration and everything was up and running fine. It’s not now, since I decided to completely excise my nasty multi-database hacks and simplify the data structures, removing all potential conflicts and separating tables into their logical homes. I’m about half way through doing that. And again, I’m really happy that I’m doing this now – what may take a day or two with development-stage databases might take weeks to do with live production servers – not to mention all the code that would have been built on top of the original suboptimal data structure. I’d actually been just about to write a whole lot more of that – something I’d actually been putting off because I knew what a mess the databases were, and was reluctant to dig myself any deeper a hole – but now I’m really looking forward to flying through what should be a much simpler, more intuitive job.

Switching to PostgreSQL

Thursday, February 7th, 2008

I have decided to move my development efforts from MySQL to PostgreSQL. Why? There’s a number of reasons, but there’s one main reason:

Schemas.

The concept of the schema is pretty unknown in the MySQL world. I admit I’d pretty much forgotten they existed even though I’ve learnt about them in the past setting up other databases (MS SQL Server – actually a pretty good product). Anyway, in MySQL a schema is nothing but the structure of your database. In PostgreSQL, a schema is a powerful feature for creating multiple “views” into the same database, but with ability to share between them.

Here’s an example. Say you have two applications, which you want to share a Users table but still have their own tables for “local” settings. Here are your options on MySQL:

  1. Put both applications into the same database, mixing the tables in with each other, perhaps with different prefixes for the tables, and overriding in the case of Users. Make Users a giant catch-all table with preferences for both apps, with a namespace for those fields inside the table. Pros: easy, can join into the shared table. Cons: Security is poor (I want to grant on a per-database level, not per-table), ugly as hell.
  2. Put each application inside its own database and make a third database for shared tables. Set your app to normally look inside its own database, and connect to the shared database when it needs to access the Users table. Pros: Better security compartmentalisation. Better looking, more intuitively named tables. Possibility of easier scaling since you can host the DBs on different machines. Cons: Loss of ability to join into the shared tables without nasty hacks. Constrains the kind of lookups you can do without severe performance penalties. More complex, loss of a single authorative logfile.
  3. Like number 2 but replicating the shared tables into and out of both apps by any of a number of means. Pros: solves the problem nicely. Cons: Complex, nasty solution which seems to be asking for trouble.

For the record, I’ve tried all three. I’d settled on number 2 as the better of three evils.

Here’s what you would do on PostgreSQL:

Create a single database with 3 users and three schemas. Name the Users App1, App2 and Shared, and the Schemas likewise, granting access to the matching users. Create the shared tables in the Shared schema, and the App1 and App2 tables in their schemas. Note that as far as the Schemas are concerned, they are in their own little world – no namespace conflicts.

Now set App1 and App2’s search paths to read App1/App2,Shared. There you go – as far as App1 and App2 is concerned, the table is right there – no complexity required. Set your app to use the appropriate schema and you’re done. It’s like editing your path in unix.

This might seem like overkill for such a small issue – but actually I’ve got a number of shared tables and more apps than that. The ability to use Schemas to solve all my problems here is a godsend, one that I wish I’d thought of earlier.

PostgreSQL has some other nice features as well, such as TableSpaces, which allows easy distribution of its storage by table onto different disks: you might want to put your ultra-high-activity Users table on the fast but expensive and small SCSI disk, for example, and the much larger but lover volume CatPictures table on a big, cheap SATA drive. There’s support for millisecond timestamps – MySQL, unbelievably, doesn’t go beyond 1 second accuracy. I’ve mentioned the much more strict SQL syntax requirements below – it’s taken me hours to clean up a lot of the junk MySQL happily allowed me to store (although I’m not going to claim it wasn’t my own fault; it was). And the new native data type of UUID makes me very happy, since I’ve come to believe that basically everything in a database of any important should have a UUID (synchronising two databases on different continents primary keyed on an integer = nightmare, keyed on a UUID = doable). And the backup facilities are far improved – it can easily write out full transaction logs while live, allowing full recoverability – something I’d been pretty worried about with MySQL. And its user rights system seems much more intuitive than MySQL’s.

I’d resisted PgSQL for quite some time, but one by one those reasons have disappeared. For one, it always had a reputation for being slow – now pretty thoroughly disproved. It seemed quite alien and unfamiliar, and I had trouble even getting it running the last time I tried it. Well, either I’ve become more knowledgeable or it’s easier to install, because I had no problems at all this time. And I worried that I didn’t know how to configure it properly – I discarded this reason upon realising I don’t really know jack shit about configuring MySQL properly either, and MySQL has hundreds of opaque options I know next to nothing about. In fact, I’ve had more trouble with MySQL! Even now, I can’t seem to force the MySQL on my local machine here to reliably use UTF8 as its default language in all situations.

I definitely won’t be deleting MySQL or anything like that. MySQL works fine for this blog and a few others. My MediaWiki is installed on it, plus a number of other apps use it. I’m of the “if it ain’t broke, don’t fix it” school when it comes to things like this so I’m just going to run them concurrently for the time being. I have nothing against MySQL, it’s served me extremely well for years – but the Schemas feature was the straw that broke the camel’s back.

I still don’t know for sure if I’ll stick with it – a horrible problem may well emerge, but one thing is for sure: I’ll keep all my data in a portable format from now on. MySQL is extremely permissive (or, dare I say, “lax”) with its enforcement of SQL syntax requirements and 90% of the time it’s taken to migrate has been in ad hoc repairs to tables and data to get them to conform. Now that’s done, I’m going to keep it done, and it’s easy to move back to MySQL at any time should the need arise. A bit of subtle vendor lock-in by MySQL, or simply making it “easier” for developers? Well, my thoughts on violating the standards to make it “easier” are pretty well known (see: any previous rant about Internet Explorer) so I’ll stick with the standards every time.

In conclusion: if you have the time, need and inclination I’d recommend giving PgSQL 8.3 a try.

Rails: Dump and reload data, unicode safe

Wednesday, February 6th, 2008

Behold my rake tasks to dump, and then reload, the contents of your database – all in highly compatible schema.rb and YAML formats. A mere rake dump_utf will create two files in /db/dump/ : firstly, an independent schema dump (doesn’t touch your proper one) and secondly a YAML file which is essentially a giant serialised hash of your DB. Running rake load_utf will import schema.rb and then all your data. And unlike every other script of this type I’ve seen around the net, it actually works, and is unicode safe.

Note that load_utf is extremely destructive and will write straight over your DB without asking further permission. However, if you haven’t run dump_utf it won’t find its files anyway, so not to worry.

Thanks to Tobias Luetke whose blog post was the starting point for this script, although there’s nothing left of it but the SQL Query now.

Needless to say, a great use of this tool is if you’re changing databases. Simply run dump_utf, modify database.yml to point to your new DB, then run load_utf – done.

Oh and I wouldn’t run it if your DB is too big, since it stores it all in memory. I may change that. And it doesn’t handle multiple databases either, I want to change that too ..

require 'Ya2YAML'
 
task :dump_utf => :environment do
  sql  = "SELECT * FROM %s"
  skip_tables = ["schema_info"]
  dir = RAILS_ROOT + '/db/dump'
  FileUtils.mkdir_p(dir)
  FileUtils.chdir(dir)
 
  ActiveRecord::Base.establish_connection
 
  puts "Dumping Schema..."
 
  File.open("structure.rb", "w+") do |file|
    ActiveRecord::SchemaDumper.dump(ActiveRecord::Base.connection, file)
  end
 
  giant_hash = {} # we're gonna put EVERYTHING in here!
 
  (ActiveRecord::Base.connection.tables - skip_tables).each do |table_name|
    giant_hash[table_name] = ActiveRecord::Base.connection.select_all(sql % table_name) 
    puts "Reading #{table_name}..."
  end
  puts "Writing file..."
  File.open("backup.yml", 'w+') do |file|
    file.write giant_hash.ya2yaml
 end
 puts "Finished!"  
end
 
task :load_utf => :environment do
  dir = RAILS_ROOT + '/db/dump/'
  FileUtils.chdir(dir)
 
  puts "loading schema..."
 
  file = "structure.rb"
  load(file)
  puts "done! now loading data ..."
 
  content_file = YAML.load_file(dir + "backup.yml")
 
  content_file.keys.each do |table_name|
    print "loading #{table_name}"
    content_file[table_name].each do |record|
    ActiveRecord::Base.connection.execute "INSERT INTO #{table_name} (#{record.keys.join(",")}) VALUES (#{record.values.collect { |value| ActiveRecord::Base.connection.quote(value) }.join(",")})", 'Insert Record'
    print "."
    end
    puts
  end
  puts "Finished!"  
end

Reserved words in PostgreSQL

Wednesday, February 6th, 2008

Trying out PostgreSQL? You might hit some troubles importing your MySQL datasets. MySQL is far more lenient about reserved words; you might find you’ve inadvertently named your columns in a way that’ll make PgSQL scream in pain.

Here’s the hard-to-find list, some obvious (SELECT, WHERE), some not (DESC – this one got me, ORDER – same, CURRENT_USER etc):

CREATE CURRENT_DATE CURRENT_ROLE CURRENT_TIME CURRENT_TIMESTAMP
 CURRENT_TRANSFORM_GROUP_FOR_TYPE CURRENT_USER DATE DEFAULT 
DEFERRABLE DESC DISTINCT DO ELSE END EXCEPT FALSE FOR FOREIGN 
FROM GRANT GROUP HAVING IN INITIALLY INTERSECT INTO IS ISNULL 
JOIN LEADING LEFT LIKE LIMIT LOCALTIME LOCALTIMESTAMP NEW NOT 
NOTNULL NULL OFF OFFSET OLD ON ONLY OR ORDER OUTER OVERLAPS 
PLACING PRIMARY REFERENCES RETURNING RIGHT SELECT 
SESSION_USER SIMILAR SOME SYMMETRIC TABLE THEN TO TRAILING
 TRUE UNION UNIQUE USER USING VERBOSE WHEN WHERE WITH

Force mongrel to delete stale pid files upon launch

Monday, February 4th, 2008

Mongrel, currently the mainstream Rails server, has a nasty habit of not deleting its PID files when it crashes and burns. This then occasionally stops automatic restarts because mongrel sees its old PID files, assumes it’s running (without a check) and refuses to start.

All that needs to be done is a startup check that the process mentioned in the PID file is actually running. This has been implemented in a patch here which I’ve integrated and successfully tested. If you’d like to use the file I generated, I’ve attached it below. Simply replace /path/to/gems/mongrel-1.1.3/bin/mongrel_rails with the attached file and run as normal. You’ll want to make a backup copy first, of course!

Also on the “things to do while hacking mongrel” list is to change the fucking version so it’s actually up to date as echoed in your console log. The file is lib/mongrel/const.rb line 68,

MONGREL_VERSION="1.1.3".freeze

which the maintainers obviously missed.

Mongrel is an aptly named bit of software, poorly maintained and with many long-standing patches which should have been included long ago. This particular patch fixes an annoying bug which shouldn’t exist any more – and harks back to March 2007, 10 months ago. That’s a pretty unforgivably long time not to include something as basic as this. And as far as I can tell the mongrel_rails program hasn’t changed significantly since last time – there’s no excuse for this, and other, patches to have not made it in by now. This is the third time I’ve manually patched it like this, and was only reminded to do so in the latest version (1.1.3) when I noticed – surprise! – it was again refusing to start with a stale PID.

I’d be more worried if there weren’t several promising alternative servers on the rise. Hopefully the Rails community’s dalliance with the very doglike mongrel won’t last too much longer…

mongrel_rails – full file

mongrel_stale_pid_file.patch – same thing in patch form. Run

$ cd /path/to/gems/mongrel-1.1.3/
$ patch -p0 < /path/to/mongrel_stale_pid_file.patch

Both from http://textsnippets.com/posts/show/931.

Testing (on MacOSX):

$ cd ~/Desktop
$ /usr/bin/ruby /usr/bin/mongrel_rails start -d -e development -p 3000 -a 127.0.0.1 -P /Users/sho/Desktop/mongrel.pid -c /rails/myapp
# mongrel starts ...
$ cp mongrel.pid another.pid
$ /usr/bin/ruby /usr/bin/mongrel_rails stop -P /Users/sho/Desktop/mongrel.pid
Sending TERM to Mongrel at PID 1605...Done.
$ cp another.pid mongrel.pid
$ /usr/bin/ruby /usr/bin/mongrel_rails start -d -e development -p 3000 -a 127.0.0.1 -P /Users/sho/Desktop/mongrel.pid -c /rails/myapp
** !!! PID file /Users/sho/Desktop/mongrel.pid exists, but is stale, and will be deleted so that this mongrel can run.
# mongrel starts as hoped .. success!

Multiple classes in CSS

Friday, January 25th, 2008

I don’t know how the hell I didn’t know this, but you can assign multiple classes to any element in CSS. This allows some really neat tricks if you basically treat classes as stackable “tags”. Just separate the classes with a whitespace and you can add as many as you want.

To whit:

="hide faq_answer">
The answer to the question!
>

Imagine you’re writing an FAQ, and want to 1. have a certain style for the answer and 2. hide the answer unless the user clicks on the question. Before I knew this, I would have just had one class – say, faq_answer – and styled that in CSS. I then would have written something in JS which hid any instance of that class in the DOM.

That sucks because it’s conflating style and behaviour. And whenever I decided I needed to hide something else, lo and behold I have to add yet another JS rule to hide any instance of that .. before you know it you’ve got 30 JS functions hiding 30 different classes. It’s nasty, it’s fragile, it’s bad practice.

Now I know this, I’ll just write one JS handler to hide anything with the class “hide”, and I’ll style by adding a second (or third, or fourth!) class to the div. Classes can act like tags! Fantastic discovery.

I have no fucking idea how I didn’t come to know this before today.

MS to force IE7 upgrade

Wednesday, January 23rd, 2008

Apparently Microsoft is going to issue a forced upgrade to IE7 to all XP users on February 12.

All I can say is: about fucking time. This is the first step to the final removal of the scourge of IE6, and the need to cater for its utterly fucked up rendering, in web design. Wonderful.