Posts Tagged ‘database’

N+1, where n > 2,000,000

Monday, June 30th, 2008

I unthinkingly executed an N+1 database operation on a table with in excess of 2 million records, doing a lookup on another table for every .. single .. one.

So let’s see .. I estimate it will need to do about 100k writes. So that’s 2*2,000,000 reads, then 100,00 writes .. a mere 4,100,000 accesses in total. Even better, it’s MySQL, and I don’t have the C adapter for that installed – so every access does a round trip through the treacle-slow pure Ruby adapter.

Oops. If it’s still going when I wake up tomorrow I might cancel it!

UPDATE: It was actually done in maybe 6 hours (wasn’t paying close attention), which I was pretty impressed by. I guess it’s not that much – MySQL probably had the whole thing in memory after a while and as I said, it was mostly reads. Still, 200 or so queries/second – not bad.

StrokeDB

Tuesday, February 12th, 2008

Another competitor to the exciting CouchDB project has emerged – and this time it’s in pure Ruby, not god damn Erlang, so it’s very interesting to me. Check it out here.

By the way, another project I’ve talked about before, ThingFish, has been through countless revisions and code growth – there’s not a day that goes by when I’m not treated to 50 or so changed files in response to my svn up. But the focus of the project seems to have changed from being a document-centric database to being some kind of network file store. None of it works, as far as I can tell, and I have no idea what they are doing.. Thingfish developers: what on earth is your project for?

Anyway, exciting times for this class of database, which I strongly believe is the future of large-scale web apps.

Strict databases are great for discipline

Friday, February 8th, 2008

Oh, boy. Nothing exposes one’s shitty programming habits like using a new, stricter databse like PostgreSQL. All over the place I’m discovering, and fixing, code in which I’d demonstrated a lax attitude to data integrity – from trying to insert invalid data into columns I’d previously specified should not be null (datetimes, mostly) to declaring booleans with 1 or 0 instead of true and false.

It’s annoying, sure, but it’s also quite satisfying. A database that won’t take any shit is great for your programming discipline and must lead to better, more reliable code. I really can’t believe how much MySQL allows you to get away with, and I’m very glad my eyes were opened to the problems with the data I was storing before I had to manually repair million-line tables in production.

I have discovered a few annoyances with PostgreSQL, though – just to be fair. Its sequence system is pretty silly – after importing data you have to then go and explicitly set new sequence numbers for any autoincrement fields (PgSQL calls them “serials”). A useful facility but I think the default should just be to increment the sequence on import. I have of course written a script to automate this but still.

Another complaint is regarding the security, which if anything seems *too* strict. When you use Schemas to share logical databases between multiple users, any user but the owner of the “foreign” schema must have privileges granted explicitly on not only every table they plan to use in that schema, but on the aforementioned sequence tables too! I can understand the first, kind of, although there should be an option to grant privileges at the schema level, but the second is just silly – if you have rights to add new rows to a table, it is implied you should also have rights to increment the sequence. A needless bit of complexity.

That said, I’m overall delighted with the migration and everything was up and running fine. It’s not now, since I decided to completely excise my nasty multi-database hacks and simplify the data structures, removing all potential conflicts and separating tables into their logical homes. I’m about half way through doing that. And again, I’m really happy that I’m doing this now – what may take a day or two with development-stage databases might take weeks to do with live production servers – not to mention all the code that would have been built on top of the original suboptimal data structure. I’d actually been just about to write a whole lot more of that – something I’d actually been putting off because I knew what a mess the databases were, and was reluctant to dig myself any deeper a hole – but now I’m really looking forward to flying through what should be a much simpler, more intuitive job.

Switching to PostgreSQL

Thursday, February 7th, 2008

I have decided to move my development efforts from MySQL to PostgreSQL. Why? There’s a number of reasons, but there’s one main reason:

Schemas.

The concept of the schema is pretty unknown in the MySQL world. I admit I’d pretty much forgotten they existed even though I’ve learnt about them in the past setting up other databases (MS SQL Server – actually a pretty good product). Anyway, in MySQL a schema is nothing but the structure of your database. In PostgreSQL, a schema is a powerful feature for creating multiple “views” into the same database, but with ability to share between them.

Here’s an example. Say you have two applications, which you want to share a Users table but still have their own tables for “local” settings. Here are your options on MySQL:

  1. Put both applications into the same database, mixing the tables in with each other, perhaps with different prefixes for the tables, and overriding in the case of Users. Make Users a giant catch-all table with preferences for both apps, with a namespace for those fields inside the table. Pros: easy, can join into the shared table. Cons: Security is poor (I want to grant on a per-database level, not per-table), ugly as hell.
  2. Put each application inside its own database and make a third database for shared tables. Set your app to normally look inside its own database, and connect to the shared database when it needs to access the Users table. Pros: Better security compartmentalisation. Better looking, more intuitively named tables. Possibility of easier scaling since you can host the DBs on different machines. Cons: Loss of ability to join into the shared tables without nasty hacks. Constrains the kind of lookups you can do without severe performance penalties. More complex, loss of a single authorative logfile.
  3. Like number 2 but replicating the shared tables into and out of both apps by any of a number of means. Pros: solves the problem nicely. Cons: Complex, nasty solution which seems to be asking for trouble.

For the record, I’ve tried all three. I’d settled on number 2 as the better of three evils.

Here’s what you would do on PostgreSQL:

Create a single database with 3 users and three schemas. Name the Users App1, App2 and Shared, and the Schemas likewise, granting access to the matching users. Create the shared tables in the Shared schema, and the App1 and App2 tables in their schemas. Note that as far as the Schemas are concerned, they are in their own little world – no namespace conflicts.

Now set App1 and App2’s search paths to read App1/App2,Shared. There you go – as far as App1 and App2 is concerned, the table is right there – no complexity required. Set your app to use the appropriate schema and you’re done. It’s like editing your path in unix.

This might seem like overkill for such a small issue – but actually I’ve got a number of shared tables and more apps than that. The ability to use Schemas to solve all my problems here is a godsend, one that I wish I’d thought of earlier.

PostgreSQL has some other nice features as well, such as TableSpaces, which allows easy distribution of its storage by table onto different disks: you might want to put your ultra-high-activity Users table on the fast but expensive and small SCSI disk, for example, and the much larger but lover volume CatPictures table on a big, cheap SATA drive. There’s support for millisecond timestamps – MySQL, unbelievably, doesn’t go beyond 1 second accuracy. I’ve mentioned the much more strict SQL syntax requirements below – it’s taken me hours to clean up a lot of the junk MySQL happily allowed me to store (although I’m not going to claim it wasn’t my own fault; it was). And the new native data type of UUID makes me very happy, since I’ve come to believe that basically everything in a database of any important should have a UUID (synchronising two databases on different continents primary keyed on an integer = nightmare, keyed on a UUID = doable). And the backup facilities are far improved – it can easily write out full transaction logs while live, allowing full recoverability – something I’d been pretty worried about with MySQL. And its user rights system seems much more intuitive than MySQL’s.

I’d resisted PgSQL for quite some time, but one by one those reasons have disappeared. For one, it always had a reputation for being slow – now pretty thoroughly disproved. It seemed quite alien and unfamiliar, and I had trouble even getting it running the last time I tried it. Well, either I’ve become more knowledgeable or it’s easier to install, because I had no problems at all this time. And I worried that I didn’t know how to configure it properly – I discarded this reason upon realising I don’t really know jack shit about configuring MySQL properly either, and MySQL has hundreds of opaque options I know next to nothing about. In fact, I’ve had more trouble with MySQL! Even now, I can’t seem to force the MySQL on my local machine here to reliably use UTF8 as its default language in all situations.

I definitely won’t be deleting MySQL or anything like that. MySQL works fine for this blog and a few others. My MediaWiki is installed on it, plus a number of other apps use it. I’m of the “if it ain’t broke, don’t fix it” school when it comes to things like this so I’m just going to run them concurrently for the time being. I have nothing against MySQL, it’s served me extremely well for years – but the Schemas feature was the straw that broke the camel’s back.

I still don’t know for sure if I’ll stick with it – a horrible problem may well emerge, but one thing is for sure: I’ll keep all my data in a portable format from now on. MySQL is extremely permissive (or, dare I say, “lax”) with its enforcement of SQL syntax requirements and 90% of the time it’s taken to migrate has been in ad hoc repairs to tables and data to get them to conform. Now that’s done, I’m going to keep it done, and it’s easy to move back to MySQL at any time should the need arise. A bit of subtle vendor lock-in by MySQL, or simply making it “easier” for developers? Well, my thoughts on violating the standards to make it “easier” are pretty well known (see: any previous rant about Internet Explorer) so I’ll stick with the standards every time.

In conclusion: if you have the time, need and inclination I’d recommend giving PgSQL 8.3 a try.

Rails: Dump and reload data, unicode safe

Wednesday, February 6th, 2008

Behold my rake tasks to dump, and then reload, the contents of your database – all in highly compatible schema.rb and YAML formats. A mere rake dump_utf will create two files in /db/dump/ : firstly, an independent schema dump (doesn’t touch your proper one) and secondly a YAML file which is essentially a giant serialised hash of your DB. Running rake load_utf will import schema.rb and then all your data. And unlike every other script of this type I’ve seen around the net, it actually works, and is unicode safe.

Note that load_utf is extremely destructive and will write straight over your DB without asking further permission. However, if you haven’t run dump_utf it won’t find its files anyway, so not to worry.

Thanks to Tobias Luetke whose blog post was the starting point for this script, although there’s nothing left of it but the SQL Query now.

Needless to say, a great use of this tool is if you’re changing databases. Simply run dump_utf, modify database.yml to point to your new DB, then run load_utf – done.

Oh and I wouldn’t run it if your DB is too big, since it stores it all in memory. I may change that. And it doesn’t handle multiple databases either, I want to change that too ..

require 'Ya2YAML'
 
task :dump_utf => :environment do
  sql  = "SELECT * FROM %s"
  skip_tables = ["schema_info"]
  dir = RAILS_ROOT + '/db/dump'
  FileUtils.mkdir_p(dir)
  FileUtils.chdir(dir)
 
  ActiveRecord::Base.establish_connection
 
  puts "Dumping Schema..."
 
  File.open("structure.rb", "w+") do |file|
    ActiveRecord::SchemaDumper.dump(ActiveRecord::Base.connection, file)
  end
 
  giant_hash = {} # we're gonna put EVERYTHING in here!
 
  (ActiveRecord::Base.connection.tables - skip_tables).each do |table_name|
    giant_hash[table_name] = ActiveRecord::Base.connection.select_all(sql % table_name) 
    puts "Reading #{table_name}..."
  end
  puts "Writing file..."
  File.open("backup.yml", 'w+') do |file|
    file.write giant_hash.ya2yaml
 end
 puts "Finished!"  
end
 
task :load_utf => :environment do
  dir = RAILS_ROOT + '/db/dump/'
  FileUtils.chdir(dir)
 
  puts "loading schema..."
 
  file = "structure.rb"
  load(file)
  puts "done! now loading data ..."
 
  content_file = YAML.load_file(dir + "backup.yml")
 
  content_file.keys.each do |table_name|
    print "loading #{table_name}"
    content_file[table_name].each do |record|
    ActiveRecord::Base.connection.execute "INSERT INTO #{table_name} (#{record.keys.join(",")}) VALUES (#{record.values.collect { |value| ActiveRecord::Base.connection.quote(value) }.join(",")})", 'Insert Record'
    print "."
    end
    puts
  end
  puts "Finished!"  
end