Any Ruby programmers who are reading this, I’m experiencing a strange issue regarding Digest::MD5. Let me show you:
>> require 'digest/md5' => true >> Digest::MD5.digest "Les Rhythmes Digitales" => "\213U\3601\260%\2 67-\343(\213I\030\347"
What the fuck is that huge escaped thing? A unicode issue?
Check out the same from bash:
$ md5 -s "Les Rhythmes Digitales" MD5 ("Les Rhythmes Digitales") = 8b55f031b0253c6ab72de3288b4918e7
Now that looks more like what I expect from an MD5 hash. Is Digest::MD5 mangling the text into some kind of weird invalid unicode?
Let’s try with $KCODE set:
>> $KCODE = "UTF8" => "UTF8" >> require 'digest/md5' => true >> Digest::MD5.digest "Les Rhythmes Digitales" => "\213U?1?%\2 67-?(?I\030\347"
Great. Any different in 1.9?
irb(main):003:0> Digest::MD5.digest "Les Rhythmes Digitales" => "\x8BU\xF01\xB0%\x B7-\xE3(\x8BI\x18\xE7"
Different again. At least I can see the characters in there, though. This is causing some pain.
Am I doing something hopelessly wrong? Somewhere in all these, some character encoding crap is going down. I can’t believe I’m the only one having these problems, and they render it difficult to use hashed passwords. I am working around the issue by shelling out to bash for now, but would like to get it fixed.
UPDATE: About 2 minutes after writing that, I realised I need to use Digest::MD5.hexdigest, not plain digest. I have no idea what the difference is supposed to be, but oh well, lesson learned. Apparently writing complaints on this blog helps me solve problems, so expect it to continue.
>> Digest::MD5.hexdigest "Les Rhythmes Digitales" => "8b55f031b0253c6ab72de3288b4918e7"
Tags: ruby
January 6th, 2009 at 7:16 am
One is the raw binary data of the digest, 128 bits of random-looking data, and exceedingly unlikely to be in any valid text encoding. So when you ask Ruby to display it as a string it there end up being quite a few escape sequences in it for the “invalidly encoded” chars in the sequence.
The other one is a hex representation of those bits. Each hex digit represents 4 bits, so there are 32 digits (characters) in the string. Notice that the information density has gone down here; each character (8 bits) only actually encodes 4 bits of information. The totally number of bits used to “store” the 128 bits of the hash is now 256 bits.
Hope that’s clear.
January 6th, 2009 at 7:33 am
Yes, that’s clear. I’d mostly come to that conclusion that after writing, sorry for not updating it.
I guess my misunderstanding came from use of MD5 on the command line, which defaults to hex output. I’d mistakenly thought that the output hash of MD5 was hex by definition; certainly, that’s all I’d ever encountered. Obviously not!
The point you raise about the 4-bit “loss” in information density is very familiar to me, actually. I believe I present some algorithms elsewhere around here for improving the density of hex-constrained UUIDs (or UUID-like strings, such as the MD5 hex digest) by converting them to various degrees of “safe” characters. You can get it up to 6 bits of useful data per 8-bit character while preserving (my preferred level of) URL-safety, for example. Drops to slightly over 5 bits if you want to be domain- or email-safe.