Weird output from Digest::MD5 in ruby

Any Ruby programmers who are reading this, I’m experiencing a strange issue regarding Digest::MD5. Let me show you:

>> require 'digest/md5'
=> true
>> Digest::MD5.digest "Les Rhythmes Digitales"
=> "\213U\3601\260%\267-\343(\213I\030\347"

What the fuck is that huge escaped thing? A unicode issue?

Check out the same from bash:

$ md5 -s "Les Rhythmes Digitales"
MD5 ("Les Rhythmes Digitales") = 8b55f031b0253c6ab72de3288b4918e7

Now that looks more like what I expect from an MD5 hash. Is Digest::MD5 mangling the text into some kind of weird invalid unicode?

Let’s try with $KCODE set:

>> $KCODE = "UTF8"
=> "UTF8"
>> require 'digest/md5'
=> true
>> Digest::MD5.digest "Les Rhythmes Digitales"
=> "\213U?1?%\267-?(?I\030\347"

Great. Any different in 1.9?

irb(main):003:0> Digest::MD5.digest "Les Rhythmes Digitales"
=> "\x8BU\xF01\xB0%\xB7-\xE3(\x8BI\x18\xE7"

Different again. At least I can see the characters in there, though. This is causing some pain.

Am I doing something hopelessly wrong? Somewhere in all these, some character encoding crap is going down. I can’t believe I’m the only one having these problems, and they render it difficult to use hashed passwords. I am working around the issue by shelling out to bash for now, but would like to get it fixed.

UPDATE: About 2 minutes after writing that, I realised I need to use Digest::MD5.hexdigest, not plain digest. I have no idea what the difference is supposed to be, but oh well, lesson learned. Apparently writing complaints on this blog helps me solve problems, so expect it to continue.

>> Digest::MD5.hexdigest "Les Rhythmes Digitales"
=> "8b55f031b0253c6ab72de3288b4918e7"

Tags:

2 Responses to “Weird output from Digest::MD5 in ruby”

  1. Wincent Colaiuta Says:

    I have no idea what the difference is supposed to be

    One is the raw binary data of the digest, 128 bits of random-looking data, and exceedingly unlikely to be in any valid text encoding. So when you ask Ruby to display it as a string it there end up being quite a few escape sequences in it for the “invalidly encoded” chars in the sequence.

    The other one is a hex representation of those bits. Each hex digit represents 4 bits, so there are 32 digits (characters) in the string. Notice that the information density has gone down here; each character (8 bits) only actually encodes 4 bits of information. The totally number of bits used to “store” the 128 bits of the hash is now 256 bits.

    Hope that’s clear.

  2. Sho Says:

    Yes, that’s clear. I’d mostly come to that conclusion that after writing, sorry for not updating it.

    I guess my misunderstanding came from use of MD5 on the command line, which defaults to hex output. I’d mistakenly thought that the output hash of MD5 was hex by definition; certainly, that’s all I’d ever encountered. Obviously not!

    The point you raise about the 4-bit “loss” in information density is very familiar to me, actually. I believe I present some algorithms elsewhere around here for improving the density of hex-constrained UUIDs (or UUID-like strings, such as the MD5 hex digest) by converting them to various degrees of “safe” characters. You can get it up to 6 bits of useful data per 8-bit character while preserving (my preferred level of) URL-safety, for example. Drops to slightly over 5 bits if you want to be domain- or email-safe.

Leave a Reply