Functors and Double-Encoding | Letters to an Unknown Audience

letters

to an unknown audience

Functors and Double-Encoding/ # /April 03, 2005

The other day I wrote about how typeful programming can help programmers detect mistakes about encodings, particularly the failure to choose a single interpretation for a string, whether it's encoded or raw. The objections that I've seen have fallen into two categories:

You shouldn't have to be precise about what is encoded and what isn't.
Using types is too limiting because any value can have multiple encodings applied.

The first objection is dealt with in a comment. I'll go into the second issue here. The roots of this objection, I think, lie in the kind of typeful thinking that C++ and Java train us for, which sees the type landscape as a finite hierarchy of "classes" that are each painstakingly defined. Some languages have a more expressive type system, though, and these can be harnessed to precisely define the kind of encoded string that should be allowed in a given location.

The way in to all this is through the simple idea of double-encoding.

What if you want to double-encode a raw string? First of all, you should do it deliberately, and not just to get some kind of extra protection against later decodings. That's muddy thinking, though it's often applied.

To encode something that's already encoded, you need a function like encode_encoded : encodedstring -> encodedencodedstring. The type of encode_encoded, being different from plain encode, suggests that the function, too, might be different. But this would be frustrating: our encode actually does the right thing on the underlying bytes. Why can't we just use that? Is type analysis preventing us from doing something easy?

No; we just need type modifiers.

Imagine that instead of primitive types rawstring and encodedstring, we instead had a modifier, encoded, which we could apply to various types (types like string!). Instead of writing rawstring we'd write just string and instead of encodedstring (a primitive) we'd write encoded string (that is, the string type, modified by the encoded modifier). This makes the relationship clear to the compiler, so it knows that encoded string is a lot like string—just modified a bit. Also, we can modify types that are already modified, so as to produce encoded encoded string.

We now have a way to mint new types. But that doesn't give us the code for free. Can we get some leverage on turning encode into encode_encoded? Perhaps so.

Let's give a general method for wrapping any function f : X -> Y in encoded, producing a function encoded f : encoded X -> encoded Y. This is easy, since we can just forget the encoding, apply f, and then inject it afterward to get back into the "encoded" world.

   function encoded(f : X -> Y)
   {
       return function(x : encoded X) {
                   inject(f(forget x));
              }
   }

(The pseudo-code might be confusing; this is a function encoded that takes one argument, a function called f, and returns a new function, which operates like f but within the encoding. I've given this function (this function modifier) the same name as the type modifier because I like the idea of writing encoded f as a way to refer to the thing that does what f does, but on encoded strings instead of regular ones).

The ability to define a type modifier like this (called, with sexy, a functor) exists in languages in the ML family, and Haskell, plus probably some others.

In this example, I've assumed one particular encoding, but if we have several different encoding operations, like urlencode and htmlencode, then we could define a functor for each one and then construct types that chain these together, such as urlencoded string, htmlencoded urlencoded string, urlencoded urlencoded htmlencoded string, and so on.

Let's review what we've accomplished. The "map" function (the one above called encoded) for each of these types allows us to apply any function within the encoding (even though that function may not have been written with the encoded type in mind) and we can still maintain the rigors of our type analysis, never trying to decode a raw string or encode an encoded string (unless that's what we mean to do). So for any function like urlencode, we don't have to rewrite it a hundred times with different type annotations. Note also that the "inject" and "forget" functions are no-ops on the underlying bytes. As a consequence of this, the encoded function above is effectively a no-op (it just dispatches to the function f, so a smart compiler can compile it away. (In fact, this depends on whether the type information is needed at run-time, but in many languages, they are not and can be removed. In any case, the map function is not expensive.)

So, type modifiers allow us to mint infinite collections of new types, and cheaply, making it easy to annotate variables with just the right flavor of encoding. And when each variable has a precise type, we can't use the wrong interpretation on it at any time.

Further objections?

Keep Reading >

Others