What Encoding Is Like | Letters to an Unknown Audience

letters

to an unknown audience

What Encoding Is Like/ # /March 30, 2005

Url-encoding is an operation that takes something called a string and turns it into something different: a url-encoded string. This operation never trips up—there is no string you can feed it that will make it say “I can't encode that.” At the same time, every encoded string that it emits is so perfectly encoded that it can be fed to another operation (decode) which emits the original string—exactly.

Encode and decode are a perfect team, of cook and eater: there's no way for the eater to digest something that the cook can't make, nor can the cook make something that the eater can't eat.

Unless you mix up their roles, and feed the eater a recipe, or point the cook at a finished meal. These guys don't know what to do with the kind of stuff that the other one routinely consumes.

With strings, similarly, you should not feed an encoded string to the encoder. Why not? Because it doesn't need to be encoded; therefore you probably made a mistake somewhere. A typeful programming language with distinct types for encoded and unencoded strings would force you to recognize this error up front. The encoder operation could have the type

  encoder : rawstring -> encodedstring

meaning that it only accepts a raw string as an argument and only emits encoded strings. If you then try to apply it twice to a raw string:

   temp = encoder(str);
   encoded = encoder(temp);

The language would balk at you, saying, “function 'encoder' cannot have type encodedstring -> encodedencodedstring!” or something like that. If the two encoding steps are so close together, as above, this might seem foolish—“I can see that myself just fine, and why should it gripe at me?” says the skeptic. But if there is some complicated plumbing in between, it might not be clear whether the output of the first encoder() call would ever get fed to the second encoder() call. Here's a contrived example:

  temp1 = encoder(str);
  temp2 = decoder(temp1);
  if (some_remote_test()) {
    temp3 = temp2;
  } else {
    temp3 = temp1;
  }
  return temp3;

Is the return value (temp3) encoded or not? It depends on whether some_remote_test returns true or false. But that operation never makes sense! Whoever's consuming return values of this function needs to know how to interpret those values. If it's going to stick that value into a URL, should it encode the value? The consumer will never know.

If, on the other hand, the above function were defined with a specific return type, either rawstring or encodedstring, then the compiler would complain that the wrong type had been assigned to temp3, either in the “if” clause or the "else" clause.

On the other hand, what if you want to write a function that really does use some external determinant to decide whether to decode a given string? Maybe some wire protocol contains a flag that indicates whether some string is encoded. Fine—our language should support this. Then I'll posit two new functions:

  forget : encodedstring -> rawstring
  inject : rawstring -> encodedstring

These functions have the same types as encode and decode, but the operate differently. “forget” turns an encoded string into an unencoded one by simply “forgetting” that it was encoded in the first place. It effectively reinterprets the same bytes as not being encoded.

Similarly, “inject” throws a rawstring into the encoded world without actually encoding it. This is useful if you know it's already encoded. If it's not encoded, the resulting value might be bogus, but you can always handle that as an error case whenever convenient.

The essential insight that typeful thinking brings is this: type determines interpretation. If a certain sequence of bytes sits in a variable of the rawstring type, I know I should interpret "%20" as a percent, followed by a 2, followed by a 0. If the same bytes are in a variable of the encodedstring type, then I should interpret "%20" as a space character. type determines interpretation.

You can not generally tell, by looking at it, whether a string is encoded or not. The string "Albert%20Hall" might be encoded or not; likewise "Albert" might be encoded or not. On the other hand, "Albert%2525252520Hall" might be encoded lots or not at all! It only depends on the intended interpretation. "Albert%20Hall" might be intended as a sequence of 13 characters, one of which is the percent character; or it might be intended as an encoded form of an 11 byte string, namely "Albert%20Hall". If you said, “Encode this, but not if it's already encoded,” I wouldn't know what to do, because I can interpret it either way.

So let's revisit the idea of a function that takes a string, encoded or not, and a boolean that tells us which. The function must return a rawstring that represents the same string, when the right interpretation is implied. In order to do this, we have to forcibly reinterpret the raw bytes—that's what "forget" and "inject" are for. So the function could look like this:

  function maybe_decode(str : rawstring, is_encoded : boolean)
    if (is_encoded) {
      result = decode(inject(str));
    } else {
      result = str;
    }
    return result;
  }

As I said, "inject" can actually produce an encodedstring value which isn't actually legal in the encoding ("decode" would choke on it). But that could only happen in this case if the is_encoded flag were incorrect—if the flag were set but the string were not really encoded. This function is trusting the caller; in that respect the type system doesn't help us catch errors. But the advantage is that, throughout the rest of the program, the types will help us find places where someone tried to pass an encoded string where a raw string was expected, or vice versa. This happens a lot. A lot a lot.

The key premise of all this is that every location where a string will be stored has just one interpretation: encoded or not. This forces you to encode and decode when moving data from a location with one interpretation to another; it also prevents scratching your head and cursing your co-workers for not documenting whether a given location holds raw or encoded data; and it prevents you from unintentionally double-encoding. Most important, it prevents having a variable with both possibilities, so that you'd need to expect (conservatively) that either might appear.

You might think this is pedantic, but I'm not done yet. Tune in tomorrow for “~~Monads~~ Functors and Double-encoding.”

Keep Reading >

Comments

A sufficiently clever encoding algorithm would encode an already encoded string as itself (as a natural consequence of the algorithm, not through some external test), and conversely so for decoding.

—posted by Jim at March 31, 2005 7:02 AM

Jim, the only function that would do that is the identify function.

You're suggesting a function like the following:

f(a)->b
f(b)->b
f'(b)->a
f'(a)->a

[f is the function (encoding) and f' is its inverse (decoding)]

Note that with your function, f'(f(b))->a. By definition you must have f'(f(x))->x. That is only true if a=b.

—posted by Jeremy Stein at March 31, 2005 9:32 AM

Ezra, I think this problem should be dealt with at a higher level than object types. What if you have a base64-encoded string? What if you need a base64-encoded string that is then url-encoded? What type would that be?

This seems more like metadata than type information. How about an attribute on string called, say, encoding, that contains a stack of encoding types applied to the string. You'd have to defer validation to run-time, but it would give you more flexibility.

—posted by Jeremy Stein at March 31, 2005 9:39 AM

Jim: Not if any string is permitted in the "unencoded" pool of strings. In that case, a string may look like it's encoded, when in fact it is not. Taking the example of amp-encoding:

This&that;another thing.

This might be encoded or not. What determines it is not the character sequence but the interpretation.

This also has security implications. If I'm inserting the string "this%26that" into a URL query string, it is essential that it be actively encoded, to "this%2526that", so that when it is decoded on the other end, it returns to "this%26that" rather than "this&that". If I encoded "this%26that" as "this%26that",
the agent at the other end would decode it to "this&that". Already the fidelity of the encode/decode cycle is broken.

What's wrong with that? If the agent at the other end then inserted that into another URL as a query parameter, like "monkeys=this&that" then the sanctity of the query-param container has been broken (hence security implications). Now, you may say, "it should have encoded it before forming the URL!" But there are certainly cases where you don't encode a value as you form a URL from it--for example, a function that accepts two URL-encoded arguments and uses them directly as query params in a URL. Encoding again would be wrong, since the interpretation is clearly that they are already encoded. Again, the situation becomes muddy if you have a variable that might be encoded or not depending on runtime behavior.

No, an encoding step must always alter the string, and must not opt-out. Otherwise, the identity decode(encode(str)) = str might be broken.

—posted by ezra at March 31, 2005 11:43 AM

Jeremy: It's a good point, about different kinds of encodings each of which can be multiply applied.

But you would agree, wouldn't you, that each variable can be clearly labelled with the sequence of encodings in which it must be interpreted? If so, then you just need a rich type system to specify that, say, a string is amp-encoded, then doubly url-encoded, then uu-encoded....

I would argue that it makes more sense to determine this information at compile time, so that you can catch mistakes at compile time, rather than having to understand and test all the different combinations of inputs that cause different run-time behaviors that might put an encoded string where an unencoded string belongs. Catching errors at compile time is good, and I posit that every string variable already has one interpretation in your head anyway. Why not let the compiler help you find mistakes?

—posted by ezra at March 31, 2005 11:58 AM

Good blog! I like your posting style, so your wording. It's good that people are so different and everyone has his own story.

—posted by China calling cards at May 13, 2006 9:14 PM

Howdy, I sow Your link at one page and I want to say only - i agree with You and dont forget keep up the good working!

—posted by hot babe at May 19, 2006 5:44 AM

Others