Skip to content


Trigraphs and other escape sequences

I was fixing a bug the other day in code that generates Java source code. I had to make a small change to code that generated Unicode escape sequences. This got me thinking about the ways in which programming languages support encoding of non-ASCII characters in source code.

Java compilers must support source programs written using the full Unicode character set. But since most text editors don’t handle all of Unicode, Java supports Unicode escape sequences using the \uXXXX syntax, where XXXX is a 16-bit hexadecimal value. When the source is compiled, the substitution happens before any lexical analysis is done. So a Java source file could be entirely composed from Unicode escape sequences. (How’s that for Obfuscated code?).

The C Programming Language predates Unicode. It even predates the consistent usage of ASCII. C uses a number of characters that were not available on some European terminals (which only offered the seven-bit ISO 646 character set and use these positions for accented characters). In order to allow such systems to support C, trigraphs were added to the language. Trigraphs are sequences of three characters (introduced by two consecutive question marks) that the compiler replaces with their corresponding punctuation characters. Here’s the set of trigraphs that were defined:

??= #
??( [
??/ \
??) ]
??  ^
??< {
??! |
??> }
??- ~

Pretty ugly. Imagine having to type ??< ... ??> around every block of code. Since full ASCII support is no longer an issue, trigraphs are now just an obscure language feature. But they live on. C++ compilers usually support C as well so trigraphs are supported by most C/C++ compilers.

C# doesn't support trigraphs but it does support Unicode escape sequences although not in quite the same way as Java. According to the C# Language Specification, Unicode escape sequences are processed in identifiers, regular string literals, and character literals. A Unicode character escape is not processed in any other location (for example, to form an operator, punctuator, or keyword). So in C#, the processing of these escapes must happen later, during lexical analysis. I don't think it really matter in practice. The most common usage of Unicode escape sequences would be in string and character literals. Using them elsewhere, especially in identifier names would be pretty weird.

Posted in Uncategorized.


0 Responses

Stay in touch with the conversation, subscribe to the RSS feed for comments on this post.



Some HTML is OK

or, reply to this post via trackback.



bob congdon is Digg proof thanks to caching by WP Super Cache