28 years after Unicode, we still can’t handle accents: PDF + macOS + URL = chaos

Nico had a simple task: copy a URL from a PDF document and paste it into a webpage for publication on a site. When he opened the PDF in Adobe Acrobat, that worked fine, but open the same document in Preview, Podofyllin, or any other app which uses the macOS PDFKit Quartz 2D engine, and the pasted URL is broken.

Why, when the first version of the Unicode standard was published in 1991, do things like this still happen nearly 28 years later? I’m afraid the answer is quite long and tortuous, and should embarrass everyone concerned (except Nico, of course!).

Unicode provides characters or ‘code points’ for a vast number of different languages and systems of characters and symbols, some of which are normally displayed as being visually identical. For example, Nordic languages have a letter å, and that has an upper case form Å. In the international definition of the unit of measurement known as the Ångström, named in honour of the Swedish physicist Anders Ångström, its official symbol is the upper case Å.

However, Unicode recognises these two characters as different code points: the letter is LATIN CAPITAL LETTER A WITH RING ABOVE, or UTF-8 C3 85; the unit of measurement is UTF-8 41 CC 8A. These are recognised as being canonically equivalent, so any comparisons and other operations performed should see them as functionally the same character.

Just two years after Unicode 1.0, Adobe Systems released version 1.0 of the specification for its new PDF files, initially supported by its Carousel reader which soon became Acrobat. This adopted quite a different approach, which stressed the importance of rendering a document exactly as its designer had intended, rather than encoding its text content in any interchangeable way. Instead of using the new Unicode characters, it therefore used (and still uses) codetables and lookup in fonts which can easily obscure the underlying ‘meaning’ of characters.

If you want to understand how different this is from Unicode, consider the following PDF snippet which draws two words using font F1:
/F1 48 Tf
(Hello World)Tj

If F1 is a font such as Times-Regular, that is rendered as expected. But if F1 happens to be the non-Unicode font Symbol, it is rendered as Ηελλο Ωορλδ, which violates the principles behind Unicode ever so slightly.

To handle the character Å, PDF uses the Mac Classic MacRomanEncoding, or Mac OS Roman, (I kid you not) and the character becomes 00C5 in that platform-specific variant of extended ASCII. That happens in this case to coincide with the Unicode code point for one of the two forms, U+00C5, and would generally map to UTF-8 C3 85.

However, using that Unicode character would be a problem if it was to be embedded in a Mac HFS+ folder or file name. For those, the file system normalises to Unicode Form D, where it becomes UTF-8 41 CC 8A. I don’t think that there are rules which apply when converting from ancient encodings such as Mac OS Roman to UTF-8, but Apple’s implementation in this case includes normalisation to Form D so that the resulting text is file-system safe.

Meanwhile, Internet URLs (and URIs) took some time to catch up with changing text encoding. Non-ASCII characters had arrived in ‘percent-encoding’ in which the UTF-8 hex for Unicode characters is given. By 2005, this changed into the Internationalised Resource Identifier (IRI) which essentially encodes in Unicode’s UTF-8. I’m not aware of any requirement for normalised forms, nor to address issues of canonical equivalence, so the two forms of Å could be treated as being different characters.

What happens when Nico copies text from within a PDF is that Adobe readers deliver it in its original Form C, but those based on the PDFKit Quartz 2D engine in macOS deliver it in Form D, in case you might want to use that text in a folder or file name (I must presume). As there’s no requirement for normalisation, and this appears to be an undocumented behaviour, macOS is unwarranted in making that normalisation.

Because IRIs and browsers don’t normalise, and treat the Unicode text that they’re given as inviolate, they cannot recognise that the two variants of the letter Å are canonically equivalent, and don’t perform any normalisation at all. This means that those characters are different in a URL, hence Nico’s original problem.

When working with accented Roman/Latin characters, the Form C versions are by far the most widely used, as they’re normally the ones which appear when you enter the character from the keyboard. So one workaround for Nico is to normalise such text before pasting it into URLs. That is easily done using my free tool Apfelstrudel if you want a graphical interface and lots of additional information such as how the different normalised forms behave when using text comparison operators. If you just want to normalise to Form C, then my free command tool unorml is readily packaged into most workflows. They’re both available from their Product Page.

uninorm01

Who’s got this wrong and should hang their heads in shame?

  • The Unicode Consortium, for using multiple code points for visually identical characters, which will always cause these problems.
  • Adobe, for not seeing the importance of standardising on Unicode UTF-8 encoding in PDF at an early stage, and plugging on with long-defunct encodings, which will always cause these problems.
  • Apple, for silently normalising UTF-8 text to Form D in the PDFKit Quartz 2D engine when the only potentially useful purpose in doing so was compatibility with a file system which is now discontinuing, and contrary to general practice.

I should also note that those most involved in the above, and least concerned about these continuing problems, are monolingual English-speakers. Nico works in Switzerland, where such issues are a constant plague, but being monoligual and accent-free isn’t really an option.

Finally, if you think this is bad in languages which use Roman/Latin characters, you should try it in Korean (so I’m told).

(Many thanks to Nico for bringing me this fascinating problem.)