hoakley June 20, 2019 General, Language, Macs, Technology

28 years after Unicode, we still can’t handle accents: PDF + macOS + URL = chaos

Nico had a simple task: copy a URL from a PDF document and paste it into a webpage for publication on a site. When he opened the PDF in Adobe Acrobat, that worked fine, but open the same document in Preview, Podofyllin, or any other app which uses the macOS PDFKit Quartz 2D engine, and the pasted URL is broken.

Why, when the first version of the Unicode standard was published in 1991, do things like this still happen nearly 28 years later? I’m afraid the answer is quite long and tortuous, and should embarrass everyone concerned (except Nico, of course!).

Unicode provides characters or ‘code points’ for a vast number of different languages and systems of characters and symbols, some of which are normally displayed as being visually identical. For example, Nordic languages have a letter å, and that has an upper case form Å. In the international definition of the unit of measurement known as the Ångström, named in honour of the Swedish physicist Anders Ångström, its official symbol is the upper case Å.

However, Unicode recognises these two characters as different code points: the letter is LATIN CAPITAL LETTER A WITH RING ABOVE, or UTF-8 C3 85; the unit of measurement is UTF-8 41 CC 8A. These are recognised as being canonically equivalent, so any comparisons and other operations performed should see them as functionally the same character.

Just two years after Unicode 1.0, Adobe Systems released version 1.0 of the specification for its new PDF files, initially supported by its Carousel reader which soon became Acrobat. This adopted quite a different approach, which stressed the importance of rendering a document exactly as its designer had intended, rather than encoding its text content in any interchangeable way. Instead of using the new Unicode characters, it therefore used (and still uses) codetables and lookup in fonts which can easily obscure the underlying ‘meaning’ of characters.

If you want to understand how different this is from Unicode, consider the following PDF snippet which draws two words using font F1:
/F1 48 Tf (Hello World)Tj
If F1 is a font such as Times-Regular, that is rendered as expected. But if F1 happens to be the non-Unicode font Symbol, it is rendered as Ηελλο Ωορλδ, which violates the principles behind Unicode ever so slightly.

To handle the character Å, PDF uses the Mac Classic MacRomanEncoding, or Mac OS Roman, (I kid you not) and the character becomes 00C5 in that platform-specific variant of extended ASCII. That happens in this case to coincide with the Unicode code point for one of the two forms, U+00C5, and would generally map to UTF-8 C3 85.

However, using that Unicode character would be a problem if it was to be embedded in a Mac HFS+ folder or file name. For those, the file system normalises to Unicode Form D, where it becomes UTF-8 41 CC 8A. I don’t think that there are rules which apply when converting from ancient encodings such as Mac OS Roman to UTF-8, but Apple’s implementation in this case includes normalisation to Form D so that the resulting text is file-system safe.

Meanwhile, Internet URLs (and URIs) took some time to catch up with changing text encoding. Non-ASCII characters had arrived in ‘percent-encoding’ in which the UTF-8 hex for Unicode characters is given. By 2005, this changed into the Internationalised Resource Identifier (IRI) which essentially encodes in Unicode’s UTF-8. I’m not aware of any requirement for normalised forms, nor to address issues of canonical equivalence, so the two forms of Å could be treated as being different characters.

What happens when Nico copies text from within a PDF is that Adobe readers deliver it in its original Form C, but those based on the PDFKit Quartz 2D engine in macOS deliver it in Form D, in case you might want to use that text in a folder or file name (I must presume). As there’s no requirement for normalisation, and this appears to be an undocumented behaviour, macOS is unwarranted in making that normalisation.

Because IRIs and browsers don’t normalise, and treat the Unicode text that they’re given as inviolate, they cannot recognise that the two variants of the letter Å are canonically equivalent, and don’t perform any normalisation at all. This means that those characters are different in a URL, hence Nico’s original problem.

When working with accented Roman/Latin characters, the Form C versions are by far the most widely used, as they’re normally the ones which appear when you enter the character from the keyboard. So one workaround for Nico is to normalise such text before pasting it into URLs. That is easily done using my free tool Apfelstrudel if you want a graphical interface and lots of additional information such as how the different normalised forms behave when using text comparison operators. If you just want to normalise to Form C, then my free command tool unorml is readily packaged into most workflows. They’re both available from their Product Page.

uninorm01

Who’s got this wrong and should hang their heads in shame?

The Unicode Consortium, for using multiple code points for visually identical characters, which will always cause these problems.
Adobe, for not seeing the importance of standardising on Unicode UTF-8 encoding in PDF at an early stage, and plugging on with long-defunct encodings, which will always cause these problems.
Apple, for silently normalising UTF-8 text to Form D in the PDFKit Quartz 2D engine when the only potentially useful purpose in doing so was compatibility with a file system which is now discontinuing, and contrary to general practice.

I should also note that those most involved in the above, and least concerned about these continuing problems, are monolingual English-speakers. Nico works in Switzerland, where such issues are a constant plague, but being monoligual and accent-free isn’t really an option.

Finally, if you think this is bad in languages which use Roman/Latin characters, you should try it in Korean (so I’m told).

(Many thanks to Nico for bringing me this fascinating problem.)

9Comments

Add yours

1

Garland on June 20, 2019 at 6:22 pm

I agree that language encoding and normalization remains troublingly inconsistent. Although, the situation is better than it has been. (Yes, that is a bit of damning by faint praise. No longer a complete Tower of Babel, computing has progressed to merely a Turret of Babel.)

I’ve recently been investigating the macOS pasteboard subsystem. While it mostly “just works” (thus becomes transparent to most users), it mediates the encoding problem that you describe. There are two aspects to consider: (1) what formats/encodings does the source application make available on the pasteboard, and (2) what formats/encodings does the destination application consume (and in what order)?

From Safari, I printed this webpage to a PDF. From Preview, I copied the “Å” to the macOS (10.14.5) pasteboard. Both the C and D forms became available on the pasteboard — the C form on the UTF-16 encoded pasteboard and the D form on the UTF-8 encoded pasteboard.

“`
$ osascript -s h -e ‘get clipboard info’
«class RTF », 355, «class utf8», 3, «class HTML», 756, «class weba», 992, «class rtfd», 579, «class ut16», 4, uniform styles, 144, string, 1, styled Clipboard text, 22, Unicode text, 4, uniform styles, 144, styled Clipboard text, 22

$ osascript -s h -e ‘get the clipboard as “ut16″‘ | hexdump -C
00000000 c3 85 0a |…|
00000003

$ osascript -s h -e ‘get the clipboard as “utf8″‘ | hexdump -C
00000000 41 cc 8a 0a |A…|
00000004
“`

Opening the same PDF in Adobe Reader, I copied the same “Å” to the pasteboard. The C form became available in both the UTF-8 and UTF-16 encoded pasteboards. The D form was not available.

“`
$ osascript -s h -e ‘get clipboard info’
«class RTF », 175, «class utf8», 2, «class ut16», 4, uniform styles, 144, string, 1, styled Clipboard text, 22, Unicode text, 2, uniform styles, 144, styled Clipboard text, 22

$ osascript -s h -e ‘get the clipboard as “utf8″‘ | hexdump -C
00000000 c3 85 0a |…|
00000003

$ osascript -s h -e ‘get the clipboard as “ut16″‘ | hexdump -C
00000000 c3 85 0a |…|
00000003
“`

I’ll leave it to others to decide what the “right” behavior should be. In any case, it is trivial to make the contents of the UTF-16 pasteboard available to applications that prefer to consume the UTF-8 pasteboard.

After copying the “Å” from Preview to the pasteboard:

“`
$ osascript -s h -e ‘set the clipboard to the clipboard as “ut16″‘

$ osascript -s h -e ‘get the clipboard as “utf8″‘ | hexdump -C
00000000 c3 85 0a |…|
00000003

$ osascript -s h -e ‘get the clipboard as “ut16″‘ | hexdump -C
00000000 c3 85 0a |…|
00000003
“`

LikeLiked by 1 person
- 2
  
  hoakley on June 20, 2019 at 6:45 pm
  
  Thank you – an interesting insight.
  I don’t think the UTF-16 is ‘native’ to PDFKit or the internal PDF representation, which is generally accessed internally in UTF-8, although Quartz 2D might retain it in its horribly archaic codepage extended ASCII. That suggests that the UTF-16 is a Form C normalisation, rather than pre-normalised data.
  Quite why the two different Unicode formats normalise differently I don’t know. Covering both bases, perhaps?
  This type of problem won’t go away though so long as Unicode retains these canonically equivalent forms, and PDF doesn’t become thoroughly Unicode. But no doubt both standardisation teams insist that they’re right.
  Howard.
  
  LikeLike
  - 3
    
    Martin Wierschin on June 20, 2019 at 11:40 pm
    
    Which UTF encoding or normalization is used to encode the text doesn’t really matter. If any clients (eg: pasteboard, file system, etc) care about the exact encoding, they can simply reencode the text. The end user generally shouldn’t care either, so long as text appears correctly on screen.
    
    As Garland pointed out, a lot of this is taken care of for the programmer. If you’re writing software that cares about comparing or collating text you’re almost always using prepackaged tools that are already Unicode aware. You’ve written many apps Howard, so I’m sure you know the macOS text APIs do all this for you. You pass the right flags and the letter Å will be tested correctly no matter how the data was originally encoded; similarly whether it’s a single precomposed character or a composed character sequence using a combining mark.
    
    There is however one problem lurking here: sometimes PDF text will appear correctly on screen, but the underlying characters are somehow different. It’s very similar to what you described regarding the Symbol font, but it’s more malicious as I’ve seen it with basic text (A-Z) where the font isn’t showing anything exotic. It means you can’t search in the PDF document, as the underlying characters won’t match what you see. Copy-paste also doesn’t work, because you just copy out a bunch of gibberish characters.
    
    In these situations (which are rare) I assume the font is using a wonky non-standard glyph mapping that’s not supported by PDF, or something like that. But as you mentioned this is a consequence of PDFs prioritizing exact display over textual interchange. In that primary goal PDFs have done a pretty great job if you ask me, so maybe we shouldn’t be too critical!
    
    LikeLiked by 1 person
    - 4
      
      hoakley on June 21, 2019 at 11:25 am
      
      Thanks.
      I have an app (you might have guessed!) which performs this type of Unicode obfuscation, Dystextia. I must admit I haven’t yet looked at the PDF it generates. When I get a chance, that sounds like an interesting exploration.
      The easiest way to throw PDF content access is to create a document in LaTeX using some obscure non-English fonts, generate a DVI from that, then convert the DVI to PostScript, and distill that using something like Ghostscript. You’ll never recover anything intelligible after that!
      Howard.
      
      LikeLike
5

Valdo on July 18, 2019 at 3:10 pm

Hello, my posting is a bit out of topic, but I found here help for several times 🙂

I moved, for private use, from MS Office to iWork. I found a spelling library for my language to use it as spellcheck in Pages. Later I removed it as Page behavior was strange, to find out that there are some other system? spelling problems.
Examples: with my language installed some words were underlined. Checking the spelling it was offered exactly the same word as replacement. Accepting the offered replacement the word remained underlined. I removed the dictionary and started to write in English which is also default language for Mac.
In Pages again appeared underlined words, for those written incorrectly was offered a correct replacement, but few correctly written words were underlined but the field for the correct replacement was empty, so the word remained underlined… Any idea whats wrong?

LikeLiked by 1 person
- 6
  
  hoakley on July 18, 2019 at 3:23 pm
  
  Some users do report this with some languages. I’m never sure whether it’s anything to do with the dictionary, or the attempt to check grammar – the underlined words seem to be selected more on the basis of usage rather than spelling, as you have seen. Even in English, grammar checking is so poor that I simply turn spelling and grammar checking off.
  Which is probably why you see the occasional mistake slip through here!
  I gather that third-party checkers are no better either. So I don’t really have a solution to offer, other than to check your own spelling as you need. At least the Dictionary app works well.
  Howard.
  
  LikeLike
  - 7
    
    Martin Wierschin on July 18, 2019 at 3:54 pm
    
    Depending on the exact type of misspellings you and Valdo are seeing, this may be a system bug that was introduced by Mojave. The system will incorrectly flag words as misspelled that are in fact spelled correctly, but violate grammar checking rules. For example, words like “by”, “of”, or “than”.
    
    The underlying bug is in the macOS spelling services and can affect any apps that use it, including Apple Mail or Pages. The system incorrectly bleeds grammar checking results into spellchecking results. I filed the bug with Apple some time ago as radar #48614141.
    
    The good news is that individual app developers can workaround the problem if they filter Apple’s results. Code can inspect the NSTextCheckingResult objects returned by the system and ignore those with a “NSGrammarRange” key in their “detail” dictionary. Nisus Writer Pro version 3.0.2 implemented such a workaround.
    
    Of course it would be much better if Apple just fixed this across the board for the whole system, for all apps at once.
    
    LikeLiked by 1 person
    - 8
      
      hoakley on July 18, 2019 at 4:33 pm
      
      Thank you, Martin, for explaining in detail what we had been observing (although I’m sure that I had also seen this earlier in Sierra too, although not as bad).
      I often used to see it in their and there, where the grammar checker clearly didn’t have a clue. I hadn’t realised that this was a bug, though, in leaking through like this. I’m once again happy to have and to use Nisus Writer Pro!
      Howard.
      
      LikeLike
9

Valdo on July 18, 2019 at 3:27 pm

To mention one more detail.
My Mac region (not the language) is set to Germany instead of my country because if I set it to my country shows the currency symbol before the amount (american style €125 eventually the old currency scraped in 2009…! The German style is correct 125€).

LikeLiked by 1 person

·Comments are closed.

Share this:

Related