Writing over at Science Blogs, Martin Rundkvist reports that the last couple ebooks he bought from Google Play Books were surprisingly terse:
On my Android smartphone, the OCRed texts in my e-book copies of Adam Roberts’ Jack Glass(2012) and Neal Stephenson’s REAMDE (2011) have lost all their apostrophes. All their quotation marks. All their long dashes. And all their diacritic characters. When Stephenson writes “naïveté”, my e-book says “navet”, which is French for turnip. When the problem first showed up, in Roberts’ book, I actually thought he wrote non-standard English as a futuristic device.
When you run operating systems in non-English language modes, like Swedish or even Chinese, you get used to misidentified characters, with ÅÄÖÜ becoming all kinds of junk symbols. But this doesn’t look like a case of that. Google’s reader software is just quietly omitting some of the most common characters in English novels!
As a geek and an ebook blogger, I like to collect stories of unusual formatting and other technical errors. Whether it’s Google deleting ebooks just because someone crossed the wrong international boundary, or strange text inserted into a copy of Game of Thrones, formatting errors are fascinating puzzles which are fun to solve.
In the case of the missing punctuation, I wasn’t able to replicate the error. But I would bet that Martin encountered a bug which involved the user language being set to Swedish and the language code for the ebook set to English.
I’m not sure where he got the idea that OCR had something to do with this, though; the ebooks in question are new enough that they were published digitally, and not converted from a scanned copy.
Does anyone know how this happened? Have you seen similar errors?
image by dlofink