Google Play Books Wants You to be the Next E E Cummings

5263546615_d32a0099a9_bWe've all encountered strange formatting in ebooks, but this next one takes the cake.

Writing over at Science Blogs, Martin Rundkvist reports that the last couple ebooks he bought from Google Play Books were surprisingly terse:

On my Android smartphone, the OCRed texts in my e-book copies of Adam Roberts’ Jack Glass(2012) and Neal Stephenson’s REAMDE (2011) have lost all their apostrophes. All their quotation marks. All their long dashes. And all their diacritic characters. When Stephenson writes “naïveté”, my e-book says “navet”, which is French for turnip. When the problem first showed up, in Roberts’ book, I actually thought he wrote non-standard English as a futuristic device.

When you run operating systems in non-English language modes, like Swedish or even Chinese, you get used to misidentified characters, with ÅÄÖÜ becoming all kinds of junk symbols. But this doesn’t look like a case of that. Google’s reader software is just quietly omitting some of the most common characters in English novels!

As a geek and an ebook blogger, I like to collect stories of unusual formatting and other technical errors. Whether it's Google deleting ebooks just because someone crossed the wrong international boundary, or strange text inserted into a copy of Game of Thrones, formatting errors are fascinating puzzles which are fun to solve.

In the case of the missing punctuation, I wasn't able to replicate the error. But I would bet that Martin encountered a bug which involved the user language being set to Swedish and the language code for the ebook set to English.

I'm not sure where he got the idea that OCR had something to do with this, though; the ebooks in question are new enough that they were published digitally, and not converted from a scanned copy.

Does anyone know how this happened? Have you seen similar errors?

image by dlofink

About Nate Hoffelder (11579 Articles)
Nate Hoffelder is the founder and editor of The Digital Reader:"I've been into reading ebooks since forever, but I only got my first ereader in July 2007. Everything quickly spiraled out of control from there. Before I started this blog in January 2010 I covered ebooks, ebook readers, and digital publishing for about 2 years as a part of MobileRead Forums. It's a great community, and being a member is a joy. But I thought I could make something out of how I covered the news for MobileRead, so I started this blog."

3 Comments on Google Play Books Wants You to be the Next E E Cummings

  1. Long ago, when uploading to Amazon (I’d put it at about 7 years ago, maybe less) you could not use smart quotes, em dashes and a few other characters. They can out as oddball garbage on the reader. It didn’t take long for Amazon to add code to “translate” the html for these characters, but some of the early adopters who uploaded books had to avoid using certain characters. Those books, unless updated, can appear a bit start with — instead of an em dash (a long single dash) and no smart quotes, etc. It’s possible that whoever converted/uploaded the book used technology that didn’t recognize some of the characters. Or that Google has a copy somewhere that is missing part of the standard character set???

    • It’s possible that whoever converted/uploaded the book used technology that didn’t recognize some of the characters.

      That is a possibility, yes.

      Surprisingly, smart quotes and other trick characters are still an issue for web publishing even in 2015. WordPress still chokes on them.

  2. My guess is that Google’s playing silly buggers again.

    Many special character entities that are perfectly fine in HTML will throw up parsing errors if pasted into XHTML or XML. Generally best practice when making ebooks is to avoid using HTML entities and stick to UTF-8 or -16 encoding.

    My guess is that special characters encoded using HTML entities in the original ebook files are getting suppressed as a consequence of Google’s mysterious conversion process. That seems to confirm that they’re using some sort of XML, which is interesting.

1 Trackbacks & Pingbacks

  1. Problemas comuns de formatação de e-book

Leave a comment

Your email address will not be published.


*