Skip to main content

Google Play Books Wants You to be the Next E E Cummings

5263546615_d32a0099a9_bWe’ve all encountered strange formatting in ebooks, but this next one takes the cake.

Writing over at Science Blogs, Martin Rundkvist reports that the last couple ebooks he bought from Google Play Books were surprisingly terse:

On my Android smartphone, the OCRed texts in my e-book copies of Adam Roberts’ Jack Glass(2012) and Neal Stephenson’s REAMDE (2011) have lost all their apostrophes. All their quotation marks. All their long dashes. And all their diacritic characters. When Stephenson writes “naïveté”, my e-book says “navet”, which is French for turnip. When the problem first showed up, in Roberts’ book, I actually thought he wrote non-standard English as a futuristic device.

When you run operating systems in non-English language modes, like Swedish or even Chinese, you get used to misidentified characters, with ÅÄÖÜ becoming all kinds of junk symbols. But this doesn’t look like a case of that. Google’s reader software is just quietly omitting some of the most common characters in English novels!

As a geek and an ebook blogger, I like to collect stories of unusual formatting and other technical errors. Whether it’s Google deleting ebooks just because someone crossed the wrong international boundary, or strange text inserted into a copy of Game of Thrones, formatting errors are fascinating puzzles which are fun to solve.

In the case of the missing punctuation, I wasn’t able to replicate the error. But I would bet that Martin encountered a bug which involved the user language being set to Swedish and the language code for the ebook set to English.

I’m not sure where he got the idea that OCR had something to do with this, though; the ebooks in question are new enough that they were published digitally, and not converted from a scanned copy.

Does anyone know how this happened? Have you seen similar errors?

image by dlofink

Similar Articles


Comments


Maria (BearMountainBooks) April 6, 2015 um 3:04 pm

Long ago, when uploading to Amazon (I’d put it at about 7 years ago, maybe less) you could not use smart quotes, em dashes and a few other characters. They can out as oddball garbage on the reader. It didn’t take long for Amazon to add code to "translate" the html for these characters, but some of the early adopters who uploaded books had to avoid using certain characters. Those books, unless updated, can appear a bit start with — instead of an em dash (a long single dash) and no smart quotes, etc. It’s possible that whoever converted/uploaded the book used technology that didn’t recognize some of the characters. Or that Google has a copy somewhere that is missing part of the standard character set???

Nate Hoffelder April 6, 2015 um 3:35 pm

It’s possible that whoever converted/uploaded the book used technology that didn’t recognize some of the characters.

That is a possibility, yes.

Surprisingly, smart quotes and other trick characters are still an issue for web publishing even in 2015. WordPress still chokes on them.


Ben Hollingum April 7, 2015 um 5:15 am

My guess is that Google’s playing silly buggers again.

Many special character entities that are perfectly fine in HTML will throw up parsing errors if pasted into XHTML or XML. Generally best practice when making ebooks is to avoid using HTML entities and stick to UTF-8 or -16 encoding.

My guess is that special characters encoded using HTML entities in the original ebook files are getting suppressed as a consequence of Google’s mysterious conversion process. That seems to confirm that they’re using some sort of XML, which is interesting.


Problemas comuns de formatação de e-book April 15, 2015 um 8:19 am

[…] passada li no The Digital Reader um artigo que, por sua vez, comenta um texto publicado no Science Blog, no qual o leitor Martin […]


Patrick Kitts March 10, 2017 um 2:09 am

I just purchased and downloaded a book from Google Play. Not a single apostrophe to be seen in any contractions. I’ll, he’ll, she’ll and we’ll become Ill, hell, shell and well. It’s 2017 and this is unacceptable. If I find out whos responsible for this grammatical butchery hell have hell to pay.


Write a Comment