About That "Dramatic Growth of Swearing in Books"
A recently published academic paper has shown that the weakening of censorship inside publishers has lead to a dramatic increase in the use of swear words in American books since 1950.
Mark Twain wrote: “There ought to be a room in every house to swear in,” because “it’s dangerous to have to repress an emotion like that”. Today, the great American novelist might have applauded the increase in cursing, with a new study identifying a “dramatic” increase in swear words in American literature over the last 60 years.
Sifting through text from almost 1m books, the study found that “motherfucker” was used 678 times more often in the mid-2000s than the early 1950s, occurrences of “shit” multiplied 69 times, and “fuck” was 168 times more frequent.
Led by Jean Twenge, author and psychology professor at San Diego State University, the team analysed the titles making up the Google Books corpus of American English books published between 1950 and 2008, looking for uses of the words “shit”, “piss”, “fuck”, “cunt”, “cocksucker”, “motherfucker”, and “tits”. They picked these words because they were described as the “seven words you can never say on television” by comedian George Carlin in 1972.
Overall, they found that writers were “significantly more likely to use each of the seven swearwords in the years since 1950”, with books published in 2005-2008 28 times more likely to include swearwords than books published in the early 1950s.
“I had guessed that the use of swearwords would increase, but I was surprised that the increase was so large – 28 times more,” Twenge said.
What caught my eye about this story was that the data came from Google Ngram Viewer and is publicly available. This means we get to play with it, too.
I’ve been exploring the data this morning, and I found an interesting quirk in the source data set.
Google has two different data sets for Google Ngram Viewer. (Actually, there’s ten or so but most of the data sets can be divided into one of two groups.) One is based on scans of books digitized in 2009, and the other is based on scans digitized in 2012. According to Google, the latter year had "more books, improved OCR, improved library and publisher metadata."
I found this on my own, but the researchers were also aware of the discrepancies. They wrote it off in a footnote because they "could not determine whether this downturn was due to an error in this database or to some other cause".
I don’t know about you but I thought Google’s explanation of "more books" explained the differences well enough.
The 2009 data set was likely missing books affected by the Google Books lawsuit, while the 2012 data set could have included titles which Google was selling in Google Play Books or books that publishers uploaded to Google Books during one of the period where that lawsuit was almost settled.
Either way, the more complete data set paints a very different picture.