Sunlight Foundation’s New Anti-Churnalism Tool May Catch Some Types of Plagiarism But It Misses the Point Completely

24.04.2013 Nate Hoffelder No Comments

A handy dandy new site is getting a lot of attention this week, but I think it focuses on the wrong problem.

Churnalism is a term for the practice of journalists copy-pasting press releases rather than writing original copy. This is more prevalent than you might think, especially in tech blogging, though the practice is frowned upon. But thanks to the new Churnalism site it won’t be quite so easy to get away with.

Here’s more from The Atlantic:

Today, the Sunlight Foundation has unveiled a tool that will help us all with this work. "The tool is, essentially, an open-source plagiarism detection engine," web developer Kaitlin Devine explained to me. It will scan any text (a news article, e.g.) and compare it with a corpus of press releases and Wikipedia entries. If it finds similar language, you’ll get a notification of a detected "churn" and you’ll be able to take a look at the two sources side by side. You can also use it to check Wikipedia entries for information that may have come from corporate press releases. The tool is based on a similar project released in the United Kingdom two years ago, which the Sunlight Foundation supported with a grant to make it open source. Churnalism will be available both on the website and as a browser extension. Its database of press releases includes those from EurekaAlert! in addition to PR Newswire, PR News Web, Fortune 500 companies, and government sources.

Okay, this will stop bloggers from simply copying chunks of a press release. But guess what? It’s not going to improve the quality of the journalism – not one bit. All this site will accomplish is to make bloggers slightly more skilled at rewriting press releases, thus contributing to the larger problem.

The larger problem in tech blogging isn’t copying press releases; it’s the lack of original work. This includes reposting a hot story originally covered elsewhere, reposting a juicy rumor, but in this case it also includes rewriting press releases.

It’s not uncommon for bloggers to rewrite a press release and not include anything more than what they were spoon fed. Speaking as someone who for the past 4 years has been watching the blogosphere from the inside, this is quite common in tech blogging. In fact, any time there is a major launch or a major press release, most coverage will be the same generic article.

Take the recent year-end US book market stats from the AAP. Almost everyone (outside of the digital publishing blogosphere) who covered the story did little more than rewrite the press release. I’m sure you recall the statistics about the rise of the ebook market (from half a percent to 22.55%)? Those stats were handed to us in the press release, and so was much of the rest of the details reported. (Anyone who used those stats all made the same mistake, too, but more on that later.)

TBH I don’t really see a difference between a post that contains a copied chunk of a press release and one that was rewritten to hide the source. When compared to what the blogger is supposed to be doing (thinking, researching, and writing), a copied press release and a rewritten press release are effectively the same thing.

I really don’t think this tool is worth the effort it took to develop it. It doesn’t solve the larger problem, and I can give you an example to show you why this is a serious issue.

I mentioned the AAP year-end stats earlier in the post. There was a problem with the stats offered in the press release, one that I only discovered because I noticed a discrepancy, investigated, and basically did my job.

To put it simply, anyone who used the 22.55% ebook market share figure and tied it to the $7.1b figure for the 2012 US book market was set up by the AAP to make a mistake.

This is not something easy to explain, but the 22.55% figure isn’t based on all of the sales data collected in 2012. The simplest (and possibly not completely accurate) way to put it is that the 22.55% is based on a subset of the sales data points that match up with the data points collected in earlier years. In other words the 22.55% is based on a data set that excluded some figures that went into the $7.1b sales estimate for the US book market.

The intention was to make the 22.55% relatable to the percentages mentioned for the earlier years, but the result was that many bloggers wrote something like "in 2012 ebooks were 22.5% of the $7.1b US book market", a statement which is not at all true.

Do you see the problem? Everyone accepts the 22.55% as legit, only it really has nothing to do with the publicly available sales data for the US book market in 2012. But now that it was repeated unquestioningly, it is the legitimate statistic.

The larger problem in journalism isn’t blatant copying; it’s the lack of thought. We are all too often robotically repeating what we are told without thinking it through. And that includes me; I am almost as guilty as anyone else.

I’m sure you’ve seen this problem before; any time a rumor comes around that mentions a hot topic, the rumor will be repeated no matter how ridiculous. And the same goes for tech stories that sound good but aren’t true.

Frankly I don’t see a way to solve the real problem. The only real solution is to raise one’s professional standards, and that is both difficult to maintain and cannot be forced upon others.