by Andrew Rhomberg, founder of Jellybooks
The promise of a formula for predicting a best-seller is getting many in the publishing industry and those who write about books excited.
Several journalists contacted me for an opinion about the book because of my background in pub tech and reader analytics. Thus I became interested in reading the book and the book’s publisher St. Martin’s Press was kind enough to provide me with an advance reader copy last week.
First of all this is a delightful book to read. I would recommend it as both an entertaining and educational read for anybody interested in the business of books. This is not a magisterial work like “Merchants of Culture” by J. Thompson, but a book written for the mass market with lots of anecdotes and examples that readers and authors can relate to.
It is a book for a general audience and avoids as far as possible jargon and “academic” language. The “code” is based on some of the latest advances in machine learning as applied to literature, but the authors attempt to simplify the computer science behind the book to a minimum. There is no mention of “big data” or artificial intelligence, just plain and simple descriptions of what the “black box” does with references for the interested readers to find out more about the inner workings of that black box.
However, there is statement in the book that is misunderstood by many of those who interviewed me about the book and that is “the algorithm can predict if a book will be a best-seller with accuracy 80%”.
I had a sense when being interviewed that most journalists thought this meant something along the following lines of: “if there are something like 500 New York Times best-sellers this year, then this algorithm can produce a list of 500 titles and 400 of those will indeed turn out to be best-sellers”. Well that’s not actually what 80% accuracy means. The misunderstanding is in the “will produce a list of 500”.
One needs a bit of statistics knowledge to understand this better. I will first provide (with some statistical elaboration) how the authors describe the 80% accuracy:
If the algorithm is applied to 50 books that are genuinely best-sellers then it will recognize that 40 of these (80%) are indeed best-sellers, but will classify incorrectly (“falsely”) that 10 of the books (20%) are not best-sellers ( a “negative” result). Thus the 10 titles that are missed are what statisticians call the “false negatives”.
Now, if the algorithm is applied to 50 books that are known not to be best-sellers, then it will recognise that 40 of these (80%) are indeed not best-sellers (80%), but will classify incorrectly (“falsely”) that 10 of the books (20%) are, in the opinion of the algorithm, in fact best-sellers (a “positive” result) when in fact they never were NYTimes best-sellers. Thus, these 10 titles that are incorrectly predicted to be best-sellers are what statisticians call the “false positives”.
Let’s construct a different scenario. Imagine a Barnes & Noble megastore in the Midwest with 200,000 nicely ordered titles on its shelves including 1,000 titles in a section called “Past and Present New York Times Best-Sellers”.
Now a mob of Trump supporters enters the stores and throws all the books on the floor in protest at Trump’s “Art of the Deal” not being displayed in the best-seller section. They don’t actually take any of the books with them, because, well, they are not really interested in reading books, so there are now 200,000 books lying in a jumble on the floor.
A poor BN intern is now assigned to put the 1,000 best-sellers back on the shelf, but, being an intern, the intern has no idea what makes a best-seller and thus the intern decides to make use of this magic new algorithm.
The poor intern now tests all 200,000 books against the algorithm.
When applied to the 1,000 best-sellers the algorithm identifies 800 of them correctly as best-sellers, but dismisses 200 as not being best-sellers.
Now it gets interesting though. When analyzing the remaining 199,000 books, the algorithm identifies 80%?—?that is 159,200 books as not being best-sellers, but it believes (incorrectly) that the rest (20%) are in fact NYTimes best-sellers. That is whopping 39,800 books. Our intern using the algorithm identified a total of 40,600 (39,800 + 800) books as NYTimes best-sellers. He discovers not just the 1,000 NYTimes best-sellers he was looking for, but 39,800 “best-sellers” while missing out on 200 real NYTimes best-sellers, that were incorrectly classified by the algorithm. That is what 80% accuracy means.
We applied the algorithm to a large sample that had many books in it that were not best-sellers, and as a result the algorithm produced many, many false positives.
It did do its job though. Whereas the original 200,000 books contained only 0.5% best-sellers (i.e. 1,000 books) the new smaller list of 39,800 books contains 2% best-sellers (800 books), a fourfold “enrichment” which came at the loss of 200 best-sellers going missing, because the algorithm is not 100% perfect.
Now, we could play this game a bit differently. The intern is lazy and fills the shelf with the first 1,000 books that the algorithm identifies as being best-sellers. Well based on the above enrichment factor we know that among the first 1,000 books the intern select, 2% (i.e. 20 books?—?actually, 19 if we don’t round) will be best-sellers. So the new “best-seller” shelf will consist almost entirely of books that are not best-sellers. There is even a 1 in 200 chance that Trump’s book will end up on the shelf.
Now, this result doesn’t sound quite as impressive, does it? But this is what 80% accuracy means. It will not turn publishing on its head given one million new books or manuscripts are written every year. An algorithm with 80% accuracy will just not cut it, but don’t be deterred from reading the book. It still offers some genuine and novel insights as to what makes a best-seller. However, it is not going to put acquisition editors out of a job.
However, machines are getting smarter, machine learning improves, and artificial intelligence is getting more intelligent. What if the algorithm were 99.9% accurate rather than just 80% accurate? In that case the intern would have correctly identified 999 of the 1,000 best-sellers lying randomly on the floor as NYTimes best-sellers and missed only one. But the intern also had to test the 199,000 other books and that would have produced 199 “false positives”, meaning he would have 1,198 books to put on the shelves, 198 more than he expected if the algorithm was 100% accurate (like an inventory lists with no mistakes or typos).
Now that would sound a hell of a lot more impressive, but an algorithm that is 99.9% accurate is still a long way off for the simple reason that human taste and fashion is so incredibly unpredictable. Book publishing will always be a bit of a lottery, but that does not mean the odds cannot be improved with good data and smart algorithms. At my own company,Jellybooks, the emphasis is on generating good data. That means understanding how people read books and when they recommend them, not just judging success based on sales data or a book’s position on particular best-seller list.
Code will appear more and more in publishing even if code can't write novels yet or predict with 100% accuracy the next NYTimes best-seller.