For those who have ever wondered how many different books are out there in the world, Google has an answer for you: 129,864,880, according to Leonid Taycher, a Google software engineer who works on the Google Books project.
Estimating the number of books in the world is more than an exercise in curiosity for the search giant: It also provides a roadmap of some of the work still left to be done in meeting the company’s ambitious goal of organizing all the world’s information.
“When you are part of a company that is trying to digitize all the books in the world, the first question you often get is: ‘Just how many books are out there?’,” Taycher explained in a blog post announcing the estimate.
To come up with a reasonable approximation, the company started by ingesting book information from multiple cataloging systems, such as the International Standard Book Numbers (ISBN).
Such catalogues, while helpful, do not provide a definitive count, however. For instance, ISBNs have only been assigned to books since the 1960s, and tend to be only used in the Western countries.
Also multiple books have been assigned to individual ISBN numbers, and publisher have assigned ISBNs to items other than books, such as t-shirts and DVDs.
So Google engineers have written programs to comb though about 150 such catalogues and directories, and eliminate as many duplicate entries as could be found.
The company also had to make a number of tough decisions about what is and isn’t a book, Taycher explained.
For instance, soft cover and hard cover editions of a text are counted as two books, as are the many different versions of a popular text, such as Shakespeare’s “Hamlet,” due to the forewords and commentaries they may contain. Serials may count as individual books or as a collected work.
As of June, the company has scanned 12 million books, according to a presentation given by Google Books engineering manager Jon Orwant at the USENIX Annual Technical Conference in Boston. These books have been written in about 480 languages (including 3 books in the Star Trek-originated Klingon language) .
The company plans to complete the scanning of existing books within a decade. The resulting virtual collection will consist of four billion pages and two trillion words, Orwant said.
About 20 percent of the world’s books are in the public domain, Orwant explained. About 10 to 15 percent of these books are in print. The remaining books — the vast majority of all titles — are still under copyright but out of print. Google is in the process of borrowing copies of these books in order to digitize them, from about 40 large libraries worldwide.
It’s this act of scanning in books that are out-of-print but still covered by copyright that has been met with some resistance by the publishing industry.
The company is now waiting for a judgement from the U.S. District Court for the Southern District of New York, on whether it can scan these books.
In 2005, the Authors Guild and the Association of American Publishers separately filed class-action lawsuits against the search giant, asserting that the company is infringing on author copyrights by scanning in the books.
Google has claimed it wants to sell digital copies of these otherwise out-of-print books, and set aside royalties for the authors to claim. The company also hopes to reveal snippets of these books in Web searches, and claims this use falls under the U.S. Fair Use doctrine.
Scanning in all the world’s books will lead to other benefits in addition to improving searches, Orwant explained. Once all these volumes are digitized, their contents can be subjected to analysis, which can lead to new insights. Linguists can discover when certain words came into widespread use, or who first starting using these words.
The Google Book Search could also help answer some outstanding historical questions: For instance, it could inform the debate over whether Isaac Newton and Gottfried Leibniz — or someone else entirely — invented calculus.
“We can search not just for a phrase but for a concept,” Orwant explained. “We can take all the different ways [that the idea of] infinity can be inflected, translate that into different languages, and do a search in parallel.”
“My hope is that as we start to expose a lot more of this collection, it will allow people to ask questions like this that they haven’t been able to ask before,” he said.
IDG News Service editor Juan Carlos Perez contributed to this report.
Joab Jackson covers enterprise software and general technology breaking news for The IDG News Service. Follow Joab on Twitter at @Joab_Jackson. Joab’s e-mail address is Joab_Jackson@idg.com