10 February 2006

Book Scanning Projects

Two book scanning projects were outlined at different plenary sessions at Vala 2006. The Million Book project and the Google Print (now renamed Google Book Search - so as to avoid users expecting to print a page) project. The first project started in 2001 and has major scanning sites in India, China, Egypt and the U.S. with a major focus on multilingual issues and a mobile print on demand van to visit remote/rural villages in India. Professor Narayanswamy Balakrishnan gave an informative and entertaining presentation showing the audacious nature of the project. In 2001 who would have thought they could scan a million books by 2005? As it turns out the project expects to scan the millionth book this year. The approach to selection was very simple - it doesn't matter what is scanned as it takes much longer to select and consider for scanning than it does to actually scan it - so just do it. And the value of books can change over time, as the professor said "a two year old car is useless, but a 200 year old car is a classic".

Daniel Clancy, Engineering Director Google Print Project, presented a similar approach. Google is willing to scan all of the collections of their partners, but there has been a start made on content which is less easily accessible right now eg. in storage collections. Both projects also have the same philosophy to the quality issues surrounding scanning of books. Good enough is OK for now, as it is better to get 80% of books scanned and accessible to users, than dwell on quality issues and make little progress. Professor Balakrishnan talked about how the quality will be improved as a greater quantity of books is scanned. With more in the database, the computers can become better trained to correct mistakes in the process. Google's approach is to build better software that will correct quality issues such as skewing, blur, and OCR mistakes. In addition there may be the possibility of involving the community of users in assisting with error corrections via the Google Books display.