Digital versions of large number Articles published between 1830 and 1922 can be found in the Making of America (MoA) at Cornell University and University of Michigan Libraries.
Cornell University and University of Michigan have different parts of
the Making of America digital Library with slightly different basic
Searching Cornell MoA catalog finds a lot of Journal articles. Entries need to be selected depending initialy by reading the Titles and then create a short-list looking at the page-images.
Searching the Michigan MoA also gives many strikes. They unfortunately do not allow filtering or ordering based on number of hits of keyword in the Journal article. Sorting would have been the simplest way to remove the single strike cases. Most of the Articles with more than 5 matches of Ceylon were interesting and on Lanka. However when a page is broken between two articles half of the matches are allocated to each of the articles. This leads to adjacent unrelated stories getting picked up with specified keyword. Selecting a clean short-list of interesting articles was not difficult.
It was interesting to find so many interesting articles on Ceylon (Lanka) in this Making of America collection. These are a subset of what has been published on Ceylon in the 19th Century. .
I wrote to both Cornell University and the University of Michigan for permission to post reformatted or new OCR text of selected articles on Ceylon from the MoA digital Library collection as time permits. Both UMich.Edu and Cornell.edu have granted formal permission for posting reformatted ASCII text on Lanka from the MoA on this non-commercial site.
However the OCR text in MoA done mainly to index the collection is not very readable. For example The city of the sacred Bo-tree by James Ricalton, 1891 article can be found at Cornell Digital Library. Note that MoA copy is faulty with Text page 322 missing (image page 321 is duplicated) and the double column plain text of pages 327,332 has been OCR'ed as a single Column text making it hardly readable. Illustrates some typical errors from automated scanning and OCR.
If anyone is interested in contributing by Proof reading some of the MoA articles please let me know which titles you will work on and I will let you know if anyone else is working on them and ensures no duplication of effort. The articles reformated so far are listed. See guide with manual text editing instructions.
The multi column plain text OCR as a single column error is
unexpectedly too frequent. Four out of eight multi-column articles
from three different journals, I looked at the University of Michigan
MoA-site were this way. This is a pity since the number of columns on
a page could easily have been predefined for the journal or range of
pages for OCR.
MoA documentation claims 99% OCR accuracy. Is that words or characters ? A suspect 1% error in characters are messing up over 5% of the words.
Textbridge 9 on a 600 dpi scan gave much better OCR results than have been posted at the MoA site. As an experiment we put the 100% scans at MoA into Textbridge 9 and found that even that gave less errors than the ASCII text posted. Textbridge 9 does also automatically merge any words broken at end of row with a ``-''. Sometimes however the ``-'' is a hyphen between words and are probably recognized using a spell-check. This indicates that the MoA Library will be able to OCR their existing scans again with the latest OCR software to generate much cleaner OCR ASCII text and a better recognition of multi column page as such.
I hope both Cornell and Umich will decide to redo the OCR in automated batch mode the full page image database which with faster computers and better OCR software should surely not as difficult as when it was first done few years ago. The Scanning is clearly the hard part which has already been done. Another option is to OCR on the fly, like they convert the 600dpi .tif images to 100dpi or 200dpi .gif on request to view page. Probably the CPU required for OCR is not significantly different. Else it could be a special option which reader can select after finding the page has been OCRed badly. The ASCII in database could be automatically replaced to self-correct as it is read by users. Unfortunately I understand the page servers are UNIX and the Textbridge OCR runs MS-Windows which is clearly not designed for automated batch processing. The solution is clearly an Unix version of TextBridge.
For a large index of online articles on digital archives read RLG DigiNews.
Lot of research has been done recently on voice and handwriting recognition. IMHO insufficient research has been done to improve OCR which is a much simpler problem but has less current business interest. OCR of knowledge in the libraries is purely of academic interest. Having personally developed software to automating the MDS analysis of Hubble Space Telescope galaxy images I feel confident that an almost perfect OCR can probably be done by applying the current computer image processing algorithms to this task. Few years ago I discussed OCR with Raj Reddy who was Dean of the School of Computer Science at Carnegie Mellon University and also suggesting a rare book on Hindoostan for his Universal Library (UL) project. His practical suggestion however, was to get it typed in India where data entry is cheap, and most often faster for text from old books.
I have many books/documents I would like to put online. I once got an estimate from India for typing at US$ 0.50 per thousand(1K) words. A good typist 33 words per min - 2000 words per hour - LKRs 100/- per hour is not too unreasonable IMHO for someone in Lanka. OCR may be faster for someone who is not a fast typist. I have Not yet found someone in Lanka willing to type or OCR-proof read-E-mail them in at that price. For example see document which is about 8K words. Please E-mail me any offers or comments.
Threre is lot of content which could be put online. IMHO it is a great project for kids. My son Rhajiv did http://lakdiva.org/codrington/ about 2-years ago when he was 14. He learnt both Lankan history and HTML in the process. In August 1998 I posted chapter 80 of the Mahavamsa. We estimated just about 200K words without all the footnotes and my son has got motivated to start on it in the summer of 2002 and posted the first volume or the Mahavamsa. He hopes to have the Chulavamsa online by next year.
Older text such as this and more recent feature articles on Lanka add significantly to the richness of information on Lanka on the web. I strongly encourage others to add pages like this from documents they own or can get permission to post online. Most publications older than 1922 are free from Copyright issues.
The few years of online Lankan Newspapers were also contributing to the available documents on Lanka.