Text editing Articles in MoA Digital Library Collection.

Guide qiven below is a sequence of instructions, for editing and proofing a generic MoA OCR text pages into a nicely formated and readable html text webpage.
You need to first get permission from MoA Digital Library to post online.

  1. Print-out the scanned images of the pages (selecting View as 100%).
    Save any pages with illustrations as .gif images.
  2. Download and Save As Text all the page OCR (selecting View as text).
  3. Use Global editing to eliminate multiple blank-Space (text line are displayed centered and generally saved as such).
  4. Join long lines broken at end.
    Delete any lines with text only from previous or subsequent Articles on first and last pages.
    Rectify any multi-column page recognition errors. This most time consuming operation is not needed for single-column or pages which have OCRed properly. Since it would have been already done by the OCR program properly recognizing the Multiple Columns.
    Faster option if Possible is to ReOCR the page
  5. Delete out extra lines with title text, page numbers etc., between pages.
    Also edit-out the figure captions if text is illustrated.
  6. Join lines with words broken with a "-" at end of lines not recognized by OCR software.
  7. Check with printout and add html paragraph breaks <p>
  8. Spell-check to correct most OCR errors.
  9. re-Wrap of fill lines to get a formatted ASCII text.
  10. Printout and Proof read to correct any remaining errors.
    The computer screen-image is far more clear than printed images of pages.
  11. Crop out any illustrations from saved .gif files of pages and save as compressed (say 50%) .jpg images.
  12. Insert text in HTML template file and edit the html to add any images and figure captions, to display on the webpage.
    Any computer literate kid (or adult) should be able to figure out in few minutes the basic HTML by View - Page Source, of a webpage in simple HTML.
    I hope reader is not brainwashed into thinking that this is a complex task, needing to learn the syntax of menu driven commercial webpage editors of Micro(brained)Software.

I am however have not convinced myself if it is not faster to directly type, than edit the OCR output, particularly in cases where the OCR has messed-up multiple columns. I hope the instructions above will help avoid making some mistakes which can make the process take longer. I did one article to be able to write and illustrate instructions above and it took far too long, even with a good editor like EMACS. A good typist can probably retype the article much faster.