FineReader

Latest

  • ABBYY FineReader Pro is an unparalleled OCR solution

    by 
    TJ Luoma
    TJ Luoma
    06.16.2014

    If you want fine-grained control over OCR and unmatched export options to a plethora of formats, ABBYY FineReader Pro for Mac is definitely worth a close look, but the current version has some significant caveats which you should consider before spending US$100 on it. FineReader's Exceptional Features If the most important feature of an OCR app is how well it does at recognizing text from a PDF or image file, then FineReader Pro is, by far, the best OCR app that I have ever used. I have thousands of journal articles saved as PDFs. Some of them are pretty good quality, but a few of them have image hovers best described as "a hasty Xerox made on a Friday afternoon before Spring Break by a work-study student who was far more interested in literally anything else." Crooked, dark, speckled, you name it. Time and time again, FineReader came through. Its automatic analysis was generally good, but when I took the time to use its more advanced features, it rewarded me with output that was as near-perfect as anyone can expect from an OCR application. FineReader then gives you an unparalleled assortment of export options, including four different options for PDFs alone: "Text under the page image" is what most people usually expect and want from an OCR app: the OCR'd document looks the same on the screen, but you can copy/paste from it into any other application. However, the option for "Text over the page image" will allow you to keep the formatting close to the original, but edit the results, if needed, and see it on the screen. This is especially useful if you want to edit the resulting PDF to correct any OCR mistakes, which will still happen, regardless of which app you choose. Export as Word (or RTF or ODT), Excel, CSV, or PowerPoint Exporting as PDF is only the beginning. You can also export to Word (.docx), including four layout options (exact copy, editable copy, formatted text, plain text), plus options to retain page numbers, headers and footers; keep line breaks and hyphenation; keep page breaks; keep pictures; keep text and background colors; high line uncertain characters; and keep line numbers. (All of those options are also available for RTF and ODT/OpenOffice.) I was surprised and somewhat perplexed to see options for Excel (.xlsx) and PowerPoint (.pptx) files, but that was mostly due to the fact that I rarely use either application. However, now that I have seen it and thought about it some more, it does now seem something like that could come in extremely handy for people who do need to process scans of documents which were originally produced in spreadsheet or presentation apps. If your scan is tabular data, you might also consider exporting as CSV. FineReader includes an option to ignore all text outside tables when exporting to CSV. Export as Ebook FineReader also supports two ebook formats, but probably not the two you would expect. You can export to ePub (as you'd expect) and FB2, which is a format that I had never even heard of, but is apparently for something called FictionBook. Options for ebooks include setting metadata such as the title, author, keywords, and an annotation (I do wish exporting to PDF had similar options). You can also use the first page as the book cover image and preserve/embed fonts (the latter option is for ePub only, not FB2). Those hoping for MOBI support should probably consider exporting to ePub and then either using Send To Kindle for Mac or some other solution. Export as Image I'm not sure who uses an OCR app to export the finished document as an image, but if you're one of those people, you can choose from JPEG, JPEG–2000, TIFF, PNG, BMP, JBIG2, PCX, and DCX. (What? No option to export a multi-page document as an animated GIF? Apparently the FineReader developers aren't familiar with Tumblr!) Export as Text or HTML The simplest way to use the text from your scan is to Export as TXT. (Markdown fans: take special note of the option to use a blank line to separate paragraphs). To those who are hoping to take complex print magazine formatting and reproduce it on the web, I have two messages: a) please don't and b) no, really, don't, but if you do, don't rely on FineReader for it. There are some options here for "fixed" vs "flexible" layout, and there's even an exciting-until-you-see-what-it-does option for CSS, but the end result is fairly ugly and convoluted HTML that will make tidy laugh and validators weep. (Or vice versa.) Seriously, if you really want an HTML copy of your document, export it as text, convert that to Markdown, and generate your HTML from that. The world will be a better place. Who shouldn't buy FineReader Pro As excellent as FineReader Pro is, it is not the right OCR tool for everyone. If you are just trying to clear off your physical desk of office office or household paperwork (mail, bills, memos, letters, etc.) then you probably just want to drop them all on your scanner and have them saved to the computer as fast as possible so you can shred/recycle the originals and then get on with the rest of your day. If that is what you want to do, don't buy FineReader Pro. Instead, use whatever came with your software, or buy PDFpen for $60 and use AppleScript to automate OCR and save $40 (or buy PDFpenPro instead). You'll probably get more use out of PDFpen/PDFpenPro's features than you would from FineReader Pro. FineReader Pro is exceedingly non-scriptable. It will not fit into any sort of automated process. In fact, when you open a PDF in FineReader Pro, it does not open the original file but instead imports the PDF into a new, untitled document. After processing, you can save the file as a FineReader document (.frdoc) which will save all of the customizations that you made to the various scan areas <h3 id="" i'mstillnotconvincedthatfinereaderproisreallyfull-onnerdyenoughforme...whatelsedoesithavetooffer""="">"I'm still not convinced that FineReader Pro is really full-on nerdy enough for me... what else does it have to offer?" Area Types: Designate parts of your document as a Text Area, Table Area, Picture Area, Background Image Area, Barcode Area, or Recognition Area. Text Functions: Is this section of text: Main Body Text, or a Header/Footer, or a Floating Text Block, or a Caption, or a Line Number, or the ever-vague Other? Need to scan a document where the text goes vertically instead of horizontally? FineReader can do that. Want to change the order that all of the recognized text areas are processed in? If you're planning to export to a non-PDF, you almost certainly do, to make sure that columns flow properly, etc. You can also choose from 186 different languages (although choosing more than 5 will present a warning that it will increase recognition time), but this comes in very handy if you need to be able to identify Latin terms, or even Greek or Hebrew, Nyanja or Papiamento! It even recognizes BASIC, C/C++, Java, Fortran, COBOL, Pascal, and "Simple chemical formulas." Sorry, no Klingon. A Quick Word Express vs Pro For the rest of this article, I'm just going to refer to the app as "FineReader" rather than "ABBYY FineReader Pro for Mac" but for the sake of clarity I want to make it clear that this is a separate app than ABBYY FineReader Express Edition for Mac. If you purchased that app and want to upgrade, this section will deliver the good and bad news. If you are a new FineReader customer, you can skip to the next section. ABBYY isn't going to win any customer appreciation awards from early adopters. Those who purchased the $70 ABBYY FineReader Express Edition for Mac can upgrade to "Pro" for an additional $80. Ouch. The description of FineReader Express in the comparison chart is definitely not the effusive self-praise ABBYY's marketing department gave Express back when it was the only product they were offering for Mac. The meager $20 discount is almost insulting especially when ABBYY's Why Upgrade? page ends with "Learn how to upgrade with a significant discount." The only worse news is for Mac App Store customers who, of course, don't qualify for any upgrade pricing at all. The Express version will continue to work, but I wouldn't expect to see any new features, especially since the Mac App Store version has apparently been pulled from the store. (N.B. Those who purchased it from the Mac App Store should still be able to download it from the "Purchased" tab from the App Store app.) Remember: OCR Is Hard. OCR is similar to speech recognition in that they both seem like something that a computer just ought to be able to do, and anything that falls short of 100% accuracy can feel disappointing. If you compare an OCR program's ability to the combination of an adult human brain and eyes, the OCR program is going to lose every time. Human brains are remarkably good at making sense out of gibberish, not to mention easily moving between different font typefaces and styles, navigate language changes with casual savoir faire, and can almost always tell the difference between a capital I and a lowercase L by subconsciously evaluating context. However, if we think about what a computer has to do in order to perform OCR on a document, we can recognize a number of elements which can affect the outcome: the quality of the original document typeface(s) of the original document (i.e. a single word or phrase in italics; limited character spacing (aka "kerning") which can make it difficult to distinguish an "n" from an "ri" etc. Not to mention the use of large initial capital letters in some print magazines where the first letter might be the equivalent height of several lines of the rest of the text.) layout (multiple columns in magazines) settings used when the document was scanned (DPI set too low or too high can cause problems) the language(s) used in the original document (or multiple languages, or technical jargon) hyphenation and line justification (should that word be hyphenated, or was it only hyphenated to keep the fully-justified column of text looking pretty?) determining what is important and what isn't The last one is so difficult that most programs don't even attempt it, they just scan everything, which is probably OK in most cases, but there is often superfluous information on the page, such as the title of the article or the author's name in the header of each page, which the human eye easily dismisses, but a computer cannot. Then, of course, there's also the problem of stray marks on the page. Bugs and Shortcomings FineReader Professional for Windows is currently at version 12, and the Mac version that I tested calls itself version 12.0.3, but this is not entirely true. It is certainly true that the text recognition engine has been refined for years on Windows, and Mac users are getting a mature program in that sense. However, the Mac implementation of FineReader Pro is very new, and there are some bugs. By far the most severe bug that I encountered during my testing was a rare and difficult to reproduce problem where PDFs imported into FineReader Pro, analyzed, and then exported as a PDF would be missing pages. I only saw this 2–3 times out over a hundred or more scans. If I had to guess I would put the occurrence rate at less than 1%. Those times when I did see it, I could tell that there was a problem when the file was being imported. For example, an 18-page original was reported as only having 16 pages as FineReader imported it. When I retried the same document, it always imported properly, making it extremely difficult to reproduce this problem. I did report this issue to the developers at ABBYY who are investigating it. The problem, of course, is that if you do not notice the page difference when importing or exporting, you might delete your original PDF and be left with an incomplete copy. I hesitated to even mention this, as I am concerned that it will {overly dissuade/discourage} people from trying the app; however, potential data loss is almost always the most serious bug that an app can have. Given that FineReader leaves the original PDF alone and imports a copy, the only way that you could lose data is if you delete your original PDF manually. However, I assume that most people would do exactly that. The second-worst bug that I encountered was that occasionally (perhaps 1–2% of the time), the PDF that I exported from Fine Reader actually looked worse than the original. Marco Arment identified this problem back in 2011: The ScanSnap came with ABBYY FineReader, which does an acceptable job, but degrades the image quality noticeably when it saves the text-embedded PDF copy. It's enough of a problem that I'm not comfortable deleting the original, and I'd rather not keep two copies of every file around, so I tried to find an alternative that could output better-quality PDFs with text. Before anyone dismisses this as Marco (or myself) being hypercritical, I invite you to look for yourself. I took a screenshot showing "before-and-after" to illustrate this problem is clearly visible to the average human. (Be warned: that screenshot link leads to a 1.3 MB TIFF file. I didn't want image compression being blamed.) The FineReader PDF was created with the image quality set to the highest value and "Compress images using MRC" was turned off. The resulting file is undeniably worse which seems like something that should never happen. This problem has also been reported to the ABBYY developers, and I hope that they will improve it in an free update to FineReader. My last comments in this section aren't about bugs, but about usability problems. The first is that FineReader Pro's error reporting is extremely weak. I have run into several documents which generated an error saying "Some of the pages have not been processed" without telling me how many pages or which ones. I am not a developer, so I don't know what is involved in making that error reporting more specific, but as a user I can tell you that the experience was highly unsatisfying. From a user's perspective I assume that the application knows how many pages were imported or exported, and how many errors were encountered, and I expect that it will give me precise information so I can track it down. It was also unclear to me what "processed" meant. Did that mean that OCR had failed completely for some pages? Had some pages failed to export? What am I supposed to do with such a vague error? Secondly, when deleting a page or pages from the app, the confirmation window asks if I am ready to delete the selected "page(s)." Again, as a user, this seems lazy. If I am about to delete a single page, then the app should be specific, i.e. "Are you sure you want to delete page 8?" If I am about to delete multiple pages, I expect it to tell me exactly which ones: "Are you sure you want to delete pages 19–23?" Is this finicky? Perhaps, but if you are selling a $100 app in a world where people hesitate to spend 99¢ and if you are going to label that app as a "pro" app, I am going to hold you to high standards. (Aside: I think the app is worth $100, and I think it's extremely unfortunate that a $5 app is considered "premium" but this is the world in which we live.) FineReader vs FineReader In the comparison chart for FineReader Pro vs Express, ABBYY describes the "Recognition accuracy" in Pro as "Unmatched" whereas Express is "Superior." Presumably ABBY hopes that we will gloss over those Meaningless MarketingSpeak Designations and not ask how, in a comparison chart between two products, "FineReader Express" can be labeled as "Superior" when, contextually, it is obvious that Express is the inferior of the two. Instead, ABBYY wants us to read between the lines (or table cells, as it were) and interpret this to mean "Express is Superior Than Our Competition Although We Are Not Coming Right Out And Saying That Especially In Countries Which Forbid That Sort of Product Comparison. But Pro is Better." DEVONthink Pro Office uses ABBYY FineReader Engine 11 for Mac for its internal OCR. Owners of recent ScanSnap scanners get an app called "ABBYY FineReader for ScanSnap" which can be used to automatically OCR scanned documents. The final product of those two options will be good, but not as good as FineReader Pro, especially if the original document has a more complex layout such as multiple columns. This is not surprising, considering that FineReader Pro is a new product. I would hope that the FineReader Engine will be updated and DEVONthink Pro Office users will be able to benefit from it. Here again the trade-off automated batch processing against the advanced features of FineReader Pro. Excellence, Not Perfection For complex documents, FineReader is your best option at turning a scanned file into a usable OCR'd document or convert it into a Word document or something else. It's a professional tool at a professional price, and while it lacks automation features, it is great at what it does. My only hope is that it will continued to be developed and improved, not just sit around for a year before a "new version" comes out. If that is the development model they are using, they'd better come up with better upgrade pricing than their current system.