Amazon's Textract AI can read millions of pages in a few hours

It doesn't spit out jumbled text from complex document layouts like basic OCRs.

By Mariella Moon May 30, 2019 4:02 am EST

Amazon has launched a new offering called Textract for its Web Services customers, and it's like optical character recognition on steroids. It more than just extracts text from documents like its name implies — Amazon says it can actually identify different document formats and their contents so it can process them properly. The product was designed to be able to recognize if it's taking text from tables and forms from documents, including scanned receipts, tax paperwork or inventories. It then generates structured data that doesn't need human input.

Since basic OCRs typically spit out jumbled information when taking text from tables and forms, companies have to resort to manual data entry that could be both costly and time consuming. Textract can process millions of pages in just a few hours, which can lower document processing costs. Plus, customers can use it even though they don't have previous machine learning experience.

Amazon says Textract can recognize information like names and social security numbers, allowing it to transfer table data from PDFs, for instance, into easily searchable spreadsheets. For much larger stacks of documents, the information it extracts could be used to build smart searches or could be loaded into databases. The bad news for some AWS customers is that the product is only available in some parts of the US (Ohio, N. Virginia, Oregon) and Ireland for now. It will, however, make its way to more regions over the next year.

Amazon's Textract AI can read millions of pages in a few hours

Recommended