UPDATED 22:29 EDT / MAY 29 2019

CLOUD

AWS announces general availability of its document reading service Textract

Amazon Web Services Inc.’s Textract service, which uses machine learning to extract text and data from documents including tables and forms, is now generally available.

Textract was first announced during the AWS re:Invent conference in November as one of several new machine learning services designed for use by people with no expertise in the subject.

Amazon reckons the service is a big improvement over the traditional optical character recognition software that enterprises have previously relied on to extract text-based data from documents. The problem with traditional OCR is that it can’t recognize common layouts seen on forms and tables. As a result, OCR software is often inaccurate when attempting to pull data from those kinds of sources.

Amazon says Textract is more of an “OCR++ service” because it can recognize tables with a document and understand that the data is placed in rows and columns.

“The power of Amazon Textract is that it accurately extracts text and structured data from virtually any document with no machine learning experience required,” Swami Sivasubramanian, AWS’s vice president of machine learning, said in a statement. “Subsequently, developers can analyze and query the extracted text and data using our database and analytics services like Amazon Elasticsearch Service, Amazon DynamoDB, and Amazon Athena and integrate with other machine learning services like Amazon Comprehend, Amazon Comprehend Medical, Amazon Translate, and Amazon SageMaker to help customers derive deeper meaning from the extracted text and data.”

Textract supports multiple image formats, including regular JPEG and PNG photo files, scans and PDF documents.

Amazon’s announcement that Textract is now generally available was met with excitement by analyst Patrick Moorhead of Moor Insights & Strategy:

“I believe that Textract will be a game changer for industries like healthcare that still rely on printed documents,” Moorhead told SiliconANGLE. “Unlike OCR, Textract identifies text positionally so it’s accurate and useful.”

Numerous customers have been using Textract since it was made available in limited preview last year, including The Globe and Mail Inc., PricewaterhouseCoopers, UiPath Inc. and Alfresco Software Inc., Amazon said.

Textract is currently available in four AWS regions, namely US East (Ohio), US East (Northern Virginia), US West (Oregon) and EU (Ireland). The company said the service will be extended to more regions later in the year.

Photo: Goumbik/Pixabay

Since you’re here …

… We’d like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.

If you like the reporting, video interviews and other ad-free content here, please take a moment to check out a sample of the video content supported by our sponsors, tweet your support, and keep coming back to SiliconANGLE.