UPDATED 19:44 EDT / SEPTEMBER 20 2017

BIG DATA

Syncsort quality manager aims to purify Hadoop data lakes

Syncsort Inc. is extending the data quality features of the Trillium Software Inc. subsidiary it acquired last November to native Hadoop environments with Trillium Quality for Big Data.

The offering combines Trillium’s data quality features with its Intelligent Execution data integration platform to enable information technology organizations to normalize and integrate data at the same time. The Trillium platform was previously available in native format only on Linux, Unix and Windows operating systems. The Hadoop support is the first time Syncsort has applied its data quality features to applications.

Data quality is about identifying inconsistencies, errors or duplication. Examples include a ZIP code entered in a date field or duplicate customer records that appear to be different because of misspellings. Normalizing data is a tricky process. For example, different countries have different address and date formats and two people with the same name in the same ZIP Code may or may not be the same person.

Users are rushing to extract data from production systems and load it into analytics engines, but are discovering that quality problems limit their effectiveness. “Everybody is trying to govern the data once it’s in the data lake so it doesn’t turn into a data swamp,” said Tendü Yoğurtçu, Syncsort’s chief technology officer. “The volume and variety of data makes it complex.”

Trillium has hundreds of matching algorithms to identify such problems, and can be configured to automatically apply corrective algorithms, Yoğurtçu said. The offering includes address- and name-matching data for 150 countries as well as postal directories and geocoding. Intelligent Execution examines the topology of a data flow and optimizes resources for the job without changes to the application. It supports both new and existing Trillium data quality projects across Hadoop, MapReduce and Apache Spark on-premises or in the cloud.

“Once you understand the data you can create the rules to cleanse that data,” Yoğurtçu said. “For example, if you have duplicates you can specify a process to flag them or get rid of them.”

Trillium Quality for Big Data is available on all Hadoop distributions including Cloudera Inc.’s CDH, Hortonworks Inc.’s HDP and MapR Technologies Inc.’s Converged Data Platform. It deploys and installs via Cloudera Manager and Apache Ambari. Pricing is on a per-node basis or cloud subscription, but Syncsort didn’t provide specifics.

Image: Flickr CC

Since you’re here …

… We’d like to tell you about our mission and how you can help us fulfill it. SiliconANGLE Media Inc.’s business model is based on the intrinsic value of the content, not advertising. Unlike many online publications, we don’t have a paywall or run banner advertising, because we want to keep our journalism open, without influence or the need to chase traffic.The journalism, reporting and commentary on SiliconANGLE — along with live, unscripted video from our Silicon Valley studio and globe-trotting video teams at theCUBE — take a lot of hard work, time and money. Keeping the quality high requires the support of sponsors who are aligned with our vision of ad-free journalism content.

If you like the reporting, video interviews and other ad-free content here, please take a moment to check out a sample of the video content supported by our sponsors, tweet your support, and keep coming back to SiliconANGLE.