David Menninger's Analyst Perspectives

Informatica Parses the World of Hadoop

Written by David Menninger | Nov 2, 2011 3:57:26 PM

Informatica recently introduced HParser, an expansion of its capabilities for working with Hadoop data sources. Beginning with Version 9.1, introduced earlier this year, Informatica’s flagship product has been able to access data stored in HDFS as either a source or a target for information management processes. However, it could not manipulate or transform the data within the Hadoop environment. With this announcement, Informatica starts to bring its data transformation capabilities to Hadoop.

We recently completed benchmark research on Hadoop, the open source large-scale processing technology, and I have been writing regularly about Hadoop in this blog. In this era of big data, Hadoop has quickly become a popular technique for storing and analyzing big data; more than half (54%) of participants in our research are either using or evaluating the technology. The research shows that most often, they use Hadoop in conjunction with unstructured data including application logs and event data. Before someone can analyze this data, it must be parsed to determine the various bits of information recorded in each row of the log files.

Informatica’s HParser is designed to make this process easier. Using DT Studio, Informatica’s Eclipse-based integrated development environment (IDE), organizations can create data transformation routines via a graphical user interface that parses the information in log files and other types of data typically processed with Hadoop. Once developed, these routines get deployed to the Hadoop cluster and are invoked as part of the MapReduce scripts, which enables them to use the full distributed processing and parallel execution capabilities of Hadoop. Using a graphical environment to develop these routines should make it easier and faster to create the code necessary to parse the data. As our research shows, staffing and training are the two biggest obstacles to leveraging Hadoop, so tools like HParser that can minimize the specialized skills required can be valuable to organizations deploying Hadoop.

Informatica is making two versions of HParser available. The community edition is free, but it’s not open source. It can be used to process log files, Omniture Web analytics data, XML documents and the JavaScript Object Notation (JSON) data interchange format. The enterprise edition also supports a number of industry-standard data formats including SWIFT, X12, and NACHA for the financial industry, HL7 and HIPAA for healthcare, ASN.1 for telecommunications, and documents in PDF, XLS or Microsoft Word formats. For the most part, the enterprise offering is targeted for those in the Informatica user base who might be extending their efforts into Hadoop. The community edition may provide enough value for customers not currently working with Informatica to consider trying some of the company’s other products.

We’ve seen other information management vendors take a similar approach. Earlier this year Syncsort announced a free version of its sort routines for the Hadoop market as well as an enterprise edition. HParser appears to be part of a bigger effort on the part of Informatica to embrace Hadoop. The company has been conducting a series of webcasts called Hadoop Tuesdays, one of which I participated in one last week, to help educate the market about Hadoop. You may find these useful if you want to learn more about Hadoop. They are not product sales pitches but are focused on explaining the technology and its uses. In addition, Informatica will be delivering a keynote presentation at Hadoop World next week.

We’ve seen business intelligence vendors and information management vendors alike embrace Hadoop. I expect we’ll continue to see more investment from Informatica and others as organizations work to make Hadoop a disciplined part of their IT infrastructure processes. As our research shows, integration is one of the top four issues for organizations working with Hadoop. The more that existing products can be extended to incorporate Hadoop or new products can be developed to make Hadoop easier to use, the more widespread its usage will become.

Die-hard MapReduce programmers may not feel that they need HParser. However, enterprise IT organizations already using Informatica should find it a welcome addition in their efforts to deal with Hadoop-based data sources. You can give it a try for yourself here.

Regards,

David Menninger – VP & Research Director