The business intelligence market is bounded on one side by big data and on the other side by data preparation. That is, to maximize their performance in using information, organizations have to collect and analyze ever increasing volumes of data while the tools available are constantly evolving in the big data ecosystem that I have written about. In our benchmark research on big data analytics, half (51%) of organizations said they want to access big data using their existing BI tools. At the same time, as I have noted, end users are demanding self-service access to data preparation capabilities to facilitate their analyses.
Part of the engineering work that made the new visualization features possible includes behind-the-scenes simplification and unification of Pentaho’s server architecture. PDI now interacts more directly with Pentaho Business Analytics. As a result, data engineers and data analysts doing data preparation can share data at any point with users of Business Analytics; this facilitates collaboration between different parts of the organization. The divide between business and IT can be an obstacle. These new capabilities should help bridge that divide.
Pentaho 7.0 also includes enhanced big data capabilities. Recognizing the growing popularity of Spark for big data processing and analysis, which I have commented on, the company has added Spark capabilities to PDI. It now supports Spark SQL as a data source. In addition, Spark Streaming routines and machine learning routines written in Spark ML or MLlib can be included as objects in data pipelines managed by PDI. Kerberos security capabilities have also been enhanced, including integration with Cloudera’s Sentry for more granular access to Hadoop data sets. Other big data enhancements include support for Avro, Kafka and Parquet.
Beginning with version 6.1 of PDI Pentaho introduced the concept of “metadata injection.” Data integration, especially big data integration, often involves many files – potentially thousands. It can be a challenge to process these files without creating many different data integration routines to deal with the variety of file types, even for something as simple as the file names. Metadata injection provides a way to apply a single routine to many different files by enabling PDI routines to accept parameters at various steps in the data pipeline process. PDI 7.0 adds metadata injection support to another 30 steps in the sequence of processing files.
A year ago we wrote about the prior major version of Pentaho and its relationship with its new parent Hitachi Data Systems (HDS). At the time HDS executives promised to leave Pentaho’s roadmap intact, and they appear to have done so. However, we haven’t seen as much integration of Pentaho into HDS products as expected. HDS has a major focus on IoT. Our recently completed IoT and Operational Intelligence benchmark research will be published soon. Over time I expect to see more IoT-related capabilities emerge that draw upon both Pentaho’s and HDS’s technologies.
Within the context of IoT and otherwise, Pentaho is in position to bring data preparation and analytics together, and thus fill a gap in the market. The pure-play data prep vendors lack analytic capabilities, and pure-play analytics vendors lack data preparation capabilities, although several have started to add them. Bringing these capabilities together will be good for users. If your organization needs to support users with both data preparation and analytic capabilities I recommend considering the new features in Pentaho 7.0.
Regards,
David Menninger
SVP & Research Director
Follow Me on Twitter @dmenningerVR and Connect with me on LinkedIn.