Subscribe via Email
Enter your Email Address:
Delivered by FeedBurner

Monday, May 14, 2012

The Rise of the Connector - Part 1

Connectors are the heart and soul of Federated Search (FS) engines and with the rise in importance of FS in today’s fast paced, Big Data, analyze everything world, they are crucial to smooth and efficient data virtualization and flow. MuseGlobal has been building Connectors, and the architecture to use them (the Muse/ICE platform) and maintain and support them (the Muse Source Factory) for over 12 years. The people who design and build Connectors must have rich technical expertise, and also have a deep understanding of data and information and its myriad formulations.

This series of posts will look at the problems arising as data grew in volume, spread across systems, moved outside the enterprise, and became all important for the business intelligence which informs current corporate decisions. Not surprisingly, as a leading FS platform Muse and its ecosystem are in the forefront of providing solutions to data problems in the modern world.

This first post considers the growing importance of being able to access data from inside an organization. (The second post looks at the problems arising as data is needed from outside the enterprise, and the complexities of access and extraction that result.)

Part 1         Wanted: data from over there, over here

As the world of Big Data grows daily and the importance of unstructured data becomes more evident to information workers and managers everywhere, methods of accessing that data become critical to success.

Typically in an enterprise the majority of their data is held in relational DBMS’s which are attached to the transaction systems that generate and use it. These include HR, Bill of Materials, Asset Management systems and the like.  However for managers to make strategic decisions on even this data is difficult, they need to see it all at once. The analysis managers need is performed by a Business Intelligence (BI) system, and it works on data held in its own (OLAP) database, which is specially structured to give quick answers to pre-formulated questions.

And here is the first problem: transaction systems with lots of data, and an analysis system with an empty database.  The solution: set up and run a batch process for each working database that takes a snapshot of its data and transforms and loads it into the OLAP database. This is ETL (Extract, Transform, Load) and is where most big company systems are at the moment. The transaction systems have no method of exporting the data, and the analysis engine just works from what it has. This three part solution works and it works well, but it has some problems.

Running a snapshot ETL on each working system at “midnight” obviously takes time, and can be nearly a day old before the process starts. This lack of “freshness” of the data didn’t matter too much 5 or even 2 years ago. It took so long to change systems as a result of the analysis that data a day or so old was not on the critical path. But today’s systems can adapt much more rapidly, and business decisions need to be based on hourly or even by-the-minute data. (Of course, if you are in the stock and financial markets then your timescale is down to micro-seconds, and you have specialist systems tailored for that level of response.)  So first we need to improve on our timing.

In order to do that we need to move from a just-in-case operation to a just-in-time one. Rather than collect all the data once a day, we need to be able to gather it exactly when we need it. Of course gathering it overnight as historic data is still important and makes the whole process work more smoothly and quickly as the just-in-time data is now only a few hour’s worth and so can be processed that much quicker to get it into the BI system. Now we have a two-legged approach: batch bulk and focused immediate updates. Sounds good, but the ETL software for the batch work will not handle the real time nature of the j-i-t data requests.

For a start the ETL process grabs everything in the transaction system database – all customers, all products, all markets. But a manager is generally going to ask for a report on a specific customer or product. It would be endlessly wasteful to grab all that “fresh” data for all customers, when only data for one is needed. So the j-i-t process has to be able to query the transaction system, rather than sweep up everything. It is also almost certain that the required report will need data from more than one transaction system, but probably not all of them. ETL is not set up to do this; it needs a system capable of directing queries at designated systems and transforming those results. And, finally, the extracted data may well need to be in a different format. After all now we are loading the data directly into the Business Intelligence analysis engine for this report (for speed), and not importing it to the OLAP database.  This means that the structure and semantics are all different.

Increasingly the tools of choice for these j-i-t operations are Federated Search (FS) systems such as MuseGlobal’s Muse platform. They can search a designated set of sources (transaction systems), run a specific query against them, and then re-format the results and send them directly to the analysis engine. Initial examples of FS systems are user driven, but for this data integration purpose, the more sophisticated FS systems are able to accept command strings and messages in a wide variety of protocols, formats and languages and act on them, thus allowing the FS system to act a s a middleman getting the data the BI engine needs exactly when it needs it. Muse, for example, through its use of “Bridges” can accept command inputs in over a dozen distinctly different protocols, and can query all the major enterprise management suites in a native or standards-based protocol.

Should we move?

The need for speed of analysis and the volume of data involved grows every day it seems. It takes time to extract all that data and to build a big OLAP database just in case we want it.  What’s more, building, and changing the structure to adapt to changing analysis needs takes time – a lot of it.

So modern BI systems have moved to holding their database in memory, rather than on disk, just so everything is that much faster. Modern analysis engines, many based on the Apache project’s Hadoop engine, can handle a lot of data in a big computer, and do it rapidly. Both Oracle (Exalytics) and SAP (Hana) have introduced these combined in-memory database plus analytics engine, and others are coming. (See here for an InformationWeek take on the war of words surrounding them.) These engines can be rapidly configured (often in real time, through a dashboard) to give a new analysis report – as long as they have the data!

Moving all that data from the transaction system takes time, so the current mode is to leave it there and rely on real-time acquisition of what is needed. This is of course much less disruptive, fresher, and much more focused on the analysis at hand. This is not to say that historical data is not important; it is, and it is used by these engines, but the emphasis is more and more on that last bar on the graph.

So, again we need a delivery engine to get our data for us from all the corporate data silos, get it when it is needed, and then deliver it to the maw of the BI analytics engine. Once again the systems integration, dynamic configuration and deep extraction technologies of a Federated Search engine come to the rescue. Muse supports the real time capabilities, parallel processing architecture, session management, and protocol flexibility to deliver large quantities of data when asked for, or on a continuing “feed” basis.

No comments: