
This series of posts will look at the problems arising as
data grew in volume, spread across systems, moved outside the enterprise, and
became all important for the business intelligence which informs current
corporate decisions. Not surprisingly, as a leading FS platform Muse and its
ecosystem are in the forefront of providing solutions to data problems in the
modern world.
This first post considers the growing importance of being
able to access data from inside an organization. (The second post looks at the
problems arising as data is needed from outside the enterprise, and the
complexities of access and extraction that result.)
Part 1 Wanted: data from over there, over here
As the world of Big Data grows daily and the
importance of unstructured data becomes more evident to information workers and
managers everywhere, methods of accessing that data become critical to success.
Typically in an enterprise the majority of their data is
held in relational DBMS’s which are attached to the transaction systems that
generate and use it. These include HR, Bill of Materials, Asset Management
systems and the like. However for
managers to make strategic decisions on even this data is difficult, they need
to see it all at once. The analysis managers need is performed by a Business
Intelligence (BI) system, and it works on data held in its own (OLAP) database,
which is specially structured to give quick answers to pre-formulated
questions.
And here is the first problem: transaction systems with
lots of data, and an analysis system with an empty database. The solution: set up and run a batch process
for each working database that takes a snapshot of its data and transforms and
loads it into the OLAP database. This is ETL (Extract, Transform, Load) and is
where most big company systems are at the moment. The transaction systems have
no method of exporting the data, and the analysis engine just works from what
it has. This three part solution works and it works well, but it has some
problems.
Running a snapshot ETL on each working system at
“midnight” obviously takes time, and can be nearly a day old before the process
starts. This lack of “freshness” of the data didn’t matter too much 5 or even 2
years ago. It took so long to change systems as a result of the analysis that
data a day or so old was not on the critical path. But today’s systems can
adapt much more rapidly, and business decisions need to be based on hourly or
even by-the-minute data. (Of course, if you are in the stock and financial
markets then your timescale is down to micro-seconds, and you have specialist
systems tailored for that level of response.) So first we need to improve on our timing.
In order to do that we need to move from a just-in-case
operation to a just-in-time one. Rather than collect all the data once a day,
we need to be able to gather it exactly when we need it. Of course gathering it
overnight as historic data is still important and makes the whole process work
more smoothly and quickly as the just-in-time data is now only a few hour’s
worth and so can be processed that much quicker to get it into the BI system.
Now we have a two-legged approach: batch bulk and focused immediate updates.
Sounds good, but the ETL software for the batch work will not handle the real
time nature of the j-i-t data requests.
For a start the ETL process grabs everything in the
transaction system database – all customers, all products, all markets. But a
manager is generally going to ask for a report on a specific customer or
product. It would be endlessly wasteful to grab all that “fresh” data for all
customers, when only data for one is needed. So the j-i-t process has to be
able to query the transaction system, rather than sweep up everything. It is
also almost certain that the required report will need data from more than one
transaction system, but probably not all of them. ETL is not set up to do this;
it needs a system capable of directing queries at designated systems and
transforming those results. And, finally, the extracted data may well need to
be in a different format. After all now we are loading the data directly into
the Business Intelligence analysis engine for this report (for speed), and not
importing it to the OLAP database. This
means that the structure and semantics are all different.
Increasingly the tools of choice for these j-i-t
operations are Federated Search (FS) systems such as MuseGlobal’s Muse platform.
They can search a designated set of sources (transaction systems), run a
specific query against them, and then re-format the results and send them
directly to the analysis engine. Initial examples of FS systems are user
driven, but for this data integration purpose, the more sophisticated FS
systems are able to accept command strings and messages in a wide variety of
protocols, formats and languages and act on them, thus allowing the FS system
to act a s a middleman getting the data the BI engine needs exactly when it
needs it. Muse, for example, through its use of “Bridges” can accept command
inputs in over a dozen distinctly different protocols, and can query all the
major enterprise management suites in a native or standards-based protocol.
Should we move?
The need for speed of analysis and the volume of data
involved grows every day it seems. It takes time to extract all that data and to
build a big OLAP database just in case we want it. What’s more, building, and changing the
structure to adapt to changing analysis needs takes time – a lot of it.
So modern BI systems have moved to holding their database
in memory, rather than on disk, just so everything is that much faster. Modern
analysis engines, many based on the Apache project’s Hadoop engine, can handle a lot of data in
a big computer, and do it rapidly. Both Oracle (Exalytics)
and SAP (Hana)
have introduced these combined in-memory database plus analytics engine, and
others are coming. (See here
for an InformationWeek take on the war of words surrounding them.) These
engines can be rapidly configured (often in real time, through a dashboard) to
give a new analysis report – as long as they have the data!
Moving all that data from the transaction system takes
time, so the current mode is to leave it there and rely on real-time
acquisition of what is needed. This is of course much less disruptive, fresher,
and much more focused on the analysis at hand. This is not to say that
historical data is not important; it is, and it is used by these engines, but
the emphasis is more and more on that last bar on the graph.
So, again we need a delivery engine to get our data for
us from all the corporate data silos, get it when it is needed, and then deliver
it to the maw of the BI analytics engine. Once again the systems integration,
dynamic configuration and deep extraction technologies of a Federated Search
engine come to the rescue. Muse supports the real time capabilities, parallel
processing architecture, session management, and protocol flexibility to
deliver large quantities of data when asked for, or on a continuing “feed”
basis.
617 comments:
«Oldest ‹Older 1201 – 617 of 617Post a Comment