The world of independent Federated Search is diminishing;
last week IBM announced that they will be acquiring Vivisimo.[1] There are a number of interesting aspects to
this, and the analysts have covered some of them [2],[3], but some particular quotes
from IBM itself and the analysts piqued my interest:
“The
combination of IBM's big data analytics capabilities with Vivisimo software
will further IBM's efforts to automate the flow of data into business analytics
applications …” [IBM]
“IBM also
intends to use Vivisimo's technology to help fuel the learning process for
their Watson
applications.” [IDC]
“Overall,
this is a very smart move for IBM, and it indicates that unstructured
information is going to play an increasingly
large role in the Big Data story…” [IDC]
All this shows the handling of structured and
unstructured information growing in importance.
What does IBM want Vivisimo for? It seems to all stem
round Big Data and the analytics that it can produce to enable better corporate
decisions. Of course, there’s also the
lovely teaser of a better performing Watson! Both Watson and Analytics massage
vast amounts of data and information to draw conclusions, assign values, and
create relationships. But, like all such endeavors, the quality of the result
depends critically on the quality of the incoming data. GIGO says it all!
Big Data analytics work very well with structured data,
where the “meaning” of each number or term is exactly known and can be
algorithmically combined with its peers, parents, siblings, and opposites to
give a visualization of the state of play at the moment or over time. Gathering
such data is a tedious process (hooray for computers!), but is not
intrinsically difficult. All that needs to happen is to set up a mapping from
each data Source to the master and let it run. The mappings are precise and the
process effective, but the volumes are vast and the time-to-repeat rather slow
for today’s fast paced world.
However, now add the fact that not everything you want to
know is held in those nice regular relational database tables, and the picture
looks far less rosy. Product reviews are unstructured, press releases are
vague, social comments are fleeting, and technical and legal documents tend to
be obtuse. But all these are vital if you want to make a really informed
decision. So bring in Federated Search to the rescue.
Federated Search is a real time activity. It is focused
on just what data or information is needed now. And it provides quality data.
It is directed to just those Sources needed for “this report”, and it analyzes
them in terms of known semantics so that the reviews, blogs, etc. mesh with the
numerical analytics, and then provide the essential “external view” of the situation.
And this is done right now, in real time. For the knowledge based systems (like
Watson) the FS Sources provide in-depth data pertinent to the current problem.
And if the Sources don’t have it, FS goes and finds it, thus allowing Watson (as an example) to add it to its knowledge base, and provide a
more informed opinion.
So that is why IBM is adding Federated Search to its
armory. What are the issues? In a word (or two): coverage and completeness.
All the Big Data systems use standardized access to the
massive databases of the corporation’s transaction and repository systems. Most
of these understand SQL or some other standard access language, and the
customization is a matter of reading a schema mapping table. That mapping table
is the same for every SharePoint or Exchange system (or similar), so once
created, it is easily deployed. These types of standardized accesses are often
referred to as “Indexing Connectors” because they extract enough data to enable
the content to be indexed and searched. (For more on this see a future post on
the deep differences between Connectors and Crawlers.)
Now, move to the world of web data and the complexity and
difficulty escalates enormously. The number
of formats and access methods multiplies almost to the point of one-to-one for
each Source. As an example look at the two press releases for this acquisition:
IBM’s is a press release, with an initial dateline, and no tags, Vivisimo’s [4]
is a blog post with tags and an author. The same Connector will not make sense
of both at the level of detail needed for a decision making analysis.
Add in the velocity of the data in the social media
(“velocity”, as you will recall, is one of the 3 “v”s that define Big Data –
Volume, Variety, Velocity) and the relatively slow to aggregate times of
conventional databases become a problem. Timing is an issue because of volume,
but also because applications have to analyze input data from users and other
sources, store it in their transactional database, and then the ETL function
has to extract from that database and move the data to the analytics database
or storage area. These are two stages, both relatively slow, that must be
batched together.
So, once moving from structured data to unstructured data,
and from the sheltered waters of the corporation to the rough seas of the Web,
a very different set of techniques is needed. And that is where Federated
Search (FS) comes in. This is the truly
hard, difficult part, and it’s where MuseGlobal shines. But first, some more information on what FS
is, and what it needs to do.
FS is immediate, which involves many synchronization and
“freshness” issues, but essentially solves the “velocity” problem by obtaining
data as it is needed. That is because FS is a “on demand” service. It is
brought into play just-in-time to get the data when needed, not in batch mode
to store it away just-in-case. Since it is used when needed it needs to be able
to target the Sources of interest right now. That means it is flexible and
dynamically configured, not painstakingly set up ahead of time and left alone.
Since it is a focused operation, targeting only the data
needed, it must be able to get the maximum out of each Source. This requires
two levels of complexity not common in other types of connectors or crawlers. These
Sources have specific protocols and search languages and often security
requirements. All these must be handled by the FS Connector so that the search
is faithfully translated to the language of the Source, and the results are
accurately retrieved. Second is getting the retrieved data into a useable form
(and format). This involves a “deep extract” involving record formats,
field/tag/schema semantics, content semantics, data normalization and
cleansing, reference to ontologies, field splitting, field combination, entity
extraction on rules and vocabularies, conversion to standard forms, enhancement
with data from third Sources, and other manipulations. None of this is
off-the-shelf processing where a single connector can be parameterized to work
with all Sources. So FS has started at the “single, deep” end of the spectrum
(crawlers are the epitome of the “broad, shallow” end) and builds Connectors to
the characteristics of each Source.
These Connectors bring focused, quality data, but they
come at a price. Vivisimo and MuseGlobal, and the other FS vendors build a very
special type of software – something that we know will eventually fail, when
the characteristics of the Source change. This needs a special dynamic
architecture to accommodate it. It needs very powerful ways to build Connectors
which can involve data analysts and programmers, as well as highly
sophisticated tools, such as the Muse Connector Builder. It needs a robust and
automated way to check for end-of-life situations, such as the Muse Source
Checker, and a highly automated build and deploy process – the Muse Source
Factory has been delivering automated software updates for 11 years now. Source
Connectors *will* stop working, and a big part of a viable FS ecosystem is
being able to get them back on line quickly and reliably. MuseGlobal has put together a data
virtualization platform with thousands of Connectors, because we know there’s a
one-on-one relationship with each data source if you want to connect to the
world out there. Figuring out the
unstructured data problem was one of our main goals at Muse from the very
beginning, some 11 years ago.
Of course, building Connectors in the first place is an equal
challenge, including the human element of dealing with a multitude of companies
publishing information and data. This is something all FS vendors have to
handle, and MuseGlobal chose to create a Content Partner Program about 10 years
ago where we talk regularly to hundreds of major Sources and content vendors.
Breadth of coverage of the Connector library is a major factor in “getting up
and running” time, and a major investment for the FS vendors. We believe that
Muse has one of the largest libraries with over 6,000 Source specific
Connectors, as well as all the standard API and protocol and search languages
ones for access where that is appropriate – but still with the “deep
extraction” which is the hallmark of Federated Search.
It is not an easy task to get right at a quality and
sustainable level, but a few vendors have produced the technology. MuseGlobal
is one – and Vivisimo is another.
IBM Analytics and Watson are set for a real quality
revolution!
Another analyst 's comments can be found on enterprise search blog at [6].
Another analyst 's comments can be found on enterprise search blog at [6].
[5] http://vivisimo.com/technology/connectivity.html
[6] http://www.typepad.com/services/trackback/6a00d8341c84cf53ef016304c436dc970d
[6] http://www.typepad.com/services/trackback/6a00d8341c84cf53ef016304c436dc970d
(*) You will need to be a subscriber to see the report
300 comments:
«Oldest ‹Older 801 – 300 of 300Post a Comment