IBM Watson"s natural language processing takes crack at unstructured data
Craig Rhinehart is director of strategy for enterprise content management at IBM. His ECM Insights blog covers a variety of information management topics.
He started his first business at 19, bought his first company at 22 and has been involved with seven acquisitions. He came to IBM through the 2006 acquisition of FileNet, where he was vice president of product marketing.
He has recently been talking about Watson, the IBM computer that defeated some human contestants on the US quiz show Jeopardy earlier this year.
Jeopardy was launched in 1964. The peculiarity of its format lies in responses being given in the form of a question. So, "Author of America; negotiated Louisiana Purchase" should elicit the response "Who is Thomas Jefferson?"
On a recent trip to London, Rhinehart briefed SearchDataManagement.co.UK. What follows is an edited version of that conversation.
What"s the significance of Watson"s triumph on Jeopardy ? It"s not a show we have here in the UK.
Craig Rhinehart: We don"t think of Watson as a game show computer. We think of it as a breakthrough in computing.
Unstructured information and communicating in natural language have not been well-adopted in IT terms. This technology will enable new ways to interact with computers, opening up new solutions.
Natural language is very ambiguous, as opposed to data, where a five is always a five. By contrast, the meaning of the word major , to give an example, is dependent on context - a major in the army, a term in music, an adjective meaning "big," and so on. In natural language, we speak in riddles, abbreviations, with pop culture references.
For computers, numbers are easy; ambiguous things are really hard. But 80% plus of our information is unstructured, and we are expecting 44 times growth more in the next 10 years.
In the knowledge management (KM) sphere, a dozen or so years ago, similar things were said about the importance of unstructured content and tacit knowledge discovery. IBM was very involved in KM. How would you compare and contrast 2011 with, say, 1998?
Rhinehart: The legacy approaches [to unstructured information] have all failed. Search is a good example. The correct answer to a question might not be the most popular item that shows up through a search engine. It"s up to you as a user to break down your question for the search engine in terms of keywords. You then get pages and pages of results, ranked in order of popularity, with the influence of ads, and so on. You then spend more time reading through to get the answer. That is today"s search experience, and if you are using it for decision support, that is anything but optimal.
Watson [on the other hand] is a natural-language-based interface machine. You can ask questions in natural language. It will search its trusted knowledge base, and it will return results based on its confidence score. On the TV show, it gave the top choice, but in real life, in business, you"d want multiple choices and confidence ratings.
It is a new paradigm. Keyword matching isn"t always the way to go. You lose the context of the question. Keywords are not natural-language questions.
OK, but much of what you have been saying, in webinars and podcasts, has been based on American English representing "natural language." But there are surely two orders of challenge for computers - what one might call formal-linguistic and cultural-linguistic - not one? How confident are you that that doesn"t matter?
Rhinehart: Watson is an English language-based technology, US and other. The Jeopardy game show has references to all flavours of the English language. Watson supports dialects and slang. IBM has multiple language support in other technologies that are deployed in call centres, crime fighting, in a whole slew of languages. It"s not a limitation.
You see social content as a different animal to be harnessed. Why?
Rhinehart: There used to be a time when an executive might dictate a letter to a secretary. Word processers reproduced that. It was all about the document. Now, those kinds of media are no longer fast enough. I used to do all my work in Word and Excel. These days, my communication is much more short-form and casual, using text-based communication tools more appropriate to the audience - blogging, tweeting.
And it"s about multimedia too?
Rhinehart: Yes, a few years ago you couldn"t have done video unless you had a multimillion-dollar studio in your basement. Now you can do it on a laptop.
Does content analytics show up the limits of traditional business intelligence? If so, how?
Rhinehart: What prevents people getting the maximum value from BI? No. 1 is the time it takes to roll out BI projects. No. 2 is that BI only deals with structured data. Now, being an ECM [enterprise content management] guy, I care more about unstructured than structured information. You could have a different perspective from one of IBM"s BI executives. But what we"ve done with our content analytics products is shorten the time to value to address the first problem. And we have also built integrations to structured data environments such as Cognos in BI or Netezza in data warehousing.
Isn"t there a myth of perfect knowledge at play here that is necessarily unattainable? From a hard-nosed business point of view, you never have enough information, and that"s where strategy and gut feel comes in.
Rhinehart: The inability to find the right information when making a decision is very frustrating. Watson can bring the right information where a decision has to be made, alleviating the decision maker of the task of finding all this stuff and deciding what is relevant. In the future, the goal will shift from trying to manage all your information to managing that which is relevant and valuable to your business. That"s not what organisations tend to do today. All information is seen as equal.