My prior musings on search tools started from a discussion on how keyword searching has reached its limits. Courts and lawyers are struggling with the immense amount of information involved in discovery and hoping for more cost effective ways of finding relevant case information without breaking the bank.

The effort to improve search goes well beyond any legal discovery issues and straight into the heart of knowledge management (KM). KM’s sister struggle works from two angles. First there is an effort to better structure information as it is captured. Second, there are efforts to create structure out of chaotic information (a.k.a. BLOBs), which is where next-generation search tools come into play.

Previously I lumped concept and semantic search into the same broad category. This time I will differentiate between the two and take the discussion another step. For now I will break search into three categories: Keyword, Concept and Semantic.

Keyword or word searching, for this discussion, is that of searching for exact word matches. This can be a complete word, such as Nation, or an extension of the root word into variations such as National, Nations or Nationalistic. The main technical feature of keyword search is that the computer is looking for binary code matches to a search query. This method has been extended and improved by adding search by data type (in a structured database) or search for multiple words within a defined proximity (e.g. within 10 words, etc.) and with known connectors (e.g. and, or, not …). The keyword method has been very useful to-date, especially when searching within large structured databases. It allows users to search by date, location, category, etc., to come up with useful results.

The problem with keyword searching is the expanding mass of unstructured information we now have. Keyword searching has become inadequate and at times counter productive to finding the right information quickly and affordably.

Concept search is one method for solving this problem. My definition: The ability to extract structure from unstructured data. In English, this means the tool can evaluate text and break it down into its component parts and ascertain their structure. As an example, feeding this article into a concept search engine, would result in defining paragraphs, sentences, nouns, phrases and perhaps even proper names and dates and numbers. This allows a database-like search, where the user can search for Name = Brown (and not get color references) or Date = December 25 and get back useful results. Concept searching is just coming into the market, with players like Recommind, Autonomy and Collexis. As an emerging technology, the challenge is good implementation. Companies and firms are attacking this problem now, so I would expect this challenge to diminish over time.

Semantic search is truly Web 3.0. Sir Tim suggested this concept over a decade ago and now efforts are under way to make it a reality. My definition: Attach meaning to each piece of data. In practice this means describing each piece of information by its relationship to another piece. In the geek world this is referred to as “subject, predicate, object” and is defined with a standard called RDF (more on that in another post). To give a very simple example: Mary has son Dave. This is referred to as a “triple.” In our semantic world we will move away from structured databases to a store of these triples. Extending our example with another triple: Dave has spouse Judy. Combining these two triples, a computer can determine that Judy has mother-in-law Mary. The result is that the machine can understand the data. In fact in this environment the machine can discover knowledge. By connecting all the triples via their relationships, the machine will answer questions we never ask. This is a quantum leap ahead of keyword searching.

Semantic search currently lives mostly in the minds of geeks and venture capitalists (with some exceptions). Still, it is a viable and growing world. What it needs are more standards and some time to develop. Its potential is tremendous, but as yet undefined.

Well … that’s my attempt to capture the past, present and future of search technology in a blog post. Future posts will delve deeper into this, as it is such a critical aspect of KM and will define KM’s future.