FREE Subscription to Dr. Dobb’s Digest: Same Great Content, New Digital Edition
Site Archive (Complete)
Email
Print
Reprint

add to:
Del.icio.us
Digg
Google
Furl
Slashdot
Y! MyWeb
Blink
January 01, 2002
Building a Synonymous Search Index

Peter Morville

WebReview.com: Building a Synonymous Search Index

Most of us are familiar with the traditional thesaurus or dictionary of synonyms.Roget's Thesaurus was an invaluable tool in helping us spruce up our high school English papers with impressive, multi-syllabic words like pulchritudinous. Simply put, this traditional thesaurus helped us go from one known term to multiple synonymous terms.

In contrast, a thesaurus for your Web site works primarily in the opposite direction, mapping many known terms onto one acceptable term per concept. Its purpose is to help users find the documents they need within a large information system. Until recently, online thesauri were familiar only to librarians, expert searchers, and developers of high-end information systems such as Dialog and MEDLINE. However, as Web sites and intranets grow into large mission-critical information systems, we're seeing a rising need to employ online thesauri as tools to help users find what they're looking for quickly and effectively.

What is a thesaurus?

A thesaurus can be defined as "a controlled vocabulary that leverages synonymous, hierarchical, and associative relationships among terms to help users find the information they need." It sounds rather complex, but once you understand the challenges that a thesaurus is designed to address, things should become clearer.

The value of a thesaurus stems from the inherent problems of natural language indexing and searching. Different users define the same query using different terms. Document authors, indexers, and information architects describe the same concepts using different terms. Consider the following example:

Figure 1
Figure 1: Many information needs go unanswered because a user's search terms don't map to the terms used by document authors and indexers.

Three users are looking for information about a car. However, they each use different terms to describe this same information need. Similarly, the people that indexed the documents selected different terms to describe the same concept. Each user has varying levels of success with no one finding all the relevant documents.

To address this problem, a thesaurus maps variant terms (synonyms, abbreviations, acronyms, and alternate spellings) to a single preferred term for each concept. For document indexers, the thesaurus tells them which index term must be used to describe each concept. This enforces indexing consistency. For users of the Web site, the thesaurus works in the background, mapping their keywords onto the single preferred term, so they find the complete set of relevant documents.

Figure 2
Figure 2: Variant terms serve as entry points into the information system, connecting the words that users have in mind with the preferred terms applied by document indexers.

A thesaurus can also leverage the richness of hierarchical and associative relationships. Users may express their information need at a broader or narrower level of specificity than that used by the indexer to describe the documents. The mapping of hierarchical relationships addresses this problem.

Figure 3
Figure 3: A thesaurus can be more than a dictionary of synonyms. You can also specify and leverage hierarchical and associative relationships.

Additionally, there may be value in mapping associations to related terms. In this example, the decision is made that users interested in automobiles may also be interested in the related terms, such as mechanic and accident. Identification of these subjective relationships increases the chances of success and promotes associative learning. In a commercial setting, the explicit suggestion that if you're interested in a particular product you may also be interested in other related products can be valuable to both buyers and sellers.


How Do You Build a Thesaurus?
Peter lays out the steps you can take to building a thesaurus by term generation and consolidation.

RELATED ARTICLES
No Related Articles
TOP 5 ARTICLES
No Top Articles.
DR. DOBB'S CAREER CENTER
Ready to take that job and shove it? open | close
Search jobs on Dr. Dobb's TechCareers
Function:

Keyword(s):

State:  
  • Post Your Resume
  • Employers Area
  • News & Features
  • Blogs & Forums
  • Career Resources

    Browse By:
    Location | Employer | City
  • Most Recent Posts:
    MEDIA CENTER  more
    NetSeminar
    Modernize your Development by Moving Build and Code Quality Upstream
    Moderated by Jon Erickson, Editor-in-Chief of Dr. Dobb's, this interactive panel discussion brings industry experts Anders Wallgren, CTO of Electric Cloud and Gwyn Fisher, CTO of Klocwork together for a candid discussion of the cost savings, productivity and quality benefits that can be achieved by stabilizing builds and code quality as early in the development cycle as possible.

    The reality of today's development environment - geographically distributed teams, the use of Agile development practices, increasing application complexity, etc. - is straining the viability of the traditional coding, build and release process. To stay ahead of the curve, development teams are modernizing their approach to dealing with these issues, and as a result are achieving new levels of development productivity. Register for the webcast.
    Date: Wednesday, July 15, 2009
    Time: 11 am PT/2 pm ET
    Modernize your Development by Moving Build and Code Quality Upstream
    Moderated by Jon Erickson, Editor-in-Chief of Dr. Dobb's, this interactive panel discussion brings industry experts Anders Wallgren, CTO of Electric Cloud and Gwyn Fisher, CTO of Klocwork together for a candid discussion of the cost savings, productivity and quality benefits that can be achieved by stabilizing builds and code quality as early in the development cycle as possible.

    The reality of today's development environment - geographically distributed teams, the use of Agile development practices, increasing application complexity, etc. - is straining the viability of the traditional coding, build and release process. To stay ahead of the curve, development teams are modernizing their approach to dealing with these issues, and as a result are achieving new levels of development productivity. Register for the webcast.
    Date: Wednesday, July 15, 2009
    Time: 11 am PT/2 pm ET
                                   
    INFO-LINK

    Resource Links: