Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

C Programming


JAN95: C PROGRAMMING

C PROGRAMMING

Of Text Seeks and Perplexed Geeks

Al Stevens

In his November 1994 "Swaine's Flames," Michael Swaine said these words: "Magician is the second geekiest entertainment profession, right behind ventriloquist. If you think back to high-school talent shows, you'll see that I'm right."

Well. My first reaction was to go on the defensive. In high school, I was a ventriloquist and a magician and demonstrated both skills in talent shows. I did not know then that I was a geek. My eyes have been opened.

Michael's revelation gave me pause and cause to contemplate. Where did it all start? Perhaps in early childhood in the small Baltimore suburb where I was born. An older boy named Noel Stuckey would gather the neighborhood children at his house and put on shows where he performed magic tricks, used puppets and ventriloquism, and played the guitar and sang. A remarkable and talented boy, Noel achieved fame as the tall guy in the folk group Peter, Paul and Mary. Until reading Michael's column, I did not know that Noel Stuckey was a geek and that through his influence I was destined to be one too.

There must be other geeks, in and out of the closet. I went in search of geeks that everyone knows about, so that I might find kindred empathy and learn to live with this stigma. Where to look? Johnny Carson's hobbies include magic and ventriloquism. He also flies airplanes and is a competent jazz drummer. Everyone knows what a geek Johnny Carson is. Edgar Bergen was a respected humorist, famous radio and movie star, sired Murphy Brown, and was undisputedly the world's most famous geek, er, ventriloquist. Paul Winchell dropped out of show business as a popular ventriloquist to bring innovation to medical cybernetics. He invented a mechanical heart that saved thousands of lives. Who but a geek would do that? Anthony Hopkins, geek-in-hiding, learned ventriloquism to play the demented protagonist in Magic. Not only did he enjoy a romp with Ann-Margaret in that role, but went on to win an Oscar as geekmeister-apparent Hannibal the Cannibal in the movie version of Silence of the Lambs. Anyone who has witnessed David Copperfield's sensuous levitation of a gorgeous, willowy assistant can detect latent geekist tendencies straining to get out from under all that macho magic.

Finding geeks is easy, as it turns out, but how can you tell when someone is definitely not a geek--who is the Anti-Geek, the standard bearer, the example for the rest of us who would escape geekism? Who, indeed? Turn now to the last page in this magazine and look at Michael Swaine, whose total absence of geekism warrants a full-color portrait next to his column. Observe those steely eyes, the confident, self-assured gaze, the powerful, in-charge bearing and countenance. Look at the firm set to that jaw. (Use your imagination, it's in there somewhere.) See the Fitzgeraldian total repose, the casual gesture of the left hand. Somehow you know that the right hand, the one you can't see, sports a macho tattoo and grips a Marlboro between the first two fingers. This picture could be of a seasoned 747 captain, a rugged outdoorsman, a captain of industry, or even a HyperCard programmer. The antithesis of geekdom. The one who gets to say who the geeks are.

We are not worthy.

Static Text Databases

In last month's column I introduced a static text-search engine that will be the subject of the next several columns. This month describes the system's purpose, analysis, and architecture. As with all "C Programming" column projects, the source code and, in this case, the database are available to all readers.

Purpose

The purpose of the database engine is to permit online retrieval of small documents within a large, static text database. The retrievals should support simple navigation and complex Boolean queries. The navigation should allow retrievals based on a user-specified document identification as well as forward and backward scans of the database, one document at a time.

These are common requirements. Many organizations need to archive their textual resources and provide orderly and structured access to them. The purpose of this project, then, is to provide a foundation from which programmers can launch custom text-retrieval systems where the feature set or royalty policies of commercial packages are restrictive.

The engine in this project supports a front end that provides the user interface. The example application uses a DOS text-mode user interface to test the engine and a Visual Basic UI to implement the application. The engine itself is written in C and implemented both as a stand-alone DOS application for testing and a Windows DLL for the running application.

Analysis

To build a text database, you first perform an analysis to determine the best way to identify, store, index, and retrieve documents. You must develop the database organization as a function of the organization of the text itself. This first step will drive the details of the text engine as well as the user-interface front end. Furthermore, the results of this analysis will differ from project to project, and the differences will influence the architectures of the database and engine software.

Here are some of the questions you must answer: What is the hierarchical organization, if any, of the text database? What is the smallest identifiable entity of text? The largest? The levels between? Magazine archives are one example, and their answers are easy. The database hierarchy for a collection of periodicals could include: publication, issue year, issue month, article, paragraph. The Bible, the text database that this project uses as an example, is organized into testaments, books, chapters, and verses. Each individual text collection implies its own organization, usually derived from the volumes, chapters, headings, and so on of the source documents.

What is a document? In this project, a "document" is defined as the smallest identifiable text entity in the database--for this example, a verse. This works well in a database organization because verses in the Bible are numbered, and these numbers are the standard method of reference for quotes, stories, and so on. Furthermore, the largest verse is less than 600 characters long, representing a manageable document size for retrievals, searches, and display. Unlike most other text collections, there are no chapter and paragraph headings. The books have names--Genesis, Exodus, and so on--but nothing below that level has identifying text. Chapters are numbered within books, and verses are numbered within chapters. This organization eliminates tables of contents other than for a list of the books within the two testaments. The example project, therefore, has no support for tables of contents. Other text applications will need them. If you use this project to launch your own, you will need to augment the software accordingly.

The organization of the text determines how it is stored and how the retrieval tables work. At the lowest level, the documents are retrieved by document numbers, which are an ascending series starting at document #1 for the first logical document in the database and proceeding serially through to the highest document number. The example application uses document numbers 1 through 31102. Note that these document numbers are static. You determine what they will be in the analysis phase of the project. Inasmuch as the engine supports a static text database, the numbers never change once you have assigned them.

The user-interface front end does not deal with these document numbers. Users specify documents using the application-specific external identifying nomenclature. The text engine must be able to convert between its internal document number and the user's language, which, in the example, consists of book name and chapter and verse numbers (John 3:16 is document number 26137, for example). Initial analysis must determine how to do this run-time conversion. The example application uses a static table that records the number of books in each testament, the number of chapters in each book, and the number of verses in each chapter. I built the table by writing a C program that scans the text and counts verses and chapters within books. The program emits the table in C source code that the application includes. I'll describe the format of the table in a subsequent column.

The conversion must go in both directions. If the user asks for a particular document by its user nomenclature, the engine converts the specification to a document number. If the user steps or pages through the database a document at a time or retrieves a document with a query, the engine must convert the retrieved document numbers to their correct user nomenclature.

Indexes

A text engine must support retrievals. The nature of the application will determine what kinds of retrievals you must support. Typical retrievals use ranges of publication dates, author, subject matter, keywords, and Boolean word and phrase queries to find documents in the database. Each of these retrieval techniques implies an indexing function that, given a retrieval specification, returns a list of document numbers.

The example application uses only the phrase and word-search queries. The others are not applicable to either the text or the application or not possible given the size of the database and the absence of applicable scholarship and time available among the project team. Authors are either implied by the book names or unknown. The original dates of publication can only be approximated. Indexes of keywords and subject matter involve an extensive analysis of the subject matter with manual decisions being made about what to include. Other text databases will require that these analyses be performed. First you determine that the requirement exists. Next you develop a technique for recording the index. Then you employ specialists in the subject matter to extract and record the indexing keywords.

Word Index

The word and phrase queries use an index of text words. The index returns a list of document numbers where the argument word appears. An early part of the text analysis builds a word-frequency table. This table does not become a part of the database or its index. Its purpose is to identify the words that will not be included in the index. Some words appear so often that to include them in the index would generate extremely long document lists and result in an unnecessarily long index file. You want to eliminate words that have the highest frequency--this involves a word-frequency list, which requires some custom programs. The first program simply reads all the database text, extracts each word, and writes a file of word records, where each record contains a word. This program is specific to the application because it depends on the format of the source-text data. Assuming a common output format from the first program, the others can use common code across projects.

The second program sorts the word file into word sequence using a case-insensitive sort. The third program reads the sorted word file, counts each word, and writes a file with one record for each unique word. The record contains the word and its frequency count. The fourth program sorts the word-count file by the frequency count. Now you can look at the sorted output from that program and see which words occur with the greatest frequency. You can decide, based on the distribution of words, how far down into the frequency list to go when selecting words to omit from the index. The omitted words go into a table that the database-building program uses to decide which words to include in the index and the retrieval program uses to decide which words to omit from a search.

What about those omitted words? How does the system find documents that include the high-frequency words? Quite simply, it assumes that those words appear in every document in the database. For example, a query that searches for the phrase "this is it" will return an initial list consisting of every document in the database because those three words are probably in every English language high-frequency list. To work effectively, a query must contain other criteria with which to narrow the search. The system scans the documents selected by those other criteria to see if the high-frequency words also appear before adding the documents to the list of retrieved documents.

Boolean Word and Phrase Queries

Boolean word queries use the Boolean operators AND, OR, and NOT to construct query expressions based on the presence and/or absence of specified words. A phrase search looks for the combination of words in the phrase in the order in which they appear in the expression.

A query returns a list of documents that match the expression. The user can use this list to view the selected documents. The Boolean query expression "Jesus AND Mary" returns 11 verses. The expression "Jesus OR Mary" returns 977 verses. The UI should use a threshold of hits to determine whether or not to continue. In the first case, it might go ahead and display the list. But since building the list involves a certain amount of overhead for each entry, the user interface could permit the user to decide whether to proceed. Some application-specific number of hits is the built-in threshold that determines whether to involve the user in the decision.

Two-Pass Searches

Because the index assumes a hit on all high-frequency words and because the engine supports phrase queries, each query potentially involves two search passes. The first pass develops a list of potential hit documents. The second pass examines the text of those selected documents to see if they really do match the search criteria. The first pass of a phrase search constructs an internal Boolean query expression with every word in the phrase joined with the AND operator. That search returns all the documents that contain all the words in the phrase. The second search examines those documents with a serial, case-insensitive text scan to see if the actual phrase exists in the document. The scan ignores excess white space and punctuation in the expression and in the documents to develop the match.

Hypertext Links

Your project might include hypertext links. Most text-search projects do. A hypertext link highlights specific words and phrases in the displayed text and associates the highlighted data with another document in the database. Every reference to Stan Musial, for example, would be linked to his biography. Building the software is not difficult. The database includes embedded tokens that identify a hypertext phrase, which displays in a highlight, and a hidden document number, which is the hypertext link. The front end of the application provides a way for the user to select a link and jump to the linked document. That's the easy part. Setting up the links, however, involves an extensive manual effort by your applications specialists, who read all the text, identify the link phrases and the linked documents, and insert the link tokens into the text. The example application in this project does not support hypertext links, primarily because of the effort involved to establish the links. The implementation would be done primarily in the front end, which would manage the display of the link and the retrieval from the engine of the associated document. The database builder would need to recognize the link's semantics when building the index. The text engine itself does not need to know that it is supporting hypertext links.

Engine and Database Architecture

The text engine is a C program that can be compiled as a stand-alone DOS program or as a Windows DLL. In DOS mode, it includes a simulated user interface that exercises all of its retrieval functions. As a DLL, it provides entry functions that do the same thing.

The database consists of one file that is a concatenation of five files developed during the database-build task. The first three files implement the index, which associates words with document numbers. A fourth file is used to develop the position in the text of the start of each document. The last logical file is the text itself, compressed with a Huffman table. Figure 1 shows the relationship of the five files.

The index is implemented as a sequential list of words. The words are, of course, of variable length, so the first part of the index is a table of fixed-length offsets into the word list. The search uses the offsets to perform a binary search of the words to see if the word is in the database. The word list is followed by a file of document lists with one list for each word in the index. Each word in the word list points to a document list. Each document list contains a list of document numbers that contain the word.

Following the index is a table of fixed-length offsets into the text part of the database. Each offset represents a document number and contains a byte/bit offset value into the text.

The text itself uses Huffman compression, which is why the document offsets include byte and bit values. The lengths of the five files are a function of several application-specific variables. The number of unique indexed words, the total character length of those words, the word distribution among documents, and the number of documents all contribute to the lengths of the files. The database-build task develops these files individually, and their lengths are determined by viewing them in a directory listing. The actual lengths are built into the text engine with macros at compile time. The files are concatenated with the DOS COPY command.

Next Month...

Having analyzed the text and developed the format for the database and the index, the project will now focus on building the database from the raw text. That effort involved more custom software, which will be the subject of next month's column.

By the way, it is not my intention that the "C Programming" column be perceived as showing a reverent bias. Many people distrust discussions that focus on religious subjects when the particular forum is unrelated to matters of religion. Each of us has private and personal feelings about such things, knows that efforts by others to modify those feelings are usually without effect, and would prefer to be spared the trouble. A prominent contemporary PC C/C++ compiler in the past included spiritual messages scattered frequently throughout the documentation. I'll leave the assessment of their greater relevance to others, but the small, well-intentioned testaments added nothing to and in fact distracted from my understanding of the compiler. (The vendor has since dropped the practice.) I don't want the subject of this project to similarly deter anyone's continued patronage of this column. In short, you'll find no preaching here.

Figure 1 The text-engine database format.


Copyright © 1995, Dr. Dobb's Journal


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.