Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Natural Language Expansions for Tense and Number


June 1993/Natural Language Expansions for Tense and Number

Natural Language Expansions for Tense and Number

Russell Suereth


Russell Suereth has been consulting for over 12 years in the New York City and Boston areas. He started designing and coding systems on IBM mainframes and now also builds PC software systems. You can write to Russell at 84 Old Denville Rd, Boonton, NJ 07005, or call him at (201) 334-0051.

This article expands the natural language processor presented in "A Natural Language Processor" (CUJ, April 1993) to include tense and number. Tense and number help determine the grammatical usage of auxiliaries and verbs, and derive meaning from the sentence. Tense indicates the time of the action or state: past, present, or future. Number indicates how many: one or more than one. These simple meanings help identify similar information between sentences. The processor uses similar information in sentences to generate an appropriate response.

I added and expanded several processes to implement tense and number. This natural language processor can now accept input sentences with auxiliary combinations, identify tense and number, and use meaning to generate grammatical responses. These expansions enable the natural language processor to process tense and number, and to respond more correctly.

Extracting Auxiliary and Verb Information

The processor uses auxiliary and verb information from the dictionary to help identify tense and number. Auxiliaries, words such as is, have, and would, can be combined to create other auxiliaries such as would have and would have been. Underlying structures identify auxiliary combinations in the input sentence.

Underlying structures define the kinds of input sentences that can be processed. The original processor had two underlying structures that defined two kinds of input sentences. In this article, I expand the underlying structures to accept auxiliary combinations. For instance, the structure NAME-AUX-VERB-PREP-DET-NOUN accepted Jim could run in the race. I expanded the structure to NAME-AUX-AUX-VERB-PREP-DET-NOUN and NAME-AUX-AUX-AUX-VERB-PREP-DET-NOUN. These structures accept Jim could be running in the race and Jim could have been running in the race.

Listing 1 and Listing 2 show only the expanded code for the original processor. This expanded code can be a model for processing tense and number, or added to the original processor. You can get the complete processor code, including the expansions for tense and number from the various sources for CUJ online source code (see page 6).

check_underlying contains the expanded structures. check_type matches the underlying structure to the input sentence. If the input sentence has the underlying structure, check_underlying concatenates the auxiliary words to the auxiliaries array. Then get_aux is called to find the auxiliary in the dictionary.

The dictionary contains the auxiliary tense, number, meaning, and usage. get_aux reads each dictionary record and calls match_aux to match the auxiliary with the dictionary word. If the match is successful, match_aux extracts the tense, number, meaning, and usage for later use. The natural language processor uses this information to match the input sentence auxiliary with the verb, and to retrieve the correct verb when generating a response.

The existence of a matching, underlying structure determines the processor's success in understanding an input sentence. The coded, underlying structures could be expanded further with code for phrase structures such as AUX, AUX-AUX, and AUX-AUX-AUX. Such an expansion would reduce the number of coded, underlying structures but would accept more kinds of input sentences.

The processor also extracts verb and pronoun information from the dictionary to later help determine that an input sentence is grammatical, and retrieve the correct verb for a generated response. match_record extracts the verb and pronoun information according to the word type. That is, for the word type VERB, the processor extracts the verb root, usage, tense, and number. For the word type PRON, it extracts only the pronoun number.

The dictionary in this expanded processor has new words and information for processing tense and number, and a new layout for the new information. Listing 3 shows the new dictionary. Figure 2 shows the dictionary layout.

Determining Tense from Auxiliary and Verb

The processor determines tense primarily by matching auxiliary tenses with verb tenses. For example, in the sentence Jim had run in the race, the processor matches the tenses of the auxiliary had with the tenses of the verb run. The auxiliary had is defined in the dictionary as past tense. The verb run is defined as past, present, and future. The past tense matches successfully and so the sentence is past tense. If no match is successful, then the auxiliary and verb don't agree and the sentence is in error.

check_aux_verb matches the auxiliary with the verb. The routine first processes sentences that have an auxiliary. If the auxiliary and verb usage don't match, then the sentence tense and usage are unknown. If the auxiliary and verb usage match, then each auxiliary tense is matched with each verb tense. If the number of successful tense matches is one, then the matched tense and usage are assigned to the sentence tense and usage.

The usage helps identify the correct verb that can be used with the auxiliary. Some kinds of verbs can't be used with a specific auxiliary. A correct verb is in was running and an incorrect verb is in was run. The verb run has a ROOT usage, was and running have an ING usage. The ING usage identifies a verb that ends with ing, NOAUX identifies a verb with no auxiliary, and ROOT identifies the main form of the verb. If the auxiliary and verb usage match, then that verb may be used with the auxiliary.

Some auxiliary and verb combinations have more than one tense match. Consider, for example, the sentence Jim can run in the race. The auxiliary can is in the dictionary as present and future tense. The verb run is in the dictionary as past, present, and future tense. This auxiliary and verb combination may be present or future tense. More than one tense match causes the tense to be unclear. In this case, you must use an alternative method to determine tense.

Determining Tense from a Previous Sentence

Input sentence context determines tense when auxiliary and verb tenses have more than one match. The processor determines sentence context by analyzing information in previous sentences. Look at the sentences Jim will run in the finals, and Jim can run in the first lane. The second sentence tense is unclear when only that sentence is analyzed. The auxiliary can is in the dictionary as present and future tense, and run is in the dictionary as past, present, and future. The second sentence may be present or future tense. But when the tense of the first sentence is also analyzed, then the second sentence tense can be determined and becomes future tense.

However, a previous sentence tense may be irrelevant or forgotten if that previous sentence occurred long ago. For example, the speaker says Jim will run in the finals and fifteen sentences later says Jim can run in the first lane. The listener may not be sure first lane refers to the finals. The sentence about the finals may even be forgotten. When no recent sentence indicates the tense, the listener assumes the tense is present.

check_aux_verb looks at the previous three sentences when the number of successful tense matches is greater than one. The current sentence's subject, action, and possible tenses are matched with the previous three sentences' subject, action, and tense. If a match is successful, then the matching sentence tense is assigned to the current sentence tense. If no match is successful, then the current sentence tense is present.

Determining Tense with No Auxiliary

Some sentences don't have an auxiliary. In these sentences, the processor uses the verb to determine the input sentence tense. When the verb is past tense, then the sentence is past tense. The sentence Jim ran in the race is past tense. When the verb is not past tense, then the sentence tense is unclear. The sentence Jim runs in the race may be present or future tense. The previous three sentences are used to determine the tense when the tense is unclear.

check_aux_verb also processes tense when the input sentence has no auxiliary. The current sentence's subject, action, and possible verb tenses are matched with the previous three sentence's subject, action, and tense. If a match is successful, then the previous sentence tense is assigned to the current sentence tense. If no match is successful, then the current sentence tense is present.

Determining Number

The sentence subject determines number. The number can be singular or plural depending on whether the subject refers to one or more than one. The sentence He runs in the race shows a singular subject, They run in the race shows a plural subject. The auxiliary, verb, and subject number in a grammatical sentence must be all singular or all plural.

Many verb forms are both singular and plural. For example, running is singular in He is running in the race, and plural in They are running in the race. Many auxiliaries also can be singular and plural. For example, could be is singular in He could be running in the race, and plural in They could be running in the race. But some auxiliaries and verbs can be only singular or plural. check_number matches the auxiliary, verb, and subject number. If the match is successful, the matched number is assigned to the sentence number. If the match is unsuccessful, the sentence number is unknown.

Number is used to identify the correct verb in a generated response. Using number enables the correct verb to be extracted from the dictionary. The sentence Jim runs in the race has a correct verb because Jim and runs are singular. A subject and verb that agree help make the response grammatical and effective.

Meaning and the Response

The original processor generated only two kinds of responses. It returned a simple OK when given a statement, and someone's location when given a question. The previous sentence that matched the same subject and action words gave the location for the response. Previous sentences were matched by the same words to identify similar information. But similar information is often determined by meaning rather than by the same words.

The expanded response process uses tense and the auxiliary's meaning to help identify similar information, making the natural language processor sound more human. Figure 1 shows an example. The processor responded with correct information because it used tense and auxiliary meaning when it matched the sentences. The tense matched because the two sentences refer to past tense. The auxiliary meaning matched because did and had been mean PARTICULAR_POINT_OF_TIME.

The processor currently assigns auxiliary meaning only when the auxiliary exists. Auxiliary meaning helps match similar information between sentences. A sentence without an auxiliary meaning can't be properly matched to another sentence. A further processor expansion can derive an auxiliary meaning from the sentence when no auxiliary exists. The derived auxiliary meaning will allow a sentence without an auxiliary to be properly matched.

Matching Sentences for the Response

When the processor generates a response, it searches in previous sentences for similar information. The processor uses the most accurate information in the response. There are four separate matches the processor uses to find information for the response. The first match has the highest probability that the information is accurate, the last match has the lowest probability that the information is accurate. People in conversation respond in a similar manner. A person may not have enough knowledge for a correct response. But that person may create an alternative response to show knowledge about the information.

The first match criterion is subject, action, tense, and auxiliary meaning. Subject and action are always in the match criteria for the response. This ensures that all matched sentences have the same subject and action. The sentences Jim had been running in the race and Where did Jim run? match because had been and did have the same tense and meaning. make_response matches information in the current sentence with information in previous sentences. In the first match, the subjects, actions, tenses, and aux_meaning arrays match the current sentence to a previous sentence. When a match is successful, make_answer generates a response with information from the previous, matched sentence (see Figure 1) .

The second match criterion is subject, action, and tense. The subjects, actions, and tenses arrays match the current sentence to a previous sentence. Here, the sentences Jim had been running in the race and Where would Jim run? match because had been and would have the same tense.

The third match criterion is subject, action, and auxiliary meaning. The subjects, actions, and aux_meaning arrays match the current sentence to a previous sentence. The sentences Jim should have run in the race and Where should Jim run? match because should and should have have the same meaning. The tense is different, and so the response may not give the correct information. Because of this, the processor prefaces the response with I'm not sure, but.

The fourth match criterion is the same as in the original processor. Only subject and action are matched. The subjects and actions arrays match the current sentence to a previous sentence. The sentences Jim is running in the race and Where will Jim run? match because they have the same subject and action. But the tense and auxiliary meaning don't match so the response may be incorrect. Because of this, the processor prefaces the response with I'm not sure, but. When the four matches are unsuccessful, the processor can't find the similar information. Then make_response moves the statement I don't know to the response.

Extracting the Correct Verb

The correct verb helps ensure a grammatical response. make_answer generates a response by moving appropriate words to the response variable. The appropriate subject and auxiliary are first moved to the response. Then get_verb is called to extract the correct verb and move it to the response. get_verb reads each record in the dictionary and calls match_verb to find the correct verb. match_verb matches the passed tense, number, and usage with the tense, number, and usage of the current dictionary record. When a match is successful, the correct verb is extracted from the dictionary and moved to the response. After the verb is extracted, the place where the action occurred is also moved to the response.

Processing the Pronoun

The processor compares several word types to determine that the sentence agrees in tense and number. Another word type that must agree in number is the pronoun, words such as he, she, and they. A pronoun can replace the name in the sentence, and be the sentence subject. Pronouns must be defined in the underlying structures, be identified as singular or plural, and be used properly in a response.

The structure PRON-AUX-VERB-PREP-DET-NOUN (Listing 1) is for a statement, and WH-AUX-PRON-VERB (Listing 1) is for a question. These structures allow the input sentences He is running in the race and Where did he run?

Number is extracted from the dictionary and identifies that the pronoun is singular or plural. match_record extracts the pronoun number and assigns it to the subject number. Subject number is used to determine number agreement and to extract the correct verb.

Pronouns in a generated response must have their first letter changed to upper or lower case. If the pronoun is the first word in the response, then it must be uppercase. If the pronoun is not the first word, then it must be lower case. make_answer has the code for the pronoun in a response. The subjects_type array has an entry for each sentence. Each entry identifies that the subject is a name or a pronoun. If the response's subject is a pronoun, then the subject's first letter is changed. check_subject assigns values to the subjects_type array. Pronoun letter change helps the processor use the pronoun properly in a response.

Conclusion

Several processes are required to process tense and number. The processes described in this article extract auxiliary and verb information; determine tense from the auxiliary and verb, a previous sentence, or with no auxiliary; determine number; match sentences for the response; extract the correct verb; and process the pronoun. Each of these processes is an expansion of the original natural language processor.

Further expansions to the processor can process time features. These expansions identify time words such as last week, match time meaning with auxiliary meaning, generate a response from time meaning, and generate a response to explain time and number errors.

The processes described in this article expand the natural language processor to include tense and number. Tense and number are used to extract the correct verb for a grammatical response. Tense and the auxiliary are used to match similar information between sentences. This match enables the processor to generate a response based on word meaning. These expansions provide the processor with more accurate responses.

(C) Copyright 1993 Russell Suereth.

Bibliography

Liles, Bruce L. 1971. An Introductory Transformational Grammar. Englewood Cliffs: Prentice-Hall.

Quirk, Randolph, and Sidney Greenbaum. 1973. A Concise Grammar of Contemporary English. San Diego: Harcourt Brace Jovanovich.

Suereth, Russell. April 1993. "A Natural Language Processor." The C Users Journal. Lawrence, KS: R&D Publications.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.