Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

COM Objects, C#, and the Microsoft Speech API


September 2002/COM Objects, C#, and the Microsoft Speech API


One of the key benefits of the new .NET architecture is the ease with which you can get huge functionality with very little code. This article describes how accessing a few components from a C# program can create a voice-recognition/voice-response system that can form the framework for any voice-driven program. The example program can respond to verbal questions and create appropriate responses. It's all table driven so the recognized phrases and responses can be easily modified as the program is running.

The three components presented in this article are the Microsoft Speech API (SAPI 5.1 SDK), the DataGrid, and the OleDb database access mechanisms. These are married together in the sample program, which uses the DataGrid to edit a database table (which stores the recognized phrases and associated responses), the voice-recognition/text-to-speech system, and some interesting tidbits such as "scraping" data from a web site and having the computer "speak" the result. (Figure 1 shows a block diagram of the sample program.)

All three of these systems — DataGrid, OleDb, and SAPI — are sophisticated, complex objects with hundreds of properties and methods, so an exhaustive description would be quite lengthy. The point of this article is that you can achieve useful functionality with a very small subset of each of these objects, so your program can be off and running with a minimum learning curve.

Introducing the Microsoft SAPI

First, two brief descriptions of Text-to-Speech (TTS) and Speech-Recognition (SR). TTS is software functionality that generates output as spoken voice sounds for a text string. For this application, the text string is untagged, but there is a full suite of XML tags that can be added to a text string to control pronunciation, emphasis, speed of speech, and so on. SR software takes spoken sounds as input, usually from the computer's microphone, and creates events when it recognizes words or phrases. For this application, the SR engine is limited to recognizing specific phrases, or "Command Mode," as opposed to recognizing the full language in "Dictation Mode." Most speech engines adapt to a particular speaker by training and become more accurate the more they are used.

Microsoft's SAPI system makes integrating TTS and SR into an application straightforward. The API itself is an interface layer into which speech engines from Microsoft and other vendors can be attached. As such, it contains not only the required TTS and SR functionality but also numerous methods that allow you to query the characteristics of the underlying engine. Fortunately, Microsoft includes TTS and SR engines free with the SAPI SDK, and if you use them, you can be certain of the engines' capabilities when you create an application.

Because of its flexibility, SAPI is a fairly extensive and complex interface. This article will show you how to implement basic items fundamental to most TTS- and SR-enabled applications. After you've tried it, you can explore other features of SAPI through the documentation and samples included with the API, and extend your knowledge. Unfortunately, the SDK was created before the advent of C# but it contains both VB and C++ sample code. I found the VB samples easier to mentally translate into C# because all the SAPI access is COM, and C# and VB have quite similar COM access. Also, once you know the names of the objects you need to access, the Intellisense features of C# will tell you most of what you need to know. For more information, see the "Useful C# Tricks" sidebar.

What to expect: TTS and SR are both evolving technologies that are far from perfect today. TTS can speak any text, but the available voices have a somewhat mechanical sound, and TTS engines can be quirky in the way they handle things like abbreviations, numbers, and so on. SR engines have reached a point of being useful in specific, confined applications. If the vocabulary of words to be recognized is small and the words don't sound alike (as is the case with the article's sample application), recognition can be quite accurate with multiple speakers and no training, even accounting for variations in speakers' voices. For dictation with a large vocabulary, training will be necessary — both training of the engine to recognize a specific speaker and training and practice by the user to learn how to speak in a way that achieves good recognition results.

I suggest using the SAPI 5.1 SDK because it is released and is immediately available. As this is being written, Microsoft's SAPI.NET system is currently in Beta and it will expand on the capabilities presented here, particularly in the area of speech-enabled web pages. The article's sample program was developed with SAPI 5.1 and appears to work properly with SAPI.NET as well. You'll need the SAPI 5.1 SDK, Visual Studio .NET, and a microphone for voice recognition.

Installing SAPI

The SAPI 5.1 SDK is free from Microsoft at http://www.microsoft.com/speech/download/sdk51. It's a whopping 68-MB download, so you may wish to order the CD if you're on a dialup line. Along with the normal SDK and documentation, you'll get demo programs and the Microsoft TTS and SR engines.

The reason the downloads for speech-related software are often so large is that they include the huge tables of vocabulary words and their associated pronunciations used by TTS engines, along with tables of word frequencies and transition frequencies, which are needed for SR engines. If you want your program to work in a language other than English, you'll also need to download a language pack — currently Japanese and Chinese are available from Microsoft for an additional 81.5-MB download.

Once the SDK is installed (with the included TTS and SR engines), you'll find a "Speech" entry in your Windows Control Panel. If you click on it, you'll be able to train the SR engine, select a default TTS voice, and so on. I suggest not training the SR engine at this point because it is interesting to try out the untrained engine to get a feel for the limitations of speaker-independent SR technology and then notice the improvement you get from training.

You may also wish to try out the sample programs that are included with the SDK (which are shipped as executables). The TTS Application (VB) example shows much of the flexibility in the TTS system, and the DictPad (C++) shows how to use the SR system for both command input (like the example program for this article) and for dictation.

If you are running Windows 2000 or earlier, the operating system does not inherently include any speech functionality. Windows XP includes a TTS engine and Office XP includes optional SR capabilities as well. Either way, you'll want to install the entire SDK to get the documentation and example programs. You can determine what speech capabilities your computer has from the Control Panel. If you have a "Speech" entry on your Control Panel, you have at least some speech components installed and can click on the entry and investigate the Speech-Recognition and Text-to-Speech tabs.

Making Your Computer Speak

With SAPI, making your computer speak is so easy that soon, all our programs will be babbling away. Once you have installed the SAPI SDK, all you need to do is create a project, add a reference to the "Microsoft Speech Object Library" "ver 5.1," and the SpeechLib namespace includes everything you need. Then, just instantiate an SpVoiceClass object and call its Speak() method.

using SpeechLib;

//add this to your 

//program's startup code

SpVoice voice = 

 new SpeechLib.SpVoiceClass();

voice.Speak(

 "Hello World",

 SpeechVoiceSpeakFlags.SVSFlagsAsync

 );

Synchronizing With Speech

The SVSFlagsAsync in the Speak() method call causes the TTS engine to start and run asynchronously and the Speak() method returns immediately. Your program is now free to perform additional actions in parallel with the TTS engine. If your program has nothing to do until the TTS engine is finished speaking, you could, instead, pass in SVSFDefault and the Speak() call won't return until your computer is done speaking the phrase you passed it. Alternatively, if your program needs to wait for the TTS engine to complete, it can call the WaitUntilDone() method:

//wait up to 2 seconds

voice.WaitUntilDone(2000); 

or you can add an event handler that is called whenever the TTS engine runs out of things to say; see Listing 1.

Be aware that the event handler isn't called for each Speak() request but only when the TTS engine becomes idle. So you can call Speak("Hello",...) followed by Speak("World",...) and your computer will say "Hello world" and generate a single event after saying, "world."

Choosing a Voice

You can select the default voice used by the TTS engine in the Control Panel under "Speech," but you can also do it programmatically. The SDK ships with a set of voices, and various other voices may be available depending on what other speech software you have loaded on your computer. The way to select a voice is to enumerate all the voices available to the TTS engine, getting a token for each one, and then select the one you want by assigning the token. The following code loops through the voices and selects one with a description containing "sVoiceName."

foreach (SpeechLib.SpObjectToken t 
  in voice.GetVoices("",""))
{
  string a = t.GetDescription(0);
  if (a.IndexOf(sVoiceName) >= 0)
  {
  voice.Voice = t;
  break;
  }
}

My favorite voice is "Microsoft Mary" and she's hard-coded into the sample program. If you'd like to let your end-user change it, add some code that uses this enumeration to fill a drop-down list and lets the user select the preferred voice.

Creating the Lookup Tables

Before jumping into the use of the SR engine, let's discuss what we are going to do with it. Our sample program, BackChat, can recognize a set of phrases and when a phrase is received, an associated verbal response is spoken. Accordingly, we need a lookup table that associates the command input phrases and output responses. We want the user (you in this case) to be able to edit the commands and responses so as to create entertaining dialogs and explore the capabilities and limitations of the SR and TTS engines. We also want the table to be nonvolatile — new phrases you add when you run the program should still be there the next time you run the program. BackChat has the following features:

  • Recognizes arbitrary spoken phrases ("Computer, what time is it?").

  • Responds with customized spoken responses ("It's 4:32 pm, Charles, thanks for asking.").

  • Allows programmatic responses (opens browser to a specific URL).

  • Allows the user to add/delete/edit recognized phrases, spoken responses, and programmatic responses dynamically.

  • Stores all data in a database table for nonvolatile storage and uses the DataGrid object for a GUI.

Using a Database for Nonvolatile Storage

I like to do things the easy way, and here is a technique I've used in many different applications. Most programs require some nonvolatile storage for configuration information; the sample program needs to store data about recognition phrases and responses from one run to the next and give the user the ability to edit them. While you can do this using any file format or the system registry, I prefer to use a database and, in particular, an MS-Access-format .mdb file because it's so easy and flexible. If you're not familiar with databases at all, the sample program offers an elementary example with a single database table that stores the data.

Why a database? Unlike the registry, you know what your settings are because you can view them from within Visual Studio and you can transfer them easily from one machine to another. You can even have multiple configuration versions (for testing) simply by having multiple databases. Unlike a custom binary file, you can easily read and modify the data from outside your program with Visual Studio (a real advantage when debugging) and not have to rewrite much if you change your mind about which data is to be saved. Even if a program ends up storing its data somewhere else, using a database offers an easy development framework that can be replaced with registry access later, if desired.

Further, Microsoft provides all the UI tools you need to allow the end user to edit table contents without your having to write much code. Opening a database is a little more difficult than opening any other file, but it is always the same. You can cut and paste the code and then, while you are doing program development, you can see results as you progress by opening the data table(s) in a Visual Studio window — so you can see and edit the data as you go. You can use this technique on almost any application.

Although SQL Server Desktop Edition ships with Visual Studio and offers you great control from within Visual Studio, if you store your nonvolatile data in it, your end user will need to have SQL Server as well. With an .mdb file your end user will only need to have the appropriate JET driver (which is easily included in your installation). .mdb File support from Visual studio is not as good as with SQL Server: You can view and edit data, but not the database table and column layout. So during development, you'll be flipping back and forth between Visual Studio and Access to create tables and columns.

If you're not familiar with methods of opening a database, this may seem a little complex and convoluted, but the sample program gets so much functionality out of so little code that it's worth it. Data is accessed through an OleDbConnection that defines which database to use and how to connect to it. The OleDbAdapter handles actually reading and writing to the database while the DataSet is the structure that holds the data in RAM once it's been read in. The OldDbCommandBuilder allows changes to the DataSet to be posted back to the database without your having to do any database-writing code yourself. The code fragment in Listing 2 connects to the database and reads all the data from the table named "BackChat" into a DataSet named "ds."

Using the DataGrid Windows Control

Here is one real payoff from using the database. You can add a DataGrid to a windows form by dragging it from the Toolbox. If you want to display the database data in the DataGrid, just one line of code does it in the form's load event:

myGrid.SetDataBinding(ds, "BackChat");

With a DataGrid named "myGrid" in a dialog box, this code will display all the data and let the user edit it. You can even save a little trouble by using DataGrid properties to connect directly to the database — but I prefer to open the DataSet explicitly because that way, I can easily use the data in the DataSet for other things within the program.

Of course, the raw DataGrid isn't very pretty, but there is plenty of flexibility for making it prettier by setting column styles, colors, and such. Also, the sample program handles size events for the dialog box and resizes the DataGrid. When the DataGrid is resized, it automatically adjusts its column-widths to keep a reasonable display. The DataGrid has hundreds of properties, events, and methods. Becoming a "DataGrid Expert" is time consuming, so the sample program in Listing 3 shows how to set up just a few useful styles.

Be aware that there are two DataGrid objects, one from the System.Windows.Forms and one from System.Web.UI .WebControls, and they are significantly different — the Windows.Forms one is the one we are using here.

Saving Results From the User's Input

When the user presses the Save button on the dialog box, we'll receive the event in our button1_click event handler. All we need to do is ask the OleDbAdapter to Update() itself from the DataSet. Because we have attached the OleDbCommandBuilder to the OleDbAdapter, all of the underlying database modification is handled automatically. After saving the data, we need to send the changes to the SR engine, which will be discussed later.

private void button1_Click(
  object sender, System.EventArgs e)
{
  //on the save button, update 
  //the file and rebuild the grammar
  dbAdapter.Update(ds.Tables["BackChat"]);
  RebuildGrammar();
}

That's it! Typing in about 20 executable lines of code (and a few more to make it look pretty, as shown in Figure 2) gives us a program configuration editing system. The end user can add, delete, or edit entries. Editing is done by simply typing into the table. It's a great technique if you need to have a prototype system running today.

Using SAPI for Command Recognition

Now that we have a table that holds the commands we want our program to respond to, we can delve into SR. SR is somewhat more complex to use than TTS because there are many more variables and options. You start by creating an instance of the SR engine:

private SpSharedRecoContext RecoContext;
RecoContext = new SpSharedRecoContext();

Now you have to tell the SR engine what to recognize — called a "grammar." You could enable a dictation mode grammar, which can generate an event whenever a phrase is recognized.

private ISpeechRecoGrammar Grammar;
Grammar = 
  RecoContext.CreateGrammar(
 m_GrammarID);
Grammar.DictationLoad(
  "",SpeechLoadOption.SLOStatic); 
 //our program doesn't do this
Grammar.DictationSetState(
 SpeechRuleState.SGDSActive); 
 //our program doesn't do this

This is the method you'd use to allow the user to dictate text (which your program could put in a textbox).

Using this method and parsing the results received for commands, however, won't yield useful results because SR engines aren't accurate enough. Instead, you create a grammar that defines the specific words and word transitions that the SR engine is allowed to recognize. If your grammar only contains the words "file" and "edit," the SR engine is much more likely to recognize these words accurately than if it is attempting to recognize the entire English language without knowing that the words "file" and "edit" are important.

Grammars can be quite complex, consisting of multiple "rules" that can be individually enabled or disabled depending on what type of input your program expects. Rules consist of one or more phrases and transitions between them. For this application, the grammar has only a single rule with each phrase in the rule representing a complete command. A more complex grammar might have individual commands for "file" and "edit," each of which would create a transition to their child rules. Our program uses a dynamic grammar — the user can change the phrases being recognized as the program runs. If your grammar is static, the SDK includes a grammar builder that allows you to specify your grammar using an XML syntax and then convert it to a grammar file, which can be read in. Instead, we'll create the grammar programmatically.

The code fragment in Listing 4 creates a grammar and either creates or retrieves an existing rule named "CommandRule." There is no method for deleting individual phrases from a rule, so the program calls the Clear() method for the rule, which removes them all and adds all the entries from the "Command" column of the DataSet. (Remember how we conveniently left the DataSet around for ourselves earlier?) Then the grammar is committed. The final command, SetRuleState() is the method that actually starts the SR engine.

Adaptation is a feature of the Microsoft (and most other) SR engine that allows it to train itself while it is being used. The more it recognizes, the better it gets. This is a valuable feature in dictation applications but has an unfortunate side-effect in some command applications (like this one). If you speak commands only occasionally, the SR engine is constantly adjusting itself to the background noise and extraneous speech it picks up. If the program is left running for a few days, the SR engine trains itself to random noise and degrades to a point where it no longer works properly. Because of this, the program turns off adaptation with:

RecoContext
  .Recognizer
  .SetPropertyNumber(
 "AdaptationOn",0); 

Handling SR Events

Next, you'll attach an event delegate, which will be called whenever a phrase is recognized; see Listing 5. Whenever one of the spoken commands in the grammar is detected by the SR engine, it generates an event where you can interpret and generate an appropriate response.

The result is a complex structure that will tell you everything you need to know (and much more). To get the result as a text string, call:

string sCommand = 
  Result
  .PhraseInfo
  .GetText(0,-1,true);

Now, you have the text that was recognized and you can write an application to do whatever you want with it. In our case, we'll look in our database table and see if there is an associated verbal response to speak and/or an action to be performed.

SR Confidence

Here we get into some of the art of working with a leading-edge technology such as speech recognition. We'd like to have an SR engine that recognized commands from a user and was always correct and could understand various users' speaking voices, accents, etc. Unfortunately, the state of SR technology has yet to achieve these levels and we have to cope with some of its current limitations. The Microsoft SR engine does its best to interpret whatever it receives as microphone input based on the current recognition grammar. If you have a command, "hello," in your recognition grammar and no other similar-sounding words, you'll probably get a recognition event for "hello" if you say "mellow," "fellow," "yellow," or any other similar-sounding word or phrase. If both "yellow" and "hello" are phrases in the grammar, the engine can differentiate between them pretty well.

The SR engine returns a confidence level for each word in the recognized phrase. The higher the confidence number, the better the match between what the SR engine heard and the engine's stored pronunciation. An issue with a small command set is that background noises will often be returned as events and the receiving program will need to check the confidence levels of the words and make a judgment as to whether or not a command was actually received. A crude way to do this is to take the average confidence for all the words and see if it is greater than a certain threshold, like this:

float fTotConf = 0;
for (int i=0; 
i < Result.PhraseInfo.Elements.Count; 
i++)
  fTotConf +=
 Result.PhraseInfo
 .Elements.Item(i)
 .EngineConfidence;
float fAveConf = 
  fTotConf / Result.PhraseInfo
  .Elements.Count;
if (fAveConf > fThreshold) . . . 

If the threshold was not achieved, the program could ignore the command or perhaps say "Please repeat." Your program could also be more sophisticated by requiring each word to meet a minimum threshold or requiring greater confidence in the first word of a command. For the example program, the required threshold is zero and this works fairly well.

Doing Something Useful

We've conquered the basics of having your computer receive speech input and speak. Now all we need to do is tie the two together. Within the Speech-Recognition event, we just need to see what response is associated with the phrase that was received. Because our phrase list is relatively small, we can do this by iterating through the entire dataset searching for a match.

foreach (
  System.Data.DataRow theRow in 
  ds.Tables["BackChat"].Rows)
{
  if (sCommand == 
  (theRow["Command"].ToString())
  {
  //do the work here
  }
}

This is a simple sequential search that will be plenty fast enough until our phrase list gets huge. If we ever reach that point, we could create a second dataset and a SQL SELECT statement to find the match, letting the database engine do the work.

Now we can get the associated response string and action string with:

string sResponse = 
  theRow["Response"].ToString();
string sAction = 
  theRow["Action1"].ToString();

Now we can speak the response. ProcessResponse() in Listing 6 handles text replacement for strings containing the time of day, and so on. Notice that before actually calling the Speak() method, we disable the SR engine. Remember that the SR engine tries to interpret everything the microphone hears, including the output from the TTS engine, so it is best to disable speech recognition for the duration of TTS output. To disable the SR engine, set the top-level rule to Inactive. Reenabling the SR engine is done here on a timer that waits either for the TTS engine to complete or for five seconds to elapse. The more sophisticated solution (used in the sample program) is to hook the TTS engine's EndStream event, which fires when the TTS engine actually finishes.

The TTS engine is started asynchronously because this allows the computer to speak at the same time as the action is being performed. The example program includes a PerformAction function, which has a fairly limited scope of functionality; but here is where the functionality of any application can be attached.

Simplicity and Functionality

That's it, we've got a program that can recognize arbitrary phrases from a table, generate arbitrary verbal responses, and perform actions. We can extend this to any desired degree of complexity to create a fully voice-enabled program. We've got a remarkably simple system for letting users edit the command phrases and responses. And we've accomplished it all in just a few hundred lines of code. The key is that we have leveraged tremendous functionality buried in the OleDb system, the DataGrid, and SAPI. w::d


Charles Simon is a software engineer and CEO of Synthigence Corp. in Washington State. You can contact him at [email protected].


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.