Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

The Learning C/C++urve


October 1996/The Learning C/C++urve

Bobby investigates the origins of Hungarian Notation, and explains why it is not so cool as some people (and companies) think it is.
Copyright © 1996 Robert H. Schmidt


(Disclaimer: The views expressed in this column are solely the author's, and do not necessarily reflect those of the staff and management of CUJ ... although if they're astute, they'll agree with me anyway. The author further makes no claims to fairness, objectivity, good sportsmanship, or fashion sense.)

Throw the deadbolt, put the kids to bed early, and pour yourself a double-tall skinny. It's time once again for The Why Files, where readers try to stump i Virtuosi di Redmond (my combination house band and research staff) with tantalizing C and C++ coding conundrums.

Redmond, We Have A Problem

This month's cry in the wilderness comes from a poor soul overseas, who writes

I've been retained by a small European town government to study and recommend treatment for a localized predator/prey ecosystem imbalance. Hoping to leverage previous research, I searched the Web, finding several analysis and expert-system programs on this topic. Unfortunately, when I downloaded the corresponding source code, it was full of declarations like:


LPRODENTINFOSTRUCT rodGetInfoStructA
        (DWORD cbSize,
         TCHAR *lpszRodentName,
         LPVOID *lprisNext,
         int ndxRIS,
         BOOL fEnable)

I immediately assumed line noise or modem problems caused these spurious characters. But when I went to a direct connection, the problem persisted. I took a giant hair dryer to my computer, thinking there may be moisture buildup causing electrical trouble, but it didn't help. My comm software checks out okay, so I'm stumped. Can you advise me here? For now, my work is halted, and I'm left sitting around playing old Jethro Tull tunes on my flute.

Regards,
P. Piper
Hamlin Germany

Poor P. Piper. What he attributes to line noise or proverbial ghosts in the machine is really no more than acute congenital Hungarian Notation. Intrigued by Piper's problem, I tasked i Virtuosi di Redmond to research the origins of this notation and present their findings.

Origin Of The lpszSpecies

As with so many phenomena, that Hungarian Notation is so pervasive does not imply that its origins are at all obvious. After a long caucus punctuated by several loud debates, one police visit, and a round of Rock-Paper-Scissors, i Virtuosi offered the following possible sources for the term "Hungarian Notation:"

  • The spellings are an English transliteration of actual Hungarian words, the original meanings of which, when published, would violate the Communications Decency Act.
  • The notation was first introduced by Eva Gabor's television character on Green Acres, and first applied by Arnold Ziffel.
  • The prefix letters, when mapped to musical notes, conform closely to Bla Bart--k's third string quartet.
  • The term is actually Hungry or Greedy Notation, so-called for the resulting increased length in code listings. We speculate the notation may have been spread by magazine columnists who get paid by the line.
  • Perhaps least likely, the notation honors Charles Simonyi, a programmer of Hungarian descent at Microsoft. i Virtuosi dismiss this as urban myth, reckoning that even Brand M would not consciously confuse and polarize the programming community this way.

Grammar Rock

Regardless of its origin, Hungarian's eventual intent seems to be bundling a C object's type with its name. I say "seems" because, unfortunately, there are many subtle dialects of Hungarian Notation. While I cannot give you a specific grammar and English-Hungarian translation, our reverse engineering has shown several general (if sometimes conflicting) trends.

First and foremost, Hungarian turns an object's type name into a diminutive stunted form which, for purposes of discussion, I'll dub a "wart." Warts are usually one to four letters long, but not always. Warts are usually made of consonants, but not always. Warts are usually lower case, but not always. Warts are always prepended to their object name; that original object name in turn always starts with an upper-case letter.

While warts are derived from the original type name, the mapping is not consistent. For example, the type name long translates to the wart l, and the type name int to i. This could imply that a wart uses the first letter of its type name. But how then do we explain the wart by, derived from byte, or dw from DWORD?

Perhaps the algorithm is "use the first N characters, fewer for commonly used types, more for other types to resolve ambiguities." And indeed, there is potential ambiguity, for while by is byte, a lone b is BOOL; clearly they both couldn't use b. char yields the wart c, WORD yields w, HANDLEs (of all flavors) yield h. Perhaps we've found the Hungarian Rosetta Stone.

But then we run into p. If the object name iResult means "Result is an object of type int", what does piResult mean? Is there some type that starts with the letter pi? Or is Result expressed in polar coordinates?

Actually, a leading p means the object is a pointer. piResult means Result is a pointer (p) to an int (i). plResult would mean Result points to a long. You will also see variations like lpResult. Not the same as plResult, lpResult means Result is a so-called far or long pointer. Such pointers are a vestige of 16-bit MS-DOS and Windows. Interestingly, while Microsoft would have us believe such pointers are yesterday's news, they continue to promulgate lps throughout their 32-bit source code and documentation. One would think that by now, lps would have gone the way of, well, LPs.

The Incredible Shrinking Name

lp really comes into its own in names like lpszName. The object is Name, and it's a long pointer to ... what? An sz? What is that? Here the pattern breaks down once again, for sz is not derived from a type name. sz means "zero-terminated array of char", the usual encoding for character strings in C. lpszName therefore means "Name is a pointer to a zero-terminated string."

Okay, fair enough, but there are a few problems here:

  • What C string doesn't end in a zero byte? Would we lose any meaning by using lpsName?
  • If we stick to a "flat" memory model, there are no long pointers, so we can reduce to psName.
  • In almost all contexts, an array decays to a pointer to its first element. If we know that Name references an array of char, we almost always manipulate it as a char *. Thus, the presence of the p often doesn't add any useful meaning. In those contexts, we could reduce the name to sName.

Finally, there's C++. While Microsoft (as I write) have just released a compiler supporting the predefined string class, they have long used their proprietary CString class as a moral substitute. To differentiate a Name that is a zero-terminated char array from a Name that is a CString, they created the new Hungarian wart lpcsz, which I think means "pointer to zero-terminated CString." That a CString object is not conceptually zero-terminated, and is more naturally passed by reference instead of pointer, apparently doesn't matter.

Violating Abstraction

The wart lpsz introduces my first objection to Hungarian notation: it betrays an object's implementation. While this is arguably acceptable in C, I find it fairly indefensible in C++.

Consider character strings. In C, about the only simple way to implement the abstracted notion of a "character string" is with an array of char. The language supports such arrays directly (with quoted strings like "doodle"), as does the standard library (strlen and its ilk). Using the C [] operator on a char array yields a single character, the expected and intuitive result.

Fact is, every experienced C programmer knows the only credible implementation for a string in C is by character array, so that betraying this implementation (in a name like lpszName) gives away nothing. Were a programmer to implement a string some other way (say, with a length byte), the resulting objects would be incompatible with the rest of the language. The programmer would have to provide an API to turn these new strings into old array-of-char strings. Since there's no robust way to hide the fact that this new string has a different implementation, attaching a wart giving away that implementation (like lples for "long pointer to length-encoded string") betrays few secrets.

In sum, C types typically expose their (conceptual) interface and implementation simultaneously. A C string is both an abstracted character string and its underlying char array implementation. The two aspects are inextricably coupled.

The story changes radically in C++, which is designed expressly to hide the implementation. Once you craft a type like string or CString, you are telling the world "think of this as an abstracted string of text, not as a particular underlying implementation." If you have written your string class well, users of that class will be forever ignorant of its implementation. This is A Good Thing, one of the principle tenets of C++. To turn around and purposely expose the object's implementation violates that tenet. Thus, names like lpcszName, which tell the world this abstracted type is actually a zero-terminated char array under the hood, I consider poor C++ practice.

The above discussion betrays a major difference between the two languages: C relies exclusively on built-in types with their built-in type rules, while C++ allows user-defined types containing pretty much whatever type rules those users want. In C, you have little choice but to implement a string with a char array. That array decays into a char *, cannot have its length extended at run time, must end in a zero-byte, and so on -- the language specifies the type's set of properties. That you may or may not like this particular set matters not; it is predefined and inviolate.

Because C programmers can't control these sets of properties -- that is, they can't tell the compiler to flag type usage they don't want -- Hungarian actually has some benefit. Let's say you write the non-Hungarian code

char *sturm;

int drung;

sturm += drung;

Do you really mean to add an integer to a character pointer? If not, too bad, since C allows this. Casually observing the statement

sturm += drung;

won't betray that you are mixing types here. Now add the proper Hungarian warts:

char *lpszSturm;

int iDrung;

lpszSturm += iDrung;

The compiler is just as happy to accept this, caring not a whit about the Hungarian warts. Instead you must rely on what I call ocular regression (i.e., eyeballing). This time, your casual observation of

lpszSturm += iDrung;

lets you know you are mixing types, alerting you to a possible bug.

Although I am quasi-defending Hungarian for C, there are still caveats. If you brand an int object iClaudius and later change its data type to unsigned, you must recast all your object names to uClaudius. To do otherwise would be worse than having no wart at all, for the name would clearly mislead the reader (vs. merely leaving the reader no clues at all).

In C++, such pro-Hungarian considerations often become moot, for two principle reasons:

  • You can use built-in types sparingly, and
  • You can use objects closer to their point of definition.

I discuss each of these reasons in turn below.

Built-In Types

In C++, objects of your string class type behave exactly as you decide they will. Don't want your string to turn into a char *? Then don't expose an operator char *() member. Want to encode the string as something other than zero-terminated? Make the internal data member(s) private, and your users won't know. Want to make the string extensible? Implement operator[] appropriately.

I apply this same reasoning to all built-in types. I could argue that most everywhere a C program uses a built-in type with its fixed set of rules, a C++ program could and often should use a user-defined type that controls those rules. Note how many of the Hungarian prefixes betray the underlying implementation: i, dw, c, l, lpsz, u, ul, w. Were you to replace use of built-in types with judicious use of class types, many of these warts would go away, since the objects would no longer be of the corresponding built-in type.

Some Hungarian warts refer to the type's logical meaning and use, rather than to its physical implementation. These warts are, like those we've already seen, contractions of the actual type name. Such warts include b (for BOOL), f (for "flag," also a BOOL), and h (for HANDLE). In these examples, the types in question are either macros or typedefs aliasing built-in types; as such, they can often be accidentally mixed with other conceptually unrelated types [1] .

An extension of this is Hungarian for actual structures, like msg (for the Windows message structure MSG) and pt (for the POINT structure). Since structs, even in C, are type-safe and can't be mixed, the normal motive for warts goes away. The compiler enforces the rules already, obviating the need for an eye check.

Where this structure type safety fails is with pointers. Because C lacks references, structs are passed by pointer; and because C lacks overloading and templates, those passed-around pointers are very often turned into void * or, in Windows, unsigned long. Once you turn an MSG * into another type, unless you encode the original type with a Hungarian wart, you can't easily tell what the conceptual type is supposed to be.

Other warts have nothing to do with the type name, and are instead derived from the object's role; examples here include cb ("count of bytes," or size), n (counter), and ndx (index). These warts tell you nothing directly of the object's actual type (e.g., is a counter signed or unsigned?), so neither you nor the compiler is able to directly enforce type checking.

You can't get away from built-in types completely, because too many useful C++ language aspects (like literal constants) rely on them. The best you can do is encapsulate those types within a class implementation, with appropriate conversion constructors and operators where needed. With judicious use of inlining, you can make classes that have the efficiency and over-all behavior of built-in types while still allowing your own tweaking of those types' rules.

Location of Object Definitions

Hungarian also reflects another C trait that C++ does not share: all block-scope objects must be declared at the start of the block. Assume you are trying to read and understand a function like


void f(void)
    {
    int Winken = 1;
    char Blinken;
    unsigned Nod;
    /* ... 50 lines of code here */
    Blinken = 0; /* wait, what is Blinken's type? */
    /* ... 50 more lines */
    Nod = Winken /* I forget, what is Winken? */
    /* .. and so on */
    }


The objects Winken, Blinken, and Nod are not used until way into the function, and even then, they are not all used together. Because there is such a distance between where the objects are defined and where they are used, you may have trouble remembering their types when you finally read the code that uses them. In a large function with many local objects and parameters, keeping all these types straight can be difficult.

C++ avoids this by letting you define an object at the very point you need it, and not before. A possible C++ rewrite of the above is


void f(void)
    {
    /* ... 50 lines of code here */
    char Blinken = 0;
    /* ... 50 more lines */
    int Winken = 1;
    unsigned Nod = Winken;
    /* .. and so on */
    }

Note that, as an alternative, you can declare blocks within blocks, as in

void f(void)
    {
    /* ... 50 lines of code here */
        {
        char Blinken = 0;
        /* ... 50 more lines */
            {
            int Winken = 1;
            unsigned Nod = Winken;
            /* .. and so on */
            }
        }
    }

Each { starts a new block, meaning object declarations can appear immediately after that {. This example works in both C and C++.

Now that the objects are defined at their point of use, you have less trouble reckoning their data type. This notion extends to data members in classes; instead of having a collection of "global" variable definitions scattered hither and yon, you bundle all these objects inside a class definition, making their data types easy to spot. I find that corralling object declarations this way, combined with a decreased use of built-in types, almost completely eliminates the need for Hungarian warts that betray an object's implementation.

Types That Aren't Built-In

While this is all well and good for built-in types, what about types that aren't built-in? Say you adopt my model and turn lpszName from a char array into a string object. Couldn't you then call the object sName, with s the Hungarian wart for string?

Yes, you can; I question whether you want to. With built-in types, you carted around the wart to help enforce (by eye) what the compiler couldn't. With user-defined types, you know and control the type behavior. You also control your design and use of those types.

My belief is that, if you still lose track of how your user-defined objects are supposed to interact, adding Hungarian warts just addresses symptoms. The real cause is a muddled design, often associated with mixed levels of abstraction and too-tight coupling between types. There should be no mystery or guessing; if you are confused by your own design, imagine how others will fare.

Also remember that a single conceptual "thing" actually may be implemented as multiple objects, each at a different program context. For example, you may have a high-level function that declares and uses a string like


string buffer =
    "some initializing value";

size_t buffer_length =
    buffer.length();

Here you manipulate buffer as an abstracted textual string entity, and have little risk of confusing its meaning. Within the string::length member function, you may have code like


size_t string::length() const
    {
    return strlen(buffer);
    }

where buffer now is a string data member of type char *. Again, within this domain there is little confusion of buffer's type, so there is correspondingly little need for a Hungarian wart like lpsz.

The type of buffer in each domain is implied by context, corresponding to the abstraction level of entities around it. If you find you are referencing two distinct buffer abstractions in a given context, rather than paste over the differences with Hungarian, you may want to reconsider your design.

Extending Hungarian for Masochists

A small set of Hungarian warts is fairly universal across most programs: i for int, ul for unsigned long, and so on. Other warts are not so universal; for example, does f mean flag or float? Then there are warts specific to a particular library or project. In Piper's example we saw a Hungarian parameter


LPVOID *lprisNext

The ris is Hungarian for RODENTINFOSTRUCT. While the actual data type is "long pointer to pointer to void", the conceptual data type is "long pointer to pointer to RODENTINFOSTRUCT".

This scheme shows how you extend Hungarian. To illustrate, if you have an API for car purchasing, you could well end up with code like


void carPurchaseCar(LPPOWEREVERYTHINGINCLUDINGSUNROOF
        *lppeisrPreferredEquipmentPackage)
    {
    FORDPROBEGT fpgtNewCar;
    // ...
    }


The possibilities frankly make the mind reel [2] . In addition to all the problems cited above, these project-specific warts suffer one addition trouble: nobody reading them for the first time has clue one what lppeisr can possibly mean. I am reminded of my first attempts to read a Russian novel; with names so long, can a casual reader be blamed for confusing lppeisrWhatever with lppesirWhatever?

Hungarian is much like C++ name mangling. Both encode an object's type with a set of unique extra characters. However, unlike Hungarian, name mangling never claims to be human-readable. When I read warty code, I feel as if I'm wading through waist-deep mud [3] ; my eye slows, and I often have to visually parse a statement multiple times to grok its full intent. Independent of its dubious value in C++, I often find Hungarian code aesthetically unpleasant and borderline unreadable.

Beyond Hungarian

With the advent of the Microsoft Foundation Classes (MFC), we now have a new wrinkle to Hungarian: warts that don't encode type information at all. MFC classes begin with C, as in CString or CTime or CSpotRun. This leading C says absolutely nothing about the type; it just tells you that the name in question is a class. Compare this to the Standard C++ library and built-in type names, which are all lower-case. By beginning with C, Hungarian class names break any illusion they could look built-in.

An adjunct to C for classes is a leading m_ for those classes' data members. As with C, this wart contains no type information; instead, it contains scope information, implying the object is in class scope. Perhaps surprisingly, I cannot fault Microsoft here, for I use a similar encoding scheme (trailing _) to distinguish member from non-member data [4] . Even though I support Microsoft's idea, I do find that the prepended m_ makes for slightly awkward reading, especially since it is tacked onto a name that is already Hungarian (e.g., m_lpszName).

Postscript

i Virtuosi have broken into a rousing glockenspiel arrangement of In-A-Gadda-Da-Vida, which tells me it's time to wrap up.

After receiving his letter, I talked to Mr. Piper on the phone. He told me that, while he appreciated my help, he'd found another way around his problem. It seems his flute playing precipitated a massive decrease in the prey population, ending the crisis. He also said he was having some difficulty getting the town to pay him, but quickly added he was working on a plan to solve that problem too.

This month's Erattica is that I have no Erattica, no embarrassing gaffes to explain away. That fact itself, ironically, may indicate some sort of problem. Surely I can't be hitting that many out of the park ... can I?

I close by noting this column ends my first year as "contributing editor" and monthly columnist for CUJ. I'd like to thank the usual suspects: my parents, my 9th grade English teacher, the members of the Academy, and most of all, you, the diligent reader. As I wrote some months ago, feel free to drop me a line, letting me know what you'd like to see covered as I, or we, commence year number two.

Notes

[1] Starting with Windows version 3.1, Microsoft let you define their various handles as structs, so that you couldn't


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.