October 01, 2004
Flexible C++ #8: Union Casts Considered Harmful, but NecessaryMatthew Wilson
Unions in C and C++ are aggregate quantities like structs, except that each element of the union has offset 0, and the total size of the union is only as large as is required to hold its largest member [1]. Only one member of a union may be "active" at a time.
Unions in C and C++ are aggregate quantities like structs, except that each elements of the union has offset 0, and the total size of the union is only as large as is required to hold its largest member [1]. Only one member of a union may be "active" at a time. Unions are most often used to provide variant functionality; i.e., allowing a variable to contain data of different types, as in the following structure from a GUI list control:
struct CellItem
{
UInt mask; /* valid field mask. (Combination of flags). */
. . .
Boolean bReadOnly; /* Indicates item's editable state. */
Int type; /* Type of data. */
union
{
long l;
float f;
} u;
u.l = 999;
assert(u.f > 998.0 && u.f < 1000.0); // Not a chance, bluey!
The actual value of u.f in this case is going to be some wildly different number; in my testing environment it is 1.39990x10-42. Not exactly a victory for the union cast!
So, if you didn't know before, hopefully you now realize that using a union to perform a cast is a pretty bad idea. Because it almost completely circumvents the compiler's ability to do any type checking, there's no protection from misalignment, truncation, or representation mismatches, and will likely get you only "dangerous and nonportable" [3] code.
(Actually, in chapter 19 of Imperfect C++ [4], I show how union casts, in the form of the STLSoft [5] union_cast template class, can be made into a very robust and useful technique by using constraints and a dash of template metaprogramming to restrict the cast types. Notwithstanding those techniques, union casts should be considered harmful and avoided wherever possible.)
Unfortunately, sometimes the carelessness of some library writers leads other library writersthat's us!with little choice but to get them out of the bottom of the toolbox and carefully dust them off. Take, for example, the Microsoft WinInet API.
Win32 ANSI / Unicode CompilationThe WinInet function FtpFindFirstFile() is actually a macro that is defined as either FtpFindFirstFileA() or FtpFindFirstFileW(), depending on whether ANSI or Unicode compilation is selected (by the absence or presence of the UNICODE preprocessor symbol). Similarly, the WIN32_FIND_DATA structure is actually a macro defined either as WIN32_FIND_DATAA or WIN32_FIND_DATAW. Hence, although the third parameter of FtpFindFirstFile() is notionally a pointer to a WIN32_FIND_DATA structure, in actuality the two function variants take pointers to the corresponding structure variants, as in:
HINTERNET FtpFindFirstFileA( HINTERNET hConnect
, char const *lpszSearchFile
HINTERNET FtpFindFirstFileW( HINTERNET hConnect
, wchar_t const *lpszSearchFile
Add a Dash of TraitsIn writing a traits class, as part of the InetSTL libraries [6], I came across a problem that necessitated using the any_caster class described in this article. The problem it addresses is that in the WinInet header files that come with versions 5 and 6 of the Visual C++ compiler, both variants of the FtpFindFirstFile() function, are declared as taking a pointer to a WIN32_FIND_DATA structure, and not to that of its requisite variants as shown above. Because WIN32_FIND_DATA is defined as WIN32_FIND_DATAA for ANSI compilation, and as WIN32_FIND_DATAW for Unicode compilation, there is no conflict with the analogous definition of the FtpFindFirstFile() macro. This means that if you code in terms of the two macros, rather any of the specific ANSI/Unicode variants, they are in sync and you won't have a problem. Consider the following code:WIN32_FIND_DATA fd; FindFirstFile(..., &fd, ...);If you compile this without UNICODE defined, it is actually translated to: WIN32_FIND_DATAA fd; FindFirstFileA(..., &fd, ...);Whether this is compiled with the correct formFindFirstFileA(..., WIN32_FIND_DATAA*, ...)or the incorrect formFindFirstFileA(..., WIN32_FIND_DATA*, ...)it still works since WIN32_FIND_DATA is translated to WIN32_FIND_DATAA. Conversely, if you compile this with UNICODE defined, it becomes: WIN32_FIND_DATAW fd; FindFirstFileW(..., &fd, ...);Again, this works with both forms because WIN32_FIND_DATA is translated to WIN32_FIND_DATAW in the presence of UNICODE. However, if you want to write code that references the functions/structures explicitly, you're in a bit of a pickle, whether you attempt to use FtpFindFirstFileA() explicitly from a Unicode compilation, or FtpFindFirstFileW() from an ANSI compilation. In either case, the function will be prototyped to point to the wrong structure variant. Consider the following correct code: WIN32_FIND_DATAA fda; WIN32_FIND_DATAW fdw; FindFirstFileA(..., &fda, ...); // 1 FindFirstFileW(..., &fdw, ...); // 2This does not work with the incorrect form of the WinInet libraries. If UNICODE is not defined, then line 2 won't compile. If UNICODE is defined, then line 1 won't compile. This is not good. Because traits that specialize in character type explicitly use the ANSI or Unicode variants of a given functiontraits are entirely independent of the presence/absence of UNICODEeither the wchar_t or char specialization will cause an error when compiled with the Visual C++ 5/6 headers. One solution is to attempt to discriminate which compiler you're using, and write the code with casts, as in:
template <>
struct inetstl::filesystem_traits<wchar_t>
{
. . .
static HINTERNET find_first_file( HINTERNET hconn
, wchar_t const *spec
, find_data_type *findData
, uint32_t flags = 0
, uint32_t ctxt = 0)
{
#if defined(_MSC_VER) && \
_MSC_VER <= 1200
return ::FtpFindFirstFileW( hconn, spec
, reinterpret_cast<LPWIN32_FIND_DATA>(findData)
, flags, ctxt);
#else /* ? compiler */
return ::FtpFindFirstFileW(hconn, spec, findData, flags, ctxt);
#endif /* ? compiler */
}
. . .
Naturally, this is very ugly, and a maintenance headache: We'd have to vet each and every compiler's WinInet.h. But we have to put up with headaches every day, and ugliness is part and parcel of any portable coding effort. The overriding objection to this approach is that it is no solution at all.
If you specify any recent version of the Microsoft Platform SDK's include directory prior to those that come with the compiler, Visual C++ 5/6 will build the correct form without issue. If we then present it with the code shown above, it will fail to compile because the LPWIN32_FIND_DATAA type is no longer incorrectly expected by FtpFindFirstFileW().
This is where our naughty cast comes in. Let's look at it in action before we see how it's implemented. Rewriting the function in an error-free form we get:
template <> struct inetstl::filesystem_traitsThe any_caster template (shown in Listing 1) takes a source type, followed by two or more conversion typesit has to be at least two, otherwise there'd be no pointand provides implicit conversion from the former to any of the latter. Naturally, there should be no ambiguity between the types to convert to, but that's the user's responsibility. Make no mistake: At no time do we try to pass an ANSI structure to a Unicode function, or vice versa. It's just that the compiler thinks that we should, and we must do the right thing while making the compiler believe we're doing what it thinks is the right thing (which is wrong). The converter has a remarkably simple implementation. It is just a union of nine types. Its constructor takes a single parameter of the source type, and there are eight implicit conversion operators for the eight destination types. In order to support between two and eight destination types, the latter six are defaulted. I originally wanted to default them to void, but naturally one cannot have implicit conversions to void. Nor can they be the same type, as the compiler would rightly complain about having multiple implicit conversion operators to the same type. So what I've done is default them to pointers to distinct instantiations of the InvalidType helper template. any_caster is implemented as a union precisely because we want, in this rare case, to subvert all type checking. The alternative would be to implement the conversion operators using C++ casts. Unfortunately, as I show in [4], it can be very difficult to generically code such conversions with the appropriate mix of C++ casts, and even C casts will precipitate warnings with come compilers. Since the size, alignment and bit representation of our convertee types are compatibleLPWIN32_FIND_DATAA and LPWIN32_FIND_DATAW are both 32-bit pointers, and the conversion from one to the other is valid in this case because one actually is the otherthe union cast is the appropriate choice in this (unusual) case. Hence union casts may be considered necessary. And that's it! Readers of Imperfect C++ might want to apply some of the techniques described in the discussion of the union_cast template to any_caster in order to increase its robustness by constraining its range of acceptable types; the full implementation is available with the STLSoft libraries. We might, for example constrain all the types to be the same size and to be, say, all integral types or be all pointers, using static assertions [4]. For example:
~any_caster() // Place in dtor so always gets checked
{
STATIC_ASSERT( is_pointer_type<T>::value &&
is_pointer_type<T1>::value &&
is_pointer_type<T2>::value &&
is_pointer_type<T3>::value &&
is_pointer_type<T4>::value &&
is_pointer_type<T5>::value &&
is_pointer_type<T6>::value &&
is_pointer_type<T7>::value &&
is_pointer_type<T8>::value);
}
We could even constrain all types to be pointers that point to things that are the same size. Even if you elect to take such measures to increase the safety of your union casts, it's worth stressing one last time that using unions for casting is something to be done only in extremis. But I think that when pushed into it by poorly designed/tested libraries, we are entitled to get out the big guns!
AcknowledgmentsThanks to Bjorn Karlsson, Garth Lancaster, Greg Peet, John Torjo and Walter Bright, for their excellent criticisms and suggestions.
About the AuthorMatthew Wilson is a software development consultant for Synesis Software, and creator of the STLSoft libraries. He is author of the book Imperfect C++ (Addison-Wesley, 2004), and is currently working on his next two books, one of which is not about C++. Matthew can be contacted via http://imperfectcplusplus.com/.
Notes & References[1] Kernighan, Brian and Dennis Ritchie. The C Programming Language, Prentice-Hall, 1988. [2] How Java's Floating-point Hurts Everybody Everywhere, Kahan and Darcy, http://http.cs.berkeley.edu/~wkahan/JAVAhurt.pdf. [3] Stroustrup, Bjarne. The C++ Programming Language (Special Edition), Addison-Wesley, 2000. [4] Wilson, Matthew. Imperfect C++, Addison-Wesley, 2004. (I can't recommend this book highly enough, ;-) ) [5] STLSoft is an open-source organization whose focus is the development of robust, lightweight, cross-platform STL-compatible software, and is located at http://www.stlsoft.org/. [6] InetSTL (http://www.inetstl.org/) is the Internet-related subproject of STLSoft, which was introduced with STLSoft version 1.7.1. It currently provides STL-like mapping of the Win32 WinInet APIs, but will evolve to cover other Internet APIs (including those on platforms other than Win32).
|
|
||||||||||||||||||||||||||||||||
|
|
|
|