Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Flexible C++ #3: Efficient Integer to String Conversions


Efficient Integer To String Conversions, part 3

First of all, I'd like to apologize for a gaff on my part [1], in not making it clear that the first few instalments in this column are to feature several different methods for achieving efficient integer to string conversions, and that none of these would represent an absolute zenith of implementation correctness and efficiency. (See sidebar "Flexible C++".)

In fact, there are five solutions (the latter four all build on my original solution described in [2]). Each of these solutions to the integer to string conversions are flawed in one way or another, and examination of these flaws will highlight issues within C++ that tend to bite the unwary (and sometimes the wary).

Throughout the next couple of columns, I'll be describing the remaining three solutions, and highlighting the drawbacks of each along the way. I'll finally end with suggesting how an easy-to-implement new keyword added to the language would result in our being able to transform one of these into a "perfect" solution. Not that that's going to be controversial or anything …

Solutions #1 & #2

So let's briefly do a recap. Solution 1 [2], in the form of the STLSoft [3] integer_to_string template functions, carries out conversion by writing the least significant digit backwards into a caller-supplied buffer, dividing by the base (10) and repeating until all digits are written and the integer is fully converted to string form. Its advantages are that it is thread-safe, does not use any local static memory, works with arbitrary character encoding (e.g. both char and wchar_t), and that the compiler can deduce the types required to instantiate the template.

uint64_t      i = . . . 
wchar_t       buff[21];
wchar_t const *result = integer_to_string(buff, stlsoft_num_elements(buff), i);

The disadvantages are that it requires the caller to supply a character buffer within which the converted form is written, and that it also requires the caller to supply the buffer capacity, which obviously introduces the possibility for mismatch between the two.

uint64_t      i = . . . 
wchar_t       buff[12]; // Typo! 12 will not handle the full-range of 64-bit integers!
wchar_t const *result = integer_to_string(buff, 21, i); // Ouch! Underwrite

(A somewhat prosaic additional criticism is that it involves typing a lot of characters, making client code far from succinct.)

Solution 2 [4] provides an alternative approach. Based on Solution 1, it uses thread-specific local static buffers such that a caller-supplied buffer is not required. It maintains practically all the efficiency of the Solution 1, but in a more easy to use package. (The details of the thread-specific buffers may feature in a future instalment.)

The advantages of this approach are that it is thread-safe, works with any character type, and does not require a caller-supplied buffer. Naturally, it does not need a caller-supplied buffer length either, so there is no possibility for an insufficient buffer being passed through to integer_to_string. The minor disadvantage is that the compiler cannot deduce the character type, since that is expressed only in the return type, rather than the function arguments, requiring us to explicitly instantiate the function template, as in:

uint64_t      i = . . . 
wchar_t const *result = int_to_string<wchar_t>(i);

However there is another, more significant, drawback, which we're going to examine, and partially address, in Solution 3.

Solution #3

Consider the following example, in light of the implementation of solution #2:

printf("%s %s", int_to_string<char>(5), int_to_string<char>(10));

With our current int_to_string() function implemention, we may get "5 5" or "10 10", but there's no way we're going to get the intended "5 10". This is because the value returned by the two calls to int_to_string() are the same, i.e. a pointer to the thread-specific buffer for the particular combination of character and integer type, in this case char and int.

This is a problem of return value lifetime, which can also occur in somewhat subtler cases:

int some_func(int, char **);

printf("%s %d\n", int_to_string<char>(argc), some_func(argc, argv);

If some_func() calls int_to_string<char>(int), whether directly or indirectly, then we're back to undefined behaviour in our output. We may get what is intended on one compiler, but whatever the last int_to_string<char>(int) call within some_func()left in the thread's conversion buffer on another. Nasty.

Let's look again at the implementation of int_to_string() and one of the possible implementations of its supporting function i2str_get_tss_buffer() that was described in [4].

template< typename C
        , typename I
        >
inline C const *int_to_string(I value)
{
  const size_t  CCH     = 21; // fits 64-bit + sign
  C             *buffer = i2str_get_tss_buffer<C, CCH>();

  return stlsoft::integer_to_string(buffer, CCH, value);
}

template< typename C
        , size_t   CCH
        >
inline C *i2str_get_tss_buffer()
{
  __declspec(thread) static C s_buffer[CCH];

  return s_buffer;
}

For efficiency reasons the conversion functions return C-strings, rather than instances of std::basic_string() or similar. The problem is that a C-string is not a value type; it is a pointer type whose value is the address of the integer's string representation laid out in memory. This problem is not unique to int_to_string(): any function which returns a pointer to a structure can suffer from this, irrespective of whether or not they are, like int_to_string(), thread-safe.

This is a classic problem in C++, and the conventional wisdom is, rightly, to avoid returning the addresses of, or references to, local static variables, whether they be of fundamental or class type, if they are subject to change.

So what can we do about it? Let's assume that we are adamant that we want to return pointers to C-strings. Clearly we want to be able to return distinct buffers from a parameterisation of i2str_get_tss_buffer() when the corresponding parameterisation of int_to_string() is called multiple times within a single expression. Unfortunately, I think that that's pretty much impossible, or at least would have a heavy runtime cost. However, we don't actually need to know whether successive calls are from within a single expression; one option is simply to make sure that the likelihood is very low. Because of the nature of integer to string conversion - i.e. there are fixed maximum lengths to the converted string form — we can approximate to impossible by changing the implementation of i2str_get_tss_buffer() to the following

template< typename C
        , size_t   CCH
        >
inline C *i2str_get_tss_buffer()
{
  const size_t                     DEGREE  =   32;
  __declspec(thread) static C       s_buffers[DEGREE][CCH];
  __declspec(thread) static size_t  s_index;

  s_index = (s_index + 1) % DEGREE;

  return s_buffers[s_index];
}

By picking a number that we believe is large enough, we reduce the likelihood of overwrites. 32 buffers of thread-specific storage, each of size CCH (the size adequate for a converted integer) are declared, along with a thread-specific indexing variable. Upon each call, the indexer is incremented, and is cycled back to 0 when it reaches 32. Thus each of the 32 buffers is used in turn, this cycling occurring on a thread-specific basis. Naturally 32 is a guess at the maximum number of integer to string conversions (remember that this is per-integer type, ie. there are 32 for uint32_t, 32 for int32_t, etc.), and represents a compromise between desired "safety" and stack-size. You would choose your own limit.

Please note that this is only the implementation for the __declspec(thread) version. As described in the last article, __declspec(thread) is suitable only for a limited number of development scenarios. In other cases you would have to use a proprietary API, something like:

template< typename C
        , size_t   CCH
        >
inline C *i2str_get_tss_buffer()
{
  const size_t  DEGREE  =  32;
  struct TSS
  {
    C       buffers[DEGREE][CCH];
    size_t  index;
  } *tss;

  tss = static_cast<TSS*>(TssStore_GetThreadDatum(. . .));

  if(tss == 0)
  {
    . . . // allocate & initialise TSS instance, and 
          // place in slot
  }

  tss->index = (tss->index + 1) % DEGREE;

  return tss->buffers[tss->index];
}

As for performance, both the __declspec(thread) form and the proprietary form have performances that are indistinguishable from their single-buffer variants described in [4].

So we've ameliorated the return-value lifetime problem. Sadly, we have not removed it, and I hope it is clear to you that it is theoretically impossible to do so. Of course we can practically remove it by selecting a sufficiently large degree of buffer array, but this is hackery gone nuts. Don't get me wrong; there are some circumstances where it's valid to go with practical, but not theoretical, correctness. I just don't think this is one of them. (Furthermore, there will be a non-trivial amount of memory used for large degrees, especially when multiple threads are using the functions.)

I would suggest that this solution is actually less desirable than solution #2, since at least in that case there is no attempt to give users of the function a false sense of security; any multiple-use of the return value will result in erroneous results. That's preferable to using a library that promises: "you're unlikely to encounter an error". In my opinion, therefore, this solution must be rejected completely [5].

If we want to return a pointer to a buffer, it's looking like we're going to have to pass in the buffer ourselves.

Solution #4

By now you may be despairing that these incremental solutions all represent retrograde steps. Thankfully that's not the case, and I suggest that solution 4 represents the optimal solution to the problem, given the current state of the language and the majority of its supporting compilers.

Getting back to the original STLSoft integer_to_string() functions, the only aspect of them that drew any significant criticism is the ability for a caller to supply an invalid value for the buffer length. If the length is too small, then at least an assertion will fire in debug builds. If the length value stipulated is sufficient, but does not accurately represent the actual, undersized, buffer length (as shown in the example at beginning of this article), then underrun will occur, which will result in corruption of either the stack or the heap.

A nice feature of modern compilers (including Borland C++ 5.6+, CodeWarrior 7+, Comeau, Digital Mars 8.32+, Intel, GCC and Visual C++ 7.0+) is that they can statically (i.e. at compile-time) deduce the size of arrays. One can define a template function taking an array parameter, and the compiler can deduce the length of the array as a template parameter, which is then accessible within the body of the template function. This means that we can define overloads of the original integer_to_string() functions that take an array parameter instead of the pointer & size parameters, as follows:

template< typename C
        , size_t   N
        >
inline C const *integer_to_string(C (&buf)[N], int8_t i)
{
  return integer_to_string(buf, N, i);
}

This eliminates the possibility of an erroneous buffer length being passed to the function. Even better, we can use compile-time checking (in the form of a static-assertion) to ensure that the buffer length is sufficient.

template< typename C
        , size_t   N
        >
inline C const *integer_to_string(C (&buf)[N], int8_t i)
{
  stlsoft_static_assert(!(N < printf_traits<int8_t>::min_value));

  return integer_to_string(buf, N, i);
}

Now there're no concerns about whether we'll run into a release-build bug that was not located via the assertion in debug-mode testing. If the given parameter is not sufficient the code will not compile. Pretty spiffy, no? ([6])

This solution is thread-safe, has no problems with return value lifetime, is not susceptible to erroneous length specification, works with arbitrary character encodings, and does not require explicit instantiation. It is readily inlined into the function it is implemented in terms of, so it does not sacrifice efficiency. Furthermore, it is pleasingly simple and does not rely on any platform-specific functionality. And, finally, it leads to more succinct code:

uint64_t      i = . . . 
wchar_t       buff[12];
wchar_t const *result = integer_to_string(buff, i);

The only downside is that one must still supply a buffer. In the last part of this series I'll look at a different technique altogether, one that obviates the problems of solutions 2, 3 and 4. Of course, this being C++, it introduces another of its own.

Acknowledgements

Thanks to Chuck Allison and Walter Bright for some useful input. Also to Igor Tandetnik for his eagle eye and eager keystrokes, in letting me know about having failed to have got the "Flexible C++" message (see sidebar) across in the first column. And thanks too to Eelis van der Weegen, for some interesting observations on Solution #2 and having the patience to wait for me to address his points; we're going to do that in solution #5, in the next column.

Notes and references

[1] Hopefully we can put this down to "first-time columnist finding his feet"!
[2] "Efficient Integer To String Conversions", Matthew Wilson, C/C++ User's Journal, December 2002.
[3] STLSoft is an open-source organisation whose focus is the development of robust, lightweight, cross-platform STL software, and is located at http://stlsoft.org.
[4] "Efficient Integer To String Conversions, part 2", Matthew Wilson, C/C++ User's Journal experts Forum, September 2003. http://www.cuj.com/experts/2109/
[5] Actually, an approach such as this might be appropriate when the (rare) repeated use of an object would result in loss of efficiency, rather than breaking correctness, so it's not entirely without its place.
[6] The updated stlsoft_integer_to_string.h is included in the archive for this article, and will appear in the STLSoft libraries from v1.7.1 onwards.

About the Author

Matthew Wilson is a software development consultant for Synesis Software, specializing in robustness and performance in C, C++, C# and Java. Matthew is the author of the STLSoft libraries, and the forthcoming book Imperfect C++ (to be published by Addison-Wesley, 2004). He can be contacted via [email protected] or at http://stlsoft.org/.


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.