Questions & Answers

By Pete Becker, May 01, 1996

May 1996/Questions & Answers

Random number generation remains one of the most poorly understood topics in programming. This month Pete clears up some questions about rand() raised in one of his previous columns.

Q

I was reading your magazine and came across your discussion on rand. You provide the following example:

#include <stdlib.h>
#include <stdio.h>

int roll(void) {
      const int min_val = 1;
      const int max_val = 6;
      const int range = max_val - min_val + 1;
      return rand() % range + min_val;
}

int main(void) {
      int n;
      for (n=0; n<100; n++)
      printf("%2d:%d\n",n+1,roll());
      return 0;
}

Then you go and tell us this doesn't work because roll is distorted towards low values. Did I miss something here? I've tried this on the new Borland Win95 compiler several times doing several 1000 rolls and the results came back very evenly distributed. You blew this one!
— Gerald Wright

A

I've received several messages saying the same sort of thing as this one. It looks like I didn't do a good enough job of explaining the underlying problem, so I'll try it again. But instead of looking at a small range, let's look at a much bigger one.

rand returns a number between 0 and RAND_MAX, inclusive. To see the problem in its simplest form, consider what happens when we try to generate random numbers between 0 and RAND_MAX-1, inclusive:


unsigned random()
{
      return rand() % RAND_MAX;
}

Whenever rand returns a number between 0 and RAND_MAX-1 random returns that value. When rand returns the number RAND_MAX, random returns 0. That is, the function random will return 0 when rand returns 0 and when rand returns RAND_MAX. random will return 1 when rand returns 1; it will return 2 when rand returns 2; and so on. The result is that random will produce twice as many 0's as any other number, because there are twice as many values returned by rand that end up producing 0.

Similarly, if we try to get numbers between 0 and RAND_MAX-2, our function looks like this:

unsigned random()
{
      return rand() % (RAND_MAX-1);
}

The same sort of analysis tells us that random will return 0 whenever rand returns 0 or RAND_MAX-1; it will return 1 whenever rand returns 1 or RAND_MAX; it will return 2 whenever rand returns 2; and so on. There are two return values of rand that will produce 0, two return values that will produce 1, and only one value that will produce 2, 3, 4, or any other number up to RAND_MAX-2. So this version of random will return 0 and 1 twice as often as it returns any other value.

The problem we see here isn't some deep property of rand or random numbers. It's simply a counting problem, like trying to put 32,768 marbles into 32,767 bins. No matter how we distribute them, there is going to be at least one bin that gets more than one marble. The same problem occurs when we try to map the range of numbers returned by rand to the range of numbers that we need for our problem: unless the ranges fit together evenly we'll get some numbers more often than others. Those extra marbles have to go somewhere.

In practice this problem can be hard to see. Let's continue with the example that started this off: trying to generate numbers in the range 0 to 5. With Borland C++ 5.0, RAND_MAX is 32767, so when we call rand we get back numbers in the range 0 to 32767. If we divide the resulting number by 6 and take the remainder, we'll get numbers in the range 0 to 5. Let's look at which numbers returned by rand map to these values:

rand()                    random()

0,6,12,.,32760,32766      0
1,7,13,.,32761,32767      1
2,8,14,.,32762            2
3,9,15,.,32763            3
4,10,16,.,32764           4
5,11,17,.,32765           5

There are 5,462 different values returned by rand that result in random returning 0. Since rand returns 32,768 possible values, this means that the probability of getting a 0 from random is 5462/32768, or 16.6687%. There are 5461 different values that result in random returning 5, so the probability of getting a 5 from random is 5461/32768, or 16.6656%. The difference between the probability of getting a 0 and the probability of getting a 5 is so small that you won't notice it without checking a very large number of results. Nevertheless, it is there. The bias that we see here wouldn't make users return your adventure game, but in other applications it could make a difference.

Q

I just got a chance to read your discussion about mapping rand's output into a different integer range and thought that there are a couple of issues that you might like to account for.

The most important point is that the lower bits of the output from the usual (linear congruential) random number generators are the least "random." That is, patterns in the lower bits are common. Hence, the output from the routine roll in your discussion is not surprising. Also, it is avoidable by relying on the upper bits to determine the integer returned.

Second, there is a payoff to thinking about mapping ranges of values in a fairly general way and then applying the thoughts in the particular circumstances.

The first point means that getting a fraction on the interval from zero to one and then converting to an integer is far preferable to using the remainder. While using an intermediate floating-point number is slower (at least on the machines that I use), it will typically not have an output of predictable sequences. Given the strategy of using a fraction and then converting to integral values, it is fairly straightforward to think about converting ranges of values.

The output of the Borland DOS and OS/2 compilers ranges from 0 to RAND_MAX, which can be written [0, RAND_MAX]. The following line generates a fraction on the interval [0, RAND_MAX/(RAND_MAX+1)]:

rand()/((double)RAND_MAX+1.0)

where the cast should avoid possible integer overflow. [The cast is not necessary: the compiler will convert to double because 1.0 is a double. — pjb] For most purposes, multiplying the resulting fraction by the maximum value desired, max_val in your routines, truncating to an integer, and then adding one will do the trick, producing only nonobvious patterns. (I say this in this convoluted way because all random number generators are deterministic — perfectly predictable given the initial conditions and the rule generating the "random" numbers.) This will work fine as long as max_val is small relative to RAND_MAX+1 or RAND_MAX+1 is perfectly divisible by max_val+1. If max_val is large relative to RAND_MAX+1 and RAND_MAX+1 is not perfectly divisible by max_val+1, then more care is necessary to make sure that the relative proportions of the different values are the same. (As you do, I assume that max_val is less than RAND_MAX.) This is where throwing away values could come into play, although there are other possibilities.

I enjoy your column very much and would not bother you with this, but no one can know everything and it is a well-established rule to use the upper bits of congruential generators to generate numbers other than the integers produced by the built-in routines.

— Jerry Dwyer

A

You're right that I should not have neglected the well- known problem with the low bits of values produced by linear congruential generators. Let's look in a bit more detail at what we mean when we say that we want rand to generate a good pseudo-random sequence. Along the way we'll get a sense of how well the implementation in BC++ 5.0 does, because that's the compiler I've used to generate the results that we'll see here.

First, a pseudo-random number generator needs to produce every number in its output range with equal probability. That's easy to measure: just run the random number generator many times and count the number of occurrences of each number. The following program scales the output of rand down so it generates numbers greater than or equal to 0 and less than 10. That makes it easy to count how many times each number occurs, and to compare them by eye.

#include <stdlib.h>
#include <stdio.h>
#include <time.h>

int main()
{
      int i;
      int data[10];
      srand(time(NULL));
      for( i = 0; i < 10; i++ )
   data[i] = 0;
      for( i = 0; i < 1000000000; i++ )
   data[rand()%10]++;
      for( i = 0; i < 10; i++ )
   printf( "count[%d] = %d\n", i, data[i] );
      return 0;
}

The results show no glaring discrepancies from the even distribution we would hope to get:

count[0] = 100006000
count[1] = 100018371
count[2] = 100006464
count[3] = 100001979
count[4] = 100009869
count[5] = 99998331
count[6] = 99999846
count[7] = 100007726
count[8] = 99977846
count[9] = 99973568

That is, we would expect each number to occur 100,000,000 times. The actual results are quite close: none of the counts is off by more than a few tenths of a percent. For an adventure game, that's certainly good enough.

Knowing that we can produce values that are evenly spread over the range we are interested in is a good first step. However, in itself it's not enough. A good pseudo-random number generator also avoids short-term patterns. For example, a generator that always produces a 2 immediately after it has produced a 1 is not a good pseudo-random number generator, even if it covers its full range reasonably well. It is too predictable. To make this notion a bit more formal, in such a generator the sequence { 1,2 } would be much more common than the sequence { 1,3 }. Stated the other way around, a good pseudo-random number generator will produce every sequence of a given length with equal frequency. That's where linear congruential generators are said to fall down: they produce short-term patterns that are fairly easy to see. The following program checks the frequency of all possible sequences of two numbers:


#include <stdlib.h>
#include <stdio.h>
#include <time.h>

#define NUM_VALS 1000000000

int main()
{
   int i,j;
   int count[10][10];
   int lastnum;
   srand(time(NULL));
   for( i = 0; i < 10; i++ )
      for( j = 0; j < 10; j++ )
         count[i][j] = 0;
   lastnum = rand()%10;
   for( i=0; i < NUM_VALS; i++ )
      {
      int curnum = rand()%10;
      count[lastnum][curnum]++;
      lastnum = curnum;
      }
   for( i = 0; i < 10; i++ )
      {
      for( j = 0; j < 10; j++ )
        printf( "%6.3f",
      count[i][j]*100/(double)NUM_VALS );
      printf( "\n" );
      }
   return 0;
}

When this program has completed the count array holds the number of two-number sequences whose first number is the value of the first array index and whose second number is the value of the second array index. If we have a good pseudo-random number generator these counts should be very close to equal. Rather than show the raw data, the program displays the frequency of each two-number sequence as a percentage of the total number of two-number sequences in the output. This produces the results shown in Table 1. There are one hundred possible two number sequences, so each one should occur one percent of the time. What we see here is that we're very close to one percent. None of the values is off by more than one part in one thousand. By this test, rand does pretty well in its low digits.

It is also said that linear congruential generators sometimes alternate odd and even numbers. If that were so, the previous table would show a higher percentage of sequences where the first number is odd and the second even, and where the first number is even and the second odd. Again, the data does not show this, so this particular random number generator does not seem to suffer from this problem.

If you need to carefully test your random number generator you need to perform tests on longer sequences than these. The bookkeeping involved becomes a bit harder, but the principle is the same: count the number of distinct sequences and look for imbalances. If you find any, you should be suspicious of your random number generator. If you don't find any, be careful about concluding that your random number generator works well. You may not be looking hard enough.

If your random number generator does have problems in its low digits, you're right that using floating-point computations to map to the desired range will help reduce the effect of that bias. Note, however, that it does not help with the problem mentioned in the preceding letter: if you're trying to put 32,768 marbles into 32,767 bins, counting in floating-point doesn't make the extra one go away. It just changes where the extra value ends up, putting it in the middle instead of at the beginning.

Q

The answer to the question about private virtual functions ("Questions & Answers," November 1995) was incorrect; the sample code won't compile. Member access is a compile-time check, whereas the virtual function mechanism is a run-time feature. Below is the sample code, compiled with three different C++ compilers, with error messages.

class Base {
    virtual void func();
};

class Derived : public Base {
    public:
    void func();
};

void sample()
{
    Base *bp = new Derived;
    //calls Derived::func?
    //doesn't compile!
    bp->func();

}

— Gerard C. Weatherby

A

AI left out the error messages you got, to avoid further embarrassment. You're absolutely right: that code is incorrect. It should not compile, because func is private in Base. I usually compile the code that I include in this column, but this fragment was so simple that I just knew I couldn't get it wrong. We recently lost half a day of testing when someone made a simple change to some code that didn't need to be changed and broke our debugger. It took that long to figure out what had gone wrong, back out the change, rebuild, and redistribute the code to our testing staff. Don't ever assume that writing code is simple. Our business is precision, and precision requires close attention. If you take shortcuts you'll make mistakes.

Pete Becker is Senior Development Manager for C++ Quality Assurance at Borland International. He has been involved with C++ as a developer and manager at Borland for the past six years, and is Borland's principal representative to the ANSI/ISO C++ standardization committee.

1 2 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Questions & Answers

Q

A

Q

A

Q

A

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Questions & Answers

Q

A

Q

A

Q

A

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content