Tim is an associate professor in the Computer Science Department of Eastern Washington University. He can be contacted at [email protected].
Assume you are developing C or C++ programs to run under Linux or other UNIX variant, and you know that they may run on a machine with more than one processor. Sequential programs, of course, only use one of those processors, which probably is sufficient if the machine is otherwise able to keep all processors busy. But if the machine ends up with an idle processor, you may want to capture those processing cycles.
Of course, there are freely available systems such as Parallel Virtual Machine (PVM) and Message Passing Interface (MPI) under which you can rewrite your programs to take advantage of message passing; see the accompanying text box entitled "Message Passing Systems." However, it may happen that your program deals with a number of smaller subproblems that are effectively independent from each otherwhat the parallel programming literature calls an "embarrassingly parallel" problem. In this case, you may be able to take advantage of the UNIX multiprocessing command fork to get multiple copies of your program executing. With only small changes in the original sequential program, you can keep both processors on dual-processor computers busy and complete your calculations in half the time.
The statement that opens up the possibility of such speed-up is the fork statement. If fork succeeds, there are two copiesthe parent process and the child processof the same program executing at the same time. The child process executes in a copy of the memory used by the parent process. This means that it receives all of the data of the parent, but copied into its own area. However, the processes share any files opened for output. The operating system lets them write to the same file, and handles the interleavingthough you should be careful to write only complete newline terminated records as single writes.
To illustrate, I use a program designed to measure the characteristics of Binary Search Trees by generating large numbers of them from random data (the complete source code for this and other examples is provided electronically; see "Resource Center," page 5). Listing One is C preprocessor code for enabling/disabling the fork processing, currently set to enable the dual-processor mode. A process determines whether it is the child or parent process by examining the variable receiving back the function value of the fork. The parent receives a strictly positive number, the process ID for the child; the child receives zero. (Failure to fork causes a negative return value.)
Listing Two is the preprocessor code surrounding the set-up for either sequential- or dual-processing. In this case, I am only setting up for two processes, so that the alternating sequence can easily be generatedthe child process does the even offset values for N (after Nstep has been doubled). The parent process does the odd offset values (by stepping ahead by 1 before Nstep is doubled). Prior to pid_t Child = fork(), the program opened an output stream, using ofstream Fout("BSTstats.csv",ios::app);both parent and child processes write to this output stream.
The working heart of the simulation (Listing Three) does not even know whether it is running as the only process or as a parent or child process. Even on an otherwise unloaded dual-processor computer, though, the operating system does require the use of some processor cycles, slightly slowing down one side of the calculation. If you just want the statistical data, this isn't a problem. However, if you are interested in the time required, you may see that the parent and child run at slightly different speeds.
You can also get data back from the child process or processes via a shared output file. In this case, the parent needs to wait for the child process to terminate, something easily done with the UNIX C Library function wait(). The child process writes its information to a shared file, which the parent process then rewinds and reads.
Listing Four is from a program implementing the "world's worst" method to calculate a logarithm as ln(x): the Monte Carlo integration of dt/t from 1 to x. The parent opens a file on a local disk, then the child process writes its information to that file. When the child is finished, the parent reads through the file to process the information. Although in this case only a single long int is being returned, you can see that the object written to the binary file and read back from it could be arbitrarily large.
Java Note
Java threads, of course, make multiprocessing easily available, but require a bit more programming effort to split the sequential code into its parallel incarnation. Rather than simply executing from the point of a fork statement, the Java thread begins processing within its run() method. Hence, the segment of code to run in parallel needs to be moved to that method. Similarly, data transfer needs to be handled explicitly in constructing the thread rather than through a copied data area. Listing Five comes from the Java analogue to the C++ code examining Binary Search Tree characteristics (Listing One). The first code segment is taken from the main() method. The Thread constructor captures the subproblem parameters into instance fields to drive the simulation in the run() method; see Listing Six. The run() method (Listing Seven) doing all of the work is simply a copy of the appropriate section of the sequential program.
Since the System.currentTimeMillis() method explicitly returns wall-clock time rather than the processor time returned by the C Library function clock(), this Java implementation may see even more of a speed difference between the threads, depending on which processor is doing the necessary operating-system operations.
In Java environments, information can flow from the threads back to the originating process because they are sharing the same memory, based on references passed as parameters. For the "world's worst" logarithm calculation, I needed to provide a reference to a mutable integereven though the Java API class Integer is immutable. Listing Eight does this. As in the C example, the size of the object returned could be arbitrarily large in the Java implementation. Indeed, because of the shared memory, it is easier for the threads to return information to the originating process through such mutable objects.
DDJ
Listing One
/* If a dual-processor UNIX machine, you can use a fork and run two experiments in parallel. Comment out the following #define to disable */ #define DUAL_PROCESSOR /* UNIX system, dual-processor computer */ #ifdef DUAL_PROCESSOR #include <unistd.h> /* fork (at least for Linux) */ #endif
Listing Two
#ifndef DUAL_PROCESSOR srand(time(NULL)); // Get a chance start to rand() sequence #else // If it IS dual-processor, fork the second process, then start two different // random sequences and arrange to alternate tree sizes during the simulation. pid_t Child = fork(); if ( Child != 0 ) // I.e., if this is the parent process N += Nstep; // Do the odd-offset ones Nstep += Nstep; // Each process steps by twice as much // Chance start to rand() sequence, different for parent and child srand(time(NULL) + Child); #endif
Listing Three
for ( ; N <= Nlim ; N += Nstep ) // Use value of N from above { clock_t Start = clock(); // Capture the starting tick double CPUsecs, // Processor time required SigmaHt = 0, // Start summations at zero SigmaDp = 0; for (Pass = 0; Pass < Npasses; Pass++) { for ( k = 0; k < N; k++ ) // Fill the tree Tree.Insert ( rand() ); Height = Tree.Height(); // Capture this tree's data AvgDepth = Tree.AvgDepth(); SigmaHt += Height; // Collect the statistics SigmaDp += AvgDepth; Tree.Empty(); // Empty out the tree } // end for ( Pass... CPUsecs = double(clock()-Start) / CLOCKS_PER_SEC; // Capture statistics in "comma-separated value" mode for later use. Fout << N << ',' << SigmaHt / Npasses << ',' << SigmaDp / Npasses << ',' << CPUsecs << endl; } // end for ( N...
Listing Four
int fd = open("/tmp/Joint.data", O_RDWR|O_TRUNC|O_CREAT, 0666); . . . < code segment omitted > // Spawn all the child processes as a chain for ( Proc = 1; Proc < Nproc; Proc ++ ) { Child = fork(); if ( Child != 0 ) // Child spawns the NEXT child. break; } . . . < code segment omitted > if ( Proc == 1 ) // That is, the first parent in the chain wait(NULL); // wait till everything is finished. else { if ( Child != 0 ) wait(NULL); // Let the immediate descendent finish N_out = write (fd, &Nhits, sizeof(Nhits)); // write out Nhits exit(1); // Bow out so that the next can process } N_out = lseek(fd, 0, SEEK_SET); // Rewind the file for ( Proc = 1; Proc < Nproc; Proc++ ) { N_out = read (fd, &Part, sizeof(Part)); // read into Part Nhits += Part; }
Listing Five
BSTrun[] worker; // Compute-engine threads . . . < code segment omitted > worker = new BSTrun[nThreads]; for ( k = 0; k < nThreads; k++ ) {//Note the threads have differing starting values for n, //while they all use a correspondingly coarser nStep. worker[k] = new BSTrun(n+k*nStep, nLim, nThreads*nStep,nPasses, fOut); worker[k].start(); // Start processing } // Now wait for the workers to finish for ( k = 0; k < nThreads; k++ ) try { worker[k].join(); } // Returns when worker[k] terminates catch (InterruptedException e) // We should never see this { System.err.println (e); System.exit(-1); }
Listing Six
class BSTrun extends Thread { // Seeded in the constructor, used in run() java.util.Random generator; // Subproblem description parameters / instance fields int n, // Starting tree size nLim, // Upper limit on tree size nStep, // Increment for tree size nPasses; // Number of trees to build for averaging // Output stream for the results PrintWriter fOut; // Copy all the construction parameters into their fields. BSTrun(int n, int nLim, int nStep, int nPasses, PrintWriter fOut) { this.n = n; this.nLim = nLim; this.nStep = nStep; this.nPasses = nPasses; this.fOut = fOut; // Since n values differ, we have somewhat different seeds. generator = new java.util.Random(System.currentTimeMillis() + n * nLim); } // end BSTrun constructor
Listing Seven
// Compute engine portion --- copied from sequential code public void run() { BST tree = new BST(); for ( ; n <= nLim; n += nStep ) { long start = System.currentTimeMillis(); double elapsed; double sigmaHt = 0, // Total of tree heights sigmaDp = 0; // Total of tree average depths int height; // Individual tree height double avgDepth; // Individual tree average depth String stats; // Build the formatted output here DecimalFormat fmt = new DecimalFormat ("0.000"); for ( int pass = 0; pass < nPasses; pass++ ) { for ( int k = 0; k < n; k++ ) // Fill the tree tree.insert ( generator.nextInt() ); height = tree.height(); // Capture this tree's data avgDepth = tree.avgDepth(); sigmaHt += height; // Collect the statistics sigmaDp += avgDepth; tree.empty(); // Empty out the tree }// end for (pass... elapsed = (System.currentTimeMillis()-start) * 1.0e-03; stats = "," + fmt.format(sigmaHt / nPasses) + "," + fmt.format(sigmaDp / nPasses) + "," + fmt.format(elapsed); fOut.println (n + stats); fOut.flush(); } // end for (n... } // end run()
Listing Eight
class IntWrapper { public int value; // All values allowed, so direct access IntWrapper (int value) // Constructor: copy over value { this.value = value; } } // end class IntWrapper . . . <code segment omitted> // The following code segment is taken from the main() method--showing the // creation of an array of threads th[], and an array of IntWrapper objects // p[] to handle parameter passing. The p[] objects also pass information // into the thread--the instance number of the thread. LnCalc[] th = null; IntWrapper[] p = null; . . . < code segment omitted > th = new LnCalc[nThreads]; p = new IntWrapper[nThreads]; for ( k = 0; k < nThreads; k++ ) // Thread initiation { p[k] = new IntWrapper(k+1); th[k] = new LnCalc(x, nShots, p[k]); th[k].start(); } // end thread initiation and start for ( k = 0; k < nThreads; k++ ) // Thread completion { try { th[k].join(); } // Wait for completion of this thread. catch (InterruptedException e) { System.out.println (e); System.exit(-1); } nHits += p[k].value; } // end wait for thread termination