Parallel

What's Different About Multiprocessor Software? Part 2

By Wayne Wolf and Princeton University, April 16, 2007

In the second in a five part series, Wayne Wolf, author of \\"High Performance Embedded Computing,\\" delineates the differences between running software on embedded multiprocessors versus general purpose systems and the precautions that must be taken. This week: Multiprocessor scheduling

In general, multiprocessor scheduling is NP-complete [Gar79]. That is, if we want to minimize total execution time on an arbitrary processor, we have no known way to find the shortest schedule in polynomial time.

Of course, many NP-complete problems have useful approximations and heuristics. By taking advantage of what we know about the multiprocessor structure, by limiting the combinations of process executions that we consider, or by other simplifications, we can create a number of simplified but useful multiprocessor scheduling problems. For example, two-processor multiprocessors can be scheduled optimally under some conditions.

One of the first multiprocessor algorithms was developed by Stone [Sto77]. Although he referred to the problem as a scheduling one, it is more accurately referred to as an allocation problem, since it selected the CPUs on which to execute processes but only implicitly the times at which they executed.

He solved the problem using network flow algorithms. Stone's model considered a network of heterogeneous processors. He found an exact solution to the two-processor scheduling problem and heuristics to solve schedules for systems with arbitrary numbers of processors.

Figure 6-3 Models for Stone's multiprocessor scheduling algorithm. After Stone

As shown in Figure 6-3 above, the problem is scheduled in two parts. An intermodule connection graph describes the time cost of communicating between two processes that are assigned to different processors; communication between processes on the same processor has zero cost.

The execution time table specifies the execution time of each process on each processor; it is possible that not all processes will be able to run on both processors.

The minimum running time balances the communication cost and the execution cost. Stone formulates the scheduling problem as one of finding a cutset of a modified version of the intermodule connection graph.

Two additional nodes are added to represent the two processors. One such node is the source of the graph (representing CPU 1) and the other is the sink (representing CPU 2).

Edges are added from each non-sink node to the source and the sink. The weight of an edge to the source is equal to the cost of executing that node's module on CPU 2 (the sink); the weight of an edge to the sink is equal to the cost of executing that node's module on CPU 1 (the source).

The cutset divides the intermodule connection graph into two sets, with the nodes in each set being assigned to the same processor. The weight of a cutset is the cost of an assignment of the nodes to the two processors as given by the cutset. To find the allocation that minimizes the total execution time, we solve a maximum flow problem on the graph.

Stone extended the problem to n processors by generalizing the notion of a cutset. The generalized cutset divides the graph into n disjoint subsets such that no proper subset of a cutset is also a cutset.

He generalized the node to include n types of distinguished nodes rather than just the source and sink. His heuristic for solving this problem iteratively used several two-processor assignments to find the n-processor assignment.

Why static tasks?
Many embedded systems statically allocate processes to processing elements. We can efficiently find bounds on the execution time of the processes in these multiprocessor systems. We will assume that there is a set of processes with data dependencies between them; in general, they can form one or more subtasks.

We will also assume that each CPU schedules processes using rate monotonic scheduling. Although we can easily figure out the schedule if we don't have data dependencies, the combination of data dependencies and rate monotonic scheduling makes the problem more challenging, although tractable.

Minimizing buffer sizes. Bhattacharyya et al. [Bha97] developed methods to efficiently schedule synchronous data flow graphs on multiprocessors. Figure 6-4 below shows an SDF graph with the nodes assigned to processing elements in a multiprocessor. We are primarily interested in the communication between PEs, since we can schedule each SDF on a processor using other methods that produce a sequential schedule.

Figure 6-4 Models for multiprocessor communication.

We model the system using an interprocessor communication modeling (IPC) graph, also shown in the figure. The IPC graph has the same nodes as the SDF graph. The IPC graph has all the edges of the SDF graph plus additional edges. We add edges to the IPC graph to model the sequential schedule on each PE; these edges are shown by the dashed line in the figure.

The edges in the allocated SDF graph that cross processor boundaries are known in the IPC graph as IPC edges because they define interprocessor communication. Any communication across an IPC edge must use an interprocess communication mechanism to cross the boundary between the processors.

We can determine whether communication across each IPC edge is bounded; edges not in a strongly connected component (SCC) are not bounded. When implementing interprocess communication on an unbounded edge, we can use a protocol that ensures that the number of tokens crossing the edge does not exceed a predetermined buffer size. We can implement interprocess communication on bounded edges by using a simpler protocol.

The IPC graph may have some redundant edges. An edge e is redundant if there is another path from source(e) to sink(e) that has a longer delay than the delay along e. The redundant edges do not have to be removed in any particular order to ensure that we remove the maximum number of redundant edges. The asymptotic iteration period T for a strongly connected IPC graph G is

where C is a cycle through the graph, t(v) is the execution time of a node v, and delay(C) is the sum of the delays around the path C. T is also known as the cycle mean. The maximum cycle mean of an IPC graph, lamda max, is the largest cycle mean for any SCC in the graph. A cycle whose cycle mean is equal to the maximum is known as a critical cycle.

We can construct a strongly connected synchronization graph by adding edges between strongly connected components. We add edges that chain together source SCCs, edges that chain together sink SCCs, and an edge that connects the overall sink of the graph to the source.

(A strongly connected component is a source SCC if any edge whose sink is in the strongly connected component also has its source in the strongly connected component. A sink SCC is such that any edge whose source is in the SCC also has its sink in the SCC.)

We need to add delays to the edges, corresponding to buffer memory, that ensure the system will not deadlock and that we can minimize the sum of the buffer bounds over all the IPC edges. We can use the added edges to help us determine these delays - the added edges can be divided into disjoint sets that help organize the graph.

Delay can be added optimally if the graph has one source SCC and one sink SCC, and it is heuristic if the graph's structure is more complex. We can determine the minimum delay on each edge that ensures that the graph's cycle mean is not exceeded.

Figure 6-5 RATAN process model.

Mathur et al. [Mat98] developed the RATAN tool for analyzing the rates of multiple-tasking systems. As shown in Figure 6-5 above, a process consists of a single threaded CDFG-style model. Data dependencies may extend from a node within one process to a node within another process. When we look only at the processes themselves, and the edges between them, a control edge is labeled with [min,max] delays measured from the activation signal for the process to the start of execution.

Those bounds are specifications on the allowable delay; our goal is to find execution rates for the processes that satisfy these bounds. A process starts to execute after all of its enable signals have become ready. If we denote the delay of an edge i -> j in the graph as dij, then the delay of a cycle C in the process graph is given by

The mean delay of the cycles is given by

where is the number of edges in C. The maximum mean cycle delay is known as lambda. In a strongly connected graph, all nodes execute at the same rate, namely lambda.

We call [r_l(X), r_u(X)] the lower and upper bounds on the rate of a subgraph X. If we have two maximal SCC of the graph, P and C, and the graph has edges from P to C, then P is a producer and C is a consumer; therefore the actual rate interval for the consumer C is

Data dependencies and scheduling
The problems created by data dependencies are illustrated in Figure 6-6, below. Here, two subtasks are divided among three processors. Take, for example, processing element M₁.

Figure 6-6 Preemption and scheduling.

This CPU runs two processes that will clearly affect each other's schedules. But the completion times of the processes on M₁ also depends on the behavior of the processes on all the other PEs in the system. Data dependencies link P₁ and P₂, which adds M₂ to the set of interrelated PEs. The data dependency between P₃ and P₄ also adds M₃ to the system.

Getting a process to run faster doesn't always help. Consider Figure 6-7 below; in this example, changing the computation time of process P_xchanges the response time of P₃ even though they run on different processors.

The data dependencies cause the shortened computation time of P_x, resulting in process P₂ running sooner and preempting P₃.

Ramamritham [Ram90b] proposed scheduling over an unrolled schedule. He noted that the maximum interval that must be considered for a set of processes is the least common multiple (LCM) of their periods; in a longer schedule, the order of processes must necessarily repeat itself. He then constructed a schedule using the LCM interval.

To develop bounds on the CPUs in the system, we can make use of a theorem from Lehoczy et al. [Leh89]. They bounded the response times for a set of independent (no data dependencies) processes running on a single CPU.

The processes are {P₁, P₂, ...}, with P₁ being the highest-priority process. The minimum period of the ith process is pi and its worst-case execution time is ci. The worst-case response time of P_i is w_i, which can be computed as the smallest non-negative root of

The ci term gives the computation time for the process. The x/pj terms give the fraction of the period of each process that delays the execution of the ith process. We cannot solve this equation directly but can use numerical methods to solve it.

We can use the worst-case response times of the various processes to help us solve the system schedule. But this formula is not sufficient because it does not take into account the data dependencies either between the CPUs or within a CPU. To handle those data dependencies, we need to use a more complex graph algorithm.

Static scheduling algorithm. Yen and Wolf [Yen98] developed an algorithm that handles multiple CPUs and data dependencies. This algorithm models the processes to be run as a task graph that can have one or more subtasks. Each process in the task graph is given bounds for its computation time [ci lower, ci upper].

This is more general than a single, fixed computation time, but it does assume that we can strictly bound the computation time of a process. Each subtask has a period that is also modeled as an interval [p_i^lower, p_i^upper]. The architecture of the platform is modeled as a processor graph. The allocation of each process in the task graph to a processing element is given in Figure 6-8, below.

Figure 6-8 An example of reduced computation time leading to longer response times. From Yen and Wolf [Yen98].

The algorithm finds bounds on the start and the finish times of the processes. Given a process Pi, the start time is bounded by earliest[P_i, request] and latest[P_i, request]; the end time is bounded by earliest[P_i, finish] and latest[P_i, finish].

The following code summarizes the delay estimation algorithm. The maxsep[,] data structure holds the earliest[] and latest[] bounds for each process; it starts out with infinite (unbounded) times for the processes. This performance analysis algorithm iteratively tightens these bounds until they stop changing (or until a predetermined iteration limit is reached).

maxsep. lower = maxsep.upper = inf inity;
step = 0; /* keep track of number of iterations */
do {
    /* use longest path algorithm to find the request and finish times */
    foreach Pi { Earl iestTimes(Gi); LatestTimes(Gi);
    /* handle max constraints */
    foreach Pi { MaxSeparations (Gi);
    step++;
} while (maxsep has changed and step < limit);

At each iteration, the algorithm performs two types of operations. The EarliestTimes()/LatestTimes() procedures analyzes data dependencies using a modified longest-path algorithm.

The MaxSeparations() procedure uses a modified max-constraint algorithm to look for combinations of process executions that cannot occur. Each of these procedures is applied separately to each subtask P_iin the task graph, but the procedures look at the execution times of all the other subtasks while updating P_i.

To understand why we need these two steps, consider the example of Figure 6-9, above. A naive analysis might lead one to conclude that the total time required to finish both tasks is 80. But two conditions cause it to be shorter. First, P₁cannot preempt both P₂ and P₃ in a single execution of the task. Second, P₂ cannot preempt P₃ since P₂ must finish before P₃ can start. Therefore, the worst-case delay to execute both tasks is only 45.

Phase constraints
We will use phase constraints to summarize the relationships between a process that is preempted and the process that preempts one. The algorithm uses two types of phases.

1) The request phase, which we will call phase[i,j,r] in the code, describes the smallest interval between the execution of P_i on one iteration and the succeeding iteration of P_j.

2) The finishing phase, which we will call phase[i,j,f] in the code, describes the smallest interval between the finishing time of one iteration of P_i and the first request time of the next iteration of P_j.

Figure 6-10 Request and finishing phases.

These definitions are illustrated in Figure 6-10 above. The gray boxes at the ends of the boxes represent the min/max timing bounds.

Data dependencies are one form of constraint on the relative execution times of processes. LatestTimes() and EarliestTimes() are modified longest-path algorithms that determine the timing and slack of the processes based on their data dependencies. Pseudocode for LatestTimes() is shown in the following block of code from [Yen98]; it is similar in form to the EarliestTimes() procedure.

LatestTimes(G) {
    /* initial ize */
    foreach (process Pi) {
        latest[Pi,request] = 0;
        foreach (process Pj) phase[i, j ,r] = 0;
    }
    foreach (process Pi in topological order) {
        wi = worst-case response time of Pi with phase adjustment phase[i,j,r];
        foreach (process Pj such that priority(Pj) > priority(Pi)) {
            latest{Pi,finish] = latest[Pi,request]+wi
            calculate phase[i,j,f] relative to latest[Pi,finish] for each j;
            foreach (immediate successor Pk of Pi) {
                delta = latest[Pk,request] 2 latest[Pi,finish];
                if (latest[Pk,request] < latest[Pi,finish])
                    latest[Pk,request] = latest[Pi,finish]
                update phase[k,j ,r] for each process Pj according to
                    phase[i,j,f] and delta;;
        }
        }
      }
    }

The algorithm walks through the graphs in the order in which they appear in the task graph. The worst-case response time wi is computed using EQ 6-6, with the modification that the term inside the summation takes into account the request phase:

After we have computed w_i, we then compute the phases relative to latest [P_i, finish]. There are two cases to consider, P_j preempts P_i and P_jdoes not preempt P_i. Updating the request phases requires looking ahead one iteration, examining the successor P_ks of P_i. If , then there is slack between the latest finish time of P_i and the latest request time of P_k. If , then the request phase must be updated, as shown below:

Page 353 EQ

These relationships are illustrated in Figure 6-11, below.

Figure 6-11. Relationships related to phases.

The MaxSeparations() procedure uses combinations of processes that cannot interfere to tighten the bounds on execution time. It uses the results of the phase analysis performed in LatestTimes() and EarliestTimes() to check the separations of phases of processes, taking preemption into account.

Max constraints model the relationship between a process and its predators: the initiation time of a process is the max of the finish times of its predecessors.

Max constraints are harder to solve than the linear constraints imposed by data dependencies. We can use a modified version of an algorithm developed by McMillan and Dill [McM92] to solve these constraints.

To read Part 1, go to The role of the operating system
Next in Part 3: Event Driven Scheduling Analysis

Used with the permission of the publisher, Newnes/Elsevier, this series of five articles is based on copyrighted material from "High-Performance Embedded Computing," by Wayne Wolf. The book can be purchased on line.

Wayne Wolf is professor of electrical engineering at Princeton University. Prior to joining Princeton he was with AT&T Bell Laboratories. He has served as editor in chief of the ACM Transactions on Embedded Computing and of Design Automation for Embedded Systems.

References:

[Gar79] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness," W. H. Freeman, 1979.
[Sto77] Harold S. Stone, "Multiprocessor scheduling with the aid of network flow diagrams," IEEE Transactions of Software Engineering, SE- 3(1) January, 1977
[Bha97] Shuvra S. Bhattacharyya, "Sundararajan Sriram and Edward Lee, "Optimizing synchronization in multiprocessor DSP systems," IEEE Transactions on Signal Processing, 45(6), June, 1997
[Mat98] Anmol Mathur, Ali Dasdan and Rajesh K. Gupta, "Rate analysis for embedded systems," ACM Transactions on Design Automation of electronic systems, 3(3) July, 1998.
[Ram90b] Krithi Ramamritham, "Allocation and scheduling of complex periodic tasks," Proceedings, 10th International Conference on Distributed Computing Systems, IEEE, 1990
[Leh89] J. Lehoczy, L. Sha, and Y. Ding, "The rate monotonic scheduling algorithm:exact characterization and average case behavior," Proceedings, IEEE Real Time Systems Symposium, IEEE, 1989.
[Yen98] Ti-yen Yen and Wayne Wolf, "Performance analysis of distributed embedded systems," IEEE Transactions on Parallel and Distributed Systems, 9(11), November, 1998.
[McM92] K.McMillan and D. Dill, "Algorithms for interface timing verification," Proceedings, IEEE International Conference on Computer Design, 1992

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Parallel

What's Different About Multiprocessor Software? Part 2

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Parallel

What's Different About Multiprocessor Software? Part 2

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Parallel Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content