Dr. Dobb's is part of the Informa Tech Division of Informa PLC

This site is operated by a business or businesses owned by Informa PLC and all copyright resides with them. Informa PLC's registered office is 5 Howick Place, London SW1P 1WG. Registered in England and Wales. Number 8860726.


Channels ▼
RSS

Parallel

What's Different About Multiprocessor Software? Part 2


In general, multiprocessor scheduling is NP-complete [Gar79]. That is, if we want to minimize total execution time on an arbitrary processor, we have no known way to find the shortest schedule in polynomial time.

Of course, many NP-complete problems have useful approximations and heuristics. By taking advantage of what we know about the multiprocessor structure, by limiting the combinations of process executions that we consider, or by other simplifications, we can create a number of simplified but useful multiprocessor scheduling problems. For example, two-processor multiprocessors can be scheduled optimally under some conditions.

One of the first multiprocessor algorithms was developed by Stone [Sto77]. Although he referred to the problem as a scheduling one, it is more accurately referred to as an allocation problem, since it selected the CPUs on which to execute processes but only implicitly the times at which they executed.

He solved the problem using network flow algorithms. Stone's model considered a network of heterogeneous processors. He found an exact solution to the two-processor scheduling problem and heuristics to solve schedules for systems with arbitrary numbers of processors.

Figure 6-3 Models for Stone's multiprocessor scheduling algorithm. After Stone

As shown in Figure 6-3 above, the problem is scheduled in two parts. An intermodule connection graph describes the time cost of communicating between two processes that are assigned to different processors; communication between processes on the same processor has zero cost.

The execution time table specifies the execution time of each process on each processor; it is possible that not all processes will be able to run on both processors.

The minimum running time balances the communication cost and the execution cost. Stone formulates the scheduling problem as one of finding a cutset of a modified version of the intermodule connection graph.

Two additional nodes are added to represent the two processors. One such node is the source of the graph (representing CPU 1) and the other is the sink (representing CPU 2).

Edges are added from each non-sink node to the source and the sink. The weight of an edge to the source is equal to the cost of executing that node's module on CPU 2 (the sink); the weight of an edge to the sink is equal to the cost of executing that node's module on CPU 1 (the source).

The cutset divides the intermodule connection graph into two sets, with the nodes in each set being assigned to the same processor. The weight of a cutset is the cost of an assignment of the nodes to the two processors as given by the cutset. To find the allocation that minimizes the total execution time, we solve a maximum flow problem on the graph.

Stone extended the problem to n processors by generalizing the notion of a cutset. The generalized cutset divides the graph into n disjoint subsets such that no proper subset of a cutset is also a cutset.

He generalized the node to include n types of distinguished nodes rather than just the source and sink. His heuristic for solving this problem iteratively used several two-processor assignments to find the n-processor assignment.

Why static tasks?
Many embedded systems statically allocate processes to processing elements. We can efficiently find bounds on the execution time of the processes in these multiprocessor systems. We will assume that there is a set of processes with data dependencies between them; in general, they can form one or more subtasks.

We will also assume that each CPU schedules processes using rate monotonic scheduling. Although we can easily figure out the schedule if we don't have data dependencies, the combination of data dependencies and rate monotonic scheduling makes the problem more challenging, although tractable.

Minimizing buffer sizes. Bhattacharyya et al. [Bha97] developed methods to efficiently schedule synchronous data flow graphs on multiprocessors. Figure 6-4 below shows an SDF graph with the nodes assigned to processing elements in a multiprocessor. We are primarily interested in the communication between PEs, since we can schedule each SDF on a processor using other methods that produce a sequential schedule.

Figure 6-4 Models for multiprocessor communication.

We model the system using an interprocessor communication modeling (IPC) graph, also shown in the figure. The IPC graph has the same nodes as the SDF graph. The IPC graph has all the edges of the SDF graph plus additional edges. We add edges to the IPC graph to model the sequential schedule on each PE; these edges are shown by the dashed line in the figure.

The edges in the allocated SDF graph that cross processor boundaries are known in the IPC graph as IPC edges because they define interprocessor communication. Any communication across an IPC edge must use an interprocess communication mechanism to cross the boundary between the processors.

We can determine whether communication across each IPC edge is bounded; edges not in a strongly connected component (SCC) are not bounded. When implementing interprocess communication on an unbounded edge, we can use a protocol that ensures that the number of tokens crossing the edge does not exceed a predetermined buffer size. We can implement interprocess communication on bounded edges by using a simpler protocol.

The IPC graph may have some redundant edges. An edge e is redundant if there is another path from source(e) to sink(e) that has a longer delay than the delay along e. The redundant edges do not have to be removed in any particular order to ensure that we remove the maximum number of redundant edges. The asymptotic iteration period T for a strongly connected IPC graph G is

where C is a cycle through the graph, t(v) is the execution time of a node v, and delay(C) is the sum of the delays around the path C. T is also known as the cycle mean. The maximum cycle mean of an IPC graph, lamda max, is the largest cycle mean for any SCC in the graph. A cycle whose cycle mean is equal to the maximum is known as a critical cycle.

We can construct a strongly connected synchronization graph by adding edges between strongly connected components. We add edges that chain together source SCCs, edges that chain together sink SCCs, and an edge that connects the overall sink of the graph to the source.

(A strongly connected component is a source SCC if any edge whose sink is in the strongly connected component also has its source in the strongly connected component. A sink SCC is such that any edge whose source is in the SCC also has its sink in the SCC.)

We need to add delays to the edges, corresponding to buffer memory, that ensure the system will not deadlock and that we can minimize the sum of the buffer bounds over all the IPC edges. We can use the added edges to help us determine these delays - the added edges can be divided into disjoint sets that help organize the graph.

Delay can be added optimally if the graph has one source SCC and one sink SCC, and it is heuristic if the graph's structure is more complex. We can determine the minimum delay on each edge that ensures that the graph's cycle mean is not exceeded.

Figure 6-5 RATAN process model.

We can construct a strongly connected synchronization graph by adding edges between strongly connected components. We add edges that chain together source SCCs, edges that chain together sink SCCs, and an edge that connects the overall sink of the graph to the source.

(A strongly connected component is a source SCC if any edge whose sink is in the strongly connected component also has its source in the strongly connected component. A sink SCC is such that any edge whose source is in the SCC also has its sink in the SCC.)

We need to add delays to the edges, corresponding to buffer memory, that ensure the system will not deadlock and that we can minimize the sum of the buffer bounds over all the IPC edges. We can use the added edges to help us determine these delays - the added edges can be divided into disjoint sets that help organize the graph.

Delay can be added optimally if the graph has one source SCC and one sink SCC, and it is heuristic if the graph's structure is more complex. We can determine the minimum delay on each edge that ensures that the graph's cycle mean is not exceeded.

Mathur et al. [Mat98] developed the RATAN tool for analyzing the rates of multiple-tasking systems. As shown in Figure 6-5 above, a process consists of a single threaded CDFG-style model. Data dependencies may extend from a node within one process to a node within another process. When we look only at the processes themselves, and the edges between them, a control edge is labeled with [min,max] delays measured from the activation signal for the process to the start of execution.

Those bounds are specifications on the allowable delay; our goal is to find execution rates for the processes that satisfy these bounds. A process starts to execute after all of its enable signals have become ready. If we denote the delay of an edge i -> j in the graph as dij, then the delay of a cycle C in the process graph is given by

The mean delay of the cycles is given by

where is the number of edges in C. The maximum mean cycle delay is known as lambda. In a strongly connected graph, all nodes execute at the same rate, namely lambda.

We call [rl(X), ru(X)] the lower and upper bounds on the rate of a subgraph X. If we have two maximal SCC of the graph, P and C, and the graph has edges from P to C, then P is a producer and C is a consumer; therefore the actual rate interval for the consumer C is

Data dependencies and scheduling
The problems created by data dependencies are illustrated in Figure 6-6, below. Here, two subtasks are divided among three processors. Take, for example, processing element M1.

Figure 6-6 Preemption and scheduling.

This CPU runs two processes that will clearly affect each other's schedules. But the completion times of the processes on M1 also depends on the behavior of the processes on all the other PEs in the system. Data dependencies link P1 and P2, which adds M2 to the set of interrelated PEs. The data dependency between P3 and P4 also adds M3 to the system.

Getting a process to run faster doesn't always help. Consider Figure 6-7 below; in this example, changing the computation time of process Px changes the response time of P3 even though they run on different processors.

Figure 6-7 Period shifting. From Yen and Wolf [Yen98] © 1998 IEEE.

The data dependencies cause the shortened computation time of Px, resulting in process P2 running sooner and preempting P3.

Ramamritham [Ram90b] proposed scheduling over an unrolled schedule. He noted that the maximum interval that must be considered for a set of processes is the least common multiple (LCM) of their periods; in a longer schedule, the order of processes must necessarily repeat itself. He then constructed a schedule using the LCM interval.

To develop bounds on the CPUs in the system, we can make use of a theorem from Lehoczy et al. [Leh89]. They bounded the response times for a set of independent (no data dependencies) processes running on a single CPU.

The processes are {P1, P2, ...}, with P1 being the highest-priority process. The minimum period of the ith process is pi and its worst-case execution time is ci. The worst-case response time of Pi is wi, which can be computed as the smallest non-negative root of

The ci term gives the computation time for the process. The x/pj terms give the fraction of the period of each process that delays the execution of the ith process. We cannot solve this equation directly but can use numerical methods to solve it.

We can use the worst-case response times of the various processes to help us solve the system schedule. But this formula is not sufficient because it does not take into account the data dependencies either between the CPUs or within a CPU. To handle those data dependencies, we need to use a more complex graph algorithm.

Static scheduling algorithm. Yen and Wolf [Yen98] developed an algorithm that handles multiple CPUs and data dependencies. This algorithm models the processes to be run as a task graph that can have one or more subtasks. Each process in the task graph is given bounds for its computation time [ci lower, ci upper].

This is more general than a single, fixed computation time, but it does assume that we can strictly bound the computation time of a process. Each subtask has a period that is also modeled as an interval [pi lower, pi upper]. The architecture of the platform is modeled as a processor graph. The allocation of each process in the task graph to a processing element is given in Figure 6-8, below.

Figure 6-8 An example of reduced computation time leading to longer response times. From Yen and Wolf [Yen98].

The algorithm finds bounds on the start and the finish times of the processes. Given a process Pi, the start time is bounded by earliest[Pi, request] and latest[Pi, request]; the end time is bounded by earliest[Pi, finish] and latest[Pi, finish].

The following code summarizes the delay estimation algorithm. The maxsep[,] data structure holds the earliest[] and latest[] bounds for each process; it starts out with infinite (unbounded) times for the processes. This performance analysis algorithm iteratively tightens these bounds until they stop changing (or until a predetermined iteration limit is reached).

maxsep. lower = maxsep.upper = inf inity;
step = 0; /* keep track of number of iterations */
do {
    /* use longest path algorithm to find the request and finish times */
    foreach Pi { Earl iestTimes(Gi); LatestTimes(Gi);
    /* handle max constraints */
    foreach Pi { MaxSeparations (Gi);
    step++;
} while (maxsep has changed and step < limit);

At each iteration, the algorithm performs two types of operations. The EarliestTimes()/LatestTimes() procedures analyzes data dependencies using a modified longest-path algorithm.

The MaxSeparations() procedure uses a modified max-constraint algorithm to look for combinations of process executions that cannot occur. Each of these procedures is applied separately to each subtask Pi in the task graph, but the procedures look at the execution times of all the other subtasks while updating Pi.

Figure 6-9 An example of phase adjustment and separation analysis. From Yen and Wolf [Yen98] © 1998 IEEE

To understand why we need these two steps, consider the example of Figure 6-9, above. A naive analysis might lead one to conclude that the total time required to finish both tasks is 80. But two conditions cause it to be shorter. First, P1 cannot preempt both P2 and P3 in a single execution of the task. Second, P2 cannot preempt P3 since P2 must finish before P3 can start. Therefore, the worst-case delay to execute both tasks is only 45.

Phase constraints
We will use phase constraints to summarize the relationships between a process that is preempted and the process that preempts one. The algorithm uses two types of phases.

1) The request phase, which we will call phase[i,j,r] in the code, describes the smallest interval between the execution of Pi on one iteration and the succeeding iteration of Pj.

2) The finishing phase, which we will call phase[i,j,f] in the code, describes the smallest interval between the finishing time of one iteration of Pi and the first request time of the next iteration of Pj.

Figure 6-10 Request and finishing phases.

These definitions are illustrated in Figure 6-10 above. The gray boxes at the ends of the boxes represent the min/max timing bounds.

Data dependencies are one form of constraint on the relative execution times of processes. LatestTimes() and EarliestTimes() are modified longest-path algorithms that determine the timing and slack of the processes based on their data dependencies. Pseudocode for LatestTimes() is shown in the following block of code from [Yen98]; it is similar in form to the EarliestTimes() procedure.

LatestTimes(G) {
    /* initial ize */
    foreach (process Pi) {
        latest[Pi,request] = 0;
        foreach (process Pj) phase[i, j ,r] = 0;
    }
    foreach (process Pi in topological order) {
        wi = worst-case response time of Pi with phase adjustment phase[i,j,r];
        foreach (process Pj such that priority(Pj) > priority(Pi)) {
            latest{Pi,finish] = latest[Pi,request]+wi
            calculate phase[i,j,f] relative to latest[Pi,finish] for each j;
            foreach (immediate successor Pk of Pi) {
                delta = latest[Pk,request] 2 latest[Pi,finish];
                if (latest[Pk,request] < latest[Pi,finish])
                    latest[Pk,request] = latest[Pi,finish]
                update phase[k,j ,r] for each process Pj according to
                    phase[i,j,f] and delta;;
          }
        }
      }

    }

The algorithm walks through the graphs in the order in which they appear in the task graph. The worst-case response time wi is computed using EQ 6-6, with the modification that the term inside the summation takes into account the request phase:

After we have computed wi, we then compute the phases relative to latest [Pi, finish]. There are two cases to consider, Pj preempts Pi and Pj does not preempt Pi. Updating the request phases requires looking ahead one iteration, examining the successor Pks of Pi. If , then there is slack between the latest finish time of Pi and the latest request time of Pk. If , then the request phase must be updated, as shown below:

Page 353 EQ

These relationships are illustrated in Figure 6-11, below.

Figure 6-11. Relationships related to phases.

The MaxSeparations() procedure uses combinations of processes that cannot interfere to tighten the bounds on execution time. It uses the results of the phase analysis performed in LatestTimes() and EarliestTimes() to check the separations of phases of processes, taking preemption into account.

Max constraints model the relationship between a process and its predators: the initiation time of a process is the max of the finish times of its predecessors.

Max constraints are harder to solve than the linear constraints imposed by data dependencies. We can use a modified version of an algorithm developed by McMillan and Dill [McM92] to solve these constraints.

To read Part 1, go to  The role of the operating system
Next in Part 3: Event Driven Scheduling Analysis

Used with the permission of the publisher, Newnes/Elsevier, this series of five articles is based on copyrighted material from "High-Performance Embedded Computing," by Wayne Wolf. The book can be purchased on line.

Wayne Wolf is professor of electrical engineering at Princeton University. Prior to joining Princeton he was with AT&T Bell Laboratories. He has served as editor in chief of the ACM Transactions on Embedded Computing and of Design Automation for Embedded Systems.

References:

[Gar79] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness," W. H. Freeman, 1979.
[Sto77] Harold S. Stone, "Multiprocessor scheduling with the aid of network flow diagrams," IEEE Transactions of Software Engineering, SE- 3(1) January, 1977
[Bha97]  Shuvra S. Bhattacharyya, "Sundararajan Sriram and Edward Lee, "Optimizing synchronization in multiprocessor DSP systems," IEEE Transactions on Signal Processing, 45(6), June, 1997
[Mat98] Anmol Mathur, Ali Dasdan and Rajesh K. Gupta, "Rate analysis for embedded systems," ACM Transactions on Design Automation of electronic systems, 3(3) July, 1998.
[Ram90b] Krithi Ramamritham, "Allocation and scheduling of complex periodic tasks," Proceedings, 10th International Conference on Distributed Computing Systems, IEEE, 1990
[Leh89] J. Lehoczy, L. Sha, and Y. Ding, "The rate monotonic scheduling algorithm:exact characterization and average case behavior," Proceedings, IEEE Real Time Systems Symposium, IEEE, 1989.
[Yen98] Ti-yen Yen and Wayne Wolf, "Performance analysis of distributed embedded systems,"  IEEE Transactions on Parallel and Distributed Systems, 9(11), November, 1998.
[McM92]  K.McMillan and D. Dill, "Algorithms for interface timing verification," Proceedings, IEEE International Conference on Computer Design, 1992


Related Reading


More Insights






Currently we allow the following HTML tags in comments:

Single tags

These tags can be used alone and don't need an ending tag.

<br> Defines a single line break

<hr> Defines a horizontal line

Matching tags

These require an ending tag - e.g. <i>italic text</i>

<a> Defines an anchor

<b> Defines bold text

<big> Defines big text

<blockquote> Defines a long quotation

<caption> Defines a table caption

<cite> Defines a citation

<code> Defines computer code text

<em> Defines emphasized text

<fieldset> Defines a border around elements in a form

<h1> This is heading 1

<h2> This is heading 2

<h3> This is heading 3

<h4> This is heading 4

<h5> This is heading 5

<h6> This is heading 6

<i> Defines italic text

<p> Defines a paragraph

<pre> Defines preformatted text

<q> Defines a short quotation

<samp> Defines sample computer code text

<small> Defines small text

<span> Defines a section in a document

<s> Defines strikethrough text

<strike> Defines strikethrough text

<strong> Defines strong text

<sub> Defines subscripted text

<sup> Defines superscripted text

<u> Defines underlined text

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task. However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

 
Disqus Tips To upload an avatar photo, first complete your Disqus profile. | View the list of supported HTML tags you can use to style comments. | Please read our commenting policy.