Embedded Systems

Assembly-Language Control-Flow Graphing

By Edward J. Beroset, September 01, 1998

Ed presents a tool that automates the task of creating a control-flow diagram for assembly-language programs. Although written for the Mitsubishi 740 series processor, the tool can be adapted for other controllers as well.

Sep98: Assembly-Language Control-Flow Graphing

Edward is an engineer for ABB Power Transmission and Distribution Inc. in Raleigh, North Carolina. He can be contacted at [email protected].

Application programmers using high-level languages such as Java or C++ typically have an array of graphical tools to aid the development effort. Embedded-systems programmers -- particularly those of us who write in assembly language -- have relatively few such tools. This is somewhat ironic, since assembly language is harder to read and understand because it contains even fewer visual cues to help us discern logical structure.

In what situations might you need to understand the structure of assembly-language programs? To answer this question, imagine the following scenario:

You're an embedded-systems programmer, working on a firmware project with an aggressive schedule. Marketing decides that it needs "one small change" in the way a particular feature operates, and it's your job to figure out how to implement that change without breaking anything else. There are 40,000 lines of assembly-language code in the project and the 32-KB ROM you're using has a little less than a 100 free bytes left. You quickly identify the locality of the code change, but the control flow of the code is complex. The code is well commented, but it's assembly language after all, and things like breaks out of FOR loops aren't as easy to spot as they would be in Fortran, C++, or Java.

In fact, this is the situation I recently found myself in, and I realized that what I needed was a graphical "map" to understand the program's structure. To that end, I wrote the tool presented in this article. It is a Perl script that reads through assembly-language source code and creates a PostScript map of the control flow.

A Solution

When studying assembly-language source code, one technique is to print the code on paper and use a pencil to draw lines and arrows to indicate the control flow. This is a useful aid, but time consuming. The tool I present here uses basically the same technique, but fully automates it. An implementation of a bubble sort (see Listing One) as written for a Mitsubishi 740 series processor. When the diagramming tool is run on this code, the result is a control-flow diagram that contains an extract of the logical structure of the code (see Figure 1).

At the upper-left corner of the diagram is the source-code filename. The rest of the page contains three corresponding columns. The right column shows the lines of code relevant to the control structure of the input source code. These lines include conditional and unconditional branches, branch destinations (labels, for instance), and calls to and returns from subroutines. In the left column are the corresponding source-code line numbers, which are important for two reasons. First, line numbers help in matching up the control diagram with the original source. Second, because the nodes are spaced equally down the page, the line numbers give a rough indication of how far apart the control points are in the actual source. The center column is the graph that shows control flow. Each node on the graph corresponds with the aligned control-flow statement in the right column. Sequential execution of instructions is depicted by a vertical line. Forward branches are depicted as a sweeping curve to the right, and backward branches are depicted as a sweeping curve to the left. Using these simple lines, it's possible to construct many of the usual control-flow constructs used in assembly language. An unconditional branch is shown by a node with a sweeping curve to the left or right with no vertical line beneath. Conditional branch nodes have both a vertical segment and a sweeping curve. A return from subroutine node has no lines emanating from it, since it's not possible to determine from whence a subroutine was called.

The diagram is created in two distinct steps. First, a Perl script scans the input text and creates a PostScript page description. Second, some program or device (a PostScript-capable printer or display program) interprets the PostScript page description. Each line of the source code is read and scanned to determine whether it's a control-flow statement. If it's not a control-flow statement, it's not relevant to the drawing and is skipped. Otherwise, the statement is categorized into code labels, unconditional jumps, conditional jumps, subroutine calls, or subroutine ends. A statement could be in more than one category if, for example, an assembly-language source line containing a subroutine call also had a code label.

The vertical straight line segments representing sequential control flow are simple to generate from these categories by following this rule: Draw a vertical line segment below this point unless the category is an unconditional jump or subroutine end.

The curved line segments representing jumps, and calls are somewhat more complicated, since each jump or call must be matched to its corresponding target (code label) so that both ends of the curved line segment can be specified. Since the code is read sequentially, a jump target could be known before the jump or vice versa. To prevent the tool from having to make several passes through the source code, jumps, calls, and labels are stored in associative arrays and processed only after the first pass is complete. An associative array is just like the usual kind of array, except that instead of referencing array members by numerical index, the index can be a string or integer. In this application, a label name is used as an index value when matching jumps or calls to targets.

After the straight and curved lines have been resolved, the program creates the PostScript output by sending out a boilerplate header followed by descriptions for all straight line segments, then curved segments. The PostScript file ends with the command "showpage," which tells a PostScript interpreter to actually render the page.

About Perl and PostScript

I chose Perl to write this program because of its excellent string-handling functions (including regular expression matching) and built-in associative array capability.

I chose PostScript as the output format for the program because it provides a relatively simple means of describing complex figures. Although it's often overlooked as "just another printer format," it is actually a full programming language, with syntax quite similar to Forth.

Perhaps as importantly, implementations of Perl and PostScript are both ubiquitous and available free of charge. The same Perl code runs on my laptop under Windows 95 and on my desktop machine under Linux, and can be shown on screen (using Ghostscript) or sent directly to a PostScript-capable laser printer.

The Perl Script

The Perl script diag.pl (see Listing Two) takes one or more command-line arguments. Each argument is interpreted as the name (or full path specification) of an assembly-language source code file. Notice the line my $showsubs = 1; within the first few lines of code. When using this program on complex software with many subroutine calls, I discovered that the diagram sometimes became too crowded. Since, in my project, the subroutines mostly just did their work and returned, I incorporated a method to turn off the drawing of subroutine calls. I find I rarely want or need to change this setting, so I used a simple static variable instead of a command-line option. To turn off the drawing of subroutine calls, simply set the value to 0 instead of 1. The main part of the program passes each specified source-code filename to the processFile subroutine that does all of the work.

In my environment, the assembly- language source files have a file extension of .asm, and I wanted to create the output PostScript files with an extension of .ps so the processFile routine is written to strip ".asm" from the input file specification and change it to ".ps" to create an output file.

The while(<SOURCEFILE>) line within the processFile subroutine reads in the source file, one line at a time. As each line is read, the script uses regular expression matching to attempt to classify the line into one of five previously listed categories. The regular expressions for these categories are stored in variables named $label, $ujump, $cjump, $subcall, and $subend. A variable $gotone is used to remember if one or more of these patterns was successfully matched and is initially set to zero for each line. If a code label is found, the $gotone variable is set to 1, and the label is added to an associative array (an associative array is often called a "hash" in Perl parlance), which associates the label with the current control node number. If a jump or call is found, the Perl script extracts the destination name (for matching to a label later on) and sets the $gotone variable to 1; and if a subroutine end is found, that fact is noted by setting the $gotone variable to one without any further action.

At the end of all pattern match attempts, if no pattern was matched, the $gotone variable is equal to zero and the source-code line is rejected because it's unrelated to control flow. If one or more patterns were matched, the source line is modified to render it printable from within PostScript. This modification adds backslashes in front of any parentheses and backslashes in the input source line to ensure that the PostScript interpreter doesn't misinterpret these characters.

After any necessary modifications are done to the source line, the script decides what kind of line(s) to draw based on the identified category. First, a PostScript fragment to draw the source-code line number and line are saved to the @outrecords array for later printing. Next, if it's neither an unconditional jump nor a return, then there is a straight vertical line segment beneath, so another fragment to draw a line is added to the @outrecords array. Finally, if the input line was a jump or call, the jump or call destination label is saved to the %jumplist hash.

When the whole assembly-language source file has been read, the code iterates through the %jumplist hash and attempts to find a matching target in the %nodeof hash. If no match is found, the target is assumed to be outside the current file and a special dead end node is added to the list. The dead end node is just a straight horizontal line with the name of the target label printed at one end. In either case, every item in the %jumplist hash causes one more code fragment to be added to the @outrecords array.

The PostScript file is written with a single print statement. Perl supports a "here-document" approach to printing in which a pair of programmer-chosen delimiters brackets a multiline file that is printed directly to the chosen output file. Each line below the print OUTFILE << "EOM"; up to the last line that contains EOM is printed as-is to the output file with a few exceptions.

The exceptions are the Perl variables $recnum, $sourcefile, and @outrecords, which appear in Listing Two. Perl substitutes the actual values for these variable names before printing. For $recnum, this is an integer representing the number of nodes on the drawing. In this context, it's used primarily to scale the node size so the drawing will fit on the page and so that the nodes are all evenly spaced down the page. $sourcefile contains the name of the source file from which the drawing is made. The most powerful construct is embodied in the @outrecords line; Perl prints the entire array with single spaces separating elements.

The PostScript Output

The PostScript file created by the tool for the bubble sort sample code in Listing One is in Listing Three. Logically, it consists of three sections -- the header, fixed definitions, and variable data section.

The first line of the header identifies the file as a PostScript Version 2.0 file. Although there are newer versions of PostScript defined, Version 2.0 is sufficient for this program and PostScript versions are designed to be backwards compatible. The next few lines announce the PostScript program creator and show the name of the PostScript source file. The BoundingBox and DocumentPaperSizes both refer to the size of paper on which the file is intended to be printed. A PostScript-capable printer equipped with several trays with different sized paper can use this information to figure out which paper to use. The BoundingBox describes the coordinate system of the paper, which is hardcoded in this application to set the lower-left corner to 0,0 and the upper-right hand corner as 612,792. PostScript's native unit of measure is the point, which is defined as 1/72 inch, so 612,792 is identical to 8.5×11 inches, which is the size of paper I use. The header section ends with the %%EndComments line.

The fixed definitions section contains a number of common PostScript idioms. PostScript uses a postfix notation, which is similar to the reverse Polish notation (RPN), employed by some scientific pocket calculators. For example, to compute 4×7 in PostScript, you might write 4 7 mul, which causes the PostScript interpreter to push 4 on the stack, then push 7 on the stack, and then execute the mul operator. The mul operator removes two numbers from the stack, multiplies them, and places their product back on the stack. Forth programmers will find PostScript familiar.

The line /inch { 72 mul } def is an example of a literal name definition. The / (slash) indicates that the text string that immediately follows, in this case "inch," is a literal name. The body of the definition is contained with curly braces, and the def indicates that the definition of the literal name is added to the dictionary currently in use. Thereafter, "inch" is an executable name such that the fragment "11 inch" is interpreted as though it were "11 72 mul," which converts inches into points.

The pagewidth and pageheight constants are defined, as well as a number of operations used in creating the drawings. The sweep operation takes the starting and ending coordinates from the stack and uses the curveto operator to create a Bezier curve to represent a branch.

The segment operator is used to draw the straight vertical line segments, and the deadend operator is used for showing jumps or calls to other modules. The linelabel operator draws the line number and the text of the assembly-language source line. The showtitle operator draws the title on the page and selects the appropriate fonts as a side effect. All the definitions are created but no actual code is executed until after the comment line that reads "main program begins." The first thing done is to translate the coordinate system origin to one that simplifies the math for the rest of the program. After that, a few scaling factors are calculated and the rest of the program consists of lines created by the Perl script.

The comment "variable data starts here" marks the start of the lines that define the actual drawing. After the number of nodes is set to the appropriate count, showtitle is invoked to draw the title and set up the font for the rest of the drawing. In PostScript, string literals are indicated by enclosing them in parentheses, so the line "(bubble.asm) showtitle" pushes the literal "bubble.asm" onto the stack and then executes the previously defined showtitle operator.

What follows are the definitions to draw each node's line number and matching source-code line (although long source lines will be clipped). Interlaced with these are invocations of the segment operator to draw the vertical lines between nodes. When all text and straight lines are done, the curved lines are drawn using the sweep operator. Finally, the PostScript program ends with the showpage operator to cause the marks to actually appear on the printed page or display.

Conclusion

While this tool was written for use with Mitsubishi 740 assembly language, it can be adapted to other assembly languages with little effort. Adaptation would involve little more than changing the regular expressions and the small part of the code that identifies the target of a jump or call. The tool currently scales everything to fit onto a single page, so large files with complex control paths can result in diagrams that print the text in a microscopic font. However, this too can be a useful diagnostic since it usually indicates that the file should be broken up into multiple files and/or that the control paths should be examined for simplifications.

DDJ

Listing One

;--------------------------------; bubble sort routine for Mitsubishi 740 series microprocessor
; Entry: list = zero page pointer to unordered list of integers
;              the first element in the list is the length
; Exit: the list is sorted
;-------------------------------- 
sort:   ldx #0                  ; clear the "swapped" flag
        lda (list,x)            ; fetch length
        tay                     ; y = length of list
loop    lda (list),y            ; a = list[y]
        dey                     ; decrement index v'ble
        beq finish              ; skip out if done with list
        cmp (list),y            ; is list[y] > list[y-1] ?
        bcs loop                ; if so, keep going
        tax                     ; save list[y] in x reg
        lda (list),y            ; fetch the other list element
        iny                     ; increment pointer
        sta (list),y            ; store the larger element
        txa                     ; remember the smaller one
        dey                     ; decrement pointer again
        sta (list),y            ; store that element
        ldx #1                  ; set the "swapped" flag
        bra loop                ; loop back
finish  txa                     ; were any elements exchanged?
        bne sort                ; if so, go through the list again
        rts                     ; otherwise we're done

Back to Article

Listing Two

#!/usr/bin/perl# This Perl program is designed to go through a number of source code files
# and extract the following information:
#   1.  labels
#   2.  calls
#   3.  jumps
#   4.  conditional jumps
# written on Thu  03-27-1997  by Edward J. Beroset
#----------------------------------------------------------------------------
use strict;
my $showsubs = 1;
my $file;
foreach $file (@ARGV) {
    &processFile($file);
}
#----------------------------------------------------------------------------
# This subroutine processes each individual file
#----------------------------------------------------------------------------
sub processFile {
    my $gotone;
    my %nodeof;
    my %jumplist;
    my $outfile;
    my ($label, $ujump, $cjump, $subcall, $subend);
    my ($category, $modline, $destination, $targetlabel, $nodenum);
    my $linenum  = 0;
    my $recnum   = 1;
    my @outrecords = ();
    my $sourcefile = pop;
    # open the input and output files.
    # We create form the output filespec from the input filespec by
    # changing the file extension from .asm to .ps
    open SOURCEFILE, $sourcefile or die "Can't open $sourcefile: $!\n";
    ($outfile = $sourcefile) =~ s/\.asm/\.ps/i;
    open OUTFILE, ">$outfile"  or die "Can't open $outfile: $!\n";
    # here are the regular expressions which describe how to identify
    # the control elements in our assembly language source file
    $label  = '^([a-zA-Z_]\w*):?[^=;\.]*(;|$)';
    $ujump  = '^\w*\s*:?\s*[bjBJ][rmRM][apAP]\s+(\(?\\\\?\w+\)?)\W?';
    $cjump  = '^\w*\s*:?\s*b[bcnepmvBCNEPMV][cseqliCSEQLI]\s+(\w*),?\s*([^,;
                                                \t]*),?\s*(\\\\?[^,; \t]*)';
    $subcall = '^\w*\s*:?\s*jsr\s+(\\\\?\w+\W)';
    $subend  = '^\w*\s*:?\s*[rR][tT][iIsS]';
    # iterate through every line of the source code file


</p>
    while(<SOURCEFILE>) {
        $linenum++;
        chop;
        $gotone = 0;
        # find code label
        if (/$label/o) {
            $category = "T";
            $nodeof{$1} = $recnum;  # add label to hash
            $gotone = 1;
        }
        # find call
        if (/$subcall/o) {
            $category = "S";
            $destination = $1;
            $gotone = 1;
        }
        # find unconditional jump
        if (/$ujump/o) {
            $category = "U";
            $destination = $1;
            $gotone = 1;
        }
        # find conditional jumps
        if (/$cjump/o) {
            $category = "C";
            if ($3) {
                $destination = $3;
            } elsif ($2) {
                $destination = $2;
            } else {
                $destination = $1;
            }
            $gotone = 1;
        }
        # find subroutine ends
        if (/$subend/o) {
            $category = "R";
            $gotone = 1;
        }
        # if we found a flow control item, add it to our lists
        if ($gotone) {
            # first, render the line printable by PostScript
            # by replacing all single backslashes with double backslashes
            ($modline = $_) =~ s/\\/\\\\/g;
            # ... then replacing all left parens with backslash left paren
            $modline =~ s/\(/\\\(/g;
            # ... then replacing all right parens with backslash right paren
            $modline =~ s/\)/\\\)/g;
            if ($showsubs || ($category ne "S"))
            {
                # remember this record
                push @outrecords, "$recnum ($linenum) ($modline) linelabel\n";


</p>
                # if it's a neither an unconditional jump nor a return,
                #  then there is a straight vertical line segment beneath
                if (($category ne "U") && ($category ne "R")) {
                     push @outrecords, "$recnum $recnum 1 add segment\n";
                }
                # if it's some kind of jump or call,
                #   then remember that we'll need a curved line later
                if (($category eq "C") || ($category eq "U") || 
                                                   ($category eq "S")) {
                    $jumplist{$recnum} = $destination;
                }
                $recnum++;
            }
        }
    }
    close(SOURCEFILE);
    while (($nodenum, $targetlabel) = each %jumplist) {
        if ($destination = $nodeof{$targetlabel}) {
            push @outrecords, "$nodenum $destination sweep\n";
        }
        else {
            push @outrecords, "($targetlabel) $nodenum deadend\n";
        }
    }
    print OUTFILE << "EOM";
%!PS-Adobe-2.0
%%Creator:  diag1.pl version 1.0 Copyright 1997 Edward J. Beroset
%%Title:  $outfile
%%BoundingBox 0 0 612 792
%%DocumentPaperSizes: Letter
%%EndComments
% ----- some definitions are in order...
/inch { 72 mul } def


</p>
/pagewidth 8.5 inch def
/pageheight 11 inch def


</p>
/sweepdict 3 dict def
/x {2 inch} def


</p>
/sweep
{
    sweepdict begin
    /first  exch node def
    /second exch node def
    /bbox second first sub -2 div def
    x first moveto
    x bbox sub first x bbox sub second x second curveto
    x first moveto
    stroke
    end
} def
/segment
{
    node x exch moveto
    node x exch lineto
    stroke
} def
/deadend
{                           % (comment) n
    node dup                % n' n'
    x exch moveto           % n' x n' (moveto x n')
    nodescalar x add exch   % nodescalar+x n'
    lineto                  %
    dup stringwidth         % (comment) wx wy
    pop                     % (comment) wx
    0.5 node -0.3 mul       % (comment) -wx -charheight
    rmoveto show            % show that string
    stroke                  %
} def
/linelabel
{
    % get ready to show comment
    3 -1 roll node dup pagewidth 2 div exch moveto
    exch show               % show the comment
    0 exch moveto           % ready to show line number
    show                    % show the line number
} def
/showtitle
{
    /Helvetica-Bold findfont
    16 scalefont setfont
    0 16 moveto
    show
    /Times-Roman findfont
    -0.5 node scalefont
    setfont
} def


</p>
% ------ main program begins
1.0 inch pageheight 1.0 inch sub translate
newpath


</p>
/nodescalar { pageheight 2 inch sub -1 mul numnodes div } def
/node { nodescalar mul } def


</p>
% --- variable data starts here
/numnodes $recnum def
($sourcefile) showtitle
@outrecords
showpage
EOM
    close(OUTFILE);
}

Back to Article

Listing Three

%!PS-Adobe-2.0
%%Creator:  diag1.pl version 1.0 Copyright 1997 Edward J. Beroset
%%Title:  bubble.ps
%%BoundingBox 0 0 612 792
%%DocumentPaperSizes: Letter
%%EndComments
% ----- some definitions are in order...
/inch { 72 mul } def


</p>
/pagewidth 8.5 inch def
/pageheight 11 inch def
/sweepdict 3 dict def
/x {2 inch} def
/sweep
{
    sweepdict begin
    /first  exch node def
    /second exch node def
    /bbox second first sub -2 div def
    x first moveto
    x bbox sub first x bbox sub second x second curveto
    x first moveto
    stroke
    end
} def
/segment
{
    node x exch moveto
    node x exch lineto
    stroke
} def
/deadend
{                           % (comment) n
    node dup                % n' n'
    x exch moveto           % n' x n' (moveto x n')
    nodescalar x add exch   % nodescalar+x n'
    lineto                  %
    dup stringwidth         % (comment) wx wy
    pop                     % (comment) wx
    0.5 node -0.3 mul       % (comment) -wx -charheight
    rmoveto show            % show that string
    stroke                  %
} def
/linelabel
{
    % get ready to show comment
    3 -1 roll node dup pagewidth 2 div exch moveto
    exch show               % show the comment
    0 exch moveto           % ready to show line number
    show                    % show the line number
} def
/showtitle
{
    /Helvetica-Bold findfont
    16 scalefont setfont
    0 16 moveto
    show
    /Times-Roman findfont
    -0.5 node scalefont
    setfont
} def
% ------ main program begins
1.0 inch pageheight 1.0 inch sub translate
newpath
/nodescalar { pageheight 2 inch sub -1 mul numnodes div } def
/node { nodescalar mul } def
% --- variable data starts here
/numnodes 9 def
(bubble.asm) showtitle
1 (14) (sort:   ldx #0          ; clear the "swapped" flag) linelabel
 1 1 1 add segment
 2 (17) (loop    lda \(list\),y     ; a = list[y]) linelabel
 2 2 1 add segment
 3 (19) (        beq finish     ; skip out if done with list) linelabel
 3 3 1 add segment
 4 (21) (        bcs loop       ; if so, keep going) linelabel
 4 4 1 add segment
 5 (30) (        bra loop       ; loop back) linelabel
 6 (31) (finish  txa            ; were any elements exchanged?) linelabel
 6 6 1 add segment
 7 (32) (        bne sort       ; if so, go through the list again) linelabel
 7 7 1 add segment
 8 (33) (        rts            ; otherwise we're done) linelabel
 3 6 sweep
 4 2 sweep
 5 2 sweep
 7 1 sweep


</p>
showpage

Back to Article

DDJ

1 2 Next

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Embedded Systems

Assembly-Language Control-Flow Graphing

A Solution

About Perl and PostScript

The Perl Script

The PostScript Output

Conclusion

Listing One

Listing Two

Listing Three

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Embedded Systems

Assembly-Language Control-Flow Graphing

A Solution

About Perl and PostScript

The Perl Script

The PostScript Output

Conclusion

Related Reading

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Embedded Systems Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content