March 12, 2007
Debugger Performance Matters: The Importance of Good MetricsAnderson MacKay, Green Hills Software
A brief tutorial on debugging complex embedded systems designs by Green Hills' Anderson MacKay on the importance of collecting and assessing debug metrics and properly interpreting and implementing the results.
Debugging is the most difficult and costly phase of software
development for systems large and small. Deeply embedded systems don't
have the standard PC user interfaces of keyboards, mice, graphic
displays or even network consoles, so you need specialized debugging tools to get the
critical
system information necessary to find and fix bugs.
For many systems, that access is provided by a hardware debug device
which communicates to your system's microprocessors through an on-chip
debug (OCD) port. These debug
devices can have dramatically different performance characteristics.
Your development time is valuable, so make sure that when you build your system and select a hardware debug device that you carefully consider debugging performance, or you may find yourself waiting when you should be debugging.
Debugging Performance Metrics
Piece by piece, you try to pin down the problem, working through the same section of code as different scenarios are examined, tested, and then set aside. During this process, you spend a lot of time reloading or reprogramming your application, re-running the application to specific breakpoints, stepping through code, uploading logging or trace information, and examining the state of the system. Fortunately for measuring debugging performance, the time taken for these tasks is dominated by a single factor " memory access speed on your system. Reloading or reprogramming the application requires direct writes to RAM and/or non-volatile memory.
Need to read out that log of what the system was doing when it died?
You want to viewing a peripheral's memory-mapped registers? Trying to
debug your deadlocked application, looking for which of your system's
tasks is holding a semaphore when it shouldn't? In all of these cases, it's memory access to the rescue. If memory access is slow, you're looking at a lot of dead time while debugging, waiting on your debugging system to catch up. As such, memory access speed is the fundamental measure of productivity and performance for debugging an embedded processor. It is also easy to measure; simply dump a large number of pseudo-random bytes from the debugging host into the memory of the system under debug and time how long it takes to complete. This will give the memory write speed of the system, and read speed can be measured by simply reading the data back.
Why does memory access performance vary from system to system?
Performance bottlenecks lurk everywhere, but the most important ones
are the hardware debug device and the design of the microprocessor's
debug port. In addition, certain other design factors of the system under debug (not just the selection of microprocessor) can affect memory access speed. To understand this better, we must examine the details of debug-mode memory access through a debug port.
Memory Access Through a Debug Port
The standard has a number of characteristics which make it well-suited to the task of in-system debugging, including access to multiple devices simultaneously, and the possibility of combining software debug, manufacturing test, and device programming into a single low-pin count connector. JTAG is a simple interface at the pin level, with a single clock called TCK driven by the debug device. Along with TCK the debug device sends one bit of data per TCK cycle on the TDI signal and one bit of control information on the TMS signal. On each cycle the system under debug replies with a single bit of data out on the TDO signal. Since only one bit can be sent and received per TCK period, the frequency of TCK is a significant factor in the performance of the JTAG interface; at a TCK frequency of 10MHz the interface can carry no more than 10 million bits per second. On top of this simple signaling scheme, microprocessors add their own protocols for allowing memory access. Some device families require thousands of JTAG TCK periods per byte read or written from memory, while the most efficient device families require only slightly more than 8 TCK cycles for each byte of memory accessed. On the whole, most devices add somewhere between 20% to 100% of overhead in TCK periods for their most efficient memory access method, so each byte of memory read or written requires 10 to 16 JTAG TCK periods. The topology of the system under debug can also affect memory access efficiency. One of the strengths of the JTAG standard lies in its ability to serially chain multiple devices from different manufacturers into a single scan chain that is all accessible through a single debug device. This makes system-level testing, visibility and debug very convenient, but it comes with a cost. Systems with multiple devices in a scan chain incur extra overhead for each operation, which reduces throughput. A system with tens of devices chained together can easily cut the theoretical best-case memory access throughput in half. Careful system design and signal routing are also required for a JTAG-based system to perform at its full potential. Remember that JTAG-based systems can send and receive only a single bit of data per TCK cycle, so it is very important that the system handle high TCK frequencies while maintaining the timing relationship of TCK to the other three JTAG signals. If the four high-speed JTAG signals are not treated carefully in circuit design and layout, the maximum frequency of the JTAG interface may be limited, and this will limit the maximum memory access performance of the system.
The ARM1176JZF-S: putting
performance metrics to work
The ARM11 debug port allows arbitrary opcodes to be fed to and executed on the processor core while in debug mode, and offers a register (the Debug Data Transfer Register, or DTR) that is visible to both the processor core and the debug port. A naïve but logical way to read memory from the debug port is shown in Figure 2 below.
This works, but for large-scale memory access is inefficient, requiring 648 JTAG clock cycles to read a single 4-byte value from memory. To put that level of efficiency into context, we can easily compute the memory access speed of a debug device when given the number of TCK cycles required per memory access:
So this scan sequence running at a typical 10MHz JTAG clock can read memory at no more than 60.3 kilobytes per second:
This same sequence with minor changes can be used to write memory at the same efficiency. Unfortunately, 60 kilobytes per second isn't very fast. As an example, a developer with a 2.5 megabyte application would have to wait 42 seconds each time the program is downloaded. An extra 42 seconds for every new test case or scenario quickly adds up to a significant loss of expensive developer time. Fortunately, it is easy to do much better. If we only execute steps 1 through 4 once and use a load instruction with auto-increment in step 5, then we increase efficiency so only 216 cycles are used for each 4-byte load or store. Thanks to the ingenuity and forethought of the ARM11 engineering team, steps 5 and 6 can also be combined and optimized so each 4-byte load or store consumes just 41 JTAG clock cycles as shown in Figure 3, below.
Now the best-case memory transfer speed at a 10MHz JTAG clock is much faster, and debugging cycles for our hypothetical developer are practically instantaneous:
This analysis is simple for the ARM1176JZF-S, but for other devices the process of efficient memory access is not always obvious or well documented. It is critical that debug devices use efficient memory access routines " and they must execute those routines within tight time constraints in order to achieve high performance.
Debug Device Implementation
If it takes even as few as 20 cycles of the microcontroller per TCK clock edge and the microcontroller runs at 60MHz, 1,640 microcontroller cycles will be required per 4-byte shift command and the maximum effective clock speed of the JTAG interface for this device will be only about 1.5MHz:
After accounting for the microcontroller handling the transfer of data to and from the debugging host, such a system is slowed further - if the microcontroller spends half its time moving data from the host, the effective TCK speed drops to 750kHz. Substituting this figure into the memory transfer speed equation for the ARM1176 (41 cycles per 4-bit load/store) yields a transfer speed of 71.5KB/sec. One way to speed things up is to use programmable logic to handle transforming high-level "shift commands" into the bit-level signal patters. Using the same microcontroller example, if 100 cycles are required on average per command and each command can shift 16 JTAG TCK cycles, then each 4-byte memory access (which uses 41 JTAG TCK cycles) can be accomplished in 300 microcontroller CPU cycles. With only 300 cycles required per 4-byte memory access and assuming that 50% of the CPU cycles are still dedicated to transferring data from the debugging host, memory access throughput increases above 390 KB/sec:
Does actual TCK frequency matter? This microcontroller system could have a TCK clock generator capable of 50MHz or more, and throughput would still be exactly 390.63KB/sec. The throughput of the debug device has become limited by the computing power of the microcontroller and its programmable shifting logic. The only way to increase memory access performance for this debug device is to increase the computational throughput of the microcontroller, either by increasing its clock rate or by decreasing the number of cycles needed per shift command.
This is an important piece of information to remember as you consider any JTAG-oriented hardware debug device - maximum TCK frequency is important, but the ability to fill those TCK cycles with useful work is even more critical. Today's typical high-end hardware debug devices are often built like the example microcontroller+PLD device benchmarked in Figure 4 above. With a high-performance microprocessor and dedicated JTAG management logic, memory access speeds of 2MB/sec or more on a system like this ARM1176 example are common, but they are only possible with a well-designed and highly optimized hardware debug device. In fact, the debug ports of some of today's devices are capable of correct operation at TCK speeds above 100MHz, offering a challenge to designers of hardware debug devices. For the ARM1176 example, 100MHz means a throughput of 9527 KB/sec, making debugging and programming tasks virtually instantaneous. To live up to the performance potential of such systems, careful system-level design of the hardware debug device is required to ensure that bottlenecks within the device do not limit performance. If the device is connected to the debug host by USB, the device must support USB 2.0 high speed or be limited by the 1.5 MB/sec throughput of USB 1.1. If the debug device is connected by ethernet, a high-performance networking subsystem capable of nearly saturating 100 megabit ethernet must be used. On top of that, the debug device must be able to issue whatever commands are necessary to execute a 4-byte load or store every 410nS to maintain the 9527 KB/sec transfer rate, and the system must have sufficient buffering and power to sustain that throughput while simultaneously transferring nearly 10 megabytes of data per second from the debugging host.
Conclusion
High-performance debugging equipment means you can spend less time waiting around for system restart and critical debugging information, and more time solving the real-world problems of a deeply embedded system. Anderson MacKay is Engineering Manager in Green Hills Software's Target Connections group, responsible for product planning, engineering, and project management for the Probe and SuperTrace Probe products.
|
|
||||||||||||||||||||||||||||||||||||
|
|
|
|