Tools

Debugging Asymmetric Multi-core Applications: Part 1

By Julien Carreno and Intel Corp., February 27, 2007

In the first of a three part series Intel's Julien Carreo describes a typical asymmetric multi-core application and the typical problems encountered in such a system.

In a project life cycle, there always comes a time when one must debug a software issue found during testing. Development teams are always looking to pro-actively find potential defects as early as possible in the development cycle.

Unfortunately those methods are not 100 percent foolproof and there will always be a few issues which only appear when testing the full system. This leads to teams reactively debugging issues as they appear.

Because of this, development teams always look for ways to improve their debugging techniques for faster, more effective debugging as well as better performance characterizations and bottleneck identification in real-time applications. Issues related to this subject increase dramatically in the context of a multi-core application where data is being passed from one core to another.

Complexity will increase even further when dealing with an asymmetric multi-core scenario where only a single debugging interface may be present for the entire system. In this series, the aim is to provide a clear understanding of typical issues that can occur in an asymmetric multi-core application and provide a set of tools for effectively debugging these issues.

In order to provide the appropriate level of detail, this topic will be covered in a series of three articles. In this first article we will cover setting a common understanding of what is an asymmetric multi-core application and what are the typical problems that can be encountered in such a system.

Typical system scenarios
In order for the developer to be able to effectively debug an asymmetric multi-core application, he/she must first clearly understand the system; this section's purpose is to set a common understanding of what an asymmetric multi-core system is.

For the purpose of this article, this section will cover the specific example of a highly integrated System on a Chip (SoC) with an accelerated Ethernet interface and cryptography acceleration. In this case, we will assume a network routing application with packet encryption enabled.

This type of application can be split into two parts: the network input/outputs and the packet encryption/decryption. Figure 1 below shows how the two parts of the application are accelerated.

Figure 1: Hardware overview

For the purpose of this article, let's consider the main core is based on typical Intel architecture, this assumption does not have any major implications on the issues and techniques detailed later on.

The architecture of the secondary cores 2 and 3 is a purpose built acceleration engine with specific hardware functionality targeted for a specific type of application. In this case, we have two versions of the acceleration engine as described below.

Scenario #1. Core 2 is a secondary core whose sole responsibility is to offload low-level data processing for a network interface from the main core. The secondary core will receive packets from the interface (for example an Ethernet MAC built into the core as a coprocessor), do a pre-configured set of checks and processing on the data (checksum validation, filtering, VLAN tagging) and then will pass the data up to the main core for higher level processing (protocol stack).

The main core will also pass data to the secondary core for transmission of data, the secondary core performs a pre-configured set of actions (checksum appending) and sends the data out on the interface. This scenario is typically described as inline acceleration.

Scenario #2. Core 3 is a secondary core whose sole responsibility is to offload specific data encryption/decryption processing functions (task) from the main core. The main core will provide data to the secondary with a description of the task to perform. The secondary core will perform the action and return the result to the main core.

This scenario is typically referred to as look-a-side acceleration. In this scenario, offloading security processing to core 3 is not only due to the performance enhancement of having another core do some of the processing for each packet.

It also has the added benefit of core 3 having hardware acceleration functionality (hashing coprocessor in this instance) allowing for much faster processing compared to a standard core.

As with any asymmetric multi-core application, the primary objective of both core 2 and 3 is to free up resources and CPU cycles on the main core for higher level data processing. Effectively, secondary cores will perform some of the complex functions for the main core, such as hashing and/or encryption (core 3), or provide accelerated input-outputs (core 2).

An asymmetric multi-core application will always be made up of a main core, the master in the system that will drive the application, and one or more secondary cores, the slaves in the system that perform a limited set of tasks and are driven by either an external signal/interface or by the master core.

For each task, data is passed from one core to another using a shared memory model or using hardware communication mechanisms (mailbox type of mechanism for instance).

Typically, only the main core will provide a debug interface, access to the internals of the secondary cores can vary widely; it can be completely inexistent, very limited, or can even impede the system's normal operations (access requires halting one of the cores for example).

Techniques described later in this series will always assume a single debug interface and inexistent access to secondary cores internals unless specifically stated. A typical debug interface can be a serial port providing a command line interface or a JTAG interface for instance.

The debugging guidelines and techniques described in a later part of this series are not confined to specific application and systems as described above. The material presented is highly relevant no matter the architecture of the main and secondary cores.

For instance, we could even argue that provided there is a main core driving the application and the secondary cores depend on the main core providing them data to process, all cores could have the same architecture with the main core being determined at boot time.

In this case we have a physically symmetric multi-core architecture implemented an asymmetric application with each core dedicated to a specific task. Further, these techniques could even be applied to systems where the secondary cores are off chip, i.e. a multi-processor configuration.

Environment
In the life cycle of a project involving silicon as well as silicon enabling software, it is often the case for the software to be ready before the first version of the silicon is available for use.

Common techniques used to alleviate this are the use of a simulation of the silicon running on a traditional workstation or/and the use of an emulation of the silicon using FPGAs (or some other similar technology).

Both these techniques, although very useful at the start of a project, have the major drawback of being slower by several factors (in the order of 10 times slower for emulation, one thousand times or more for simulation) compared to the real silicon.

All the debugging techniques listed in this paper will always assume the real silicon to be the starting point of any investigation. Should simulation and/or emulation be relevant to debugging a specific type of issue, it will be specifically stated.

The techniques described hereafter will assume that the acceleration cores are a black box on the real silicon and access to their internals is very limited or even non-existent under normal operations.

Problem statement
As stated previously, in an asymmetric multi-core environment, it is often the case for only the main core to have a debug interface, and for the secondary core(s) to be black boxes with no visibility of the internals.

This often raises the issue of attempting to know what is happening in the secondary core on top of investigating whatever communication mechanism is in place between the cores for passing data back and forth. Debugging tools are often used to connect directly to these secondary cores in some way or form to access internal data otherwise inaccessible (VisionICE, JTAG, internally developed tool, etc)

However, these tools often have drawbacks, such as impacting the performance of the core while running, causing a change in the behaviour of the system, or having to stop the core to examine its state.

In a real-time application this type of debugging may not be of any use as we may need to maintain the behaviour of the system in order to reproduce the issue while debugging; the application will have real-time constraints which could potentially mean that any issue investigated can be related to timing in the system.

In these cases it is left up to the designer of the application and the tester to incorporate into the application other means of debugging the entire system using the only debug interface with no impact or minimal impact to performance thus preserving the real-time behavior of the system during debugging. These techniques are what we will cover in detail at a later stage in the series.

Typical applications: two examples
Detailed in the following section are two possible applications based on an asymmetric multi-core system. Should more detail be required, consulting the documentation on products such as the IXP425 or IXP2350 can be a good starting point.

Inline Ethernet acceleration. The diagram in Figure 2 below gives a simplified overview of a network application with an accelerated network interface. The application design is based on a network processing engine (purpose built processor with a radically different architecture than an Intel XScale or Intel architecture core) performing the low level packet processing for the main Intel XScale core.

As mentioned previously, the specific architecture of each core is not incidental to the following material but this will help in gaining a clear understanding of this specific example. Communication of transmit request and receive notifications is done through the use of a simple First-In-First-Out queuing mechanism implemented using software based memory queues.

Figure 2: Inline acceleration

Look-a-side pixel shading acceleration. The diagram in Figure 3 below gives a simplified overview of a graphics application with look-a-side pixel shading acceleration.

The main core (Intel architecture or Intel XScale) will offload graphics content calculations to the graphics module (core with multiple hardware thread support). Communication of pixel shading requests and completion notifications is done through the use of hardware rings.

Figure 3: Look-a-side acceleration./b>

Some possible problems
In this section we will list the different types of issues that can be encountered in an asymmetric multi-core application. This list is by no means exhaustive. Other types of issues are possible. Listed here are just the common and main issues that can come up in such projects.

How to approach them for an effective and timely defect resolution will be covered in a later part; first we will set the ground work for a common understanding of these typical issues.

Drop in performance. Any real-time application has requirements specifying the performance to be achieved for a specific scenario. Performance is usually specified by the number of events per second that the system can handle. One possible issue that can arise is that the performance drops sharply and unexpectedly during the course of a use case. This drop is recurring and will happen every time the test case is run and will happen once or more (see Figure 4 below).

The system recovers without any intervention from the user. Below is an example of a system operating at maximum capacity with approx 80 percent of incoming events being processed and sent back out; the rest of the events are dropped by the system. You can see clearly that throughput drops on occasion and then recovers.

Figure 4: Recurring performance drop

Another possibility is for performance to drop occasionally on the same scenario. In this case, the drop in performance cannot be reproduced on every instance of the test case and will happen only once (see Figure 5, below). Here as well the system will recover without any outside intervention.

Figure 5: Occasional performance drop

Yet another scenario involves the throughput dropping (or even stopping all together) and not recovering until there is an outside intervention such as the user stopping traffic and then starting it again.

Figure 6: Performance drop requiring outside intervention

Application lock-up. In worst case scenarios, the application may lock up completely, for example the secondary core may lock and become totally unresponsive.

In this scenario, the application was functioning correctly but then unexpectedly stopped and does not recover when all activity on the system is stopped then restarted.

The seriousness of the lock-up can be assessed through the extent of the steps taken by the user to recover the system. It can range from having to reconfigure the application, to stopping and restarting one or more of the secondary cores, and even rebooting the entire system.

Data drops. In a network acceleration scenario, the application may start dropping data while maintaining adequate performance. In this case, the system is not functioning at full capacity, there is still bandwidth available; but the system is dropping data in specific scenarios.

In Figure 7 below, the same system bandwidth is used no matter the packet size (i.e. the same number of bytes are processed per second, only the packet size varies). And in specific test cases, a percentage of the data is dropped even though with slightly different settings the same throughput can be achieved with no errors.

Figure 7: Percentage packet processed depending on packet size

Extra data appearing. While testing secondary core behaviour it is common to compare actual data with expected data, it is sometimes possible for the actual data to contain all expected data but also contain some extra non-corrupting data, this data can be valid or invalid but it does not impact the expected data. The extra data can present itself in the form of entirely new extra packets, duplication of packets transmitted or received.

Data corruption. Data corruption is probably the most common of issues related to any network data processing application or data processing offloading application.

Corruption can take several forms, such as, increased/decreased packet length, data value changed causing CRC errors, corrupted entries in a FIFO queue Data corruption can be summed up by two scenarios:

* The output data from processing is different from the expected output.
* The data taken out of a storage location is not what was supposed to have been input.

Data corruption can manifest itself at different levels of the applicationand falls into two broad categories. First, some of these corruptions may happen systematically but may not be detectable under normal functioning. Second, other corruptions may happen occasionally and are detectable immediately through normal path checks (such as CRC checks in a network application).

Timing misses. Certain types of applications may have hard deadline requirements whereby the secondary core may be required to be capable of responding to a specific signal in a certain interval of time.

Missing this timing window can have drastic repercussions on the performance and stability of the system. A timing miss may lead to a variety of errors ranging from an application lock-up to sporadic or systematic data corruption.

A timing miss is not usually linked to a specific defect in the application but more to a lack of performance at a certain point of the data processing. This lack of performance may be due to the application itself or may be related to other applications interfering and, for example, overusing a shared resource.

Non-responsive secondary core. A non-responsive core, as opposed to a core lock-up, will present the problem of having a single path from one core to another not functioning.

For instance, any requests using a specific type of communication mechanism might get ignored, or a specific acceleration feature on the secondary core might not be responsive. This type of issue could occur from start-up or after an unknown event which destabilizes the system.

Next in Part 2: Tools and techniques available for debugging multicore applications.

To read more about multicore issues, go to "More on Multicores and Multiprocessors ."

Julien Carreno is a senior engineer and technical lead within the Digital Enterprise Group at Intel Corp. He is currently the technical lead on a team responsible for delivering VoIP solution software for the next generation of Intel's Embedded Intel Architecture processors.

He has worked at Intel for more than three years, specialising in acceleration technology for the Embedded Intel Architecture markets. His areas of expertise are Ethernet, E1/T1 TDM, device drivers, embedded assembler and C development, multi-core application architecture and design.

More Insights

INFO-LINK


	To upload an avatar photo, first complete your Disqus profile. \| View the list of supported HTML tags you can use to style comments. \| Please read our commenting policy.

Tools

Debugging Asymmetric Multi-core Applications: Part 1

Related Reading

More Insights

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Tools Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content

Tools

Debugging Asymmetric Multi-core Applications: Part 1

Related Reading

News

Commentary

Slideshow

Video

Most Popular

More Insights

White Papers

Reports

Webcasts

Currently we allow the following HTML tags in comments:

Single tags

Matching tags

Tools Recent Articles

Most Popular

This month's Dr. Dobb's Journal

Upcoming Events

Featured Reports

Featured Whitepapers

Most Recent Premium Content