Tag Archives: QorIQ

Expert Viewpoint / Fused core logic: Multithreading for multicore performance

By John Dixon – The term ‘multithreading’ is a generic description of the method used to increase the utilization of a single core with thread level as well as instruction level parallelism. In my conversations with multiple customers on this topic following our recent QorIQ Advanced Multiprocessing (AMP) series announcements, including the T4240 based on [...]

The right stuff: How a communications processor became the smarts of a flight computer

By Tom Thompson — One thing is certain in the semiconductor industry: when you make powerful and robust microprocessor units (MPUs), our creative customers use them in all sorts of places to solve thorny design problems. Because of their low power consumption, array of I/O interfaces and computational prowess, Freescale MPUs have found their way [...]

M2M: It’s mobile and embedded!

By Iain Davidson — This week is a BIG week in Europe. We have Mobile World Congress and embedded world in the same week and I’m torn. Where do I go to talk M2M? The answer? I will be in both places, two days each. At Mobile World Congress, it’s all about M2M for healthcare [...]

Making the move to multicore: Core communications and channels (Part 5)

By Rob Oshana – This article continues my discussion of migrating legacy network routing software to a multicore platform. While multiple cores can boost throughput, care must be taken in how they access shared data to avoid corrupting it or using stale information. This can be managed by using the processor’s memory manager or hypervisor to [...]

Making the move to multicore: Overview

By Rob Oshana – Products with multicore processors have become nearly ubiquitous today. The CPU in a year-old laptop has dual cores and—even more likely—quad cores. The latest crop of tablets possess dual core processors, whether they are Apple’s iPad 2, Research In Motion’s Playbook, or Amazon’s newly-introduced Kindle Fire. Even some of today’s high-end smartphones have [...]

The CodeWarrior debugger never sleeps!

By Jim Trudeau – Catnaps are good things, for cats, people, and electronic devices. More and more of the gadgets we use are battery powered. More and more is expected of them: meaning faster processors with greater capabilities to get all that great work done. Yet, while doing all of these great things, the processor must [...]

Case study: Correcting parallel code with Prism

By Tom Thompson

Recently this blog covered issues encountered when debugging parallel software on multicore processors. Specifically, the article mentioned Prism, a software tool developed by CriticalBlue Ltd. Prism is an embedded code analysis tool that scans the execution traces of parallel software to detect problems that can occur in a multicore environment, such as race conditions, resource contentions, or other concurrency related software errors. When Prism was applied to Freescale’s Long Term Evolution (LTE) Layer 2 code, which was undergoing a port from a single core processor to a dual core processor, it identified issues with data integrity and an execution bottleneck that impacted the code’s performance. Through the use of Prism, these software errors were corrected. Furthermore, not only was the resulting communications algorithm more robust, its throughput was improved by over forty percent. (Figure 1.)

[caption id="attachment_3892" align="alignnone" width="605" caption="Figure 1. Performance of LTE P2020 port compared to MCP8548. Notice how performance degraded until the second core was put to use effectively"]Figure 1. Performance of LTE P2020 port compared to MCP8548. Notice how performance degraded until the second core was put to use effectively[/caption]

Since I provide support for Freescale’s multicore development work, I was intrigued by such a result. I wanted more details on how Prism accomplished this, which could be useful information to pass on to fellow developers. Therefore, what follows is an in-depth look at the LTE code port. The focus is on how Prism helped embedded software engineers locate bugs and identify software dependencies that impacted the performance of the parallel code.

Prism is a plugin for the Eclipse Integrated Development Environment (IDE). It works by analyzing the dynamic trace data generated by executing software on a simulator or the target hardware. Prism can analyze C or C++ code written as a bare-board application, or as a parallel application running on Linux.

The parallel application under investigation is first run and a trace file generated. Prism then reads and processes the dynamic trace information to detect dependencies in the code. Dependencies can take several forms, and each requires a different solution to reduce or eliminate their impact on a parallel application’s performance.

The Devil’s in the Dependencies

The first type of dependency is a data dependency, and as its name implies it is where an operation depends upon the result of data calculated in a previous operation. For example, many digital filtering algorithms rely on results obtained from previously processed data samples. Unfortunately, there isn’t a way to eliminate this type of dependency. However its effect can be mitigated by reorganizing the code such that other independent calculations execute on other cores until the required data is ready.

The second type of dependency occurs when the parallel application runs out of resources. This often occurs where different parallel operations contend for the same variable or buffers. This type of dependency is termed “resource incorrect” or a concurrency error, because the problems center on poorly allocated resources. Fortunately, such dependencies can be fixed by studying the application’s available resources at a given moment and distributing them properly.

The third type of dependency is where an operation relies on a result of a previous operation whose results have not yet been committed to memory. In this case, the operation uses stale data in its calculation, with possibly disastrous results. These dependencies, where portions of the code get ahead of the data, are also known as race conditions. Such dependencies can be corrected by reorganizing the code or by synchronizing access to the shared data.

Prism helps the programmer spot these dependencies so that they can be corrected. As you know, half the battle to fixing a code bug lies in identifying where it lies in the embedded application code. This is even more paramount when dealing with parallel applications. Prism’s value is that it provides snapshots of the embedded software’s activities and resource consumption so that you can schedule resources based upon modeled information, and not educated guesswork.

With that in mind, let’s turn to the case where Prism was used to spot problems in the code port.

Strategy for Migrating from Single to Dual-Core

The code under scrutiny was LTE Layer 2 stack for a wireless (or cellular) base station that was originally implemented on a Freescale single-core PowerQUICC MPC8548 processor (Note: the PowerQUICC processor consists of several different communications-specific cores, but only a single Power Architecture core). The protocol stack was being ported to a QorIQ processor, the dual-core P2020. The porting efforts focused on the wireless-side protocols because of their hard real time requirements. This is because a LTE base station must be able to handle the traffic of multiple communication channels on both uplinks (from user to base station) and downlinks (from base station to the user), and all within a one-millisecond processing window. The embedded code ran under a Linux OS and used POSIX threading services.

Given these requirements, the original software used a low-priority thread to manage IP network traffic, while a high-priority thread handled the uplink and downlink operations sequentially. The downlink function took data collected by packet reception software from the IP network, processed it, and placed the results into a buffer for transmission. The uplink function performed a similar buffering operation with the incoming stream of wireless data. A loop iterated once per millisecond to invoke the downlink/uplink functions and process the accumulated data in the buffers. By moving the LTE stack to a QorIQ processor, its multiple cores could be simultaneously manage IP packet reception, uplink processing, and downlink processing. The increased throughput would allow the P2020-based LTE base station to handle more channels (users).

The first stage of the port was to simply migrate the code to the P2020 RDB platform and get the application running successfully (Figure 2). This was a straightforward recompile using the P2020’s Board Support Package (BSP) software. However, stability and performance issues soon surfaced. Code traces were generated from the board and Prism was brought to bear on the problem.

[caption id="attachment_3858" align="alignnone" width="590" caption="Figure 2. How the migration of the wireless stack was planned."]Figure 2. How the migration of the wireless stack was planned.[/caption]

Removing Race Conditions Brought About by Concurrent Access

Prism’s Data Race view was able to quickly pinpoint the stability problem. Two processes were accessing the same pool of data buffers, and occasionally race conditions occurred where invalid data was consumed (Figure 3). The Data Race view permitted the engineers to quickly determine that the LTE Layer 2 stack did not have any mechanism to safeguard the integrity of the buffer pool for a multicore environment. The problem was corrected by using software locks to synchronize data accesses. Unfortunately, the overhead of this code made the port run slower that the PowerQUICC version, as Figure 1 points out.

[caption id="attachment_3865" align="alignnone" width="598" caption="Figure 3. Prism displaying a race condition"]Figure 3. Prism displaying a race condition[/caption]

Making Code Parallel Through Modeling

The next stage of the porting process was to reverse the trend toward worse performance by utilizing the P2020′s second core for parallel processing. The plan was to factor the code so that the uplink function executed in a thread on the second core. Meanwhile, the downlink thread remained in the application’s main thread on the first core, along with the packet reception code. Using the data from extensive code traces, Prism can generate virtual models of a multicore environment, and let you experiment with what-if scenarios and how they affect code execution. For example, you can “migrate” a thread to a second virtual core and observe the results without writing any code. The software models that Prism generated from this virtual arrangement for the LTE stack revealed a potential for corruption to the state variables that manage downlink retransmissions.

When data is downlinked, it is stored in the base station’s Media Access Channel (MAC) layer until a response is returned from the user device. This allows the data to be resent with minimal delay when a data error is detected. The initial solution was to have the uplink function process the response information as it was received. That strategy works as long as the uplink and downlink functions execute sequentially. However, if they operate in parallel, the data dependencies of the retry state variables became evident in the Prism display. The fix was to have the uplink thread gather the retry information, and leave the final processing to the downlink thread. This eliminated the data dependencies and the data corruption problem.

The code was then revised and run. However, its performance was half what was expected. Again, Prism was trained on the code (Figure 4).

[caption id="attachment_3877" align="alignnone" width="597" caption="Figure 4: Prism reveals that core 1 (blue) is not executing concurrently with core 0 (green)."]Figure 4: Prism reveals that core 1 (blue) is not executing concurrently with core 0 (green).[/caption]

Smarter Threading to Better Utilize the Second Core

In Prism’s Schedule view, it became apparent that the uplink and downlink threads were still operating sequentially. Prism helped zero in on a synchronization call Synchronize(COND_VAR) made after the downlink code had completed. This call kicked off the uplink code. This arrangement made sense back when the LTE stack operated sequentially on a single-core processor, but now it was stalling the uplink code on the second core. The code was corrected by starting the uplink thread prior to starting downlink processing. This scheme allowed both threads to run in parallel, a fact confirmed by Prism when it analyzed the revised code (Figure 5). The code now ran approximately forty percent faster than the original implementation, which meant that the LTE stack was capable of handling more users, which could give a wireless product a competitive edge.

[caption id="attachment_3878" align="alignnone" width="597" caption="Figure 5. Now the LTE code executes concurrently in both cores. "]Figure 5. Now the LTE code executes concurrently in both cores.  [/caption]

Exorcising Dependencies

For this case study Prism uncovered all three types of parallel code dependencies: race conditions, concurrency errors, and data dependencies. Even better, the engineers used Prism to try alternatives without committing time and effort until a boost in performance was assured. It’s important to note that the LTE stack being ported was reliable and had been deployed in the field. The problems uncovered were because the stack’s algorithms were designed with sequential execution in mind. It’s a completely different situation when the code executes in a multicore environment. For example, the contents of buffers or variables are often no longer sacrosanct because now they are shared. A process executing on another core can modify their contents without warning unless you coordinate the accesses carefully. Prism is a valuable tool that exposes such vulnerabilities.

Prism also lets you model the effects of one or more additional cores, which allows you to discover any dependency conditions and correct them. Also, you can add more cores during the modeling phase to determine when the overhead of managing the multiple cores offsets any performance gains. The QorIQ family has processors ranging from one to eight cores, and Prism allows you to design a solution that uses a QorIQ part with just the right number of cores. This lets you meet the design’s software processing requirements while still hitting a specific price point.

You don’t have to take my word for Prism’s capabilities. CriticalBlue offers a free evaluation version that you can download. Check it out.

Follow

Get every new post delivered to your Inbox.

Join 152 other followers