구글와이드(336x280)_상단 2개


Programming and optimizing C code, part 2 C/C++/C#

Programming and optimizing C code, part 2
This second of a five-part series shows how to optimize DSP "kernels," i.e., inner loops. It also shows how to write fast floating-point and fractional code.

Page 1 of 4

DSP DesignLine

[Editor's note: Part 1 introduces the basic principles of writing C code for DSP. Part 3 will explain how to access DSP features like hardware loops and circular addressing from portable C. It will be published Monday, March 5. For more programming tips, see the DSP programmer's guide.]
DSP Kernels
In the past, the performance of signal processing applications rested heavily on hand-coded "kernels." These kernel loops consisted of specialized arithmetic that consumed most of the computational load. The rest of the application consisted of generic "control code." This control code consumed only a few MIPS, and it was written in C. This division between kernels and control code heavily influenced processor design. For example, DSPs will have a set of instructions and registers specialized for DSP kernels, and may have another set appropriate for control code.
Today, the distinction between DSP kernels and control code is blurring. Newer application kernels require all sorts of, operations, including operations that were previously seen in control code. In addition, compilers are penetrating into previously hand-coded areas. In those C written DSP kernels we are at the place where performance needs must be considered most carefully.
Floating Point
DSP algorithms are usually conceived in floating point. Most design packages that emit C (such as Matlab) primarily use double precision floating point, but the majority of DSP platforms are fixed-point machines. In the past, this meant that the C code must be transformed from floating-point into fixed-point code. This transformation entails painstaking attention to scaling in order to preserve accuracy.
As DSPs have gotten faster, it has become practical to simply leave less-critical code in floating point in order to reduce development costs—even when the target DSP lacks native floating-point instructions. This has led to a re-evaluation of the floating-point support functions that vendors provide with their C compilers. In the past these functions were an afterthought, provided only to ensure code portability. Today, they are often carefully handcrafted. In the quest for speed, these libraries may even omit some aspects of the IEEE standard—such as standards-compliant processing of NaN values—which are mathematically useful but are seldom critical for DSP applications. This is illustrated in Figure 1, which shows reference IEEE-compliant functions for ADI's Blackfin on the right. The left-hand side shows highly optimized, non-compliant functions. (These are sample figures that do not show the entire range of performance.)

Figure 1. IEEE-compliant (right) vs. non-compliant (left) floating-point libraries.
Also consider if you really need the 64-bit (or "double") precision which is the normal ANSI C portability standard. Many applications—for example those in the automotive and audio areas—only require 32-bit (or "float") precision. Using the lower precision can double your speed, whether you use native floating-point instructions or software emulation.
[Editor's note: For a great intro to floating-point arithmetic, see this tutorial.]
Fractional processing
Even if you use a high-speed library, native fractional arithmetic is a hundred times faster than software-emulated floating point. Unfortunately, the fraction is not a type found in portable C. As a result, it is difficult to let a standard ANSI compiler know that you want to use fractional arithmetic.
To solve this problem, you can evolve the language either by creating your own dialect of C or by international standards committee. The problem with creating your own dialect of C is that your code is no longer portable. The problem with going through standards committees is that it takes decades for the world to adopt a new coding standard.
Another approach is to enhance the semantic capability of the compiler in the hope that it will comprehend that complex chunks of C correspond to fractional operations. This is challenging, but it can be done. We'll look at an example in the next section.
We can also offer intrinsics (or built-in functions), which map directly to single machine instructions. This produces a clumsy but efficient programming style. We'll look at an example in the following text.
Given the drawbacks to all of these approaches, it is tempting to use C++ instead of C. C++ allows the programmer to define new types and overloaded operators. This may appear to be a natural way to express fractional arithmetic. However, the semantic gap between expression and intention is wider in C++ than it is in C, and this approach requires very careful coding and analysis. The C++ language is more powerful than C, but that means it does more for you automatically. For instance, C++ compilers may unexpectedly create constructors and destructors. Also, C++ style involves more indirection, which can cause problems in a compiler's alias analysis. For example, C++ programs tend to produce more temporary variables and common subexpressions, which the compiler must then analyze. As another example, data tends to exist within structs or objects, rather than as stand-alone variables.

Page 2: next page

Page 1 | 2 | 3 | 4

http://www.dspdesignline.com/showArticle.jhtml?articleID=197008568
Programming and optimizing C code, part 2


Page 2 of 4

DSP DesignLine

An example of improved semantic analysis
In Figure 2 we implement fractional operation using portable C. The line
y[i] += ((scaler * x[i] ) >>15 );
can compile to a single machine instruction (specifically, a fractional MAC) on most DSP architectures if the compiler is smart enough.

Figure 2. When constructed carefully, C code can compile very efficiently.
This approach enables you to maintain complete portability, but it leaves many questions. Will we encounter overflow, or has the compiler planted saturating arithmetic code? Are we exploiting the extra bit width found in DSP accumulators, or do we need to scale the result every time around the loop? (Signed variables are sometimes easier to work with in this context. For example, the semantics of C are stricter for unsigned overflow than they are for signed overflow.)
This approach works well when the loop is very small and simple. However, as you increase the number of adjacent fractional operations, you will eventually strain the compiler's ability to bridge the semantic gap.
Intrinsics
Another approach is to use intrinsics, which are pre-defined primitives that give the compiler exact knowledge of the programmer's intention. They look like function calls, but they map to single instructions. Because the compiler understands the programmer's intent and the impact of the intrinsic, the flow of optimization can be maintained. You will find intrinsics (also called built-in functions) defined in compiler manuals and in supplied include files. Figure 3 shows an example use of intrinsics.

Figure 3. Intrinsics look like function calls, but map to single instructions. Here we see the add_fr1x32 and mult_fr1x32 intrinsics.
One downside of intrinsics is that the intrinsic names can be obscure and hard to remember. In addition, the code is often criticized as lacking in elegance.
Helpfully, a set of intrinsics have appeared on several platforms, creating a de facto standard. These are the ITU/ETSI built-ins, which are shown in Figure 4. These intrinsics were defined by the European Telecommunications Standards Institute in order to specify GSM codecs. Using these built-ins is a quick way to get impressive performance on supporting platforms.

Figure 4. The ITU/ETSI intrinsics.
When evaluating intrinsics, check that a precise definition is supplied in C. This allows you to maintain portability to platforms lacking an implementation and to be sure of the definition.
How the compiler optimizes kernels
Unless we are using profile-guided optimization or literal loop counts, the compiler is going to assume that your program spends most of its time in the inner loops. Poor performance may result if inner loops are constructed with low iteration counts.
Most inner loop optimization relies on loop unrolling. With loop unrolling, the compiler expands the code so that it can generate more opportunities for instruction rearrangement, avoid stalls, and combine operations to achieve parallel execution. The two basic techniques used to achieve these aims are software pipelining and vectorization.
Software pipelining allows work from separate iterations of the loop to be combined in a single machine cycle.

Figure 5. Software pipelining of a loop that contains load (L), multiply-accumulate (M), and store (S) operations. E.g., L1 indicates a load from loop iteration 1.
On the third iteration of this loop, we are storing the results of the first iteration (S1); doing the MAC operation from the second iteration (M2); and loading the data inputs for the third iteration (L3), all in a single cycle. Achieving this pattern is very significant for performance. The alternative—performing only one operation per cycle—is much slower.
[Editor's note: For more on loop unrolling and software pipelining, see this tutorial.]

Page 3: next page

Page 1 | 2 | 3 | 4

Programming and optimizing C code, part 2


Page 3 of 4

DSP DesignLine

Vectorization (also referred to as SIMD) means performing an operation on more than one element of data at a time. The performance benefits of this technique are obvious: If you process more data per cycle, you spend fewer cycles processing the data.
Figure 6 illustrates these concepts at the machine instruction level. First we have naïve code generation. We are loading each 16-bit data operand individually and then performing a multiply-accumulate. In the second version of the code the compiler has applied vectorization. Vectorization combines two iterations of this loop and requires each memory access to fetch 32 bits, or two data items. Finally we see the effect of software pipelining. The total effect is to accelerate the program by a factor of six. Note that the code size has expanded, as entry and exit processing are now required.

Figure 6. Three versions of the same code with varying degrees of optimization.
Speeding up your program will require that you look at the code to check that these transformations have taken place wherever they might be hoped for. If they haven't, you should ask yourself why not. Is the compiler missing information? Does the code structure prevent optimization? It is particularly important to look for the following:
  • Aliasing of the data. If two data buffers might overlap, the compiler will not vectorize the loop.
  • Data dependencies: If a calculation depends on a result from a previous loop iteration, the compiler will not vectorize the loop. We will look at a few examples in the next section.
  • Sequential memory access: If we're going to fetch two or more operands with a single load instruction, the operands must be arranged sequentially in memory.
  • Appropriate alignment of that data: Many machines only fetch data items from addresses which are multiples of the data size. If we want to load 32 bits at a time, for example, we should ensure that the data is aligned on 32-bit addresses.
  • Simplify the loop: Keep the loop short and remove conditional expressions and transfers of control.
  • Make the control flow transparent: Use constants or literals in loop control rather than variables or globals. In other words, use:
    for (n=0; n>100; n++)
    rather than something like:
    for (n=start; n<stop; n+=step)
    If the compiler can see how many times the loop will iterate it can avoid costly vectorization when the loop count is low. If the increment is unknown at compile time, it might prevent the compiler from vectorizing the loop. Variable start and end values can also force the compile to insert extra loop prolog and epilog code.
    • Going forwards rather than backwards through the data may have advantages in some processors.


    Page 4: next page

    Page 1 | 2 | 3 | 4
http://www.dspdesignline.com/197008568;jsessionid=PVE0NGL2IE2MOQSNDLPSKHSCJUNN2JVN?pgno=3
Programming and optimizing C code, part 2


Page 4 of 4

DSP DesignLine

Good and bad expressions
As noted above, data dependencies will prevent the compiler from vectorizing a loop. Figure 7 illustrates two common types of dependencies. Sometimes dependencies are unavoidable. This is particularly true for algorithms with tight feedback loops, such as IIR filters. However, programmers should avoid introducing dependencies unless they are absolutely necessary.

Figure 7. Operations that inhibit optimization.
As a counterpoint, Figure 8 shows operations which lend themselves to optimization. In the reduction example we have X on both the left and right of the expression, but we can tolerate that this time because we're adding. Addition and multiplication are arithmetically associative operations, which means the compiler can reorder them freely.

Figure 8. Optimization-friendly operations.
Keep it simple
Perhaps the most important thing you can do to help your compiler is to keep your code simple. This makes it easy for the compiler to understand your intentions.
Don't unroll inner loops by hand. You might think that unrolling a loop would assist the compiler, but this can make it harder for the compiler to understand what is going on. This is illustrated in Figure 9. An exception can sometimes be made for unrolling outer loops.

Figure 9. Unrolling a loop by hand makes it harder for a compiler to understand your code.
Similarly, do not software pipeline loops by hand. Figure 10 shows why this is a problem. On the left is a simple loop; you can image a compiler could understand that. On the right we see what happens when the programmer tries to do the work for the compiler. You can imagine why a compiler would have difficulty understanding the programmer's intent.

Figure 10. Pipelining by hand makes it harder for a compiler to understand your code.
Part 3 will explain how to access DSP features like hardware loops and circular addressing from portable C. It will be published Monday, March 5.


Page 1 | 2 | 3 | 4
http://www.dspdesignline.com/197008568;jsessionid=PVE0NGL2IE2MOQSNDLPSKHSCJUNN2JVN?pgno=4

null



바보들의 영문법 카페(클릭!!)

오늘의 메모....

시사평론-정론직필 다음 카페
http://cafe.daum.net/sisa-1

바보들의 영문법 다음 카페
http://cafe.daum.net/babo-edu/

티스토리 내 블로그
http://earthly.tistory.com/

내 블로그에 있는 모든 글들과 자료에 대한 펌과 링크는 무제한 허용됩니다.
(단, 내 블로그에 덧글쓰기가 차단된 자들에게는 펌, 트랙백, 핑백 등이 일체 허용되지 않음.)

그리고 내 블로그 최근글 목록을 제목별로 보시려면....
바로 아래에 있는 이전글 목록의 최근달을 클릭하시면 됩니다.
그러면 제목을 보고 편하게 글을 골라 보실 수 있습니다.

그리고 내 블로그내 글을 검색하시려면 아래 검색버튼을 이용하시면 됩니다.


가가챗창

flag_Visitors

free counters