구글와이드(336x280)_상단 2개


Programming and optimizing C code, part 3 C/C++/C#

Programming and optimizing C code, part 3
This third in a five-part series explains how to access DSP features like hardware loops and circular addressing from portable C. It also shows how to use pragmas and inline assembly.

Page 1 of 3

DSP DesignLine

[Editor's note: Part 2 shows how to optimize DSP kernels (i.e., inner loops), and how to write fast floating-point and fractional code. Part 4 explains why it is important to optimize "control code," and shows how to do so.]
To optimize your code, the compiler will analyze your program and make deductions. Compilers will often attempt optimization of wide scope, so this analysis can span multiple functions, multiple files, or even the entire width of the application. This approach gives the compiler maximum flexibility, but it can slow compile times. More importantly, such schemes can be badly affected by aliasing and flow-of-control uncertainty.
Thus, you must consider what the compiler can deduce about your program. You can often improve compiler performance by simplifying the code and by adding information. Information can be added to a program at a high level by manipulating compilation options. It can also be added at a low level by recoding and adding informative statements. We will look at this latter approach in this article.
Pointers or Arrays
A common question is whether data access in tight loops should be coded using pointers or arrays. It doesn't make a lot of difference to a modern optimizing compiler. Arrays have the useful property of an implied length, which helps the compiler understand if two data reference are definitely not going to overlap, which is often vital for aggressive optimization. On the other hand, pointers are closer to the hardware.

Figure 1. Alternatives for addressing in tight loops.
Loop Structure
It often helps to experiment with the loop structure of the program. Compilers focus their optimization efforts on the inner loop, so pulling operations from the outer loop into the inner loop can improve performance. Sometimes you can even collapse the inner and the outer loop into a single loop. The downside to this approach is that the single remaining loop may be too complicated for the compiler to optimize.
You can also reverse the order of the loop nest, that is, you can make the inner loop become the outer loop. One reason to make this transformation is to achieve sequential data access. As we mentioned in the last article, sequential data access is a requirement for vectorization. In order to achieve sequential access patterns, you may need to re-order the data after reversing the order of the loop nest.
To illustrate this point, consider the initial access pattern to array "Input" has us striding through the data with a large gap. It is impossible to load multiple data elements simultaneously, but if we notice this an associative operation we can reorder. In the second form, we are stepping through sequentially controlled by "k" and getting about six times the performance on a Blackfin processor.
Here's an example of non-sequential access.

Figure 2. Two equivalent loops with non-sequential (top) and sequential (bottom) data access. The original form moves through memory jumping "NC" words at a time, which prevents the compiler from vectorizing.
When looking at loop structure, keep in mind that memory access costs can be significant. If a number of loops access the same data in external memory, it may make sense to unify those loops into one large loop. This will allow you to fetch the data only once.
If you still don't get the desired vectorization, it is time to explore vectorizing built-ins. Here is a vectorizing built-in that tells the compiler to perform a dual operation:
Compilers often fail to vectorize operations because they cannot determine if it is safe to do so. With built-ins, the compiler does not have to work out that vectorization is safe: You have told it explicitly how to proceed.

Page 2: next page

Page 1 | 2 | 3
http://www.dspdesignline.com/howto/197800009;jsessionid=PVE0NGL2IE2MOQSNDLPSKHSCJUNN2JVN
Programming and optimizing C code, part 3


Page 2 of 3

DSP DesignLine

Hardware Loop Constructs
Many DSPs have a zero-overhead hardware loop feature. This specialized type of branch is much more efficient than a standard conditional branch. When a compiler emits such a specialized construct, it indicates that the loop is well understood. This is an important sign to watch for: If a compiler understands the loop well enough to use hardware looping, it is likely to go on to vectorization and software pipelining.
What might stop the compiler from emitting a hardware loop instruction? Start by looking for function calls with unknown side effects inside the loop. (Pre-defined support functions are acceptable, as are simple local functions where the compiler can see exactly what goes on inside the function.) In addition, there must be no transfer of control into the loop, other than the normal entry. (You can usually get away with jumping out of a loop, although that adds conditional jumps with their own overheads.) It is also important to note that DSPs support only a few hardware loops, so only the inner-most loops will make use of this feature.
Alignment of Data
The compiler will usually lay out the data to assist you. For instance, all top-level arrays will be allocated on four-byte boundaries. A common mistake is to clutter the data. For example, programmers often place flags or other control data at the top of a data array, with the real data following. Then the programmer passes the data as "&Buffer[n]" to the function. That may confuse the compiler; keeping data uniform and passing the zeroth address is much clearer.
The payoff from aligning data is so great that the compiler may go to great lengths to ensure that data is aligned. For instance, if the Blackfin compiler is not certain of the data alignment, it may generate two versions of a loop, prefaced by a run time test of alignment to decide which to execute. The performance boost is well worth the extra code if you don't know the alignment ahead of time, but it is better to clarify things for the compiler and avoid the extra code.
Data Layout
You should also think about the layout of multi-part data. For instance, suppose you are working with complex numbers comprising 16-bit real and imaginary parts. You could declare the data as follows:
The problem with this approach is that it requires two separate loads every time you want to access a complex number. A better approach is to declare the data as:
With this approach, you can load both the real and imaginary parts with a single 32-bit load.
It is also important to construct two-dimensional and higher arrays carefully. It is common to construct a 2D array is to use malloc(). With this approach, you create a column of pointers to malloc'ed rows of data. This allows complete flexibility of row and column sizes, as well as the location where the array is stored. However, this approach prevents the compiler from understanding how one row relates to another. It doesn't know if rows are adjacent in memory, or even if they have a constant offset between them. The compiler has an easier time optimizing the code if you do a straightforward two-dimensional declaration of the array.
Aliasing
The biggest cause of problems to the compiler is data aliasing. Aliasing means that multiple variables point to the same data. For example, two buffers might overlap, or two pointers might point to the same data. In such cases, changes to one data item have to be considered as potentially affecting other data items. In the worst case, a single pointer with unknown address may disrupt all of the compiler's assumptions. Data in registers, which may have been altered, must be expensively reloaded before its next use. Aliasing is generally unintended, so the compiler must make conservative assumptions about variables that might be aliased. Pointers coming from outside a function in the form of arguments or globals are regarded with suspicion, as are pointers which have several purposes. Reference parameters in C++ are also suspect.
When the complier makes conservative assumptions about aliasing, it has a huge negative impact on performance. This is as it should be: Compilers can seem extraordinarily pessimistic, but however remote the danger, correctness always trumps performance. Thus, it is critical to ensure that the compiler knows which variables reference which data with complete confidence.
Aliasing applies to more than overlapping data buffers. For instance, if a global is used in a loop, the compiler will handle any writes in that loop conservatively in case they over-write the global variable. Using a global loop control variable has a particularly disastrous impact on performance.
Local variables are easier for a compiler to analyze. The compiler will know that something on the local stack frame cannot overlap data in global space or alias with address passed in from outside. A compiler may be confident that the address of a local variable has never been taken, while it cannot follow the same logic for a global variable.

Page 3: next page

Page 1 | 2 | 3
http://www.dspdesignline.com/howto/197800009;jsessionid=PVE0NGL2IE2MOQSNDLPSKHSCJUNN2JVN?pgno=2
Programming and optimizing C code, part 3


Page 3 of 3

DSP DesignLine

Pragmas
One way to work around these problems is to use pragma statements. Many compilers support pragma statements, although they often choose different names for the same concepts. Fortunately, compilers simply treat unknown pragmas as C comments. Thus, pragmas are a useful way to optimize code while maintaining portability.
Consider the powerful "no_alias" loop pragma from the Blackfin compiler. This pragma tells the compiler that none of the loads and stores in the following loop will overlap:
Of course, if this assertion is untrue, you would get incorrect answers from your program. Another way of doing the same thing is to use the standard C qualifier "restrict" on the declarations. On a pointer, this means that the pointer cannot create an alias.
To assist vectorization, the Blackfin compiler has pragmas such as "vector_for", which tells the compiler it is safe to execute multiple iterations of this loop in parallel:
As noted earlier, data alignment is critical for optimization. Thus, it can be very useful to assert data alignment to the compiler. On Blackfin, you can use the "all_aligned(n)" pragma. Omitting the "n" variable in this pragma tells the compiler that all of the data is aligned. (On StarCore DSPs, "#pragma align *p 8" has a similar effect.) Setting "n" to a non-zero value tells the compiler that the data will be aligned after n iterations of the loop. This lets the compiler generate efficient prolog and epilog code.
You will always get the best performance if you code your FOR loop using literals. However, if you need the generality of variables, then several compilers allow you to pass extra information to the compiler. The Blackfin compiler supports a "loop_count(min,max,mod)" pragma:
The first parameter is a minimum trip count to say that the loop will always iterate at least this many times. The middle parameter is the maximum trip count, which tells the compiler the maximum times the loop may iterate. This helps the compiler determine if it's going to be worth making a big effort with this loop. Finally, there is the trip modulo (that is, the step size between different loop counts), which is needed for software pipelining and vectorization. (On TI C6x series, the very similar #pragma MUST_ITERATE([min, max, multiple]); has the same effect.)
Qualify Memory
Any real application will ultimately derive its input from IO or DMA transfer. Applications may also share data or manipulate memory mapped registers. An optimizing compiler must be advised that this data may change underneath it. This is done using the "volatile" qualifier, which tells the compiler that data may change unexpectedly. It is not sufficient to simply reload the value before any use. Compiler optimizers have become so aggressive that if a loop appears to be doing nothing, it will be eliminated. Indeed, forgetting the Volatile qualifier is the most common reported programmer error in the Blackfin processor environment.
Conversely, mark data "const" if it is unchanging.
The Circular Addressing Pattern
Many DSP platforms can perform circular addressing very efficiently. Circular addressing is useful for addressing filter coefficient arrays and other common tasks. The compiler is unlikely to pick up this style of addressing from basic "if then else" statements, but the portable use of a modulo operator ( array[i%n] ) can work. This is a little dangerous as most DSP machines lack a division instruction. If the compiler does not pick up on the circularity, you will spend a hundred cycles in a division support function. Obviously you should check the code to make sure this has not happened. If the compiler supports circular addressing but is not using it, you can use options such as "force-circbuf" on the Blackfin compiler, or use explicit built-in functions.
Specialized Instructions
There are usually some DSP idioms which the compiler will not understand. For instance, most DSPs provide highly complex instructions for FFT and Viterbi algorithms. The compiler is unlikely to recognize the C code that corresponds to these instructions. However, DSP vendors typically supply built-ins that map to these instructions. It is important to review the documentation and header files and make these part of your programming vocabulary. Examples from the Blackfin processor include a Viterbi support instruction and a Bitmux instruction. Correct use of these can make a critical loop go ten or twenty times faster.
Figures 3 and 4 show another example: many machines have an instruction for counting bits. This is a real customer example from the Blackfin processor. In Figure 3, the programmer can see that the code is counting the occurrence of bits in a word, but the compiler doesn't pick up on this. If you use the correct intrinsic, as shown in Figure 4, the loop goes 60 times faster.

Figure 3. In the original code, the compiler will not make use of the processor's specialized hardware.

Figure 4. The code now uses a built-in function that maps to a single, complex instruction.
Inline Assembler
If all else fails you can select a few functions to implement in assembler. However, this is usually a time-consuming endeavor. As a more restricted alternative, many compilers of inline asm statements usually modeled the format pioneered by GNU.
Figure 4 shows an example use of inline assembly. This is from Ogg Vorbis, which requires a fractional, or widening, 32-bit multiplication. Neither the C language nor the Blackfin processor supports this directly. As a result, the C implementation of this operation required use of 64-bit integers, from which the desired values would be extracted. In contrast, the inline asm implements the required operation in 14 percent of the cycles of the C approach. The invocation of this code is kept readable by defining it as a function.

Figure 5. A "widening" 32-bit multiply specified using inline assembly
Compilers don't actually parse inline assembly text, they just pass it on to an assembler. To the compiler, these inserted machine instructions may upset many assumptions and upset the flow of optimization. However, the Blackfin compiler supports a "regs_clobbered" pragma that lets the programmer tell the compiler exactly which registers are affected by the asm() statement. This pragma also lets the programmer tell the compiler how memory was modified. Without this information the use of inline assembler is often counter-productive.
Part 4 will explain why it is important to optimize "control code," and show how to do so. It will be published Monday, March 12.


Page 1 | 2 | 3

http://www.dspdesignline.com/howto/197800009;jsessionid=PVE0NGL2IE2MOQSNDLPSKHSCJUNN2JVN?pgno=3

null



바보들의 영문법 카페(클릭!!)

오늘의 메모....

시사평론-정론직필 다음 카페
http://cafe.daum.net/sisa-1

바보들의 영문법 다음 카페
http://cafe.daum.net/babo-edu/

티스토리 내 블로그
http://earthly.tistory.com/

내 블로그에 있는 모든 글들과 자료에 대한 펌과 링크는 무제한 허용됩니다.
(단, 내 블로그에 덧글쓰기가 차단된 자들에게는 펌, 트랙백, 핑백 등이 일체 허용되지 않음.)

그리고 내 블로그 최근글 목록을 제목별로 보시려면....
바로 아래에 있는 이전글 목록의 최근달을 클릭하시면 됩니다.
그러면 제목을 보고 편하게 글을 골라 보실 수 있습니다.

그리고 내 블로그내 글을 검색하시려면 아래 검색버튼을 이용하시면 됩니다.


가가챗창

flag_Visitors

free counters