구글와이드(336x280)_상단 2개


Programming and optimizing C code, part 4 C/C++/C#

Programming and optimizing C code, part 4
This fourth in a five-part series explains why it is important to optimize "control code," and shows how to do so.

Page 1 of 3

DSP DesignLine

[Editor's note: Part 3 explains how to access DSP features like circular addressing from portable C. It also shows how to use pragmas and inline assembly. Part 5 shows how to optimize memory performance, and how to make speed vs. size tradeoffs.]
To most DSP programmers, "code optimization" implies optimization of inner loops. However, there can be more to application performance than tightly vectorized inner loops.
Firstly, embedded applications include an expanding range of algorithms, and a surprising variety of expressions are creeping into critical loops. In many cases, these new types of expressions cannot be easily vectorized.
Secondly, there is a tradition in the DSP world that the bulk of the application or "control code" only matters in so far as its code size, but sometimes the bulk of the application can account for a significant percentage of the execution time. "Control code" mainly refers to decision-making code (such as switch statements) but is can also be thought of as "generic" C code. This type of code is found in all applications—not just in DSP applications—so all the normal optimization techniques from the computing mainstream are applicable here.
These factors are becoming more important as DSP pipelines grow deeper. Deeper pipelines create longer branch delays, which makes decision-making code increasingly costly. In addition, DSP oriented processors often skimp in areas such as division and speculation. (For example, they often lack the speculative execution hardware found in general-purpose processors). All of this presents new opportunities for optimization.
Conditionals
The most important aspect of control code is decision making. Decision-making code is full of conditional expressions. The performance of this code depends on behavior of the branches associated with the conditional expressions. It is faster to not take branches. If the processor has branch-prediction hardware, a correctly-predicted branch is faster than an incorrectly predicted branch. Thus, control code executes faster if the processor can avoid taking branches—or at least correctly predicting branches—most of the time.
To achieve this goal, compilers use profile-guided optimization. With this approach, the toolset uses multiple builds and runs to analyze the application's behavior. Each run generates a complete trace of the program path showing every transfer of control. ("Compiled simulation" techniques are very useful here, as they offer speeds hundreds of times faster than conventional simulators.) Once the compiler knows (statistically) which way every branch went, the program is recompiled to use the most efficient branches. This optimization can easily result in a 10 to 15% improvement in control code performance.
Of course, the program path may depend upon the training data you provide. In general, this means that you want to use training data that represents a typical case. In some application areas, however, it is better to use specialized test data. For example, it may be desirable to accelerate the worst-case path rather than the average performance.
The results of control flow optimization can be disorienting because the compiler will aggressively rearrange the program. On many DSPs, the cheapest control flow decision is to not take the branch and drop through to the following instruction instead. On machines like Blackfin, even correctly predicted branches jumps cost a few stall cycles, so it is better to not take branches. To achieve this goal the compiler may have to substantially rearrange your code, so large blocks of assembly will move around after this optimization.
If you can't use simulation to build a trace then you can still intervene locally in key areas. You can insert "expected true" and "expected false" into your C to tell the compiler which way you expect branches to go:

Here we see it used with an error call. In most cases the code will proceed without executing the error routine. The "expected_false" tells the compiler that calling the error function is rare. As a result, the compiler rearranges the code as follows:
< Compare>
< Conditional Branch to X>
Y: the rest of the function.

X: < Call error routine>
< Jump Y >
In most cases we have a zero cost drop through to Y. The rearrangement also benefits the cache. Because the most-frequently executed code is now lumped together with a linear control flow, cache misses are less likely.
It is also possible to removing some conditional expressions altogether. For instance, Blackfin offers min, max, and abs machine instructions. These instructions provide the same functionality as a simple "if" statements, as illustrated in Figure 1.

Figure 1. Simple conditional expressions (left) can often be replaced by single machine instructions (right).
In this example, we remove a conditional branch, which might have taken many cycles, and replace it with a single cycle machine instruction. It also helps that the code now takes the same number of cycles whether the condition is true or false.
When using this trick, make sure you know what data types the machine instructions support. For example, on the min and max instructions on Blackfin are defined for signed 32 bit values. The compiler will easily perform the transformation for 32-bit signed data, but it may require assistance from the programmer to handle 16-bit and/or unsigned data.

Page 2: next page

Page 1 | 2 | 3
http://www.dspdesignline.com/showArticle.jhtml?articleID=197801488
Programming and optimizing C code, part 4


Page 2 of 3

DSP DesignLine

If you cannot remove a conditional expression, try to reduce the number of times you execute it. For example, if a conditional expression is inside a "for" loop, restructure the code to move the conditional expression outside the loop. This technique is illustrated in Figure 2. Note that this technique requires you to duplicate the "for" loop, which increases code size.

Figure 2. If a conditional expression is inside a "for" loop (top), restructure the code to move it outside the loop (bottom).
Many processors also have fast conditional instructions called predicated instructions. For example, Blackfin supports a register move instruction that executes only if the "conditional code" (CC) flag is set:
< Perform an operation that sets the CC flag>
IF CC Rx = Ry; // Predicted instruction: perform a register move if CC is set
These instructions greatly speed up the execution of conditional expressions. You should look for them in the machine instructions the compiler generates. If you don't see them, think about what might have blocked the compiler and consider whether you can help the compiler by shuffling the code around. For example, suppose a condition determines which of two expressions to evaluate, as illustrated in Figure 3. In order to use predicated instructions, you need to evaluate both expressions. At first glance, it might appear that it would take longer to execute both expressions. However, doing so eliminates branches, which results in a significant cycle savings.

Figure 3. An "if-else" expression (top) can be broken up into multiple expressions that map easily into predicated instructions (bottom).
Compilers are reluctant to make this kind of rearrangement for fear of side effects from the expression that normally would not be executed. For example, what if executing this expression causes an address error? This is why the programmer often needs to manually rearrange the code.
Figure 4 shows an example of creative thinking on removing conditional branches. This example comes from a real application code. The code checks a Boolean array (i.e., an array of ones and zeros) called "KeyArray." Based on the value read from KeyArray, it either adds or subtracts buffer[I+10]*64 from a sum.
Instead of having one or zeros, you can hold plus or minus 64 values in KeyArray. This lets you create a loop without any conditional expressions. This is an example of a transformation that a compiler won't do for you: The compiler is not going to change the contents of your data arrays!

Figure 4. A conditional expression can be avoided by changing the values in "KeyArray" from ones and zeros (top) to +64 and -64 (bottom).
One place you get an unexpected conditional is on entry to standard C "for" loops. The loop hardware in DSPs is not usually an exact match for the C definition of for loops. A for loop in C may iterate no times at all, whereas a DSP loop hardware assumes the loop will iterate at least once. The result of this mismatch is that the compiler has to plant guard code in front of loops to see if the loop will iterate at all. You can avoid this by using constant loop bounds or use the loop_count pragma.

Page 3: next page

Page 1 | 2 | 3
http://www.dspdesignline.com/197801488;jsessionid=PVE0NGL2IE2MOQSNDLPSKHSCJUNN2JVN?pgno=2
Programming and optimizing C code, part 4


Page 3 of 3

DSP DesignLine

Division
Division is often considered more of a control code item than a DSP essential. Thus, DSPs often lack hardware support for single-cycle division. For instance, Blackfin offers a division-acceleration instruction, but even with this acceleration it takes several cycles to complete a division. Therefore, you should get division out of loops wherever possible. You should also bear in mind the modulus operator implies division.
If you are dividing by a power of two, you can get rid of the division simply by converting it into a right shift. However, this trick doesn't always work with a signed divisor. Consider the operation ( -1)/4. The correct answer is zero, but a simple signed shift will return (-1). To avoid this problem, the compiler will insert the expression shown in Figure 5. This is another case where it's worth thinking about whether you should use signed or unsigned variables.

Figure 5. The compiler will insert a sign check for signed division-by-shifting. Code example is for a Blackfin processor.
Even for such a simple optimization as division-by-shifting the compiler must have full visibility as to the nature of the divisor. If the divisor is a constant or a literal, the compiler can see if it's a power of two. But what if it's a variable or a global? Is there any way that you can expect the compiler to know what's in that variable? Should you simplify the program? If you have two possible values for the divisor, say 8 and 16, would it be worth specializing to have two constant divisor divisions and choose between them at run time? On Blackfin the division support has two implementations, depending on the precision that's required. Roughly speaking, if the result and the divisor fit into 16 bits the division will take about 40 cycles. However, if you need a full 32 bit division, it could take as long as 400 cycles. So once more, deciding if we really need 32 bit precision is important. It may be worth scaling the values before you enter the division so that you get the 16-bit case. As this could involve loss of precision, you should be careful about making this optimization.
Figure 6 shows a real-world code example where we can get rid of division. This example also illustrates the point that the programmer must consider the need for precision. In this example we are comparing two ratios. This can be made to go hundreds of times faster by using high school algebra: if you multiply both sides of the equation by the divisors, the equation is still true. This replaces expensive division with cheap multiplication.

Figure 6. An example where division (top) can be replaced by multiplication (bottom).
However, this only works if we avoid overflow. If the inputs are 12-bit values, for example, the operations will not overflow. The result of multiplies are 24-bit values that fit in a 32 bit register. What if the inputs had 17 bits each? Then we have a 34 bit results and a potential overflow. To avoid overflow, the compiler will act conservatively. It will only consider the standard precisions of 8, 16, 32 and 64 bits. It won't know if a 16 bit short happens to hold 12 bit data. Thus, the compiler will not make the algebraic substitution above.
Integer size
Many processors today are fusing aspects of DSP and microcontroller, but they don't have the instruction set space to be complete in both respects. One of the places you can see the gaps is in integer size. For DSP style calculations you usually work with 16-bit data, but 32 bit ints are popular for control code.
Compare instructions on Blackfin, for example, compare numbers at 32-bit precision. If you use a 16-bit "short" in a comparison, the compiler will simply promote the short up to 32 bits. It may take a few cycles to convert the short to 32 bits, so it's better to start off with 32 bits for your control code variables. This is especially relevant for loop control variables. The compiler worries more about overflow with 16 bits, and thus treats a loop controlled by shorts more cautiously.
The other thing to note about integer size is that wider integers require more work. For instance, you can do two 16 bit multiplications per cycle on a Blackfin, but it takes 3 cycles to do one 32 bit multiplication. Thus, it's six times slower to use the extra width. One place this can sneak in is address arithmetic, which is done at 32 bits in standard C.
Function inlining
Control code typically contains lots of small functions. These inhibit optimization. If the compiler sees code peppered with function calls, it will optimize cautiously if at all.
You can attack this by inlining functions, which means taking the body of that function's code and inserting it at the point of the call. This can be done by marking small functions as inline functions. You can also invoke function inlining. On Blackfin, this is done with the –Oa option. Inlining a function can give you big savings because it saves you the call instruction, the return instruction, and the construction of parameters and return values.
Just as important, inlining gives the optimizer greater visibility, as the optimizer can now see exactly what the function does. Inlining also removes transfers of control and provides the optimizer with a larger linear block of code. This is important because optimizers perform better with contiguous blocks of code than with multiple disconnected chunks of code. A contiguous block gives the optimizer more scope to move instructions around, to schedule them into any stalls, and increase efficiency. However, you have a delicate conflict to manage here. If there is too much going on in a block, you will strain the compiler analysis and the number of available registers. Of course, inlining functions can also dramatically increase code size if you're not careful.
Part 5 shows how to optimize memory performance, and how to make speed vs. size tradeoffs.


Page 1 | 2 | 3
http://www.dspdesignline.com/197801488;jsessionid=PVE0NGL2IE2MOQSNDLPSKHSCJUNN2JVN?pgno=3

null



바보들의 영문법 카페(클릭!!)

오늘의 메모....

시사평론-정론직필 다음 카페
http://cafe.daum.net/sisa-1

바보들의 영문법 다음 카페
http://cafe.daum.net/babo-edu/

티스토리 내 블로그
http://earthly.tistory.com/

내 블로그에 있는 모든 글들과 자료에 대한 펌과 링크는 무제한 허용됩니다.
(단, 내 블로그에 덧글쓰기가 차단된 자들에게는 펌, 트랙백, 핑백 등이 일체 허용되지 않음.)

그리고 내 블로그 최근글 목록을 제목별로 보시려면....
바로 아래에 있는 이전글 목록의 최근달을 클릭하시면 됩니다.
그러면 제목을 보고 편하게 글을 골라 보실 수 있습니다.

그리고 내 블로그내 글을 검색하시려면 아래 검색버튼을 이용하시면 됩니다.


가가챗창

flag_Visitors

free counters