구글와이드(336x280)_상단 2개


Programming and optimizing C code, part 5 C/C++/C#

Programming and optimizing C code, part 5
Part five of this five-part series shows how to optimize memory performance, and how to make speed vs. size tradeoffs.

Page 1 of 2

DSP DesignLine

[Editor's note: Part 4 explains why it is important to optimize "control code," and shows how to do so.]
Memory Performance
Compilers tend to treat memory as a uniform and infinite resource that can be accessed at no cost. In reality, memory access costs may dominate the performance of your application. Thus, it is worth learning a little about memory effects.
Most DSPs access both internal (on-chip) and external (off-chip) memories. There is an enormous performance gap between the fast on-chip memories (commonly SRAM) and the slower external memories (commonly SDRAM). Compilers seldom engage in intelligent data placement. Thus, it is up to the programmer to place the most-critical program code and data into on-chip memory. This is usually done via a linker control file. You can also tell the linker to place specific arrays into internal memory. For example, the following statement tells the linker to place the array in internal memory if there is room:

The speed of internal memory accesses depends on the data access patterns. Under ideal conditions, most DSPs can make two data accesses per cycle. However, DSP memories are usually split into multiple "banks," a bank may stall if it receives two simultaneous access requests. Thus, on-chip memory may stall if it receives two access requests close to each other.
Unfortunately, the compiler does not know whether two data entities are close by in space, because memory layout is a linker function. How can the compiler decide whether it should generate dual-access code? Part of the answer is to use pragmas. For example, Blackfin has a pragma "different_banks," which tells the compiler that the data will derive from different banks, and to schedule aggressively for dual access.
The speed of external memory accesses depends upon properties of the memory as well as the speed and width of the bus linking memory to processor. On many DSPs, the programmer can select the speed of the memory bus. To save power, you can start by selecting a low bus speed. If you discover that your application performance is memory-bound, you can ramp up the bus speed. This is trickier than it sounds, because the processor and bus speeds are normally provided as multiples of an input clock signal. (More precisely, the processor speed is set as a multiple of the input clock, and the bus speed is set as a fraction of the processor speed.) Thus, only certain combinations of processor and bus speeds will be useful. For example, suppose you are using the ADSP-BF533 Blackfin and you want to run the memory bus at its maximum rate of 133 MHz. As illustrated in Figure 1, only specific combinations of processor and bus speeds meet this goal.

Figure 1. The yellow bars show combinations that come closest to meeting the target of a 133 MHz memory bus. The values shown are for a 750 MHz ADSP-BF533.
The significance of external memory performance can be realized by analyzing Figure 2. The first row (L1) shows that internal memory has single-cycle access. Cached results are in the second row (L3 cached). The third row (L3) shows what happens in an uncached system: In this case, sequential 16-bit transfers take 40 cycles per item. Note that the cost of a memory access shown here could swamp any benefit of an optimizing compiler.
External memory has a structure of rows (where one row occupies perhaps 4kbytes) and a larger structure of banks. When memory accesses move from one row to the next, it adds an extra delay. This delay is not a significant issue for sequential accesses, because sequential access generate many accesses within each row before moving on to the next. The alternate rows column shows a worse case arising from a more random access pattern. You can recover most of the lost performance by scattering data amongst the banks, as illustrated in the last column.

Figure 2. Memory access times for various scenarios.
A natural reaction to Figure 2 might be to rely on caching. Clearly, caching generates a significant advantage over the raw memory costs in row three. However, note that sequential access from cached external memory still costs 7.7 times more than accessing internal memory. Caches work best if you re-use data, so you should think about your data access patterns. Try to craft loops that re-use data as much possible, rather than constantly fetching new data from external memory.
The compiler can give you very little help with all this. As mentioned in part 1, you will often spot a memory problem by using a statistical profiler. Large numbers of unexplained stalls on a load or store instruction are a useful hint that you need to re-think your use of memory


Page 2: Code Speed vs. Space

Page 1 | 2
http://www.dspdesignline.com/showArticle.jhtml?articleId=198001797
Programming and optimizing C code, part 5
Code Speed vs. Space

Page 2 of 2

DSP DesignLine

Small code has many benefits. Smaller code fits better into internal memory, so smaller code can raise the speed of the application. Smaller code also reduces the need for external memory, thereby reducing the cost and power consumption of the system.
Compilation tools will often assist you in optimizing the size of your program. You may request optimization for maximum speed, you can request minimum code size, or you can aim for a trade-off between speed and space.
Often you can give the compiler guidance at a low level. You can decide that one file should be compiled for space and another for speed, and you can even decide how individual functions should be optimized by using pragmas.
Figure 3 shows the results we obtained from switching certain optimizations on and off to uncover the differences between "compiled for speed" and "compiled for space".

Figure 3. Differences in file size for different optimization options.
Two optimization options caused significant differences in the code size. The first culprit was function inlining. That was a bit of a surprise, and a warning to discover exactly how aggressive your compiler is when it is told to go for speed at all costs. The other code expanding optimizations mostly come from heavy optimization of loops. These optimizations heavily use techniques such as loop unrolling and pipelining. Of minor interest was 2% gained by arranging data access to maximize the use of 16 bit data offsets, rather than using 32 bit address calculations
A small thing to watch out for is the fact that memory has edges. Programmers tend to place data arrays at the first address or butt data up to the last address. However, this creates problems for software pipelining. Software pipelining is the most powerful technique that the compiler has for optimizing tight loops because it allows the DSP to fetch the next set of data while processing the previous data. When a data array is placed on a memory edge, the last iteration of the loop will attempt to fetch data that lies outside of the memory space, causing an address error. To avoid this problem, compilers reduce the loop count by one, and executing the last iteration as an epilog to the loop. This creates safe code, but it adds many instructions to the code. To avoid this code bloat, Blackfin has an "-extra-loads" compiler option. This option tells the compiler it is safe to load one element off the end of arrays.
Advanced compilers attempt to provide intelligent blending of space- and speed-sensitive optimization. For example, the Blackfin compiler offers a selection from 0% to 100% space-sensitive optimization. The compiler combines the target it is given with its own understanding of how space expansive each of its optimizations are. The compiler also evaluates the execution profile of the application as discovered under simulation. The blocks of code which are infrequently executed are compiled to save space rather than to maximize speed. All of this results in a very flexible solution.
Naturally we want to know if it works. In Figure 4 we graph the response for a test program, in this case, a JPEG compression (cjpeg) program. This graph shows that at the extremes of minimizing space or maximizing speed, we get very low returns. Several points are clustered near the origin corresponding to target values 30% through 70%. Any of those give a good tradeoff between speed and space.

Figure 4. Results for speed/size compiler tradeoffs. On both axis, a lower number represents a better result.
If you are not comfortable with fully automatic optimization, or you cannot use simulation, you can approach this problem manually. Figure 5 shows what happens when you take each of the files comprising this application individually, compile them for speed and space, and measure the effects for each file.

Figure 5. Results of optimizing individual files for speed and size. The "% of avail" column shows how much each file contributes to the overall optimization. For example, fileio.c accounts for 24.95% of the total available speedup.
In the "speed" column we see the difference between optimizing for speed or space. Only the files highlighted in yellow show a significant performance effect. This tells us most of our application could be compiled neutrally or to save space. Similarly, the right-hand columns show how code size varies under the same conditions. Only the files highlighted in blue show a significant effect.
The only files that need careful consideration are those which are highlighted in both yellow and blue. For the others, the best optimization settings are obvious. Interestingly, most of the files that are highlighted in yellow don't match up with the files highlighted in blue. This demonstrates that a little analysis can substantially reduce the complexity of the optimization choices.
We end this series where we began. To optimize C code successfully, your efforts must be applied intelligently. You should base your efforts on study of the application and the target processor, rather than optimizing indiscriminately.


Page 1 | 2
http://www.dspdesignline.com/198001797;jsessionid=PVE0NGL2IE2MOQSNDLPSKHSCJUNN2JVN?pgno=2

null



바보들의 영문법 카페(클릭!!)

오늘의 메모....

시사평론-정론직필 다음 카페
http://cafe.daum.net/sisa-1

바보들의 영문법 다음 카페
http://cafe.daum.net/babo-edu/

티스토리 내 블로그
http://earthly.tistory.com/

내 블로그에 있는 모든 글들과 자료에 대한 펌과 링크는 무제한 허용됩니다.
(단, 내 블로그에 덧글쓰기가 차단된 자들에게는 펌, 트랙백, 핑백 등이 일체 허용되지 않음.)

그리고 내 블로그 최근글 목록을 제목별로 보시려면....
바로 아래에 있는 이전글 목록의 최근달을 클릭하시면 됩니다.
그러면 제목을 보고 편하게 글을 골라 보실 수 있습니다.

그리고 내 블로그내 글을 검색하시려면 아래 검색버튼을 이용하시면 됩니다.


가가챗창

flag_Visitors

free counters