I do the following:
invalidate_icache_all(); icache_disable();This boils down to:
#define CP15ISB asm volatile ("mcr p15, 0, %0, c7, c5, 4" : : "r" (0)) #define CP15DSB asm volatile ("mcr p15, 0, %0, c7, c10, 4" : : "r" (0)) /* Invalidate all instruction caches * Also flushes branch target cache. */ asm volatile ("mcr p15, 0, %0, c7, c5, 0" : : "r" (0)); /* Invalidate entire branch predictor array */ asm volatile ("mcr p15, 0, %0, c7, c5, 6" : : "r" (0)); /* Full system DSB - make sure that the invalidation is complete */ CP15DSB; /* ISB - make sure the instruction stream sees it */ CP15ISB; get_SCTLR ( sctlr ); sctlr &= ~SCTLR_I_CACHE; set_SCTLR ( sctlr );
core 0 SCTLR = 00c5087d 100K: 77271 77212 77234 77150 per 1K = 772 20K: 15422 15436 15454 15459 per 1K = 771 1K: 775 774 767 773 per 1K = 775Note the SCTLR register no longer has bit 0x1000 set, so the I cache is indeed disabled.
I am now omitting the 500K timing because it doesn't add new information and is annoyingly slow. What I do find though is that my LED blink delay test now runs extremely slowly. The heart of this test is a call to delay_ms(), which looks like this:
void delay_us ( int delay ) { volatile unsigned int count; count = delay * us_delay_count; while ( count -- ) ; } // 1003 gives 1.000 ms void delay_ms ( int delay ) { unsigned int n; for ( n=delay; n; n-- ) delay_us ( 1003 ); }The heart of the delay_us() loop looks like this:
40002dac: e51b3008 ldr r3, [fp, #-8] 40002db0: e2432001 sub r2, r3, #1 40002db4: e50b2008 str r2, [fp, #-8] 40002db8: e3530000 cmp r3, #0 40002dbc: 1afffffa bne 40002dacThe read and write from [fp,-8] is a reference to memory on the stack. This surprises me somewhat. It could be that the volatile forces this, or it is simply because we are not giving a -O switch to the compiler.
The question though is why this slows down so much, but memcpy seems to run in the same amount of time. Whatever the case, we have certainly confirmed that we are able to switch off the I cache.
Interestingly there is a 64 bit counter here (fed by the 24M clock).
Delay for 10 ms = 8019 (with I cache enabled) Delay for 10 ms = 144717 (with I cache disabled). Delay for 10 ms = 144715 (with I cache disabled, optimized). Delay for 10 ms = 9991 (BBB with I cache enabled (just right!) Delay for 10 ms = 9991 (BBB with I cache enabled, optimized (just right!) Delay for 10 ms = 104896 (BBB with I cache disabled)Comparing to the BBB timings is interesting, but not the main focus right now. It is interesting that the BBB timings are almost exactly 10 ms.
I removed the volatile, and there was no change in the timing. I tried adding "register", but that changed nothing either. Apparently (as advertised), register is no more than a hint and is in general just ignored these days.
I was very much surprised that the "optimized" version (with no memory references in the loop) ran just as fast as the unoptimized. Details follow as to what this "optimized" version is all about.
As for the "optimized" timing, I discovered I could optimize just one function in a file (using gcc) with the following:
__attribute__ ((optimize(1))) static void e_delay_us ( int delay ) { // volatile unsigned int count; register unsigned int count; count = delay * us_delay_count; while ( count -- ) { asm volatile ( "nop" ); asm volatile ( "nop" ); asm volatile ( "nop" ); } }I added the 3 "nop" instructions so there would be 5 instructions in the loop, as in the non-optimized case above. The idea is to eliminate memory references in the loop, while keeping everything else constant. This being a RISC machine, it should execute every instruction in a single clock if there are no conflicts.
4001dd90: 4001dd90: e52db004 push {fp} ; (str fp, [sp, #-4]!) 4001dd94: e28db000 add fp, sp, #0 4001dd98: e30b3590 movw r3, #46480 ; 0xb590 4001dd9c: e3443003 movt r3, #16387 ; 0x4003 4001dda0: e5933000 ldr r3, [r3] 4001dda4: e0000093 mul r0, r3, r0 4001dda8: e3500000 cmp r0, #0 4001ddac: 0a000004 beq 4001ddc4 4001ddb0: e320f000 nop {0} 4001ddb4: e320f000 nop {0} 4001ddb8: e320f000 nop {0} 4001ddbc: e2500001 subs r0, r0, #1 4001ddc0: 1afffffa bne 4001ddb0 4001ddc4: e28bd000 add sp, fp, #0 4001ddc8: e49db004 pop {fp} ; (ldr fp, [sp], #4) 4001ddcc: e12fff1e bx lr
Kyu / [email protected]