BogoMips 提高
s5pv210平台 移植linux 3.0和 linux 3.4.2时,BogoMips[ 0.000083] Calibrating delay loop... 997.78 BogoMIPS (lpj=2494464)都是接近1000,但是 测试移植的 linux-3.7, linux-3.9.7, linux-3.9.11, linux-3.10.28 发现都是
s5pv210平台 移植linux 3.0和 linux 3.4.2时,
BogoMips
[ 0.000083] Calibrating delay loop... 997.78 BogoMIPS (lpj=2494464)
都是接近1000,
但是 测试移植的 linux-3.7, linux-3.9.7, linux-3.9.11, linux-3.10.28 发现都是
600多
Calibrating delay loop... 663.55 BogoMIPS (lpj=1658880)
差别这么大,很担心移植clock或者存在别的问题导致性能降低。
调试发下:
在函数中
static unsigned long __cpuinit calibrate_delay_converge(void)
{
/* First stage - slowly accelerate to find initial bounds */
unsigned long lpj, lpj_base, ticks, loopadd, loopadd_base, chop_limit;
int trials = 0, band = 0, trial_in_band = 0;
lpj = (1<<12);
/* wait for "start of" clock tick */
ticks = jiffies;
while (ticks == jiffies)
; /* nothing */
/* Go .. */
ticks = jiffies;
do {
if (++trial_in_band == (1<<band)) {
++band;
trial_in_band = 0;
}
__delay(lpj * band);
trials += band;
} while (ticks == jiffies);
/*
* We overshot, so retreat to a clear underestimate. Then estimate
* the largest likely undershoot. This defines our chop bounds.
*/
trials -= band;
loopadd_base = lpj * band;
lpj_base = lpj * trials;
recalibrate:
lpj = lpj_base;
loopadd = loopadd_base;
/*
* Do a binary approximation to get lpj set to
* equal one clock (up to LPS_PREC bits)
*/
chop_limit = lpj >> LPS_PREC;
while (loopadd > chop_limit) {
lpj += loopadd;
ticks = jiffies;
while (ticks == jiffies)
; /* nothing */
ticks = jiffies;
__delay(lpj);
if (jiffies != ticks) /* longer than 1 tick */
lpj -= loopadd;
loopadd >>= 1;
}
/*
* If we incremented every single time possible, presume we've
* massively underestimated initially, and retry with a higher
* start, and larger range. (Only seen on x86_64, due to SMIs)
*/
if (lpj + loopadd * 2 == lpj_base + loopadd_base * 2) {
lpj_base = lpj;
loopadd_base <<= 2;
goto recalibrate;
}
return lpj;
}
执行
ticks = jiffies;
while (ticks == jiffies)
ticks的大小都是一样的,所以排除了cpu时钟和 单条指令的周期。
下面只有一个地方值得怀疑
__delay(lpj * band);
linux3.4.2内核中在 arch/arm/lib/delay.S中实现
/*
* loops = r0 * HZ * loops_per_jiffy / 1000000
*
* Oh, if only we had a cycle counter...
*/
@ Delay routine
ENTRY(__delay)
subs r0, r0, #1
#if 0
movls pc, lr
subs r0, r0, #1
movls pc, lr
subs r0, r0, #1
movls pc, lr
subs r0, r0, #1
movls pc, lr
subs r0, r0, #1
movls pc, lr
subs r0, r0, #1
movls pc, lr
subs r0, r0, #1
movls pc, lr
subs r0, r0, #1
#endif
bhi __delay
mov pc, lr
ENDPROC(__delay)
linux3.9.7和linux3.13中在delay-loop.S实现
/*
* loops = r0 * HZ * loops_per_jiffy / 1000000
*/
.align 3
@ Delay routine
ENTRY(__loop_delay)
subs r0, r0, #1
#if 0
movls pc, lr
subs r0, r0, #1
movls pc, lr
subs r0, r0, #1
movls pc, lr
subs r0, r0, #1
movls pc, lr
subs r0, r0, #1
movls pc, lr
subs r0, r0, #1
movls pc, lr
subs r0, r0, #1
movls pc, lr
subs r0, r0, #1
#endif
bhi __loop_delay
mov pc, lr
ENDPROC(__loop_delay)
通过对比发现在linux3.13,多了一个对齐标识
.align 3
把linu3.9.7中也添加了这个对齐标识,BogoMips尽然从663编成了994.
欣喜万分。
[liujia@210]#cat /proc/cpuinfo
processor : 0
model name : ARMv7 Processor rev 2 (v7l)
BogoMIPS : 997.78
Features : swp half thumb fastmult vfp edsp thumbee neon vfpv3 tls
CPU implementer : 0x41
CPU architecture: 7
CPU variant : 0x2
CPU part : 0xc08
CPU revision : 2
Hardware : SMDKV210
Revision : 0000
Serial : 0000000000000000
但是测试应用程序的时候发现,真实的性能却没有提升。
从网上找到这样一段话(http://www.linux-mips.org/wiki?title=BogoMIPS&oldid=6231):
BogoMIPS used to be the infamous, prestigious benchmark for Linux machines over a decade. Unfortunately - or fortunately - depending of point of view - the BogoMIPS number of the favorite machine BogoMIPS have little to nothing to do with actual processor performance. Certain other microarchectural details will also very overproprotionally influence the benchmark. On the other side memory performance, I/O performance, cache size and speed and many other processor and system architecture feature that make a crucial difference for system performance will not influence BogoMIPS at all. The BogoMIPS number for any given processor architecture is basically proportional to the clock rate. On most processor architectures the BogoMIPS loop is compiled into just two instructions. Accordingly small is the aspects of a processor that are actually tested. And processors again are just a small part of an overall system which includes other hardware and software. To show the actual code on MIPS:
.set noreorder loop: bnez $reg, loop subu $reg, 1 .set reorder"
A typical modern machine with efficient branches or branch prediction can execute this loop at a rate of one instruction per cycle. Out of Order Execution which provides roughly a 50% speedup on real workloads provides no benefit. Not even second level caches or memory subsystems are exercised. The more surprising it is that BogoMIPS have become a benchmark for performance as important as extra inches in spam email. Having been a permanent annoynce over the years due to miss-interpretation by users and due to excessive output on multiprocessor machines Linux by default will no longer print the BogoMIPS number since 2.6.9-rc2.
The purpose of the BogoMIPS benchmark is to calibrate internal delay loops which are used for very short delays or in situations where a process can't sleep. This is done by calling the mdelay(), udelay() and ndelay() functions which take the time to delay as the argument in units of milliseconds, microseconds or nanoseconds, respectivly.
看来 BogoMips 并不能能反应真实的性能。
更多推荐
所有评论(0)