Linux Power Management for x86 CPU (1)---- C-State
http://blog.sina.com.cn/s/blog_7014a5340100mv7m.htmlLinux Power Management for x86 CPU (1)---- C-State------------------------------------------------Modern CPUs are more and more powerful.
http://blog.sina.com.cn/s/blog_7014a5340100mv7m.html
Linux Power Management for x86 CPU
Modern CPUs are more and more powerful. When there is no job to do, it
enters into idle state. During its ilde period, we certainly can cut
Linux Power Management for x86 CPU
------------------------------------------------
Modern CPUs are more and more powerful. When there is no job to do, it
enters into idle state. During its ilde period, we certainly can cut
off its power and have it enter into low-power state only if we know
when there is new assignment and we can re-activate CPU and have it do
its jobs again. The process is like this:
To achieve the above goal, we need to answer the following questions:
1.
-----------------
The answer to the first question is very simple as a matter of fact: When
it is idle, CPU runs the swapper process (process ID is 0. Pobably, it
should be called idle thread, anyway, it is a legacy name, and all text-
books call it that way). So, CPU must be idle when it runs into swapper.
Traditionally, the swapper process does nothing. In a forever loop, it just
checks if there is other task to do, if not, delays for a while and then
checks again, otherwise, it tells process scheduler to schedule other task.
The code is like like this:
2. How to Cut Off Power
-----------------------
Note that CPU consists of many units, besides core logic, it has cache, BIU
(Bus Interface Unit), Local APIC. when a CPU is in idle state, we can cut
clock signal and power from some units. The more units are stopped, the more
power saved.
We need to consider another side effect of cuting CPU power: Each unit spends
some time to power up. So, the more units are stopped, the more time it takes
for CPU to be re-activated (wake up). We call the time as entry/exit latency.
2.1
-------------
To find a balance between power-saving and entry/exit latnecy, Intel CPUs
provide many low-power states called C-State, or sleeping state. Deponding
on CPU models, Intel CPUs support C-States: C1, C2, C3, C4 C5, C6, ...
(C0 is active state).
execute any instruction, but consumes less power.
Besides Cx, some Intel CPUs have enhanced CxE states. For example, Intel
Core 2 Duo instroduced enhanced C-States:
states have an additional feature than Cx-State: they reduce CPU voltage
before entering Cx-state (In fact, voltage-reducing is implemented based
on EIST/T-States).
2.2
---------------------------
Then, how to enter into some certain C-State ? Intel provides three methods.
2.2.1
----------------------
As we know, Intel x86 has a HLT (halt) instruction. From 486DX4, this
instruction will cause CPUs to enter into C1 or C1E state. If BIOSes
enable C1E feature, CPU enters C1E, otherwise CPU enters C1. BIOSes
enables C1E via some MSR register. For example, for Intel Xeon 7000,
BIOS can set bit 25 of IA32_MISC_ENABLE_MSR (MSR 1A0).
Note that HLT can be used for C1 entry only. That means, you cannot
enable CPU to enter C2 or above by HLT.
2.2.2
----------------------------
And Intel defines P_LVLx I/O registers (x is 2 ~ 5). I/O reading P_LVLx
register will cause CPU to enter into C-state. Generally, P_LVL2 for C2,
but P_LVL3 of Core i7 for C6 while P_LVL3 of Duo 2 for C3. It depends on
CPU model.
2.2.3
--------------------------------
Except HLT instruction and P_LVLx registers, Intel provides another way
to enable CPU to enter into C-State: MWait. This instruction should be
used together with Monitor. Normally, we use monitor instruction to
watch a range of memory, and then use mwait with some hintsto enable CPU
to enter into Cx-state.
Without this instruction, when a CPU is in sleeping state, if other CPUs
want to wake it up, the only way is to send an IPI. However, IPI is an
expensive operation, it takes much time (compared to Monitor/MWait). With
Monitor/MWait pair, other CPUs can wakup sleeping CPU by modify the memory
watched (monitored) by the sleeping CPU.
/*-----------------------------------------------------------
现在用到的代码
stop_critical_timings();
if (!need_resched()) {
__monitor((void *)¤t_thread_info()->flags, 0, 0);
smp_mb();
if (!need_resched())
__mwait(eax, ecx);
}
----------------------------------------------------------
*/
start_critical_timings();
3.
-----------------------------
When a CPU runs into swapper process, there might be some processes in
various wait queues of this CPU. Once the condition changes, those
processes could become runnable again. Because they have been already
assigned to this CPU, before sleeping, the CPU must prepare to run the
processes in wait state in the near future.
Then, what's the conditions which a process can wait for ? Yes, time and/
or interrupt. A process can wait on a timer orinterrupt or some events
that will be triggered in interrupt handling.
Intel CPU returns to C0 from sleeping state once receiving interrupt, and
timer is implemented via hardware timer interrupt. So those processes in
waitqueues would be executed once they becomes runnable (we skip tickless
kernel and C3-stop LAPIC timer for the time being).
Besides, other CPUs can assign some jobs to an idle CPU andwake it up via
interrupt or the method provided by monitor/mwait.
4.
-------------------
ACPI
Advanced Configuration and Power Management Interfacedefines two methods (control interfaces) to control CPU C-states. And
ACPI specification defines 3 C-states. Note that ACPI C-states is not the
same as Intel CPU C-States. For example, we can map Intel CPU C1/C1E to
ACPI C1, Intel C2/C2E to ACPI C2,Intel C3, C4, C5, C6 to ACPI C3.
4.1.
-------------------------------
In DSDT table, each processor optionaly can have a P_BLK register block,
For example,
Reading P_LVL2 causes CPU to enter C2 state; reading P_LVL3 causes CPU to
enter C3 state.
In FADT table, there are two fields to give C2 and C3 entry/exit latency
respectivly,
Based on entry/exit latency, OS can select which C-state should be entered
into when CPU is idle. OS should select as deeper sleeping state as possible,
so as to save more power. In fact, the hardware entry/exit latency is used
as a reference point, and OS will adjust the entry/exit latency for each
C-state during runtime.
When CPU is idle, OS checks the most recent impending timer, and compares
the interval with C-State latency, and select one of C-state to enter.
4.2. _CST & _CSD ACPI objects
-----------------------------
4.2.1 _PDC
----------
_PDC, OS uses it to inform the platform of the level cpu power managemet
Note that OS must use _PDC/_OSC method to inform the platform of the level of
power management which OS can handle. Based on this information, ACPI firmware
can return different values(package) for_CST and _CSD.
4.2.2 _CST
_CST是通过ACPI ASL code 汇报给OSPM的有关该平台CPU所支持的C-state的信息。它的格式如下所示:
CSTPackage : Package ( Count , CState ,…, CState )
其中Count表示所支持的C-state的个数
CState: Package ( Register , Type , Latency , Power )
Register表示OSPM调整C-state的方式,Type表示C State的类型(1=C1, 2=C2, 3=C3)。Latency表示进入该C-state的最大的延迟, Power表示在该C-state时的功耗(单位是毫瓦)。下述是一个sample code,注释部分已经讲的很明白了CPU0支持4个C-state,其中C1使用FFixedHW的方式访问,其它3个C-state都是通过P_LVL方式切入,第三和第四个Cstate都被映射到ACPI C3。
----------
_CST, the platform declares the supported C-States. ACPI can define a _CST
4.2.3 _CSD
C-State Dependency 用于向OSPM提供多个logic processor之间C-state的依赖关系。比如在一个Dual Core的平台上,每颗核可以独立运行C1但是如果其中一个核切换到C2,另一个也必须要切换到C2,这时就需要在_CSD中提供这部分信息。
------------
_CSD, the platform provides C-State control cross logical processor
dependency information to OS;
I am copying the following words from ACPI sepc,
OSPM can coordinate the transitions between logical processors, choosing to initiate
the transition when doing so does not lead to incorrect or non-optimal system behavior.
This OSPM coordination is referred to as Software Coordination. Alternately, it might
be possible for the underlying hardware to coordinate the state transition requests
on multiple logical processors, causing the processors to transition to the target
state when the transition is guaranteed to not lead to incorrect or non-optimal
system behavior. This scenario is referred to as Hardware (HW) coordination
5. Linux C-State Related Code
--------------------------
Linux has a global function pointer pm_idle, if nobody changes it, it is set
to default_idle(). The routine default_idle() just calls HLT instruct to put
CPU into halt state. If CPU supports C-state, this will cause CPU to enter C1
or into C1E if BIOS enabled C1E feature.
In fact, there are many module trying to have pm_idle point to a specific
routine. For example,
The priotrity of swapper process is very low, it executes only when there is
no other runable process. Any runnable process can preempt CPU from swapper
process. In a forever loop, swapper process executes cpu_idle() like this,
5.1
--------------------------
Linux CPU C-State related modules/drivers are orgnized as follows,
5.1.1 Driver Register
-----------------------
In acpi_processor_init(), which is a module initialization routine and
called by do_initcalls(), two related drivers, acpi processor bus driver
and acpi_idle_driver, are registered. If you really want to look into it,
take a look at the following path:
kernel_init()
notes:
5.1.2 Device Discovery & Register
---------------------------------
ACPI subsystem parses ACPI tables, and for each ACPI processor object,
it calls acpi processor bus driver's add entrypoint, acpi_processor_add(),
to add an acpi processor device.
After adding an acpi processor device, acpi subsystem will call processor
driver's start entrypoint function, acpi_processor_start().
In acpi_processor_start(), the routine acpi_processor_power_init() is
called to evaluate _PDC, and read & parse _CST, _CSD or use FADT/MADT
info to initialize processors' power state information, and then calls
cpuidle_register_device() to register a cpuidle device into cpuidle
infrastructure.
For hotplug CPUs, during acpi_processor_init() execution, the routine
acpi_processor_install_hotplug_notify() is called to register a CPU
hotplug callback. when a CPU is online, acpi_processor_start() gets
execution.
Please note that both the processors operate the same physical CPUs,
besides cpuidle driver, there are some other processor-related drivers,
such as T-State driver, P-state driver,
etc. The ACPI processor driver acts as a bridge/coordinator among
those drivers.
5.1.3 Driver/Device attach
-----------------------
acpi(高级配置 和电源管理接口) subsystem registered processors into acpi_process_driver, if/when
the registered CPU is online, the start entrypoint, acpi_processor_start()
is called. This entry function takes many initialization jobs for T-state,
P-state and C-state. Now we just look at c-state, it calls
The first called routine will evaluate _CST or read FADT if _CST failed,
to get C-state description from ACPI tables. Refer to section 4.1/4.2,
and see how to handle c-state information.
The second one will setup some information for each valid c-state, note
for most cases (without kernel parameter, bus master,
This enter routine is used to enter corresponding C-state.
5.1.4
-----------------
The governors of cpuilde are simple to read/understand. It provides 3
main callbacks for cpuidle infrastructure.
Each governor has a rating in its structure. When governors are registered
into cpuidle insfrastructure by the routine cpuidle_register_governor(),
cpuidle will select the one with max rating unless users specified one
via sysfs interface. The cpuilde_curr_governor pointers point to the
selected one.
Only one governor can be used at the same time. When, OS decides to put a
CPU into C-state, it calls select entrypoint of current governor, governor
will by its policy choose one C-state,
6. Linux Files related to C-States
----------------------------------
driver/acpi/processor_core.c
driver/acpi/processor_idle.c
driver/cpuidle/cpuidle.c
driver/cpuidle/driver.c
driver/cpuidle/governor.c
driver/cpuidle/sysfs.c
driver/cpuidle/governor/ladder.c
driver/cpuidle/governor/menu.c
7. Some Kernel Parameters
-------------------------------
idle=poll,
idle=halt,
idle=nomwait
idle=mwait
max_cstate=n
Others (which may help locate issue when C-State doesn't work),
nohz=off
nolapic_timer
lapic_timer_c2_ok
clocksource=tsc (or hpet, pit, acpi_pm, jiffies),
8. Sysfs & Proc
-----------------
Check C-State stastics & state,
Check governor & driver,
9. TBD
-----------
9.1 Broadcast Timer
------------------
When some CPU enters deep C (C3 or above), their Local APIC timer will
stop as well (Linux uses LAPIC timer as tick device in most cases). This
issue is handled by "broadcast timer scheme.
9.2 Dynamic Tick /Tickless
--------------------------
Linux supports tickless which causes the C-State code more complex.
9.3 Idle Load balancing
-----------------------
When CPUs enter into idle state, one of idle CPU will be nominated as ILB
(Idle Load Balancer). It is responsible for pulling task from busy CPUs and
re-assigne the tasks to idle CPUs and have idle CPUs to start-up.
更多推荐
所有评论(0)