http://blog.sina.com.cn/s/blog_7014a5340100mv7m.html

Linux Power Management for x86 CPU  (1)---- C-State

------------------------------------------------

Modern CPUs are more and more powerful. When there is no job to do, it
enters into idle state. During its ilde period, we certainly can cut
Linux Power Management for x86 CPU  (1)---- C-State
------------------------------------------------

Modern CPUs are more and more powerful. When there is no job to do, it
enters into idle state. During its ilde period, we certainly can cut
off its power and have it enter into low-power state only if we know
when there is new assignment and we can re-activate CPU and have it do
its jobs again. The process is like this:

                  no job                 cut off power
   CPU in active  ----------> CPU in idle --------------> low-power state
       ^                                                                               |
       |                                                                                |
       |         re-power up                                                    v
       <-----------------------------------------------------         

To achieve the above goal, we need to answer the following questions:

          1) How to know CPU is idle so that we can cut off power;
          2) How to cut off power;
          3) When and how to re-power up CPU;


1.  When CPU is idle
-----------------
The answer to the first question is very simple as a matter of fact: When
it is idle, CPU runs the swapper process (process ID is 0. Pobably, it
should be called idle thread, anyway, it is a legacy name, and all text-
books call it that way). So, CPU must be idle when it runs into swapper.

Traditionally, the swapper process does nothing. In a forever loop, it just
checks if there is other task to do, if not, delays for a while and then
checks again, otherwise, it tells process scheduler to schedule other task.
The code is like like this:

        while (1) {
            while (no_job_to_do) {
                delay for a while; <------- halt instruction, in fact;
            }
            schedule_other_process;
       }

  So, To cut CPU power, we change the above code to,

        while (1) {
            while (
no_job_to_do) {
                cut_off_cpu_power;  <-----done in pm_idle() for Linux
                    ...
            }
            schedule_other_process;
      }



2. How to Cut Off Power
-----------------------
Note that CPU consists of many units, besides
core logic, it has cache, BIU
(Bus Interface Unit), Local APIC. when a CPU is in idle state, we can cut
clock signal and power from some units. The more units are stopped, the more
power saved.

We need to consider another side effect of cuting CPU power: Each unit spends
some time to power up. So, the more units are stopped, the more time it takes
for CPU to be re-activated (wake up). We call the time as entry/exit latency.


2.1  C-State
-------------
To find a balance between power-saving and entry/exit latnecy, Intel CPUs
provide many low-power states called C-State, or sleeping state. Deponding
on CPU models, Intel CPUs support C-States: C1, C2, C3, C4 C5, C6, ...
(C0 is active state).  While in sleeping state(C1 or above), CPU doesn't
execute any instruction, but consumes less power. 


    C0 - CPU is full-powered, and executes instruction;
    C1 - stop main internal core clocks;
    C2 - C2 has two sub-mode: Stop-Grant & Stop-Clock;

         While in C1/C2, CPU still processes bus snoop & snoop from other
         cores. That means CPU automatically exits C1/C2, handle snoop and
         then returns C1/C2 again.
        
    C3 - Flush cache. So, it won't exit C3 to handle snoop.
    C4 - for multi-core processors. For example, for Duo 2, if both cores
         are in C4, the package will enter a deeper sleep state.
        
    C5 - I don't know :)
    C6 - For Intel Core i7, the package enters more deeper sleep if all
         cores in C6, and some additional power-saving from QPI link.
        
    Cn -  ...    Sigh~,
   
Besides Cx, some Intel CPUs have enhanced CxE states. For example, Intel
Core 2 Duo instroduced enhanced C-States:  C1E, C2E, C3E, C4E. The enhanced
states have an additional feature than Cx-State: they reduce CPU voltage
before entering Cx-state
(In fact, voltage-reducing is implemented based
on EIST/T-States
).


2.2  HLT, P_LVLx and MWait
---------------------------
Then, how to enter into some certain C-State ? Intel provides three methods.


2.2.1  HLT instruction
----------------------
As we know, Intel x86 has a HLT (halt) instruction. From 486DX4, this
instruction will cause CPUs to enter into C1 or C1E state. If BIOSes
enable C1E feature, CPU enters C1E, otherwise CPU enters C1. BIOSes
enables C1E via some MSR register. For example, for Intel Xeon 7000,
BIOS can set bit 25 of IA32_MISC_ENABLE_MSR (MSR 1A0).

Note that HLT can be used for C1 entry only. That means, you cannot
enable CPU to enter C2 or above by HLT.


2.2.2  P_LVLx I/O registers
----------------------------
And Intel defines P_LVLx I/O registers (x is 2 ~ 5). I/O reading P_LVLx
register will cause CPU to enter into C-state. Generally, P_LVL2 for C2,
but P_LVL3 of Core i7 for C6 while P_LVL3 of Duo 2 for C3. It depends on
CPU model.



2.2.3  Monitor/MWait instruction
--------------------------------
Except HLT instruction and P_LVLx registers, Intel provides another way
to enable CPU to enter into C-State: MWait. This instruction should be
used together with Monitor. Normally, we use monitor instruction to
watch a range of memory, and then use mwait with some hintsto enable CPU
to enter into Cx-state.

Without this instruction, when a CPU is in sleeping state, if other CPUs
want to wake it up, the only way is to send an IPI. However, IPI is an
expensive operation, it takes much time (compared to Monitor/MWait). With
Monitor/MWait pair, other CPUs can wakup sleeping CPU by modify the memory

watched (monitored) by the sleeping CPU.

/*-----------------------------------------------------------

现在用到的代码

    stop_critical_timings();
    if (!need_resched()) {
        __monitor((void *)&current_thread_info()->flags, 0, 0);
        smp_mb();
        if (!need_resched())
            __mwait(eax, ecx);
    }    
----------------------------------------------------------
*/
    start_critical_timings();



3.  Re-activate CPU
-----------------------------
When a CPU runs into swapper process, there might be some processes in
various wait queues of this CPU. Once the condition changes, those
processes could become runnable again. Because they have been already
assigned to this CPU, before sleeping, the CPU must prepare to run the
processes in wait state in the near future.

Then, what's the conditions which a process can wait for ? Yes, time and/
or interrupt
. A process can wait on a timer orinterrupt or some events
that will be triggered in interrupt handling.

Intel CPU returns to C0 from sleeping state once receiving interrupt, and
timer is implemented via hardware timer interrupt. So those processes in
waitqueues would be executed once they becomes runnable (we skip tickless
kernel and C3-stop LAPIC timer for the time being).

Besides, other CPUs can assign some jobs to an idle CPU andwake it up via
interrupt or the method provided by monitor/mwait.


4.  ACPI & C-State
-------------------
ACPI
Advanced Configuration and Power Management Interface
defines two methods (control interfaces) to control CPU C-states. And
ACPI specification defines 3 C-states. Note that ACPI C-states is not the
same as Intel CPU C-States. For example, we can map Intel CPU C1/C1E to
ACPI C1, Intel C2/C2E to ACPI C2,Intel C3, C4, C5, C6 to ACPI C3.


4.1.  P_LVLx registers in P_BLK
-------------------------------
   
In DSDT table, each processor optionaly can have a P_BLK register block,
For example,
   
     Processor (
               \_PR.CPU0,      // Namespace name
               1,
               0x120,          // P_BLK system I/O address
               6               // size of P_BLK
         ) {...}
   
    P_LVL2:   P_BLK + 4, 1 byte, system I/O space;
    P_LVL3:   P_BLK + 5, 1 byte, system I/O space;

Reading P_LVL2 causes CPU to enter C2 state; reading P_LVL3 causes CPU to
enter C3 state.

In FADT table, there are two fields to give C2 and C3 entry/exit latency
respectivly,
   
        FADT.P_LVL2_LAT,  The worst-case hardware latency to enter/exit a
                          C2 state. A value > 100 indicates the system does
                          not support a C2 state.
                     
    FADT.P_LVL3_LAT,  The worst-case hardware latency to enter/exit a
                          C3 state. A value > 1000 indicates the system does
                          not support a C3 state.

Based on entry/exit latency, OS can select which C-state should be entered
into when CPU is idle. OS should select as deeper sleeping state as possible,
so as to save more power. In fact, the hardware entry/exit latency is used
as a reference point, and OS will adjust the entry/exit latency for each
C-state during runtime.

When CPU is idle, OS checks the most recent impending timer, and compares
the interval with C-State latency, and select one of C-state to enter.
   
4.2. _CST & _CSD ACPI objects
-----------------------------   

4.2.1 _PDC
----------
_PDC, OS uses it to inform the platform of the level cpu power managemet
      support provided by OS;

Note that OS must use _PDC/_OSC method to inform the platform of the level of
power management which OS can handle. Based on this information, ACPI firmware
can return different values(package) for_CST and _CSD.

4.2.2 _CST

_CST是通过ACPI ASL code 汇报给OSPM的有关该平台CPU所支持的C-state的信息。它的格式如下所示:

CSTPackage : Package ( Count ,  CState ,…,  CState )

其中Count表示所支持的C-state的个数

CState: Package ( Register ,  Type ,  Latency ,  Power ) 

Register表示OSPM调整C-state的方式,Type表示C State的类型(1=C1, 2=C2, 3=C3)。Latency表示进入该C-state的最大的延迟, Power表示在该C-state时的功耗(单位是毫瓦)。下述是一个sample code,注释部分已经讲的很明白了CPU0支持4个C-state,其中C1使用FFixedHW的方式访问,其它3个C-state都是通过P_LVL方式切入,第三和第四个Cstate都被映射到ACPI C3。


----------     
_CST, the platform declares the supported C-States. ACPI can define a _CST
      object for a processor like,

          Name (_CST, Package()) {Count, CState,…, CState},  where,   
          CState: Package (Register, Type, Latency, Power)
   
    For example,   
   
    Processor (\_PR.CPU0,1, 0x120, 6) {
        ...
        Name (_CST, Package() {
            4,      //the number of supported C-States
            Package(){ResourceTemplate(){Register(FFixedHW, 0, 0, 0)}, 1, 20, 1000},
            Package(){ResourceTemplate(){Register(SystemIO, 8, 0, 0x161)}, 2, 40, 750},
            Package(){ResourceTemplate(){Register(SystemIO, 8, 0, 0x162)}, 3, 60, 500},
            Package(){ResourceTemplate(){Register(SystemIO, 8, 0, 0x163)}, 3, 100, 250}
        })   
        ...
    }

    In this example, CPU0 has 4 C-states, C1, C2 and two C3 with different
    latency and average power consumption.
   
        C1: FFixedHW, it means using "halt" or "mwait" instruction to enter C1;
        C2: SystemIO, 8-bit size, so a byte-read to I/O addr 0x161 to enter C2;
   
    If Cx state uses FFixedHW, we check if the CPU supports mwait instruction. Calling
    cpuid.ax = 0x05, the returned value in edx register tells us which C-state is
    supported by mwait instruction (including the number of sub-state of each C-State).


4.2.3 _CSD

C-State Dependency 用于向OSPM提供多个logic processor之间C-state的依赖关系。比如在一个Dual Core的平台上,每颗核可以独立运行C1但是如果其中一个核切换到C2,另一个也必须要切换到C2,这时就需要在_CSD中提供这部分信息。


------------     
_CSD, the platform provides C-State control cross logical processor
dependency information to OS;
     
      CSDPackage: Package (CStateDep,…, CStateDep), where,
      CStateDep:  Package (NumberOfEntries, Revision, Domain, CoordType,
                           NumProcessors, Index)
     
    For example,
     
    Processor (\_SB.CPU0, 1, 0x120, 6) {
        Name (_CST, Package() {
            3,
            Package(){ResourceTemplate(){Register(FFixedHW, 0, 0, 0)}, 1, 20, 1000},
            Package(){ResourceTemplate(){Register(SystemIO, 8, 0, 0x161)}, 2, 40, 750},
            Package(){ResourceTemplate(){Register(SystemIO, 8, 0, 0x162)}, 3, 60, 500}           
        })
        Name(_CSD, Package() {
            Package(){6, 0, 0, 0xFD, 2, 1}, // 6 entries, Revision 0, Domain 0, OSPM Coordinate
                                           // Initiate on Any Proc, 2 Procs, Index 1 (C2-type)
            Package(){6, 0, 0, 0xFD, 2, 2} // 6 entries, Revision 0, Domain 0, OSPM Coordinate
                                          // Initiate on Any Proc, 2 Procs, Index 2 (C3-type)
       })
    }   
    Processor (\_SB.CPU1, 2, 0x130, 6) {
        Name(_CST, Package() {
            3,
            Package(){ResourceTemplate(){Register(FFixedHW, 0, 0, 0)}, 1, 20, 1000},
            Package(){ResourceTemplate(){Register(SystemIO, 8, 0, 0x161)}, 2, 40, 750},
            Package(){ResourceTemplate(){Register(SystemIO, 8, 0, 0x162)}, 3, 60, 500}
        })
        Name(_CSD, Package() {
            Package(){6, 0, 0, 0xFD, 2, 1}, // 6 entries (fields in this package), Revision 0,
                                            // Domain 0, OSPM Coordinate
                                            // Initiate on any Proc, 2 Procs, Index 1 (C2-type)
            Package(){6, 0, 0, 0xFD, 2, 2} // 6 entries, Revision 0, Domain 0, OSPM Coordinate
                                           // Initiate on any Proc, 2 Procs, Index 2 (C3-type)
        })
    }

I am copying the following words from ACPI sepc,

OSPM can coordinate the transitions between logical processors, choosing to initiate
the transition when doing so does not lead to incorrect or non-optimal system behavior.
This OSPM coordination is referred to as Software Coordination. Alternately, it might
be possible for the underlying hardware to coordinate the state transition requests
on multiple logical processors, causing the processors to transition to the target
state when the transition is guaranteed to not lead to incorrect or non-optimal
system behavior. This scenario is referred to as Hardware (HW) coordination


5. Linux C-State Related Code
--------------------------
Linux has a global function pointer pm_idle, if nobody changes it, it is set
to default_idle(). The routine default_idle() just calls HLT instruct to put
CPU into halt state. If CPU supports C-state, this will cause CPU to enter C1
or into C1E if BIOS enabled C1E feature.

In fact, there are many module trying to have pm_idle point to a specific
routine. For example,

     APM                        apm_cpu_idle()  //legacy APM power management
     cpuidle                    cpuidle_idle_call()
     AMD-CPU                    c1e_idle()     //AMD C1E acts like Intel C3
     CPU supporting MWait       mwait_idle()   //C1 only
     idle=poll by kernel-param    poll_idle()    //noop, no power reducing
     idle=halt by kernel-param    default_idle()
     ...

The priotrity of swapper process is very low, it executes only when there is
no other runable process. Any runnable process can preempt CPU from swapper
process. In a forever loop, swapper process executes cpu_idle() like this,

        void cpu_idle(void)
        {
             ...
                while (1) {
                     while (!need_resched()) { <----If hasn't runnable process
                           local_irq_disable();
                           pm_idle();
                }  
             ...
                schedule();  <------- select a new process to be executed
             ...
       }
        
        

5.1  Architecture Overview
--------------------------
Linux CPU C-State related modules/drivers are orgnized as follows,
        

                         ----------------
                         |   sysfs        |
                         ----------------
                                  |
             --------     ------  |
            | ladder |    |menu|  |
            ---------     -----   |
                 |          |     |
              ------------------------
              |cpuidle infrastructure |
              ------------------------
                   |
                   |
            ----------------------
            |acpi-cpuidle driver |
            ----------------------                  
                   |
                   |
          ----------------------------
          |ACPI processor bus driver |
          ----------------------------


5.1.1 Driver Register
-----------------------
In acpi_processor_init(), which is a module initialization routine and
called by do_initcalls(), two related drivers, acpi processor bus driver
and acpi_idle_driver, are registered. If you really want to look into it,
take a look at the following path:

kernel_init()
    ==> do_basic_setup()
         ==> do_initcalls()
             ==>   ... acpi_processor_init();
                       ==> cpuidle_register_driver(&acpi_idle_driver);
                           acpi_bus_register_driver(&acpi_processor_driver);

  Among, the registering of drivers is in driver/acpi/processor_core.c;

notes:  
    a) cpuidle insfrastructure is NOT a driver, and it is initialized by
       core_initcall(). It provides:
         I) In userland apps/users can check/switch cpuilde governor by
            sysfs interface:  /sys/devices/system/cpu/(cpuX)/cpuidle/
         II) interfaces for governor registering;
         III) interfaces for cpuilde devices, cpuilde driver;
         IV) Set global pm_idle pointer to cpuilde_idle_call();
    b) acpi_idle_driver is registered into cpuidle infrastruct, while
       acpi_processor_driver is registered acpi subsystem as an acpi bus
       driver;        
    c) cpuilde infrastructure allows only one driver to register, it uses
       a global pointer to the registered acpi_idle_driver. Refer to
       cpuidle_register_driver() provided by cpuidle infrastructure in
       driver/cpuidle/driver.c
    d) ACPI process driver registers a hotplug callback for cpu hotplug,
       so it will get notification when a CPU is online/offline.  


5.1.2 Device Discovery & Register
---------------------------------
ACPI subsystem parses ACPI tables, and for each ACPI processor object,
it calls acpi processor bus driver's add entrypoint, acpi_processor_add(),
to add an acpi processor device.

After adding an acpi processor device, acpi subsystem will call processor
driver's start entrypoint function, acpi_processor_start().

In acpi_processor_start(), the routine acpi_processor_power_init() is
called to evaluate _PDC, and read & parse _CST, _CSD or use FADT/MADT
info to initialize processors' power state information, and then calls
cpuidle_register_device() to register a cpuidle device into cpuidle
infrastructure.

For hotplug CPUs, during acpi_processor_init() execution, the routine
acpi_processor_install_hotplug_notify() is called to register a CPU
hotplug callback. when a CPU is online, acpi_processor_start() gets
execution.

Please note that both the processors operate the same physical CPUs,
besides cpuidle driver, there are some other processor-related drivers,
such as T-State driver, P-state driver,  CPU-hotplug infrastructure,
etc. The ACPI processor driver acts as a bridge/coordinator among
those drivers.



5.1.3 Driver/Device attach
-----------------------
acpi(高级配置
和电源管理接口)  subsystem registered processors into acpi_process_driver, if/when
the registered CPU is online, the start entrypoint, acpi_processor_start()
is called. This entry function takes many initialization jobs for T-state,
P-state and C-state. Now we just look at c-state, it calls
 
            acpi_processor_power_init();
                ==> acpi_processor_get_power_info();
                ==> acpi_processor_setup_cpuidle();   

The first called routine will evaluate _CST or read FADT if _CST failed,
to get C-state description from ACPI tables. Refer to section 4.1/4.2,
and see how to handle c-state information.

The second one will setup some information for each valid c-state, note
for most cases (without kernel parameter, bus master,  etc)

    C1, state->enter = acpi_idle_enter_c1;
    C2, state->enter = acpi_idle_enter_simple;
    C3, state->enter = acpi_idle_enter_bm;
    
This enter routine is used to enter corresponding C-state.



5.1.4  Governor
-----------------
The governors of cpuilde are simple to read/understand. It provides 3
main callbacks for cpuidle infrastructure.

         rating        <-- menu is 20, ladder is 10;
         enable()
         select()
         reflect()

Each governor has a rating in its structure. When governors are registered
into cpuidle insfrastructure by the routine cpuidle_register_governor(),
cpuidle will select the one with max rating unless users specified one
via sysfs interface. The cpuilde_curr_governor pointers point to the
selected one.

Only one governor can be used at the same time. When, OS decides to put a
CPU into C-state, it calls select entrypoint of current governor, governor
will by its policy choose one C-state,

      cpuilde_idle_call()
      {
         
          next_state = cpuilde_curr_governor->select();
          target_state = &dev->states[next_state];
         这边的代码有变化!
                   
         
          dev->last_state = target_state;
          dev->last_residency = target_state->enter(dev, target_state);
                   
         
         
          cpuilde_curr_governor->reflect();
      }


6. Linux Files related to C-States
----------------------------------
driver/acpi/processor_core.c
driver/acpi/processor_idle.c
driver/cpuidle/cpuidle.c
driver/cpuidle/driver.c
driver/cpuidle/governor.c
driver/cpuidle/sysfs.c
driver/cpuidle/governor/ladder.c
driver/cpuidle/governor/menu.c



7. Some Kernel Parameters
-------------------------------
idle=poll,             polling, always in C0, most no power-saving;
idle=halt,             use HLT instruction only,  only enter C1;
idle=nomwait           don't use mwait, P_LVLx method is used;
idle=mwait             force OS to use mwait for C-state;
max_cstate=n           specifiy available max C-state, n is a number


Others (which may help locate issue when C-State doesn't work),

nohz=off             don't use dynamic tick/tickless mode
nolapic_timer        don't use local APIC timer
lapic_timer_c2_ok    Local APIC timer is ok in C2
clocksource=tsc (or hpet, pit, acpi_pm, jiffies),      override clock source




8. Sysfs & Proc
-----------------

Check C-State stastics & state,

    /proc/acpi/processor/CPUX/
   
Check governor & driver,
   
   /sys/devices/system/cpu/cpuidle/       (for system0-wide)           
   /sys/devices/system/cpu/cpuX/cpuidle/  (for CPU)         




9. TBD
-----------
9.1 Broadcast Timer
------------------
When some CPU enters deep C (C3 or above), their Local APIC timer will
stop as well (Linux uses LAPIC timer as tick device in most cases). This
issue is handled by "broadcast timer scheme.


9.2 Dynamic Tick /Tickless
--------------------------
Linux supports tickless which causes the C-State code more complex.


9.3 Idle Load balancing
-----------------------
When CPUs enter into idle state, one of idle CPU will be nominated as ILB
(Idle Load Balancer). It is responsible for pulling task from busy CPUs and
re-assigne the tasks to idle CPUs and have idle CPUs to start-up.


Logo

更多推荐