How Intel Server CPUs Save Power - Intro to C-states

Keywords: #c-states #power-management

Note: this post describes how C-states work on Intel server CPUs. C-states are often similar on alternative platforms but the specifics of each state differ. However, the basic concept of using C-states for power management is common to all platforms, including client systems.

What are C-states?

C-states represent different power states for CPU cores, each state consumes different amounts of power and has slightly different behaviour. They are used to save energy while the CPU is powered but idle.

As a human analogy, think of C-states as different stages of relaxation; going from a quick sit-down all the way to a deep meditation.

As with humans, the deeper the state a CPU is in, the longer it will take to resume working/respond to interruptions. This is an important aspect of CPU power management; balancing power savings from entering deeper states without increasing the time it takes to continue operating.

Differences between “hardware” and “software” C-states

It is important to be aware that there are two types of C-state and both are often used in isolation and simply called “C-states”.

Software C-states are commonly also referred to as “Requested C-states” or “Logical C-states”. This is because these are the C-states which can be requested by the operating system. While hardware C-states reflect the actual state of a CPU. Notation varies slightly between hardware and software C-states but understanding how they connect is usually intuitive.

For example, the OS can request that the CPU goes into the hardware state “CC1”, by referencing the software state “C1”.

How are C-states used to save power?

When the operating system has no runnable tasks, it can maximise on the opportunity to save power. It first estimates which software C-state will be the most suitable to request based on a number of factors such as how long it thinks the system will be idle for. Then it uses the mwait instruction to hint and request that C-state to the CPU. It is then up to the processor to determine exactly which hardware C-state it will enter.

What are the different C-states?

Going from “shallowest” to “deepest” (i.e. lowest to highest power saving), here are examples of commonly used hardware C-states:

This list is not exhaustive of all C-states and many platforms support many other hardware C-states. However, those listed below are extremely widely supported and used in production systems working today.

  • CC0 - The busy state, indicates that the CPU is currently operating.
  • CC1 - The CPU begins “core clock gating”. In CC1, the core has stopped executing but the state is preserved. This means that to return to normal operation, the core just has to stop core gating.
  • CC1e - Known as “Enhanced C1”. In addition to the clock gaiting involved in CC1, there is also a suggestion to drop the core voltage. This means that C1e has a similar wake-up latency to C1 but will operate at a slightly lower frequency after waking up, while the voltage is restored. Note: the reason it is only a hint, is that the CPU will generally only reduce the voltage if every core has entered C1e.
  • CC6 - Rather than clock gating, in CC6, the CPU begins “power gating” which means that the core voltage will be reduced to 0V. Cores in CC6 can take up to 10 times longer to wake from idle than the previously-mentioned C-states. This is because when the core reduces the voltage to 0V, it has to flush caches and other state-saving components. This means that the state of the core has to be saved elsewhere. Therefore, when the core is awoken, it will have to restore its operating state from nothing. This process of restoring the state does not apply to any of the previous C-states.

All of the C-states above are core C-states, which means that they affect the power consumption at a core level. However, there are C-states which represent a power state of larger scopes such as at the package level (e.g. PC6 - package C6).

Wake Latency

As mentioned above, going to deeper states means saving power. But this comes at a cost. That cost is called “Wake Latency”. Wake latency is the time it takes for the CPU to wake from a given C-state and continue operating.

To measure wake latency, at Intel, I work on a free and open-source tool walled ‘wult’ - Wake Up Latency Tracer. ‘wult’ tests wake latencies of CPUs with the following process:

  1. Set a timer in the future - the CPU “alarm clock”.
  2. Tell the CPU to sleep.
  3. When the timer expires, trigger an interrupt which wakes the CPU.
  4. As soon as the CPU is awake, read the current time and calculate the difference between the timer triggering and the current time. That’s the wake latency!

For more information, the wult documentation describes in detail how the process works and the selection of methods that can be used to implement the method outlined above.

How can I modify C-states on Linux?

Another free and open-source tool I have contributed to is called ‘pepc’ - Power, Energy & Performance Configurator. On Linux, this tool can be used to easily enable and disable C-states through a CLI.

Further reading

If you would like to learn more about server CPU power management, I strongly recommend the freely available book Energy Efficient Servers by Corey Gough, Ian Steiner & Winston Saunders. This book was given to me when I started at Intel and it taught me a lot as a great introduction to server CPU power management.

Additionally, in case you are interested in Linux specifics, you can read about how Linux handles idle time and C-states in the Linux documentation on CPU Idle Time Management.

I hope you enjoyed reading this blog post! Sign up to my newsletter here: