Server processors don’t know what holidays are, since they are designed to work 24 hours a day and 7 days a week continuously and non-stop. Well, it seems that AMD EPYC Rome CPUs at 1044 stop their activity dead. Why does this happen and what are the consequences?
Today, many businesses offer services that are based on having a server infrastructure connected to the Internet. They do it all over the world without a break and the fact that their CPUs stop from time to time with no choice but scheduled maintenance for restarting or disregarding energy saving methods, is still a problem. economic.
Why AMD EPYC Rome CPUs at 1044 days stop their activity?
In reality, it is not the entire processor that stops its activity after said time, and what occurs is a stop and not a shutdown. That is, its activity is stopped in time. According to AMD itself, this phenomenon occurs when one of the cores, due to inactivity, is unable to wake up again. The reasons why this occurs? Well, they are not known yet, since an official explanation has not yet been given to the problem.
We must bear in mind that AMD’s EPYC Rome is based on the Zen 2 architecture and has been behind it for a few years now. The curious thing is that the error occurs almost three years after the last system reset. Although a server is designed to run without interruption, it is completely normal for different parts of the system to have scheduled maintenance shutdowns on a regular basis.
What’s more, unlike in PCs, a modern server has mechanisms to save information on the state of RAM and processor cache lines for immediate recovery. Either due to a drop in voltage, a fall in the electrical system or due to maintenance. So the problem is not as serious as it may seem at first glance.
AMD does not intend to provide a solution
And it is that the problem is not found in any firmware, nor driver, but within the bowels of the processor itself. Taking into account that in all the time, after the AMD EPYC Rome that freezes at 1044 days, they have released two generations of their server processors. One based on Zen 3 and the other on Zen 4 architecture, so there is no interest on their part to solve the problem.
Rather, the basic problem comes in the way in which each of the cores manages the so-called CC6 state, which occurs when the voltage of a core is reduced to 0 volts. In order to reduce energy consumption, this is done continuously in any type of processor today. However, the problem here is that after a certain time the subsystem in charge of this is unable to reactivate the kernel affected by it.
In other words, the problem is slight due to the fact that this does not happen when a kernel is active, but after those 1044 days since the last reboot, if they go to sleep they will no longer wake up. That is why AMD currently recommends disabling the CC6 state, which prevents the different processor cores from going to sleep.