One of the most important novelties that has been introduced in the RTX 40 graphics cards is the so-called NVIDIA BE, a feature that will be common in all GPUs of the brand from now on and that promises to increase performance in games, especially in the face of Ray Tracing. In this article we will tell you how it works and what problem it solves.
Faced with the growing problem of rising chip costs, the brute-force solution based on increasing cores indiscriminately will soon be meaningless. To all this, we must add that memory bottlenecks will not allow an increase in performance at the same rate as was the case up to now. It is in these moments, in which solutions like NVIDIA SER come to make all the sense in the world
What is NVIDIA BE?
“SER” is an acronym for Shader Execution Reorderingand this is not some new and previously unknown device that NVIDIA has pulled out of the sleeve, but rather a feature that has been implemented for the first time in the RTX 40. This is one of the reasons for the increased performance of current and future generations of brand GPUs.
Furthermore, this is a feature that is not unique to NVIDIA, since the same concept is found on Intel Arc GPUs under the name of Thread Sorting Unit. At first sight we could say that this is a type of out-of-order executionlike the one found in mobile and PC CPUs today.
However, we must bear in mind that the way in which a GPU manages its information is different from that of a CPU. In any case, the concept would be the same, the fact of internally reorganizing the execution of the instructions so that they do not have to wait for the corresponding units to be free to be able to execute them, but the concept is one thing and the concept is quite another. the operation and it is at that point where things differ completely.
A GPU works differently than a CPU.
To understand what NVIDIA SER is, we have to take into account the way in which a GPU executes its instructions and what is the difference with respect to a conventional CPU. In any case, we have to start with the fact that the concept of cores used in traditional marketing is completely false. The so-called “CUDA cores” cannot be considered cores, since they do not perform the rest of the classic tasks of a CPU as they are now:
- Acquisition and decoding of instructions.
- Jump prediction.
- Execution out of order.
They are a simple ALU, which adds the ability to perform arithmetic operations, usually in floating point. What could be classified as a complete kernel are what we usually call:
- At NVIDIA it is called YE (Multiprocessor Streaming)
- in AMD UC they were baptized (Compute Unit) and more recently, in RDNA architectures, WGP (Workgroup Processor).
However, these do not have all the complete functionality of a core either, since they are in charge of planning the order in which the instructions previously sent by the command processors that are in the central part of the chip will be executed. The command processor is the part that reads the instruction list for the GPU created by the central processor, actually decodes the instructions, and then sends them to the appropriate units.
Occupation in a GPU and the simile with the movie theater
To understand how a GPU works, we are going to make a comparison with a conventional cinema, which has several projections at the same time. Each thread of execution is a person or a group of people going to watch a movie.
The command processor is the box office that sells us a ticket for a specific session. Behind them, the spectators look for their room, which in NVIDIA slang would be the SM unit, but the usher would be the equivalent of the planner within each unit and, therefore, would be in charge of accommodating those who are going to see the movie.
Yes, we know that this is a very simplified explanation of how a GPU treats its instructions, since it is somewhat more complex than this, but it helps us to explain the concept of occupancy with respect to performance. The most commonly used units in such a chip are the so-called SIMDa single instruction multiple data.
The problem? When there is a lack of sufficient data and part of the seats and part of the ALU remain unused. Which means that you are not using 100% of the drive and this is something that happens more commonly than usual. This is where the SER or Shader Execution Reordering that NVIDIA has implemented in its RTX 40 comes in.
How does NVIDIA BE work?
Well, quite simply really. We have to assume that the SIMD units all carry out the same instruction at the same time and what we are interested in is occupying all possible units. In the case of NVIDIA SER, the catch is that all the instructions of a type are unified in the same block and are executed in unison, regardless of the output order in which they need to be executed. In other words, and going back to the example of the cinema, it is like instead of having two half-empty rooms where they show the same movie, we relocate all the spectators in the same room.
At the end of the day, the performance we get from doing this is the same, the number of processing units are the same, but we don’t waste the resources of using extra SM units and this allows reduce among other things the energy consumption of the chip. The fact of having more active units not only increases it due to the fact that it is worse to keep the equipment in two theaters in our example cinema on, it is also that this allows us to reach higher clock speeds, that is, more MHz , on our GPU.
How does Ray Tracing improve?
However, we must start from the fact that NVIDIA sells this new feature as a key to increase performance in Ray Tracing. However, it is not the only improvement they bring, since the jump instructions, which have historically been the slowest on a GPU. That is, the key is found in the concept of memory coherence.
Traditionally, what a GPU does is work with a graphic primitive such as a Esports Extras or a vertex, for example, which are located in a specific part of the screen. This helps us to order them according to their position in the frame, since at most they will affect the surrounding elements.
Therefore, data coherence, that is, making sure that everyone has the same view of memory, only has to do so for the adjoining graphic primitives. However, a beam in ray tracing is different in that it can traverse the entire screen and affect multiple objects at the same time.
The fact of unifying most of the instructions in a single unit ensures that they all use the same cache and, therefore, the coherence is increased. That is, instead of having all the instructions related to the path of a ray through the scene in several units, which would mean having to coordinate the coherence of several different local caches and a significant impact on the second level cache.