Yesterday the brand announced a presentation event of, in theory, a new graphics card and…
Although yesterday we had a glimpse on the AD102 and its architecture, today it is time to break it down further, since NVIDIA has shown its complete block diagram with slight changes and quite interesting details. Logically we are going to talk about the full version that, in theory, would arrive with the RTX 4090 Ti, although this can be extrapolated by changing the general characteristics for the RTX 4090, since they carry the same chip, but cast. How is this monster of gaming computing with Ada Lovelace architecture?
As for general specifications, there are no changes compared to what we saw yesterday, but it should be remembered that we are talking about a chip manufactured by TSMC with its exclusive process for NVIDIA 4N and that therefore the density has been maximized, being able to include 76.3 billion transistors.
NVIDIA AD102, architecture and unseen news
The chip has an area of 608.4mm2, so it’s a little smaller than its predecessor, but instead, given its density, NVIDIA has been able to squeeze a lot more into this series of chips. AD102 as an architecture will have 12 GPC, 72 TPC, 144 SM with 18,432 shaders (CUDA Cores), no less than 144 RT Cores and 576 Tensor Cores.
Frequencies in Boost go up to the 2.5GHz while the memory interface has not been modified and maintains the 384 bit. Your Encoder NVENC go to the eighth generation and supports twice the performance in AV1something that is surely highly requested by certain users.
Reviewed all the theory that we have already seen in two previous articles, what are the novelties as such? Well, as NVIDIA has shown and commented, everything lies at the SM level, where the changes are substantial even though the general structure has been maintained.
News in the SM of Ada Lovelace
As we know and going from outside to inside, each CPG gets 6 TPC with a common Rasterization engine for all of them. In this aspect, in the rasterization engines, there are interesting novelties, since each GPC has a dedicated engine for itself, but this one has two ROP partitions with eight ROPs each. That means Ada Lovelace has a +71.42% ROP than Amperewhich is precisely and completely precisely the difference that exists in Shaders between the GA102 and this AD102.
That is, each rasterization engine has 16 R.O.P.multiplied by the 12 GPC gives us a total of 192 ROP. On the other hand and focusing after the paragraph, each TPC integrates two SM and each SM continues to implement 4 Sub-Cores as in Ampere.
Now we go inside each Sub-Core, where all the news and the magic of this new architecture are, since the layout of elements as architectural concepts as we have seen does not change as such and should not be confused with their layout physics on the chip. One thing is the theory and the diagrams and another how they are implemented physically etched into the silicon.
Following the theory, each Sub-Core is divided into three different engines, where NVIDIA again leaves out of them and sharing the resources of each SM to the RT Core, the Texture engines in four groups (64 KB per SM) and furthermore, the L1 Data and Shared Memorywhich is kept in 128KB.
Therefore, the four Sub-Cores divided into three different engines each implement a very important novelty that had not been seen until now. L0 Instruction Cache, Warp Scheduler and Dispatch are now one set that processes 32 threads per clock. The only thing that NVIDIA has respected here against Ampere are the Log files which still have a size of 16,384 x 32-bit.
Going back to the engines, what we see is that on the one hand we have the 4th Gen Tensor Coresa motor exclusive FP32 and one shared for more FP32 and INT32. The latter can work with both types of data, so NVIDIA counts all of it as FP32 and adds it to the exclusive engine it already has. Therefore, we have 16 FP32 units and 16 FP32/INT32 units For each sub-corethis multiplied by the four Sub-Cores that an SM has gives us a count of 128 FP32 unitswhich in turn multiplied by the 144 SM that the AD102 has in its architecture gives us the number of 18,432 shaders.
Once this is understood, we emphasize again that they are two different engines and share common resources, but one only works with data in Floating Point, while the other the other can vary with Integers if required.
NVIDIA, diagrams and integers
The biggest problem in understanding the changes comes from NVIDIA itself, as each architecture has an advanced and a simplified SM diagram. The one we have just above is the advanced one and as we can see it is incorrect, since it specifies four motors and not three, where it also shows a single unit for LD/STwhile it actually has 4, in addition to the SFU (Special Functional Unit).
It was also said that Ada Lovelace would have 192 KB of L1 and finally has 128KB (18MB total). But this is not the most important thing in this aspect, since neither the number of FP32 nor that of INT32 is greater than expected, but it is a surprise to find a single group of units for the L0, Warp Scheduler and Dispatchwhat NVIDIA now calls simply Warp at a size of 64 of them, although it maintains the Thread per clock.
The novelty is that Ada Lovelace gets a single block with a 33% more Warp vs. Ampereand it was necessary due to the novelties mentioned in the rasterization engines.
Is Ada Lovelace a totally new architecture?
The reality is that no, it is only an evolution of Ampere optimized for the new 4th and 3rd Generation RT Core and Tensor Corethe numbers against Ampere prove it, because the vast majority are scalable:
- +71% GPC.
- +71% SM.
- +71% Shaders.
- +71% ROP.
- +71% L1 cache.
- +1,600% in L2 cache (from 6,144 KB vs 98,304 KB, the RTX 4090 has 73,728 KB of L2, 12 times more than the RTX 3090 Ti with 6,144 KB).
Therefore, much of the improvement is focused on including “more of everything”, have a higher final frequency and of course, minimize the outputs to the VRAM/PCIe thanks to the increases in the cache. NVIDIA’s movement tries to enhance Ray Tracing and AI with scaling through DLSS 3, but as soon as these are not used, what we are going to see are performance increases that will range from +50% up to +80% in games (and hopefully) for those that do not have support and that, and no other, will be the real and scalable performance of the architecture.
Clock by clock and in IPC performance should not be above 10% compared to Ampere as an architecture and SM, without RT or DLSS, since we only have larger caches, which obviously improve what is present and are appreciated.
But, it is true that we do not know the direct impact on the IPC of Shader Execution Reordering (SER), because although NVIDIA speaks of an improvement of up to three times the performance in Ray Tracing, in the scalability of the Frame Rate it is only above 25%.
Therefore, initially, the performance impact if Ray Tracing is not required or supported should be zero, because the goal with SER is to dynamically reorganize inefficient workloads and that is why it is implemented within the NVAPI . We could think that in traditional rendering it will improve something, the problem is how much.
Summarizing, Ada Lovelace is an evolution of Ampere that looks too much, that is, it has more similarities than real news and all because NVIDIA tries to implement and enhance the area where AMD is weakest based on game releases that use its features, because in terms of improvements as such, the price that they ask for their graphs is not in accordance with the evolution of the AD102 architecture and Ada Lovelace in general.
AD102, its architecture in terms of performance and efficiency
Only the higher node density and frequency along with the new RT Cores and Tensor Cores engines seem to make a real difference, because the architecture changes are really minimal (Warp and L2 Cache). That is, the image above where NVIDIA shows a 2x efficiency improvement it is more thanks to TSMC than to the architecture itself, where only the larger L2 cache influences this term.
Comparatively speaking and knowing that the 4N NVIDIA and TSMC is a direct evolution of the company’s N5P for those of Huang, what the green team should have in hand if we compare it with the TSMC N7 (the difference with the Samsung N8 is greater) is a 40% less consumption (there is talk of 50% in 4N) and a 80% more densitytogether with some frequencies a 15%-20% faster by the node itself and that NVIDIA has taken to the 35% between RTX 3090 Ti and RTX 4090… Well, let’s see how it goes.
If you can reach a 50% reduction in consumption in the node (at least), let’s suppose that what is explained in the architecture adds a minimum percentage of efficiency (¿+-5%?) and push the frequencies 35%Well, you have a nice +-90% (pulling low). If we add to this the hypothetical increase in IPC of 10% due to the changes in the architecture of the AD102 and Ada Lovelace, then we are in the values that NVIDIA itself gives in the image above, always approaching us of course.
We insist that we start from comparing the hypothetical data of the 4N vs N7 from TSMC and not from the N8 from Samsung, which is far behind the second of the Taiwanese, so in theory, NVIDIA really must have better data than those exposed and with them and would further reduce the IPC improvement, because the improvements of the node are greater than what is really exposed.
All this is obviously outlined, we will never know how much the 4N improves compared to the N5 and much less compared to the Samsung N8, and except for direct comparison, we will not know how much improvement there is in the IPC either, but on the upside, the numbers add up (too much), and somehow reinforce that Ada Lovelace is a technical upgrade rather than a new architecture.