Hardware optimization and well-considered software architecture can make a difference, but not much.
Multi-core processors can theoretically run multiple code threads in parallel, but currently, certain categories of operations hinder attempts to improve overall performance through parallel computing.
Is it time to use accelerators to run highly parallel code? Standard processors have many CPUs, so cache coherence and synchronization may involve thousands of low-level system cycles to maintain coordination among all cores. Depending on CPU width and data dependencies, the CPU's ability to utilize instruction-level parallelism is also limited.
These CPU performance bottlenecks are real and widespread, and they are difficult to address. Although software developers play a significant role in creating parallel algorithms, there is still room for hardware specifically designed to execute parallel code.
Three major bottlenecks:
CPU architects have spent countless cycles looking for ways to improve processors to enhance performance or reduce power consumption. This is the main motivation for the continuous improvement of CPUs. "CPUs are designed to run unstructured application code," explains Steve Roddy, Chief Marketing Officer of Quadric. "To speed up execution without burdening programmers with considerations of the code or the target machine, modern CPUs have accumulated a long list of increasingly exotic features designed to run random unknown code as fast as possible."
Advertisement
Technologies that have evolved over the years include:Superscalar architecture, characterized by a decoder that can simultaneously issue multiple instructions to a series of parallel functional units.
Speculative execution, where the CPU speculates on the most likely outcome of a branch decision before making it. If the guess is correct, it takes a step forward. If the guess is wrong, the pipeline must be flushed and another branch is executed.
Out-of-order execution, where multiple instructions in a sequence are executed out of order as long as they are not interdependent.
Real-time memory access tracking, where multiple memory requests can be issued to memory at the same time. Depending on congestion and traffic, some requests issued later may return earlier.
Despite these efforts, there are still some bottlenecks. The first is the time required to fetch data from memory, which usually takes hundreds of cycles. Multiple CPUs and threads can issue multiple requests, overlapping their access times to minimize apparent latency. Caching helps to minimize the latency of future accesses, but if one thread changes a value that other threads are using, maintaining consistency takes time.
The second bottleneck is synchronization issues caused by data dependencies between threads. If multiple threads want to access the same data, the data may need to be locked for exclusive use when it is changed and then released. Alternatively, if multiple threads are participating in the overall computation, there may be a point where all threads must complete their work before others can continue. The overhead of managing synchronization can be quite significant.
The third bottleneck involves instruction-level parallelism, where multiple instructions are executed simultaneously. Superscalar CPUs explicitly support this behavior, but they also have limitations.
Flow Computing Chief Technology Officer and Chief Architect Martti Forsell said: "CPUs cannot properly handle latency hiding. They cannot properly handle synchronization, and they cannot properly handle low-level parallelism."
Consistency latency
In addition, caches can buffer data to cope with longer DRAM access times. However, cache systems are usually hierarchical, mixing private caches and shared caches. Arif Khan, Senior Director of Product Marketing at Cadence Silicon Solutions Division, said: "The fundamental choice that affects performance is to optimize reference locality by increasing multi-level caches." Level 1 (L1) caches are closest to the CPU and prioritize fast access. They are usually private caches, which means that only one CPU can access them.Some CPUs also provide private Level 2 (L2) cache. The advantage of doing this is that more data can be saved in the cache, while some data is moved to a larger and slightly slower (but cheaper) L2 cache. With these private caches, if multiple CPUs need to access the same data, a copy will be retained in each individual cache. "In CPUs, due to the presence of private caches, consistency issues are inevitable," Forsell pointed out.
Some CPUs have shared L2 caches, and most Level 3 (or last-level) caches are also shared. The advantage of shared caches is that multiple CPUs can access a single piece of data without separate copies. However, L2 sharing usually occurs between two or four CPUs. Therefore, an eight-core processor with this structure will have two L2 caches, and sharing will not cross between them. This means that if two threads running on two CPUs with different L2 caches need the same data, each L2 cache must have its own copy to maintain consistency. Only the last-level cache (LLC) is always shared among all CPUs and does not require consistency.
Figure 1: An example of cache hierarchy. L1 cache is usually dedicated to one CPU. L2 cache may also be so, or they may be shared by all or some CPUs. The last-level cache is usually shared by everyone. Source: Bryon Moyer/Semiconductor Engineering
But sharing also becomes complex. For processors with multiple CPUs, these CPUs are usually distributed throughout the entire chip, making it difficult to place a unified LLC so that all CPUs can access it with the same latency. And larger caches come at a cost. "The larger the cache, the higher the latency," Rambus's Distinguished Inventor Steven Woo pointed out. In practice, although the LLC is logically seen as a single unit, it is physically divided into multiple blocks, each close to a CPU.
Consistency protocols have been established for many years. Fred Piry, Vice President and Researcher of CPU Technology at Arm, said: "MESI is the most common cache consistency protocol, which describes cache lines as being modified (M), exclusive (E), shared (S), or invalid (I)."
Woo said: "Currently, MESI is a fairly stable protocol. In fact, it has not changed much, which indicates that we have done a lot of work in the past to fine-tune it."
CPUs send messages to other CPUs, indicating status changes. "If you are writing to a shared line, you need to tell other CPUs, 'I am modifying this cache line, please forget it. I have a new version now.' In mobile phones or laptops, there are not so many CPUs, so the speed is very fast," Piry said. "If there are hundreds of CPUs in a large server, the latency may be significant."
Tracking these messages is not easy. "This can be very challenging because you may encounter situations where messages are chasing each other in the system," Woo said. "In the worst case, if you have a large chip, the core at one corner of the chip must invalidate the core at the other corner. This requires the time to traverse the entire chip, as well as the time for messages to be passed through the hierarchy."
This usually occurs on a dedicated consistency mesh that carries messages. For smaller processors, latency may be limited to about 10 cycles. "Arm's Consistency Mesh Network (CMN) supports AMBA CHI, which is such a structure," Khan pointed out. "IP providers also have solutions in this area."However, for processors with multiple cores, the grid may encounter congestion and delays when messages pass through, with the worst-case scenario potentially taking about 1000 cycles.
This consistency activity takes place behind the scenes at the hardware and system software level, so application software developers have little control. However, they are best positioned to influence performance based on the structure of the program and the way threads share variables. For hardware designers, the best choice is to adjust the cache size, cache privacy, and consistency grid. Improving them can enhance performance, but it will increase chip size and cost, and increase power consumption.
Synopsys ARC-V RPX processor product manager Mohit Wani said, "An efficient consistent shared cache architecture is essential for minimizing the shared data access latency for all threads involved in shared computing."
Synchronization Delays
There are two main synchronization mechanisms: locks (also known as mutual exclusion, i.e., "mutual exclusion") and barriers. Locks acquire data accessible and usable by multiple CPUs and restrict access to a single CPU. This helps ensure that each CPU processes the same data with the same value, and no CPU uses an old or intermediate version while others use a new version.
Barriers restrict the execution of code beyond a specific point, allowing all threads that contribute to the result to catch up and complete before any thread can continue to execute the result. A reduction algorithm is an example, which may involve multiple threads calculating a single result.
There are at least two ways to implement these two tools, and both have costs. "Sometimes, when encountering a barrier, what people do is equivalent to going to sleep, and then operations like interrupts will wake them up again," Woo explained. "The cost of interrupts is high, and sending all release messages and restarting the kernel takes a lot of time." The advantage of doing this is that it saves power because the kernel is in a sleep state while waiting.
Another method is to use flags or semaphores, and the kernel performs a busy loop to poll the flag to see when it changes. "When the value at the memory location changes, they know they can move on," Woo said. This technique is faster because it does not involve interrupts, but the kernel consumes more energy while spinning while waiting.
In addition, threads on idle cores may be replaced by other threads during this period, and these context switches are also expensive. "When a task is suspended to free up a processor core to wait for other tasks to catch up, the context switch overhead can cause significant losses," Wani pointed out.Finally, just as with consistency, once a release barrier or lock is lifted, the message is sent to other cores, which can span thousands of cycles for multi-core units. "If you have 100 CPUs in your system, the command will propagate to all CPUs in the system," Piry explained. "And all CPUs will receive a synchronization request and confirm that they have received and completed it. Moreover, they need to act according to the command, which may take thousands of cycles."
"CPU hardware can help by providing an instruction set and memory architecture optimized for thread synchronization and data sharing," Wani said. Some small processors for embedded use come with hardware semaphores or mailboxes to speed up synchronization. This approach, while suitable for small CPUs, does not scale well.
More complex units perform specific instructions, such as memory atomic instructions. They execute read/modify/write sequences atomically, which means that no other entity can see or affect the result before it is completed. Such instructions help avoid the need for explicit synchronization to prevent data hazards.
Program architecture is also an important tool. "The best way to address synchronization overhead is to adopt a software architecture that can reduce the frequency at which threads need to synchronize," Wani said.
Instruction-Level Parallelism
Parallel execution can occur at multiple levels. Instruction-level parallelism refers to the simultaneous execution of multiple instructions. This is allowed by modern superscalar CPUs, but with limitations. "You can execute multiple instructions in multiple functional units in a superscalar process, but the requirement is that these instructions must be independent," Forsell said.
Figure 4 shows a part of a 10-width CPU microarchitecture, reflecting the widest CPU in commercial use today. By definition, the maximum parallelism here is 10, but there are two issues. The first issue is related to the distribution of functional units, and the second issue comes from data dependencies.
A 10-width CPU has 10 sets of available functional units, and which set gets the instruction depends on what the instruction is. In the fictional example in Figure 4, each set has two functional units, one of which is an integer ALU. Therefore, for integer operations, up to 10 can run simultaneously. But only five have floating-point units, two have bitwise operation units, and there is one each for multipliers, dividers, and fused multiply-add (FMA). Thus, it is assumed that up to two bitwise operations can run together, but multiplication, division, and FMA cannot be parallel.These parallel opportunities reflect the greatest possibility and only occur when the variables involved in parallel computing are not interdependent. For example, if an operation depends on the result of a previous operation, then these two operations must be executed sequentially.
Once again, it is emphasized that software architecture is the most powerful lever for maximizing parallelism. "Algorithms are always the best way to optimize the performance of a given hardware platform," said Khan. "Distributed data parallelism and fully sliced data parallelism techniques can be used for model replication and data partitioning, thereby achieving optimal system utilization and model performance."
Identifying data dependencies requires software analysis. Static analysis of declared variables can be performed, and the compiler can sort the target code to maximize the performance of the target CPU. However, many programs use a large number of pointers, which cannot be statically analyzed because their values are determined in real-time. Dynamic analysis, monitoring the access patterns of loads and stores, may be necessary. In this case, the analysis results can help developers optimize the parallelism of the program while ensuring that dependencies are respected.
Control and Data Planes
Threads running a series of instructions without any branching (so-called basic blocks) can achieve the greatest degree of parallelism. Branching introduces uncertainty, and attempts to speculatively execute may end in incorrect guesses and system adjustments.
Engineers sometimes describe two coding styles: one for control and one for data manipulation. This distinction is the basis of network processing chip architecture, which provides dedicated hardware for long chains of data calculations, supplemented by a CPU to execute control code, which usually involves many branches. Code in the data plane is more suitable for parallel execution than control code.
Given the limited opportunities for performance improvement in traditional CPU architectures, it has been suggested that equipping a separate hardware unit for code similar to the data plane can greatly improve parallelism, which is much better than what can be achieved by using a CPU alone. "Now, platforms at all performance levels have accepted the idea of offloading structured tasks from general-purpose CPUs and transferring these tasks to task-specific dedicated processing engines," said Roddy.
Startup company Flow Computing is providing such a unit. Timo Valtonen, CEO of Flow Computing, explained: "This is an IP block closely integrated with the CPU on the same piece of silicon. It will hang on the bus like a GPU or neural accelerator. It is a configurable small processor array that can execute parallel code without the need for too many barriers or locks."
The proposed unit has a large cache shared by all its cores, thereby eliminating consistency latency. If the main CPU and the parallel unit are processing some of the same data, then consistency between the CPU cache and the parallel unit cache will be necessary, but Flow believes this is unlikely.Using fibers provides a lightweight approach to handle situations that would otherwise be handled by threads. Fibers are created by the CPU, can be passed to accelerators, and avoid the operating system service calls required for handling threads. This can alleviate some synchronization latency.
The ability to link instructions together can maximize instruction-level parallelism. "Linking is taking the output of one instruction and directly feeding it into the next instruction," Woo explained. "Since the 1980s and early 1990s, supercomputers from companies like Cray have been doing the same thing."
Using parallel units requires recompiling the code. "We do apply a special compilation algorithm to handle code with dependencies to achieve chained execution," Forsell said. The new code allows the CPU to hand off highly parallel parts, and dependency analysis allows instructions flowing through the parallel units to be chained.
But the problem is that the entire codebase will use the native language of the CPU. The compiler will not generate separate target code for the parallel parts. This means that the parallel units must be equipped with decoders and other logic that corresponds to the accompanying CPU image. This work needs to be completed in cooperation with the CPU supplier, as decoding someone else's instruction set may be considered a violation of intellectual property rights.
However, the compiler is not an omnipotent tool. "You must remember that the compiler values correctness over performance," Woo warned. "They sometimes have to make conservative assumptions rather than aggressive ones. To make full use of the system, programmers must input instructions or bypass the compiler's automatic code generation."
Beyond incremental growth
Although CPU designers have been looking for opportunities to improve performance, the low-hanging fruit is long gone. The roofline model describes where computation is limited by memory bandwidth and where it is limited by computation, which can help developers identify opportunities for improvement and their associated costs. But without some game-changing developments, improvements will gradually increase with each generation.
Synopsys pointed out its four architectural concepts for improving performance.
Making higher-level caches configurable as larger caches or smaller caches plus some memory, and being able to dynamically reconfigure to adapt to the current workload;
At each cache level, using prefetchers to hide more acquisition latency;Even with an ordered CPU, unordered memory requests are allowed, so that any content that returns first can be used even if previous requests are still pending;
Quality of Service (QoS) is implemented in software to give priority to certain cores for critical workloads.
Will the parallel offloading unit become a bigger game-changer? It is possible, as long as system developers believe it can seamlessly integrate into their existing workflow. If there are too many changes, designers (who are mostly conservative by nature) will resist the risks that such changes may bring to the project. In addition, the incremental silicon area must add enough value to maintain the chip's appropriate profitability. This may make it more attractive to dedicated SoCs, as the value of increased performance is obvious.
Tools and development processes related to such accelerators also need to be smoothly integrated into existing processes, which respects the programmers' reluctance to make too many changes. Even if the hardware concept and its related tools are flawless in concept, new users still need to be convinced that the hardware and software implementation is correct and error-free.
Whether Flow can succeed may help determine the direction of the technology's development. If it fails, is it due to a fundamental flaw in the concept or a problem in execution? If it is the latter, others may take over and continue to develop. If it is the former, then the whole idea will be questioned.
On the other hand, if it succeeds, you can bet that others will try to imitate and improve its success. Considering that this proposal is still in its early stages, we may have at least a year or even longer to determine if there is a winner.
Comments