Physical AI HBM Smart Factory SDV AIoT Power Semicon 특수 가스 정정·반론보도 모음 e4ds plus

[IT 인사이트] SK Hynix Vice President Lim Eui-cheol: "PIM is the key to unlocking bottlenecks in the AI era."

기사입력2025.08.20 10:17


▲Im Ui-cheol, Vice President of SK Hynix

Fundamentally solving cost and delay problems by placing operations in memory.
Performance per watt competitive with the latest LPDDR PIM accelerators

By using PIM and GPUs together to separate attention computations into memory, we can fundamentally reduce the cost and delay issues that increase rapidly with token length.

SK Hynix Vice President Lim Eui-cheol presented on the topic of 'Crushing the token cost wall of LLM Service-Attention offloading with PIM-GPU heterogeneous System' at the '2025 2nd Sangsaeng Forum Deep Tech Convergence Networking Day' event held at COEX on August 19.

Vice President Lim Eui-cheol stated that the speed of AI diffusion is outpacing the speed of economic and energy systems, and that large-scale language model (LLM) services require electricity and infrastructure costs that exceed users' willingness to pay, and are accompanied by structural challenges such as expanding data centers and securing power.

The core of this problem is not the algorithm, but the energy efficiency of the computing infrastructure. SK Hynix pointed out that the source of this bottleneck is 'memory' and proposed a solution with memory semiconductors based on processing-in-memory (PIM).

Processors and memory that have been around for over half a centuryThe gap in the speed of development, the so-called 'memory wall', has become a critical limitation in the AI era.

Unlike traditional workloads that rely on cache and data reuse, LLM's inference, especially the decode step, is memory-intensive, requiring extensive parameters to be reloaded for each token and used only once.

This is why compute unit utilization plummets even with the latest GPUs.

Ultimately, system performance and power are determined by the bandwidth and latency of the memory subsystem, including HBM.

Data centers have been reusing model weights across multiple batches, shifting much of the memory bottleneck in the feedforward path to compute-intensive.

Attention, on the other hand, has a different character.

Since input and generated tokens are different for each user, there is virtually no shareable data, so increasing the batch size does not reduce memory round-trips.

As token lengths exceed 10K and 100K due to increased input context and internal reasoning, attention now emerges as the dominant component of overall delay and energy.

PIM is a method that inserts core operations such as matrix-vector multiplication (GEMV) inside the memory bank, completes the operation without moving data, and exports only the result to the outside.

The elimination of board-level movement between memory and processor leads to a dramatic reduction in power consumption, while leveraging bank parallelism to linearly increase effective throughput.

It is a structure that is precisely aligned with GEMV, which accounts for more than 90% of LLM inference time, and thus targets the decode bottleneck head-on.

SK Hynix secured samples by tape-out GDDR6-based PIM silicon in 2022.

Based on this, we designed an 'AMX' accelerator card that integrates the PIM die on the PCB, andWe implemented an AI control hub in FPGA that receives models and commands from the tro and orchestrates the PIM chip.

We demonstrated the operation of LLM (e.g. Rama 3 70B, batch 8, 2K tokens) on a showcase where prefill is performed by GPU and decode is performed by PIM on a server equipped with both GPU and AMX.

It is also noteworthy that the VLM workload was run by having visitors connect with a QR code, select a model, enter a prompt, and receive a response.

The next step is based on LPDDR.

The LPDDR series PIM, which prioritizes low power and high integration, aims for approximately 256GB per card and 70TB/s of internal bandwidth, and aims for performance/wattage that is competitive with the latest accelerators.

The system configuration is two-pronged.

First, there is the 'All-In-One' type combination where the LPDDR6 PIM is attached in parallel next to the existing accelerator + HBM and the attention path is dedicated to the PIM, and second, there is the 'Disaggregated' type separation where the GPU/accelerator is dedicated to prefill and non-attention decode and only the attention is offloaded to a separate PIM card.

The common goal is to structurally remove the memory bottleneck of attention and lower the overall TCO.

The expansion of model size and token length exposes the limitations of conventional computing, but creates a market for memory-centric semiconductors such as PIM.

Additionally, simply eliminating data movement reduces power and heat generation, alleviating power and cooling constraints in data centers.

Hardware adoption must be accompanied by the maturation of the software stack, runtime, and graph partitioning tools to be effective in large-scale deployment environments.

In addition, the wider the open collaboration on interfaces, command sets, and kernel optimizations, the more PThe spread of IM is accelerating.

Vice President Lim Eui-cheol said that the direction of next-generation semiconductors is clear and that calculations must be tailored to the data.

Furthermore, to address the dominant bottleneck of the LLM era, the memory nature of attention, PIM that utilizes internal memory parallelism and low-cost data paths is essential, and SK Hynix's GDDR6 PIM silicon, AMX cards, and LPDDR6 PIM roadmap presented a blueprint to make this transition a reality with a hybrid architecture with GPUs/accelerators.

▲Attendees of the 2nd Symbiosis Forum Deep Tech Convergence Networking Day in 2025 are taking a commemorative photo.