Modern computation workloads contain abundant Data Level Parallelism(DLP), which requires specialized data parallel architectures, such as Graphics Processing Units(GPUs). With parallel programming models, such as CUDA and OpenCL, GPUs are easily to be programmed for non-graphics applications, and therefore become a costeffective approach for data parallel architectures. The large quantity of available parallelism places a heavy stress on the memory system as the limited number of pins confines the number of memory controllers on the chip. This creates a potential bottleneck for performance scalability of the GPUs. To accelerate communication with the memory system, we propose the Intra-Clustering on-chip network for data parallel architectures, which is built upon a traditional two-dimensional electrical mesh network with memory controllers connected through a nanophotonic ring and compute cores grouped into different clusters. Our evaluations with CUDA benchmarks show that the Intra-Clustering architecture can improve communication delay by an average of 17%(up to 32%) and IPC by an average of 5%(up to 11.5%).
Publications
Bandwidth Efficient On-Chip Interconnect Designs for GPGPUs
A Case for Handshake in Nanophotonic Interconnects
Nanophotonics has been proposed to design low latency and high bandwidth NOC for future Chip MultiProcessors (CMPs). Recent nanophotonic NOC designs adopt the token-based arbitration coupled with credit-based flow control, which leads to low bandwidth utilization. In this work, we propose two handshake schemes for nanophotonic interconnects in CMPs, Global Handshake (GHS) and Distributed Handshake (DHS), which get rid of the traditional creditbased flow control, reduce the average token waiting time, and finally improve the network throughput. Furthermore, we enhance the basic handshake schemes with setaside buffer and circulation techniques to overcome the Head-Of-Line (HOL) blocking. Our evaluation shows that the proposed handshake schemes improve network throughput by up to 62% under synthetic workloads. With the extracted trace traffic from real applications, the handshake schemes can reduce the communication delay by up to 59%. The basic handshake schemes add only 0.4% hardware overhead for optical components and negligible power consumption. In addition, the performance of the handshake schemes is independent of on-chip buffer space, which makes them feasible in a large scale nanophotonic interconnect design.
APCR: An Adaptive Physical Channel Regulator for On-Chip Interconnects
Chip Multi-Processor (CMP) architectures have become mainstream for designing processors. With a large number of cores, NetworkOn-Chip (NOC) provides a scalable communication method for CMP architectures, where wires become abundant resources available inside the chip. NOC must be carefully designed to meet constraints of power and area, and provide ultra low latencies. In this paper, we propose an Adaptive Physical Channel Regulator (APCR) for NOC routers to exploit huge wiring resources. The flit size in an APCR router is less than the physical channel width (phit size) to provide finer granularity flow control. An APCR router allows flits from different packets or flows to share the same physical channel in a single cycle. The three regulation schemes (Monopolizing, Fair-sharing and Channel-stealing) intelligently allocate the output channel resources considering not only the availability of physical channels but the occupancy of input buffers. In an APCR router, each Virtual Channel can forward a dynamic number of flits every cycle depending on the run-time network status. Our simulation results using a detailed cycle-accurate simulator show that an APCR router improves the network throughput by over 100% in synthetic workloads, compared with a traditional design with the same buffer size. An APCR router can outperform a traditional router even if the buffer size is halved.
A Hybrid Buffer Design with STT-MRAM for On-Chip Interconnects
As the chip multiprocessor (CMP) design moves toward many-core architectures, communication delay in Network-on-Chip (NoC) has been a major bottleneck in CMP systems. Using high-density memories in input buffers helps to reduce the bottleneck through increasing throughput. SpinTorque Transfer Magnetic RAM (STT-MRAM) can be a suitable solution due to its nature of high density and nearzero leakage power. But its long latency and high power consumption in write operations still need to be addressed. We explore the design issues in using STT-MRAM for NoC input buffers. Motivated by short intra-router latency, we use the previously proposed write latency reduction technique sacrificing retention time. Then we propose a hybrid design of input buffers using both SRAM and STT-MRAM to hide the long write latency efficiently. Considering that simple data migration in the hybrid buffer consumes more dynamic power compared to SRAM, we provide a lazy migration scheme that reduces the dynamic power consumption of the hybrid buffer. Simulation results show that the proposed scheme enhances the throughput by 21% on average.
Efficient Data Packet Compression for Cache Coherent Multiprocessor Systems
Multiprocessor systems have been popular for their high performance not only for server markets but also for computing environments for general users. With the increased software complexity, networking overheads in multiprocessor systems are becoming one of the most influential factors in overall system performance. In this paper, we attempt to reduce communication overheads through a data packet compression technique integrating a cache coherence protocol. Here we propose Variable Size Compression (VSC) scheme that compresses or completely eliminates data packets while harmonizing with existing cache coherence protocols. Simulation results show approximately 23% of improvement on average in terms of overall system performance when compared with the most recent compression scheme. VSC also improves performance by 20% on average in terms of cache miss latency.
Scalable and Efficient Bounds Checking for Large-Scale CMP Environments
Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks
As the number of cores on a single chip increases with more recent technologies, a packet-switched on-chip interconnection network has become a de facto communication paradigm for chip multiprocessors (CMPs). However, it is inevitable to suffer from high communication latency due to the increasing number of hops. In this paper, we attempt to accelerate network communication by exploiting communication temporal locality with minimal additional hardware cost in the existing state-of-the-art router architecture. We observe that packets frequently traverse through the same path chosen by previous packets due to repeated communication patterns, such as frequent pair-wise communication. Motivated by our observation, we propose a pseudo-circuit scheme. With previous communication patterns, the scheme reserves crossbar connections creating pseudocircuits, sharable partial circuits within a single router. It reuses the previous arbitration information to bypass switch arbitration if the next flit traverses through the same pseudocircuit. To accelerate communication performance further, we also propose two aggressive schemes; pseudo-circuit speculation and buffer bypassing. Pseudo-circuit speculation creates more pseudo-circuits using unallocated crossbar connections while buffer bypassing skips buffer writes to eliminate one pipeline stage.
Efficient Lookahead Routing and Header Compression for Multicasting in Networks-On-Chip
As technology advanced, Chip Multi-processor (CMP) architectures have emerged as a viable solution for designing processors. Networks-on-Chip (NOCs) provide a scalable communication method for CMP architectures as the number of cores is increasing. Although there has been significant research on NOC designs for unicast traffic, the research on the multicast router design is still in infancy stage. Considering that one-to-many (multicast) and one-to-all (broadcast) traffic are more common in CMP applications, it is important to design a router providing efficient multicasting. In this paper, we propose an efficient lookahead routing with limited area overhead for a recently proposed multicast routing algorithm, Recursive Partitioning Multicast (RPM) [17]. Also, we present a novel compression scheme for a multicast packet header that becomes a big overhead in large networks. Comprehensive simulation results show that with our route computation logic design, providing lookahead routing in the multicast router only costs less than 20% area overhead and this percentage keeps decreasing with larger network sizes. Compared with the basic lookahead routing design, our design can save area by over 50%. With header compression and lookahead multicast routing, the network performance is improved by 22% in a (16 × 16) network on average.
Adaptive Prefetching Scheme Using Web Log Mining in Cluster-based Web Systems
The main memory management has been a critical issue to provide high performance in web cluster systems. To overcome the speed gap between processors and disks, many prefetch schemes have been proposed as memory management in web cluster systems. However, inefficient prefetch schemes can degrade the performance of the web cluster system. Dynamic access patterns due to the web cache mechanism in proxy servers increase mispredictions to waste the I/O bandwidth and available memory. Too aggressive prefetch schemes incur the shortage of available memory and performance degradation. Furthermore, modern web frameworks including persistent HTTP make the problem more challenging by reducing the available memory space with multiple connections from a client and web processes management in a prefork mode. Therefore, we attempt to design an adaptive web prefetch scheme by predicting memory status more accurately and dynamically. First, we design Double Prediction-by-Partial-Match Scheme (DPS) that can be adapted to the modern web framework. Second, we propose Adaptive Rate Controller (ARC) to determine the prefetch rate depending on the memory status dynamically. Finally, we suggest Memory Aware Request Distribution (MARD) that distributes requests based on the available web processes and memory. For evaluating the prefetch gain in a server node, we implement an Apache module in Linux. In addition, we build a simulator for verifying our scheme with cluster environments. Simulation results show 10% performance improvement on average in various workloads.