Conference & Workshops

A Case for Handshake in Nanophotonic Interconnects

L. Wang, J. Jayabalan, M. Ahn, H. Gu, K. H. Yum and E. J. Kim

Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS), Boston, MA, May, 2013.

Nanophotonics has been proposed to design low latency and high bandwidth NOC for future Chip MultiProcessors (CMPs). Recent nanophotonic NOC designs adopt the token-based arbitration coupled with credit-based flow control, which leads to low bandwidth utilization. In this work, we propose two handshake schemes for nanophotonic interconnects in CMPs, Global Handshake (GHS) and Distributed Handshake (DHS), which get rid of the traditional creditbased flow control, reduce the average token waiting time, and finally improve the network throughput. Furthermore, we enhance the basic handshake schemes with setaside buffer and circulation techniques to overcome the Head-Of-Line (HOL) blocking. Our evaluation shows that the proposed handshake schemes improve network throughput by up to 62% under synthetic workloads. With the extracted trace traffic from real applications, the handshake schemes can reduce the communication delay by up to 59%. The basic handshake schemes add only 0.4% hardware overhead for optical components and negligible power consumption. In addition, the performance of the handshake schemes is independent of on-chip buffer space, which makes them feasible in a large scale nanophotonic interconnect design.

APCR: An Adaptive Physical Channel Regulator for On-Chip Interconnects

L. Wang, P. Kumar, K. H. Yum and E. J. Kim

Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT), Minneapolis, MN, September 2012

Chip Multi-Processor (CMP) architectures have become mainstream for designing processors. With a large number of cores, NetworkOn-Chip (NOC) provides a scalable communication method for CMP architectures, where wires become abundant resources available inside the chip. NOC must be carefully designed to meet constraints of power and area, and provide ultra low latencies. In this paper, we propose an Adaptive Physical Channel Regulator (APCR) for NOC routers to exploit huge wiring resources. The flit size in an APCR router is less than the physical channel width (phit size) to provide finer granularity flow control. An APCR router allows flits from different packets or flows to share the same physical channel in a single cycle. The three regulation schemes (Monopolizing, Fair-sharing and Channel-stealing) intelligently allocate the output channel resources considering not only the availability of physical channels but the occupancy of input buffers. In an APCR router, each Virtual Channel can forward a dynamic number of flits every cycle depending on the run-time network status. Our simulation results using a detailed cycle-accurate simulator show that an APCR router improves the network throughput by over 100% in synthetic workloads, compared with a traditional design with the same buffer size. An APCR router can outperform a traditional router even if the buffer size is halved.

A Hybrid Buffer Design with STT-MRAM for On-Chip Interconnects

H. Jang, B. S. An, N. Kulkarni, K. H. Yum and E. J. Kim

Proceedings of ACM/IEEE International Symposium on Networks-on-Chip (NOCS), Copenhagen, Denmark, May 2012

As the chip multiprocessor (CMP) design moves toward many-core architectures, communication delay in Network-on-Chip (NoC) has been a major bottleneck in CMP systems. Using high-density memories in input buffers helps to reduce the bottleneck through increasing throughput. SpinTorque Transfer Magnetic RAM (STT-MRAM) can be a suitable solution due to its nature of high density and nearzero leakage power. But its long latency and high power consumption in write operations still need to be addressed. We explore the design issues in using STT-MRAM for NoC input buffers. Motivated by short intra-router latency, we use the previously proposed write latency reduction technique sacrificing retention time. Then we propose a hybrid design of input buffers using both SRAM and STT-MRAM to hide the long write latency efficiently. Considering that simple data migration in the hybrid buffer consumes more dynamic power compared to SRAM, we provide a lazy migration scheme that reduces the dynamic power consumption of the hybrid buffer. Simulation results show that the proposed scheme enhances the throughput by 21% on average.

Efficient Data Packet Compression for Cache Coherent Multiprocessor Systems

B. S. An, M. Lee, K. H. Yum and E. J. Kim

Proceedings of the Data Compression Conference (DCC), Snowbird, Utah, April 2012

Multiprocessor systems have been popular for their high performance not only for server markets but also for computing environments for general users. With the increased software complexity, networking overheads in multiprocessor systems are becoming one of the most influential factors in overall system performance. In this paper, we attempt to reduce communication overheads through a data packet compression technique integrating a cache coherence protocol. Here we propose Variable Size Compression (VSC) scheme that compresses or completely eliminates data packets while harmonizing with existing cache coherence protocols. Simulation results show approximately 23% of improvement on average in terms of overall system performance when compared with the most recent compression scheme. VSC also improves performance by 20% on average in terms of cache miss latency.

Scalable and Efficient Bounds Checking for Large-Scale CMP Environments

B. S. An, K. H. Yum and E. J. Kim

Proceedings of International Conference on Parallel Architectures and Compilation Techniques (PACT-2011), Galveston Island, Texas, October 2011

We attempt to provide an architectural support for fast and efficient bounds checking for multithread work-loads in chip-multiprocessor (CMP) environments. Bounds information sharing and smart tagging help to perform bounds checking more effectively utilizing the characteristics of a pointer. Also, the BCache architecture allows fast access to the bounds information. Simulation results show that the proposed scheme increases _μ PC of memory operations by 29% on average compared to the previous hardware scheme.

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

Minseon Ahn and E. J. Kim

International Symposium on Microarchitecture (MICRO-43), Atlanta, Georgia, December 2010

As the number of cores on a single chip increases with more recent technologies, a packet-switched on-chip interconnection network has become a de facto communication paradigm for chip multiprocessors (CMPs). However, it is inevitable to suffer from high communication latency due to the increasing number of hops. In this paper, we attempt to accelerate network communication by exploiting communication temporal locality with minimal additional hardware cost in the existing state-of-the-art router architecture. We observe that packets frequently traverse through the same path chosen by previous packets due to repeated communication patterns, such as frequent pair-wise communication. Motivated by our observation, we propose a pseudo-circuit scheme. With previous communication patterns, the scheme reserves crossbar connections creating pseudocircuits, sharable partial circuits within a single router. It reuses the previous arbitration information to bypass switch arbitration if the next flit traverses through the same pseudocircuit. To accelerate communication performance further, we also propose two aggressive schemes; pseudo-circuit speculation and buffer bypassing. Pseudo-circuit speculation creates more pseudo-circuits using unallocated crossbar connections while buffer bypassing skips buffer writes to eliminate one pipeline stage.

Efficient Lookahead Routing and Header Compression for Multicasting in Networks-On-Chip

L. Wang, P. Kumar, R. Boyapati, K. H. Yum and E. J. Kim

ACM/IEEE Symposium on Architectures for Networking and Communications Systems (ANCS), La Jolla, CA, October 2010

As technology advanced, Chip Multi-processor (CMP) architectures have emerged as a viable solution for designing processors. Networks-on-Chip (NOCs) provide a scalable communication method for CMP architectures as the number of cores is increasing. Although there has been significant research on NOC designs for unicast traffic, the research on the multicast router design is still in infancy stage. Considering that one-to-many (multicast) and one-to-all (broadcast) traffic are more common in CMP applications, it is important to design a router providing efficient multicasting. In this paper, we propose an efficient lookahead routing with limited area overhead for a recently proposed multicast routing algorithm, Recursive Partitioning Multicast (RPM) [17]. Also, we present a novel compression scheme for a multicast packet header that becomes a big overhead in large networks. Comprehensive simulation results show that with our route computation logic design, providing lookahead routing in the multicast router only costs less than 20% area overhead and this percentage keeps decreasing with larger network sizes. Compared with the basic lookahead routing design, our design can save area by over 50%. With header compression and lookahead multicast routing, the network performance is improved by 22% in a (16 × 16) network on average.

Adaptive Prefetching Scheme Using Web Log Mining in Cluster-based Web Systems

H. K. Lee, B. S. An and E. J. Kim

Proceedings of the International Conference on Web Services (ICWS), Los Angeles, USA, July 2009

The main memory management has been a critical issue to provide high performance in web cluster systems. To overcome the speed gap between processors and disks, many prefetch schemes have been proposed as memory management in web cluster systems. However, inefficient prefetch schemes can degrade the performance of the web cluster system. Dynamic access patterns due to the web cache mechanism in proxy servers increase mispredictions to waste the I/O bandwidth and available memory. Too aggressive prefetch schemes incur the shortage of available memory and performance degradation. Furthermore, modern web frameworks including persistent HTTP make the problem more challenging by reducing the available memory space with multiple connections from a client and web processes management in a prefork mode. Therefore, we attempt to design an adaptive web prefetch scheme by predicting memory status more accurately and dynamically. First, we design Double Prediction-by-Partial-Match Scheme (DPS) that can be adapted to the modern web framework. Second, we propose Adaptive Rate Controller (ARC) to determine the prefetch rate depending on the memory status dynamically. Finally, we suggest Memory Aware Request Distribution (MARD) that distributes requests based on the available web processes and memory. For evaluating the prefetch gain in a server node, we implement an Apache module in Linux. In addition, we build a simulator for verifying our scheme with cluster environments. Simulation results show 10% performance improvement on average in various workloads.

Recursive Partitioning Multicast: A Bandwidth-Efficient Routing for Networks-On-Chip

L. Wang, Y. Jin, H. J. Kim and E. J. Kim

International Symposium on Networks-on-Chip (NOCS), San Diego, CA, May 2009

Chip Multi-processor (CMP) architectures have become mainstream for designing processors. With a large number of cores, Networks-on-Chip (NOCs) provide a scalable communication method for CMP architectures. NOCs must be carefully designed to meet constraints of power consumption and area, and provide ultra low latencies. Existing NOCs mostly use Dimension Order Routing (DOR) to determine the route taken by a packet in unicast traffic. However, with the development of diverse applications in CMPs, one-to-many (multicast) and one-to-all (broadcast) traffic are becoming more common. Current unicast routing cannot support multicast and broadcast traffic efficiently. In this paper, we propose Recursive Partitioning Multicast (RPM) routing and a detailed multicast wormhole router design for NOCs. RPM allows routers to select intermediate replication nodes based on the global distribution of destination nodes. This provides more path diversities, thus achieves more bandwidth-efficiency and finally improves the performance of the whole network. Our simulation results using a detailed cycle-accurate simulator show that compared with the most recent multicast scheme, RPM saves 25% of crossbar and link power, and 33% of link utilization with 50% network performance improvement. Also RPM is more scalable to large networks than the recently proposed VCTM.

Temperature-Aware Scheduler Based on Thermal Behavior Grouping in Multicore Systems

I. Yeo and E. J. Kim

Design, Automation and Test In Europe (DATE), Nice, France, April 2009

Abstract—Dynamic Thermal Management techniques have been widely accepted as a thermal solution for their low cost and simplicity. The techniques have been used to manage the heat dissipation and operating temperature to avoid thermal emergencies, but are not aware of application behavior in Chip Multiprocessors (CMPs). In this paper, we propose a temperature-aware scheduler based on applications’ thermal behavior groups classified by a K-means clustering method in multicore systems. The application’s thermal behavior group has similar thermal pattern as well as thermal parameters. With these thermal behavior groups, we provide thermal balances among cores with negligible performance overhead. We implement and evaluate our schemes in the 4-core (Intel Quad Core Q6600) and 8-core (two Quad Core Intel XEON E5310 processors) systems running several benchmarks. The experimental results show that the temperature-aware scheduler based on thermal behavior grouping reduces the peak temperature by up to 8◦C and 5◦C in our 4-core system and 8-core system with only 12% and 7.52% performance overhead, respectively, compared to Linux standard scheduler.