!ASPLOS五日目 本会議三日目. !Machine Learning I ::PUMA: A Programmable Ultra-efficient Memristor-based Accelerator for Machine Learning Inference * memristive crossbar ** 2-6 bits per cell vs 1-bit or CMOS(SRAM) = 6x ** cell area is 4F^2 vs 120F^2 for CMS(SRAM) = 30x ** Analog MVM 1.34pJ/op * ref. RENO DAC15, PRIME ISCA16 * Domain-specific ISA ** large register address space to support memoristive crossbar ** vector width keeps instruction memory low in spatial architecture * Hybrid core ** hybrid memrisitive and CMOS * compiler optimization ** graph partitioning ** MVM instruction consume high latency * inference energy: skylake, Pascalと比べて削減. * PUMA compiler https://github.com/illinois-impact/puma-compiler * PUMA simulator https://github.com/Aayush-Ankit/dpe_emulate ::FPSA: A Full System Stack Solution for Reconfigurable ReRAM-based NN Accelerator Architecture * isues ** ReRAMなシステムではDACとADCがでかい.(Logical ViewだとReRAMでかいけど) ** communication bound ** reliability ** flexibity *** ReRAM-based VMM(fast), Digital-based others(relatively slow) * refs. bridge tha gap between neural netwoks and ..., ASPLOS'18 * FPSA; ReRAM-based processing element ** reduce digital circuit, spiking schema ** fully parallel * routing = iland-style, like FPGA (ref. mrFPGA) * system stack: neural synthesizer -> spatial-to-temporal mapper -> place & route ::Bit-Tactical: A Software/Hardware Approach to Exploiting Value and Bit Sparsity in Neural Networks * MAC演算ユニットにスパースな演算データを無駄を省いて供給したい * → オンデマンドであいてる演算器にデータをつっこめるようにする * 演算器へのデータパスにMUXをいれてデータ供給を制御している,のかな. * うまくつくれれば便利そう !Machine Learning II :: TANGRAM: Optimized Coarse-Grained Dataflow for Scalable NN Accelerators * scaling NN perf. ** use more PEs & more on-chip buffers *** monolithic engine <- low resource utilization, long array busses, far from SRAM *** -> tiled architecture - mostly local data transfers, easy to scale up/down ***- <- dataflow scheduling ? * inter-layer parallel ** buffer sharing dataflow - タイルでデータを共有 → 最初に分割して配って,あとで交換する * inter-layer pipeline ** pipeline multiple layers, pros: save DRAM B/W, cons: utilize resources less efficiently(long delay, large SRAM) ** -> fine-grained data forwarding *** forward each subset of data to the next layer as soon as ready *** require matched access patterns between adjacent layers *** データフローツールでパイプラインスケジューリングする :: Packing Sparse Convolutional Neural Networks for Efficient Systolic Array Implementations: Column Combining Under Joint Optimization * スパース行列をデンスな行列に変換する話 * zero weights in systolic arrays are wasteful ** -> column combining. 9タイルを3タイルに. *** 保存された重さとの積の方だけ選択して計算する. * ref. Full-stack Optimization for Accelerating CNNs with FPGA Validation, ICS 2019 ??? :: Split-CNN: Splitting Window-based Operations in Convolutional Neural Networks for Memory System Optimization * DL faces a memory problem, HBM meomry is expensive ** Accelerator(eg. GPU): 16GB/32GB/..., Host: 512GB/1TB/... * opportunities enabled by NV-LINK ** ref. vDNN(Rhu, MICRO 49) * memory profile of training DNN * Split-CNN ** accuracy drops slowly as we splite deeper and into more patches ** batch毎にsplitの感じをかえる * HMMS is a static memory planner that detemines the timing of memory allocation, deallocation, prefetching and offloading 学習時のメモリボトルネックを解決するために,データを分割するSplit-CNNと,メモリ管理/プリフェッチの管理システムHMMSを提案.IBM Power System S822LCで評価. !Storage :: LightStore: Software-defined Network-attached Key-value Drives 組み込みクラスのプロセッサと数TBのNAND FLASHを使ったNW接続なKVS LightStore を提案.FTLはHW上に実装.Xeonサーバ上のRocksDBと比べて,Random Setの速度はXenサーバを凌駕,ノード数に対してスケール,省電力. * one ssd per network port, KV interface, * optimization ** system optimization *** mmemcopy, thread * LSM-tree spec. opt ** decoupled keys from KV paris, bloom filter * FTL in HW :: SOML Read: Rethinking the read operation granularity of 3D NAND SSDs 3D NANDで密度あがったので同じ容量のSSDはチップ数減って,チップ間並列性がへって読み出しが遅くなった.なので,Partial-page読み出しを1つのread命令にパックできるようにSWとHWを工夫した,と. * fewer number of NAND chips -> lower multi-chip parallelism * ← sigle-operation-multiple-location ** Partial-page readを1 READ命令にまぜる :: FlatFlash: Exploiting the Byte-Accessibility of SSDs within A Unified Memory-Storage Hierarchy SSD(PCIe接続なフラッシュストレージ)にDRAMと同じようにバイトアクセスできるようにするために SSD->DRAMへのpromotionメカニズムを実装した,と. * FlatFlash, byte addressable interface ** avoid paging ** reduce i/o traffic ** reduces dram latency * dram in ssd + pcie mmio + opencapi * ref. FlashMap, ISCA'15 - unifying the memory and storage <- FlatFlashは1.6倍速い. * DRAM への promoteがおそい -> background実行したい -> consistency問題 !Quantum Computing :: A Case for Variability-Aware Policies for NISQ-Era Quantum Computers * ref. qubitのswapを最適化する問題 * not all qubis are created equal ** exploit variation in error rates to improve reliability *** assign more operations on reliable qubits/link *** <- SWAPカウントじゃなくて :: Tackling the Qubit Mapping Problem for NISQ-Era Quantum Devices * qubit connection limitation * mapping with SWAP ** heuristic - Zulehner et al., DATE'18, Siraichi et al., CGO'18 * reduce search complexity ** swap-based search *** Prev.: mapping-based search, high complexity - O(exp(N)) *** Proposed: search a SWAP sequence - only consider high-priority qubits - O(N^2.5) ** reverse traversal for init. mapping *** Prev.: random initial mapping *** Proposed: Inspired by the reversibility ** control the parallelism :: Noise-Adaptive Compiler Mappings for Noisy Intermediate-Scale Quantum Computers * Q algorithmと実機にはギャップがある * NISQ Resource constraints ** Low qubits: 5-72 ** high gate error rates: 1-10% ** Qubts hold state for 100us * cur. ** compile onece per input: more optimization opportunities ** reduce program execution time to avoid decoherence ** communication/SWAP optimization ** Used in IBM, Rigetti, Google compilers ** -> NISQ system have ~10x spatial and temporal noise variation! * proposed: noise-adaptive compilation * noise variation impacts successes rate ** noise data is measured twice daly by IBM - https://quantumexperience.ng.bluemix.net/qx/devices * #1: choose a good initial mapping * #2: coherene-aware sheduling ** influences mapping: choose qubits with good coherence time * #3: reduce SWAPs, use low-error rate routes * -> implement as a constrained optimization * Scaffold Program -> LLVM IR ScaffCC -> Optization using z3 SMT Solver* -> OpenQASM * *にノイズデータいれる https://github.com/prakashmurali/TriQ :: Optimized Compilation of Aggregated Instructions for Realistic Quantum Computers ロジカルな量子操作と物理的な操作の乖離が大きい.効率的な物理制御をするために1-, 2-qubit操作じゃなくて,最大10qubitsまで同時に操作するようなユニットにまとめるよ.という話なのかな? * layered approach to quantum compilation * GRAPE - GRadient Ascent Pulse Engineering * how to maximally utilize optimal control? - physical gate decomposition, phisical gate optimization