トップ差分一覧 Farm ソース検索ヘルプ PDF RSS ログイン

Diary/2019-2-2

FPGAX

Google@六本木にて．
https://fpgax.connpass.com/event/115446/

FPGAXメモ

「TPUの最近の話」Google 佐藤さん

TPU Pod - HPC-powered scalable all reduce distributed training
- Cloud TPU - https://cloud.google.com/tpu/
実はいろんなHWがある．BigQuery architectureとかもある
TPU v1, v2, v3 - 2018
- 2016年時点ではMLPが6割くらいだった
brain floating point format
- exp 8bit, mantissa 7bit - FP16と違って，FP32と同じだけ指数部を多めにとってる
TPU v2, v3
- v2: 180TFLOPS, 64GB HBM - $4.5/h($1.35/h preembitble) @us
- v3: 420TFLOPS, 120GB HBM
DAWNBenchでコスト評価 https://dawn.cs.stanford.edu/benchmark/
- GPUの1/5で学習できますよ，と．
TPU3.0 Pod : > 100PFLOPS (8x faster than v2)
All reduce with 2-D toroidal mesh network by Google's HPC hardware
Coud TPU v2(64 units) vs. NVIDIA V100(8 units)
- 27x faster training at 35% lower cost
ebay
- 55M training image
- accuracy boost +10%, training time speedup 100x
data parallel, model parallel
- これからはmodel parallel
  - 参考 https://research.preferred.jp/2018/12/model-parallelism-in-dnn/
- model parallel training with biggan
  - https://colab.research.google.com/github/tensorflow/hub/blob/master/examples/colab/biggan_generation_with_tf_hub.ipynb
Cloud TPU APIs
- estimator
- keras
Edge TPU

「AIチップ最新レビュー」北海道大学百瀬啓さん

ニューロモフィックな商用チップもある
- https://www.mythic-ai.com/
- Eta tensai - https://etacompute.com/news/press-releases/684/
写真 penneの~/Pictures/fpgax_20190202に
- ニューロチップの世界の動向 - IMG_20190202_131619613.png
- (Server) - Scaling Trend - IMG_20190202_132028592.png
- DaDianNao: 中国CAS 学習/推論用 2014年 - IMG_20190202_132138781.png
- 量子化・圧縮の適用 - IMG_20190202_132429613.png
- (edge) - Quantization Trend - IMG_20190202_132843358.png
- 量子化: CNN+RNN・(DNPU) ISSCC '17/14.2 KAIST D.Shin - IMG_20190202_133013971.png
- Log量子化/ビットシリアル ... QUEST '18 北大 - IMG_20190202_133250029.png
- (edge) in-memory processing - IMG_20190202_133538324.png
- BRein Memory : Binary/Ternary in-memory ('17/6月) - IMG_20190202_133712446.png
- BWN/TWN/BNN(精華大) '18 VLSIC - IMG_20190202_133831586.png
- Mixed Signal Binary CNN (プリンストン大 '18 VLSIC) - IMG_20190202_134035448.png
- ReRAM(RAND) パナソニク VLSI Tech. 2018, 16-4 - IMG_20190202_134149363.png
- 8bit Analog Chip with P-PCM (IBM) '18 IEDM - IMG_20190202_134452078.png
- エネルギー効率改善 (量子化と低電力回路技術) - IMG_20190202_134945142.png
- Technology Trend - IMG_20190202_135046754.png

「 LUT-Network ～本物のリアルタイムコンピューティングを目指して～」渕上竜司さん

論よりRun
バイナリよりLUTのテーブルを活用できる方がいい，というアプローチ
https://github.com/ryuz/BinaryBrain
https://www.slideshare.net/ryuz88/lut-network-fpgx201902/1
発表スライド: https://www.slideshare.net/ryuz88/lut-network-fpgx201902/1

「（仮）DNNコンパイラの歩みと最近の動向」ぼのたけ/NII 今井健男さん

Deep Learning コンパイラ
- nGraph Intel Nervana - https://github.com/NervanaSystems/ngraph
- TensorFlow XLA Google - https://www.tensorflow.org/xla/
- TVM Washington Univ. - https://tvm.ai/
  - HalideIR - https://github.com/dmlc/HalideIR
- PlaidML Vertex.ai - https://github.com/plaidml/plaidml
- DLVM Illinoi Univ. - http://dlvm.org/
- Tensor Comprehensions Facebook - https://github.com/facebookresearch/TensorComprehensions
- TIRAMISU MIT - https://www.csail.mit.edu/research/tiramisu-compiler
- GLOW Facebook - https://facebook.ai/developers/tools/glow
- ONNC Skymizer - https://github.com/ONNC/onnc
Deep Learning コンパイラ中間表現
- グラフレベル
- オペレータレベル
TVMの場合
- Operator fusion - 複数のオペレータの融合
- Constant folding - 定数伝播，簡約
- Static memory planning - 中間テンソルのためのメモリの確保
- Data layout transformation - 内部のテンソル計算効率化のためにデータのレイアウトを変換
TVM Conference - https://sampl.cs.washington.edu/tvmconf/
TVMの方向性
- AutoTVM - 機械学習を用いたオペレータ自動最適化
  - GRU, XGBoost
  - Halideでは，Mullapudi et al. '16がある - https://dl.acm.org/citation.cfm?id=2925952
- VTA(ヴィータ) - TVM専用のAIチップ
  - http://sampl.cs.washington.edu/tvmconf/slides/Thierry-Moreau-VTA.pdf
  - 2階層のISA(CISCベースマルチサイクル: DENSE, ALU, LOAD, STORE + RISCベースマイクロサイクル)-
- TVMと連動したlatency hiding - DNNコンパイル時に命令列内の依存関係を解析
- Relay - グラフレベル中間言語
  - 参考 https://www.slideshare.net/bonotake/tvmir-relay
  - グラフ最適化からプログラム最適化へ
  - shapeのチェックを型検査で
  - 自動微分可能な高階関数を採用
  - - Differential programming language - https://popl18.sigplan.org/event/popl-2018-papers-keynote-some-principles-of-differential-programming-languages
DNNコンパイラの「正しさ」とは？
- 1. DNNの正しい振舞いとは？ - そもそも精度劣化とかあるし
- 2. DNNとコンパイル結果の間の等価性とは？
- 3. 等価性をどうチェックすればいいか？
- 4. 等価性に従えばどの最適化が適用可能なのか？

「RISC-V の現況と Esperanto Technologies のアプローチ」京都産業大学情報理工学部安田豊さん

WesternDigitalのRISC-V https://github.com/westerndigitalcorporation/omnixtend
Nvidia + RISC-V - https://riscv.org/2018/08/sifive-announces-first-open-source-risc-v-based-soc-platform-with-nvidia-deep-learning-accelerator-technology/
EsperantのDavid Ditze(RISC-V Tokyoの話) ... David Ditze - Crusoeの人
- bulding the highest TeraFLOPS per Watt Machine Learning computing system
- 要はRISC-VでBig.Littleやるという話
  - (Big) Maxion: 64k L1, 4M L2, Deep Pipelines, OoO, Branch Prediciton
  - (Little) Minion: 4096コアのせる，In-Order, Vector extension of floating tensor instruction(F16, F32, F64 F128)
tool chainもいろいろある
- gcc, gdb, Qemu などなど
- 商用シミュレータもある http://www.imperas.com/imperas-riscv-solutions

「HBM-FPGA をさわってみた」長瀬産業西沢正登さん

Q: 自然言語のベクトルとかに使えるといいのでは？ by 佐藤さん

「Deep Learning推論を高速化するソフトウェア技術」Idein 中村晃一さん

https://github.com/nineties
ハードウェアに依存したくない場合がある - スマホとか．
モデルアーキテクチャ - よいモデル重要
- モデルの精度と計算量 https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet_v1.md
  - 1%稼ぐのは大変(特に100%に近い場合)．正しい目を持ちましょう．
  - 参考 https://www.slideshare.net/ren4yu/deep-neural-network-79382352
例1: アルゴリズム，ループ融合，メモリレイアウトで4倍くらいはかわる
例2: MobileNet/STM32H7
- Tightly Coupled Memory)使う，SIMD使う，アライメントきをつける 9秒->3.1秒
- float32->float16 で 1.1秒
レイヤー融合
- Conv. Batch -> Conv.
- Conv. ReLU -> Conv. + Relu (Reluを最内ループでインライン化)
アルゴリズム
- 2D Conv.のアルゴリズム Direct, im2col, Winograd, FFT
- 参考: Convolutionの数理とアルゴリズム - https://speakerdeck.com/nineties/convolutionfalseshu-li-toarugorizumu
テンソルのshapeによって最適実装は変わる
- 入力の方とか出力の方とか
- フィルタサイズ(3x3とか1x1とか)でも変わる
VidroCore IVアーキテクチャ
- 参考: Using Raspberry Pi GPU for DNN - https://www.slideshare.net/notogawa/using-raspberry-pi-gpu-for-dnn
- 書き戻しが細い(キャッシュとおらないDMA)のがネック
- py-videocore https://github.com/nineties/py-videocore
- QMKL https://github.com/Idein/qmkl
Actcastというサービスを開発中 - https://actcast.io/
FPGA使いたいのはレイテンシつめたいところ．制御系．
発表スライド - https://speakerdeck.com/nineties/deep-learningtui-lun-wogao-su-hua-surusohutoueaji-shu

「TensorFlow XLA：XLAとは、から、最近の利用事例について」@vengineer（ソースコード解析職人）さん

参考 XLA.jl を試してみた - https://qiita.com/antimon2/items/ccfb5c2353d99fcb1976
参考 JuliaからCloud TPUを使う論文の、ざっくりまとめ - https://kiszk.github.io/2018/12/19/TensorFlow-Julia-TPU-XLA/
参考 Introducing PyTorch across Google Cloud https://cloud.google.com/blog/products/ai-machine-learning/introducing-pytorch-across-google-cloud
- How To Build And Run PyTorch For TPU - https://github.com/pytorch/xla
参考 JAX: Autograd and XLA - https://github.com/google/jax
参考 LeFlow: XLA - https://github.com/danielholanda/LeFlow
- LeFlow: Enabling Flexible FPGA High-Level Synthesis of Tensorflow Deep Neural Networks https://arxiv.org/pdf/1807.05317.pdf
- XLA->LLVM->(LegUp)->Verilog HDL

「MN-Coreについて」PFN 金子紘也さん

PFNとしてはTrainingの需要が大きい
Trainingにおける演算
- 誤差逆伝搬．途中の値も覚えておかないとダメ．
- 大きなGEMMとして書き下せる．学習時は基本，密行列．
- V100のピーク性能向上はfp16 GEMM専用エンジン(Tenssor Core)
- 図の参考は http://cs231n.stanford.edu/
データ並列による分散学習
- バッチサイズ増やして → GPUにばらまいて → All-reduce
- 単純にバッチサイズ増やすと精度劣化．テクニックが必要．
MN-1a = Tesla P100 1024, MN-1b = Tesla V100 512．インフィニバンド接続．
Deep Learningの研究動向
- SoTAではモデルサイズは増える
  - 画像から動画/立体，巨大化
  - - 時間方向/空間方向へのConv.HD
  - MoE(Mixture of Experts)
  - NAS(Netwrok Architecture Search)
  - - ネットワークアーキテクチャ自体を自動探索する試み
MN-Core - 深層学習用プロセッサ
- 階層メモリ型SIMDアーキテクチャ．512MABを1chipに集積(MAB -> L1B -> L2B -> Chip)
  - 倍/単/半精度演算はロジックは共有している．命令で切り替え．
  - 1 TFLOPS/W(半精度)，500W
- 空冷
- 2020年に運用予定
計算能力は競争力の源泉
- NIPSの論文提出締切で大手クラウドのGPUが枯渇 https://www.theregister.co.uk/2017/05/22/cloud_providers_ai_researchers/?mt=1495474350040
HWをささえるSW
- ChainerX https://docs.chainer.org/en/latest/chainerx/index.html
  - 高速な自動微分の実装，選択可能なbackend
  - 参考: ChainerX とりあえず入れてみよう https://qiita.com/SatoshiTerasaki/items/defbb1ea49b88c452118
- Chainer-compiler https://github.com/pfnet-research/chainer-compiler
  - Pythonから拡張ONNXフォーマットへのconvert
  - 拡張ONNX上における計算グラフの最適化，自動微分

「私のMNISTのFPSは530000です。ですがもちろんフルパワーで（以下略」なかはらさん

参考: Can FPGAs beat GPUs in Accelerating Next-Generation Deep Neural Networks? - http://isfpga.org/fpga2017/slides/D1_S1_InvitedTalk.pdf
CNNの結果可視化してみる → 情報分類 → XOR使ってエントロピーを最大化する問題
- クラス分類はいい
- 回帰にはあまりよくない
参考: Optuna: A hyperparameter optimization framework - https://github.com/pfnet/optuna
- Chainer + Optunaの例 - https://github.com/pfnet/optuna/blob/master/examples/chainer_simple.py
関数分解法で回路分割
- エンコーダとみなす(どれだけ違うか，で小さくできる)
参考: Analysis and synthesis of weighted-sum functions https://ieeexplore.ieee.org/document/1624513
参考: Logic synthesis for a single large look-up table https://ieeexplore.ieee.org/document/528842

あとでよむ

RAPIDNN: In-Memory Deep Neural Network Acceleration Framework https://arxiv.org/pdf/1806.05794.pdf
SDA: Software-Defined Accelerator for LargeScale DNN Systems https://www.hotchips.org/wp-content/uploads/hc_archives/hc26/HC26-12-day2-epub/HC26.12-5-FPGAs-epub/HC26.12.545-Soft-Def-Acc-Ouyang-baidu-v3--baidu-v4.pdf
3D rendering in fpga - http://www.cs.columbia.edu/~sedwards/classes/2014/4840/reports/BallBalance-presentation.pdf, http://www.cs.columbia.edu/~sedwards/classes/2014/4840/reports/BallBalance.pdf
Learning Transferable Architectures for Scalable Image Recognition https://arxiv.org/abs/1707.07012

Diary/2019-2-2

FPGAX

FPGAXメモ

あとでよむ

検索

趣味の工作

コンピュータ

イベント

リンク