!SeeDotで遊んでみる
::STM32F407 Discovery KitをArduinoで使う．
* Arduino IDEをインストール
* ボードマネージャのURLを追加 https://raw.githubusercontent.com/stm32duino/BoardManagerFiles/master/STM32/package_stm_index.json
* ツール → ボードマネージャ で，STM32 Coreをインストール
* ボードでDiscoveryを，Board part numberでSTM32F407G-DISC1を選択
* サンプルの 01.Basics→Blinkが動作するのを確認する(LD4がチカチカする)
::SeeDotを使う
仮想環境用意して，https://github.com/microsoft/EdgeML/tree/master/Tools/SeeDotに書いてある手順で実行する．
* venvを用意
 python3 -m venv EdgeML
 source ./EdgeML/bin/activate
* GitHubからcloneして環境準備
 git clone https://github.com/Microsoft/EdgeML
 cd EdgeML/tf/
 pip install -r requirements-cpu.txt
 pip install -e .
* usps10でProtoNNを学習
 cd examples/ProtoNN
 python fetch_usps.py
 python process_usps.py
 mkdir usps10/output 
 python protoNN_example.py --data-dir ./usps10 --projection-dim 25 --num-prototypes 55 --epochs 100 -sW 0.3 -o usps10/output
*Arudino向けにビルド
 cd ../../../Tools/SeeDot
 mkdir arduino
 python SeeDot.py -a protonn --train ../../tf/examples/ProtoNN/usps10/train.npy --test ../../tf/examples/ProtoNN/usps10/test.npy --model ../../tf/examples/ProtoNN/usps10/output -o arduino
** こんな感じですすむ
 (EdgeML) miyo@tama:% python SeeDot.py -a protonn --train ../../tf/examples/ProtoNN/usps10/train.npy --test ../../tf/examples/ProtoNN/usps10/test.npy --model ../../tf/examples/ProtoNN/usps10/output -o arduino
 
 ================================
 Executing on protonn for Arduino
 --------------------------------
 Train file: ../../tf/examples/ProtoNN/usps10/train.npy
 Test file: ../../tf/examples/ProtoNN/usps10/test.npy
 Model directory: ../../tf/examples/ProtoNN/usps10/output
 ================================
 
 -----------------------
 Collecting profile data
 -----------------------
 Generating input files for float training dataset...done
 
 Build...success
 ...
 Accuracy is 89.985%
 
 ------------------------------
 Generating code for arduino...
 ------------------------------
 
 Generating input files for fixed testing dataset...done
 
 Generating code...completed
 
 Arduino sketch dumped in the folder arduino
** できたもの
 (EdgeML) miyo@tama:% ls arduino
 arduino.ino  config.h  input/  library.h  model.h  predict.cpp  predict.h
::ボードとの接続
* UART2を使う．PA2がTX, PA3がRX
 HardwareSerial Serial1(USART2);
** F407上のシリアル通信について - ~/.arduino15/packages/STM32/hardware/stm32/1.5.0/variants/DISCO_F407VG/PeripheralPins.c
* なんかでた
 4596: Predicted label: 9; True label: 9; Correct prediction
 4597: Predicted label: 9; True label: 9; Correct prediction
 4598: Predicted label: 9; True label: 9; Correct prediction
 4599: Predicted label: 9; True label: 9; Correct prediction
 4600: Predicted label: 9; True label: 9; Correct prediction
 
 ------------------------
 Average prediction time:
 1380.68
 ------------------------
 
 4601: Predicted label: 9; True label: 9; Correct prediction
 4602: Predicted label: 9; True label: 9; Correct prediction
 4603: Predicted label: 9; True label: 9; Correct prediction
* ソースコードによると，micros()を使って測定した値をイテレーション回数で割ってるみたい
** 入力データはflash上に用意されているデータ
** 1イテレーションあたり1380.68u秒で推論できてるよ，ってことみたい．
* Accuracy Modeにすると，シリアルから入力を受けつけて推論できるみたい
::メモ
* usps - https://www.kaggle.com/bistaumanga/usps-dataset

!TVM/AWS-F1で遊んでみた(失敗)
https://github.cfom/dmlc/tvm/blob/master/docs/deploy/aws_fpga.md をやってみる．
まだうまくいってない．

::やってみたこと
ビルド用にセットアップしたc4.4xlargeマシンにログインして，AWS-F1用の環境変数を
 % source src/project_data/aws-fpga/sdaccel_setup.sh
 % source ${XILINX_SDX}/settings64.sh
でセット．
TVMを，https://docs.tvm.ai/install/from_source.html を参考にビルドする．
LLVMがいるみたいなのでLLVM 4.0.1をビルド．CMakeが古いのでCMakeから...
 % wget https://cmake.org/files/v3.8/cmake-3.8.2.tar.gz
 % wget http://releases.llvm.org/4.0.1/llvm-4.0.1.src.tar.xz
 % wget http://releases.llvm.org/4.0.1/cfe-4.0.1.src.tar.xz
 % tar xvf cfe-4.0.1.src.tar.xz
 % tar xvf llvm-4.0.1.src.tar.xz
 % mv cfe-4.0.1.src llvm-4.0.1.src/tools/clang
 % cd llvm-4.0.1.src
 % mkdir build; cd build
 % cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=$HOME/llvm-4.0.1 ../
 % make -j8 && make install
 % export PATH=$HOME/llvm-4.0.1/bin:$PATH
 % sudo yum install python36 python36-devel python36-pip
 % sudo pip3 install numpy decorator
で準備してから
 % git clone --recursive https://github.com/dmlc/tvm
 % cd tvm
 % git submodule init
 % git submodule update
 % mkdir build
 % cp cmake/config.cmake build
 % cd build
config.cmake の
 set(USE_LLVM OFF)
 set(USE_SDACCEL OFF)
 set(USE_OPENCL OFF)
を
 set(USE_LLVM ON)
 set(USE_SDACCEL ON)
 set(USE_OPENCL ON)
に変更して，
 % cmake ..
 % make -j8
おわったら，
 % export TVM_HOME=$HOME/tvm
 % export PYTHONPATH=$TVM_HOME/python:$TVM_HOME/topi/python:$TVM_HOME/nnvm/python:${PYTHONPATH}
で，利用の準備が完了．
エミュレーション環境の設定を
 % emconfigutil --platform ${AWS_PLATFORM} --nd 1
 % sudo cp emconfig.json $(dirname $(which python))
build.pyとrun.pyを用意して，
 % export XCL_EMULATION_MODE=1
 % export XCL_TARGET=sw_emu
 % python3 build.py
と実行すると
 TypeError: string argument without an encoding
とエラーが．
$TVM_HOME/python/tvm/contrib/sdaccel.py の
 out_file.write(bytes(code))
を
 out_file.write(bytes(code, 'UTF-8'))
に変更して，
 python3 build.py
myadd.soとかができるので，
 export LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH
 python3 run.py
で実行．
 [centos@ip-172-31-23-90 tvm-test]$ python3 run.py 
 ERROR: xclProbe-scan failed at fpga_pci_get_all_slot_specs
 xclProbe found 0 FPGA slots with xocl driver running
 ERROR: [SDx-EM 08] Please set XCL_EMULATION_MODE to "hw_emu" to run hardware emulation. 
 ERROR: [SDx-EM 09] Please set XCL_EMULATION_MODE to "sw_emu" to run software emulation. 
 ERROR: No devices found
 [03:43:37] /home/centos/tvm/src/runtime/opencl/opencl_device_api.cc:263: No OpenCL platform matched given existing options ...
 [03:43:37] /home/centos/tvm/src/runtime/opencl/opencl_device_api.cc:263: No OpenCL platform matched given existing options ...
 Traceback (most recent call last):
   File "run.py", line 17, in <module>
     a = tvm.nd.array(np.random.uniform(size=n).astype("float32"), ctx)
   File "/home/centos/tvm/python/tvm/ndarray.py", line 214, in array
     return empty(arr.shape, arr.dtype, ctx).copyfrom(arr)
   File "/home/centos/tvm/python/tvm/_ffi/ndarray.py", line 132, in empty
     ctypes.byref(handle)))
   File "/home/centos/tvm/python/tvm/_ffi/base.py", line 314, in check_call
     raise get_last_ffi_error()
 tvm._ffi.base.TVMError: Traceback (most recent call last):
   [bt] (2) /home/centos/tvm/build/libtvm.so(TVMArrayAlloc+0x9c) [0x7f743f29588c]
   [bt] (1) /home/centos/tvm/build/libtvm.so(tvm::runtime::NDArray::Empty(std::vector<long, std::allocator<long> >, DLDataType, DLContext)+0x1b8) [0x7f743f295758]
   [bt] (0) /home/centos/tvm/build/libtvm.so(+0xec74c6) [0x7f743f2e74c6]
   File "/home/centos/tvm/src/runtime/opencl/opencl_device_api.cc", line 123
 TVMError: Check failed: context != nullptr: No OpenCL device
といわれる．ERRORの通り，
 export XCL_EMULATION_MODE=sw_emu
として実行．
 [centos@ip-172-31-23-90 tvm-test]$ python3 run.py 
 ERROR: xclProbe-scan failed at fpga_pci_get_all_slot_specs
 xclProbe found 0 FPGA slots with xocl driver running
 ERROR: device::load_binary binary target=Bin, no Hw HAL handle
 Traceback (most recent call last):
   File "run.py", line 21, in <module>
     fadd(a, b, c)
   File "/home/centos/tvm/python/tvm/_ffi/function.py", line 153, in __call__
     return f(*args)
   File "/home/centos/tvm/python/tvm/_ffi/_ctypes/function.py", line 209, in __call__
     raise get_last_ffi_error()
 tvm._ffi.base.TVMError: Traceback (most recent call last):
   [bt] (4) /home/centos/tvm/build/libtvm.so(TVMFuncCall+0x46) [0x7f8d3998ac76]
   [bt] (3) /home/centos/tvm/build/libtvm.so(+0xed1998) [0x7f8d399fb998]
   [bt] (2) /home/centos/tvm/build/libtvm.so(+0xed159a) [0x7f8d399fb59a]
   [bt] (1) /home/centos/tvm/build/libtvm.so(+0xecd99f) [0x7f8d399f799f]
   [bt] (0) /home/centos/tvm/build/libtvm.so(+0x722392) [0x7f8d3924c392]
   File "/home/centos/tvm/src/runtime/opencl/opencl_module.cc", line 219
   File "/home/centos/tvm/src/runtime/module_util.cc", line 73
 TVMError: Check failed: ret == 0 (-1 vs. 0) : Check failed: err == CL_SUCCESS: OpenCL Error, code=-44: CL_INVALID_PROGRAM
 [centos@ip-172-31-23-90 tvm-test]$ 
FPGAささってるマシンじゃないとだめな想定のようにみえる．とりあえず，合成だけでもしておく．
 % unset XCL_EMULATION_MODE
 % export XCL_TARGET=hw
 % python3 build.py
とすると，最後に同様のエラーはでるけど，xclbinの合成はできた．おわったら
 % $SDACCEL_DIR/tools/create_sdaccel_afi.sh \
   -xclbin=myadd.xclbin \
   -o=myadd \
   -s3_bucket=[バケット名] \
   -s3_dcp_key=[DCP保存フォルダ名] \
   -s3_logs_key=[ログ保存フォルダ名]
で，AWS-F1用のイメージ作成処理をキック
 % cat *_afi_id.txt
でFpgaImageIdを確認して，
 % aws ec2 describe-fpga-images --fpga-image-ids [FpgaImageId]
で，Stateがpendingからavailableになったら完了．

::AWS-F1インスタンスでトライ
AWS-F1インスタンスを起動して，tvm，llvmなどの一切合切をc4インスタンスからコピーして
 % sudo -s
 % source $AWS_FPGA_REPO_DIR/sdaccel_setup.sh
 % export LD_LIBRARY_PATH=.:$LD_LIBRARY_PATH
 % export TVM_HOME=/home/centos/tvm
 % export PYTHONPATH=$TVM_HOME/python:$TVM_HOME/topi/python:$TVM_HOME/nnvm/python:${PYTHONPATH}
と準備．
 % export XCL_EMULATION_MODE=sw_emu
 % export XCL_TARGET=sw_emu
 % python3 build.py
 % python3 run.py
とすると，
 [root@ip-172-31-62-53 tvm-test]# python3 run.py 
 xclProbe found 1 FPGA slots with xocl driver running
 [0.5068266  0.1325183  0.9167701  ... 0.46502367 0.02036605 0.5523464 ]
 [0.6834595  0.8389502  0.16160455 ... 0.9921764  0.5801108  0.86317337]
 [0. 0. 0. ... 0. 0. 0.]
 [1.1902862  0.9714685  1.0783746  ... 1.4572     0.60047686 1.4155197 ]
 [root@ip-172-31-62-53 tvm-test]#
と計算できた(出力用にprintを適当に追加した)
 % export XCL_EMULATION_MODE=hw_emu
 % export XCL_TARGET=hw_emu
 % python3 build.py
 % python3 run.py
では，
 [root@ip-172-31-62-53 tvm-test]# python3 run.py 
 xclProbe found 1 FPGA slots with xocl driver running
 [0.64612037 0.24518912 0.6705971  ... 0.75197536 0.02399846 0.12009709]
 [0.8558938  0.9514521  0.5152762  ... 0.3747665  0.5249482  0.58834535]
 [0. 0. 0. ... 0. 0. 0.]
 INFO: [SDx-EM 01] Hardware emulation runs simulation underneath. Using a large data set will result in long simulation times. It is recommended that a small dataset is used for faster execution. This flow does not use cycle accurate models and hence the performance data generated is approximate.
 [1.5020142  1.1966412  1.1858733  ... 1.1267419  0.5489466  0.70844245]
 INFO: [SDx-EM 22] [Wall clock time: 04:39, Emulation time: 0.0579385 ms] Data transfer between kernel(s) and global memory(s)
 myadd_kernel0_1:m_axi_gmem-DDR          RD = 8.000 KB               WR = 4.000 KB        
と計算できたみたい．
FPGAでは，と，
 % unset XCL_EMULATION_MODE
 % export XCL_TARGET=hw
 % python3 run.py
とやってみた．
 [root@ip-172-31-62-53 tvm-test]# python3 run.py 
 xclProbe found 1 FPGA slots with xocl driver running
 xclAllocBO ERROR: AllocBO IOCTL failed
 ERROR: std::bad_alloc
 ERROR: Operation failed due to earlier error 'std::bad_alloc'
 Traceback (most recent call last):
   File "run.py", line 18, in <module>
     b = tvm.nd.array(np.random.uniform(size=n).astype("float32"), ctx)
   File "/home/centos/tvm/python/tvm/ndarray.py", line 214, in array
     return empty(arr.shape, arr.dtype, ctx).copyfrom(arr)
   File "/home/centos/tvm/python/tvm/_ffi/ndarray.py", line 254, in copyfrom
     check_call(_LIB.TVMArrayCopyFromBytes(self.handle, data, nbytes))
   File "/home/centos/tvm/python/tvm/_ffi/base.py", line 314, in check_call
     raise get_last_ffi_error()
 tvm._ffi.base.TVMError: Traceback (most recent call last):
   [bt] (2) /home/centos/tvm/build/libtvm.so(TVMArrayCopyFromBytes+0x768) [0x7fe99b3b9368]
   [bt] (1) /home/centos/tvm/build/libtvm.so(+0xecb6f2) [0x7fe99b4106f2]
   [bt] (0) /home/centos/tvm/build/libtvm.so(+0x722392) [0x7fe99ac67392]
   File "/home/centos/tvm/src/runtime/opencl/opencl_device_api.cc", line 171
 TVMError: Check failed: e == CL_SUCCESS: OpenCL Error, code=-5: CL_OUT_OF_RESOURCES
で，固まってしまった．
ためしに，F1インスタンスでも
 % python3 build.py
 % python3 run.py
としてみたが，かわらず．残念．