- 追加された行はこのように表示されます。
- 削除された行は
このように表示されます。
!Ubuntu/ZCU106
ZCU111と同じようにビルド.
スクリプト化した.https://github.com/miyo/build-zcu106-linux
あとは必要なものをSDカードに書くだけ.
:: SDカードの用意
先頭に200MくらいのFAT領域,残りにext4領域を作る
FAT領域のタイプはc(W95 FAT32 (LBA)),ext4の方は83(Linux)にセットする.
で,
mkfs.vfat -F 32 -n boot /dev/sdX1
mkfs.ext4 -L root /dev/sdX2
などとしてフォーマット.Xのとこは自分の環境にあわせる.
:: コピー
あとは,SDカードの先頭に作ったFATパーティションに ${WORK}/imageの中身をコピー.
二番目のパーティションにはQEMUで作ったルートパーティションを展開
:: USB-UART
Linuxだと/dev/ttyUSB{0,1,2,3}が見える./dev/ttyUSB0に接続する.
(ZCU111は/dev/ttyUSB1だったので注意)
!rftoolをビルドしてみよう
petalinuxなプロジェクトからrftoolを実機にコピーしてビルドしてみることに
* "rfdc.h"を求められるので<rfdc.h>に書き変え
* rfdcのヘッダファイルを/usr/local/include/rfdcにコピー
* Makefileに CFLAGS = -I/usr/local/include/rfdc
を追加.これで,とりあえずビルドはできた.
!fpgautilがない
rftoolを動かしてみようとおもったら fpgautil がなかった.
git clone https://github.com/Xilinx/meta-xilinx-tools.git
で,
cd meta-xilinx-tools/recipes-bsp/fpga-manager-script/files
make fpgautil
でよかった.
!ZCU106のベンチマーク
とりあえず BYTE UNIX Benchmarks
Benchmark Run: Thu Dec 03 2020 05:46:53 - 06:14:57
4 CPUs in system; running 1 parallel copy of tests
Dhrystone 2 using register variables 6372358.5 lps (10.0 s, 7 samples)
Double-Precision Whetstone 1156.6 MWIPS (9.8 s, 7 samples)
Execl Throughput 1674.6 lps (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 168388.8 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 50505.4 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 438224.3 KBps (30.0 s, 2 samples)
Pipe Throughput 397549.8 lps (10.0 s, 7 samples)
Pipe-based Context Switching 73965.6 lps (10.0 s, 7 samples)
Process Creation 4355.1 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 2905.9 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 963.4 lpm (60.0 s, 2 samples)
System Call Overhead 608372.4 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 6372358.5 546.0
Double-Precision Whetstone 55.0 1156.6 210.3
Execl Throughput 43.0 1674.6 389.4
File Copy 1024 bufsize 2000 maxblocks 3960.0 168388.8 425.2
File Copy 256 bufsize 500 maxblocks 1655.0 50505.4 305.2
File Copy 4096 bufsize 8000 maxblocks 5800.0 438224.3 755.6
Pipe Throughput 12440.0 397549.8 319.6
Pipe-based Context Switching 4000.0 73965.6 184.9
Process Creation 126.0 4355.1 345.6
Shell Scripts (1 concurrent) 42.4 2905.9 685.3
Shell Scripts (8 concurrent) 6.0 963.4 1605.7
System Call Overhead 15000.0 608372.4 405.6
========
System Benchmarks Index Score 430.0
------------------------------------------------------------------------
Benchmark Run: Thu Dec 03 2020 06:14:57 - 06:43:02
4 CPUs in system; running 4 parallel copies of tests
Dhrystone 2 using register variables 25487070.9 lps (10.0 s, 7 samples)
Double-Precision Whetstone 4627.6 MWIPS (9.8 s, 7 samples)
Execl Throughput 6123.9 lps (30.0 s, 2 samples)
File Copy 1024 bufsize 2000 maxblocks 318714.5 KBps (30.0 s, 2 samples)
File Copy 256 bufsize 500 maxblocks 88220.4 KBps (30.0 s, 2 samples)
File Copy 4096 bufsize 8000 maxblocks 914718.4 KBps (30.0 s, 2 samples)
Pipe Throughput 1596831.6 lps (10.0 s, 7 samples)
Pipe-based Context Switching 287838.8 lps (10.0 s, 7 samples)
Process Creation 12114.5 lps (30.0 s, 2 samples)
Shell Scripts (1 concurrent) 7948.9 lpm (60.0 s, 2 samples)
Shell Scripts (8 concurrent) 1031.4 lpm (60.1 s, 2 samples)
System Call Overhead 2331185.6 lps (10.0 s, 7 samples)
System Benchmarks Index Values BASELINE RESULT INDEX
Dhrystone 2 using register variables 116700.0 25487070.9 2184.0
Double-Precision Whetstone 55.0 4627.6 841.4
Execl Throughput 43.0 6123.9 1424.2
File Copy 1024 bufsize 2000 maxblocks 3960.0 318714.5 804.8
File Copy 256 bufsize 500 maxblocks 1655.0 88220.4 533.1
File Copy 4096 bufsize 8000 maxblocks 5800.0 914718.4 1577.1
Pipe Throughput 12440.0 1596831.6 1283.6
Pipe-based Context Switching 4000.0 287838.8 719.6
Process Creation 126.0 12114.5 961.5
Shell Scripts (1 concurrent) 42.4 7948.9 1874.8
Shell Scripts (8 concurrent) 6.0 1031.4 1719.0
System Call Overhead 15000.0 2331185.6 1554.1
========
System Benchmarks Index Score 1187.7
user@zcu106:~/byte-unixbench/UnixBench$
STREAMは,
user@zcu106:~/STREAM-master$ gcc -DSTREAM_ARRAY_SIZE=40000000 -O2 -fopenmp -o stream stream.c
user@zcu106:~/STREAM-master$ ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 40000000 (elements), Offset = 0 (elements)
Memory per array = 305.2 MiB (= 0.3 GiB).
Total memory required = 915.5 MiB (= 0.9 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 4
Number of Threads counted = 4
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 119349 microseconds.
(= 119349 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 9058.9 0.071932 0.070649 0.074034
Scale: 7858.5 0.084209 0.081440 0.090125
Add: 7354.7 0.131362 0.130529 0.132072
Triad: 5933.6 0.162834 0.161791 0.163859
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
user@zcu106:~/STREAM-master$ OMP_NUM_THREADS=1 ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 40000000 (elements), Offset = 0 (elements)
Memory per array = 305.2 MiB (= 0.3 GiB).
Total memory required = 915.5 MiB (= 0.9 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 1
Number of Threads counted = 1
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 298444 microseconds.
(= 298444 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 4044.8 0.163095 0.158229 0.170722
Scale: 2209.2 0.291470 0.289698 0.293948
Add: 2277.1 0.421664 0.421598 0.421831
Triad: 1804.5 0.532179 0.532003 0.532374
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
user@zcu106:~/STREAM-master$ OMP_NUM_THREADS=2 ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 40000000 (elements), Offset = 0 (elements)
Memory per array = 305.2 MiB (= 0.3 GiB).
Total memory required = 915.5 MiB (= 0.9 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 2
Number of Threads counted = 2
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 160329 microseconds.
(= 160329 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 7377.2 0.090815 0.086754 0.093672
Scale: 4324.7 0.150540 0.147987 0.152486
Add: 4442.9 0.216218 0.216074 0.216463
Triad: 3487.0 0.276336 0.275306 0.277016
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------