Whetstone Benchmark - whetstonePi64g8 and g12
Vector Versions - Whetv64SPg8 and g12, whetvDP64g8 and g12
This has a number of simple programming loops, with the overall MWIPS rating dependent on floating point calculations. with no accessing of data in L2 cache or RAM.
Results are provided for the original scalar single precision (SP) version, along with those for single and double precision (DP) varieties of the vector version, originally written for use on the first Cray 1 supercomputer delivered to the UK. For more information see Pi 5 The Vector Processor later.
Examination of the time used by the different tests shows that this can be dominated by those executing such as COS and EXP functions.
Pi 5/Pi 4 comparisons are provided for the gcc 8 scalar versions, indicting performance gains between 2.44 to 2.59 times for the three (MFLOPS) floating point tests and 2.79 on overall MWIPS. Performance of the Pi 5 gcc 12 compilations were essentially identical to those from gcc 8.
Pi 5/Pi 4 vector SP and DP gcc 8 performance gains were similar between 2.34 to 3.10 times for MFLOPS and around 2.3 for MWIPS. Pi 5 SP Vector/Scalar gains are also provided, giving 5.40 to 7.86 times for MFLOPS but only 1.88 times for overall MWIPS, deflated by the COS/EXP tests. Maximum SP scalar speed was 1.36 GFLOPS with vectors at 8.08 SP and 4.0 DP.Note different single and double precision numeric results. Dhrystone Benchmark - dhrystonePi64g8 and g12
This is the most popular ARM integer benchmark, often subject to over optimisation, rated in VAX MIPS aka DMIPS.
Pi 5 GCC 8 gain over Pi 4 was 2.37 times. There was a slight gain using GCC 12, where DMIPS/MHz ratio reached 8.57.
Linpack 100 Benchmark MFLOPS - linpackPi64g8 and g12, linpackPi64gSP, linpackPi64NEONig8
This original Linpack benchmark executes double precision arithmetic. I introduced two single precision versions, one using NEON functions to include vector processing.
Performance of this benchmark can vary, with its dependence on data placement in L2 cache.
Unlike when the Pi 5 was introduced. later compilers produced code as fast as the NEON version. Now with GCC 12, The NEON variety was slower and the others produced a small gain over GCC 8 compiations. Comparisons for the latter indicated Pi 5 gains were between 3.16 and 3.54 times over the three versions. Maximum Pi 5 speeds were 6.60 GFLOPS SP and 3.93 GFLOPS DP.
Livermore Loops Benchmark MFLOPS - liverloopsPi64g8 and g12
This benchmark measures performance of 24 double precision kernels, initially used to select the latest supercomputer. The official average is geometric mean, where Cray 1 supercomputer was rated as 11.9 MFLOPS. Following are MFLOPS for the individual kernels, followed by overall scores.
Although each kernel is executed for a relatively long time, performance of some can be inconsistent.
Pi 5 GCC 8 maximum speed was 9.87 DP GFLOPS, with gains over the Pi 4 between 2.14 and 4.65 over the 24 loops.
Maximum performance via GCC 12 was 10.57 DP GFLOPS, with those for all of the loops similar to GCC 8 scores.
Fast Fourier Transforms Benchmarks - fft1Pi64g, fft3cPi64g8 and g12
This is a real application provided by my collaborator at Compuserve Forum. There are two benchmarks. The first one is the original C program. The second is an optimised version, originally using my x86 assembly code, but translated back into C code, making use of the partitioning and (my) arrangement to optimise for burst reading from RAM. Three measurements use both single and double precision data, calculating FFT sizes between 1K and 1024K, with data from caches and RAM. Note that steps in performance levels occur at data size changes between caches, then to RAM.
Comparisons of averages of the three runs are provided. Those for FFT1 demonstrate the clear and different advantage of the Pi 5 over the Pi 4, depending on the source of the data, with that from L3 cache providing gains of up to 13.34 times and up to 4.71 times involving the larger L2 cache. Most other gains are in the two to four times range. With the faster CPU speed limited FFT3c, gains were mainly mbetween 2 and 3 times. GCC 12 over GCC 8 comparisons indicate a slight advantage of the former using data from caches, but the role reversed, dealing with RAM data transfers.
BusSpeed Benchmark - busspeedPi64g8 and g12
This is a read only benchmark with data from caches and RAM. The program reads one word with 32 word increments for the next one, skipping following data word by decreasing increments. finally reading all data. This shows where data is read in bursts, enabling estimates being made of bus speeds, as 16 times the speed of appropriate measurements at Inc16.
The most important ratios are from Read All, others demonstrating when all data is not being read sequentially and the Pi 5 appears to be significantly faster than the Pi 4. The main results indicate Pi 5 gains of just over twice reading data from L1 and L2 caches, but can be more than four times from L3 and more than three times from RAM. Maximum bus speed, using one CPU core, is estimated as around 14 GB/second from Inc16 also shown under Read All. See MP results for higher estimates.
Pi 5 performance produced from GCC 8 and GCC 12 compilations was essentially the same.
MemSpeed Benchmark MB/Second - memspeedPi64g8 and g12
The benchmark includes CPU speed dependent calculations using data from caches and RAM, via single and double precision floating point and integer functions. The instruction sequences used are shown in the results column titles.
When compiled with GCC 6, earlier results identified unusual slow operation dealing with 32 bit floating point and integer calculations. This looks as though the effect is to read data from RAM instead of caches, and why Pi 5 performance gains were mainly less than two times. With double precision floating point, average Pi 5 gains were around four times for the first two sets of calculations, including more that 10 times with L3 cache involvement.
The GCC 12 compilation appears to have corrected the above misoperations, providing gains of more than eight times over GCC 8. These calculations also show slight improvements in double precision calculations. Maximum calculated speeds are provided, indicating 15.3 single core GFLOPS SP and 6.86 DP, the relationship expected using SIMD calculations. The tests also confirmed this with the near 6.4 GFLOPS/GHz SP and near half that DP. This performance was obtained using data from L1 and L2 caches with almost that from L3 cache.
NeonSpeed Benchmark MB/Second - NeonSpeedPi64g8 and g12
This carries out some of the same calculations as MemSpeed. All results are for 32 bit floating point and integer calculations. Norm functions were as generated by the compiler and NEON through using intrinsic functions.
The initial GCC 8 test functions produced the same irregular results as MemSpeed first “Normal Float and Int” calculations that appear to only read RAM based data. Performance from NEON code indicated that the Pi 5 was typically 2.5 times faster than the Pi 4, using cache based data, and 1.5 times from RAM. Exceptions were gains of up to 7.9 times using L3 cache and nearly 4.8 from lower level caches.
The GCC 12 compiler produced acceptable “Normal” performance on the Pi 5, reflected by gains of up to more than ten times over GCC 8 results. This compiler is also shown to provide faster operation than that from NEON functions. Many of the latter show 20% improvements but some were slower. Maximum floating point speed demonstrated was nearly 17 GFLOPS.
Vector Versions - Whetv64SPg8 and g12, whetvDP64g8 and g12
This has a number of simple programming loops, with the overall MWIPS rating dependent on floating point calculations. with no accessing of data in L2 cache or RAM.
Results are provided for the original scalar single precision (SP) version, along with those for single and double precision (DP) varieties of the vector version, originally written for use on the first Cray 1 supercomputer delivered to the UK. For more information see Pi 5 The Vector Processor later.
Examination of the time used by the different tests shows that this can be dominated by those executing such as COS and EXP functions.
Pi 5/Pi 4 comparisons are provided for the gcc 8 scalar versions, indicting performance gains between 2.44 to 2.59 times for the three (MFLOPS) floating point tests and 2.79 on overall MWIPS. Performance of the Pi 5 gcc 12 compilations were essentially identical to those from gcc 8.
Pi 5/Pi 4 vector SP and DP gcc 8 performance gains were similar between 2.34 to 3.10 times for MFLOPS and around 2.3 for MWIPS. Pi 5 SP Vector/Scalar gains are also provided, giving 5.40 to 7.86 times for MFLOPS but only 1.88 times for overall MWIPS, deflated by the COS/EXP tests. Maximum SP scalar speed was 1.36 GFLOPS with vectors at 8.08 SP and 4.0 DP.
Code:
Pi 4 GCC 8Whetstone Single Precision C Benchmark 64 Bit gcc 8R, Fri May 22 10:48:53 2020Loop content Result MFLOPS MOPS SecondsN1 floating point -1.12475013732910156 524.251 0.076N2 floating point -1.12274742126464844 534.904 0.524N3 if then else 1.00000000000000000 2978.570 0.073N4 fixed point 12.00000000000000000 2493.078 0.264N5 sin,cos etc. 0.49911010265350342 57.643 3.012N6 floating point 0.99999982118606567 397.676 2.831N7 assignments 3.00000000000000000 996.647 0.387N8 exp,sqrt etc. 0.75110864639282227 27.327 2.841MWIPS 2085.311 10.008Pi 5 GCC 8Whetstone Single Precision C Benchmark 64 Bit gcc 8R, Thu Aug 10 15:44:50 2023Loop content Result MFLOPS MOPS Seconds G8 Pi5/4N1 floating point -1.12475013732910156 1279.196 0.087 2.44N2 floating point -1.12274742126464844 1364.748 0.573 2.55N3 if then else 1.00000000000000000 7190.834 0.084 2.41N4 fixed point 12.00000000000000000 5995.954 0.306 2.41N5 sin,cos etc. 0.49911010265350342 154.725 3.131 2.68N6 floating point 0.99999982118606567 1027.998 3.055 3.59N7 assignments 3.00000000000000000 2398.668 0.449 2.41N8 exp,sqrt etc. 0.75110864639282227 93.596 2.314 3.43MWIPS 5822.922 9.998 2.79Pi 5 GCC 12 Whetstone Single Precision C Benchmark 64 Bit gcc 12, Thu Sep 28 11:46:43 2023Loop content Result MFLOPS MOPS SecondsN1 floating point -1.12475013732910156 1279.140 0.088N2 floating point -1.12274742126464844 1364.558 0.575N3 if then else 1.00000000000000000 3594.939 0.168N4 fixed point 12.00000000000000000 5994.963 0.307N5 sin,cos etc. 0.49911010265350342 157.996 3.075N6 floating point 0.99999982118606567 1027.940 3.064N7 assignments 3.00000000000000000 2398.054 0.450N8 exp,sqrt etc. 0.75110864639282227 95.590 2.273MWIPS 5839.767 10.000#################### Vector Whetstone Vecton Length 258 ####################Pi 4 GCC 8 SPWhetstone Vector Benchmark 64 Bit Single Precision, Wed Aug 30 10:41:57 2023Loop content Result MFLOPS MOPS SecondsN1 floating point -1.13316142559051514 2338.496 0.391N2 floating point -1.13312149047851562 1651.957 3.877N3 if then else 1.00000000000000000 4427.445 1.114N4 fixed point 12.00000000000000000 1733.458 8.659N5 sin,cos etc. 0.49998238682746887 74.913 52.923N6 floating point 0.99999982118606567 2573.346 9.988N7 assignments 3.00000000000000000 18596.381 0.474N8 exp,sqrt etc. 0.75002217292785645 78.503 22.581MWIPS 4764.843 100.007
Code:
Pi 5 GCC 8 SPWhetstone Vector Benchmark 64 Bit Single Precision, Sat Oct 7 10:15:16 2023Loop content Result MFLOPS MOPS Seconds G8 Pi5/4N1 floating point -1.13316142559051514 7111.676 0.290 3.04N2 floating point -1.13312149047851562 3857.446 3.746 2.34N3 if then else 1.00000000000000000 10141.446 1.097 2.29N4 fixed point 12.00000000000000000 2396.242 14.135 1.38N5 sin,cos etc. 0.49998238682746887 177.032 50.534 2.36N6 floating point 0.99999982118606567 7986.011 7.263 3.10N7 assignments 3.00000000000000000 42584.598 0.467 2.29N8 exp,sqrt etc. 0.75002217292785645 178.102 22.459 2.27MWIPS 10753.538 99.990 2.26Pi 5 GCC 12 SPWhetstone Vector Benchmark gcc 12 64 Bit Single Precision, Sat Oct 7 10:46:30 2023 Vector/ Pi 5 ScalarLoop content Result MFLOPS MOPS Seconds GCC12/8 G12 Pi5 N1 floating point -1.13316142559051514 7393.282 0.286 1.04 5.78 N2 floating point -1.13312149047851562 7364.751 2.009 1.91 5.40N3 if then else 1.00000000000000000 14169.053 0.804 1.40 3.94N4 fixed point 12.00000000000000000 2398.742 14.457 1.00 0.40N5 sin,cos etc. 0.49998238682746887 177.260 51.673 1.00 1.12N6 floating point 0.99999982118606567 8078.622 7.351 1.91 7.86N7 assignments 3.00000000000000000 26419.105 0.770 0.62 11.02N8 exp,sqrt etc. 0.75002217292785645 178.359 22.961 1.00 1.87MWIPS 10974.928 100.311 1.02 1.88Pi 4 GCC 8 DPWhetstone Vector Benchmark 64 Bit Double Precision, Wed Aug 30 10:48:05 2023Loop content Result MFLOPS MOPS SecondsN1 floating point -1.13314558088707962 1146.624 0.709N2 floating point -1.13310306766606850 1094.230 5.203N3 if then else 1.00000000000000000 4405.221 0.995N4 fixed point 12.00000000000000000 1730.427 7.711N5 sin,cos etc. 0.49998080312723675 73.193 48.149N6 floating point 0.99999988868927014 1294.129 17.655N7 assignments 3.00000000000000000 9967.123 0.785N8 exp,sqrt etc. 0.75002006515491115 83.614 18.845MWIPS 4233.571 100.052Pi 5 GCC 8 DPWhetstone Vector Benchmark 64 Bit Double Precision, Sat Oct 7 10:18:59 2023 Loop content Result MFLOPS MOPS Seconds G8 Pi5/4N1 floating point -1.13314558088707962 3499.307 0.535 3.05N2 floating point -1.13310306766606850 2793.370 4.688 2.55N3 if then else 1.00000000000000000 10158.471 0.993 2.31N4 fixed point 12.00000000000000000 2396.163 12.809 1.38N5 sin,cos etc. 0.49998080312723675 171.834 47.176 2.35N6 floating point 0.99999988868927014 3994.760 13.156 3.09N7 assignments 3.00000000000000000 21713.754 0.829 2.18N8 exp,sqrt etc. 0.75002006515491115 184.857 19.607 2.21MWIPS 9763.593 99.793 2.31Pi 5 GCC 12 DPWhetstone Vector Benchmark gcc 12 64 Bit Double Precision, Sat Oct 7 10:50:40 2023Loop content Result MFLOPS MOPS SecondsN1 floating point -1.13314558088707962 3602.841 0.523N2 floating point -1.13310306766606739 3619.564 3.647N3 if then else 1.00000000000000000 14167.623 0.718N4 fixed point 12.00000000000000000 2398.696 12.898N5 sin,cos etc. 0.49998080312723675 172.068 47.491N6 floating point 0.99999988868927014 3997.801 13.252N7 assignments 3.00000000000000000 13172.392 1.378N8 exp,sqrt etc. 0.75002006515491115 182.557 20.014MWIPS 9829.517 99.920
This is the most popular ARM integer benchmark, often subject to over optimisation, rated in VAX MIPS aka DMIPS.
Pi 5 GCC 8 gain over Pi 4 was 2.37 times. There was a slight gain using GCC 12, where DMIPS/MHz ratio reached 8.57.
Code:
Pi 4 GCC 8 Dhrystone Benchmark 2.1 64 Bit gcc8, Mon May 25 22:16:05 2020 Nanoseconds one Dhrystone run: 72.83 Dhrystones per Second: 13729822 VAX MIPS rating = 7814.36 Numeric results were correct Pi 5 GCC 8 Dhrystone Benchmark 2.1 64 Bit gcc8, Thu Aug 10 15:49:13 2023 Nanoseconds one Dhrystone run: 30.69 Dhrystones per Second: 32578833 VAX MIPS rating = 18542.31 Pi 5/Pi 4 Gain 2.37 Numeric results were correct Pi 5 GCC 12 Dhrystone Benchmark 2.1 64 Bit gcc12, Thu Sep 28 11:44:33 2023 Nanoseconds one Dhrystone run: 27.68 Dhrystones per Second: 36120831 VAX MIPS rating = 20558.24 GCC 12/8 Gain 1.11 Numeric results were correct
This original Linpack benchmark executes double precision arithmetic. I introduced two single precision versions, one using NEON functions to include vector processing.
Performance of this benchmark can vary, with its dependence on data placement in L2 cache.
Unlike when the Pi 5 was introduced. later compilers produced code as fast as the NEON version. Now with GCC 12, The NEON variety was slower and the others produced a small gain over GCC 8 compiations. Comparisons for the latter indicated Pi 5 gains were between 3.16 and 3.54 times over the three versions. Maximum Pi 5 speeds were 6.60 GFLOPS SP and 3.93 GFLOPS DP.
Code:
Pi 4 GCC 8 Linpack Double Precision Unrolled Benchmark n @ 100 Optimisation 64 Bit gcc 8, Mon May 25 22:05:47 2020 Speed 1111.51 MFLOPS Numeric results were as expected Linpack Single Precision Unrolled Benchmark n @ 100 Optimisation 64 Bit gcc 8, Mon May 25 22:09:12 2020 Speed 1930.27 MFLOPS Numeric results were as expected Linpack Single Precision Benchmark n @ 100 NEON Intrinsics 64 bit gcc 8, Mon May 25 22:11:15 2020 Speed 2030.95 MFLOPS Numeric results were as expected------------------------------------------------------ Pi 5 GCC 8 Pi5/Pi4 Linpack Double Precision Unrolled Benchmark n @ 100 Optimisation 64 Bit gcc 8, Thu Aug 10 16:12:47 2023 Speed 3933.38 MFLOPS 3.54 Numeric results were as expected Linpack Single Precision Unrolled Benchmark n @ 100 Optimisation 64 Bit gcc 8, Thu Aug 10 16:04:18 2023 Speed 6106.68 MFLOPS 3.16 Numeric results were as expected Linpack Single Precision Benchmark n @ 100 NEON Intrinsics 64 bit gcc 8, Thu Aug 10 16:13:52 2023 Speed 6603.58 MFLOPS 3.25 Numeric results were as expected------------------------------------------------------ Pi 5 GCC 12 GCC 12/5 Linpack Double Precision Unrolled Benchmark n @ 100 Optimisation 64 Bit gcc 12, Thu Sep 28 15:58:07 2023 Speed 4136.39 MFLOPS 1.05 Numeric results were as expected Linpack Single Precision Unrolled Benchmark n @ 100 Optimisation 64 Bit gcc 12, Thu Sep 28 16:04:19 2023 Speed 6472.77 MFLOPS 1.06 Numeric results were as expected Linpack Single Precision Benchmark n @ 100 NEON Intrinsics 64 bit gcc 12, Thu Sep 28 15:49:56 2023 Speed 5665.39 MFLOPS 0.86 Numeric results were as expected But 4 needed changing in program, via #define GCC12ARM64N, to avoid unnecessary error reports.
This benchmark measures performance of 24 double precision kernels, initially used to select the latest supercomputer. The official average is geometric mean, where Cray 1 supercomputer was rated as 11.9 MFLOPS. Following are MFLOPS for the individual kernels, followed by overall scores.
Although each kernel is executed for a relatively long time, performance of some can be inconsistent.
Pi 5 GCC 8 maximum speed was 9.87 DP GFLOPS, with gains over the Pi 4 between 2.14 and 4.65 over the 24 loops.
Maximum performance via GCC 12 was 10.57 DP GFLOPS, with those for all of the loops similar to GCC 8 scores.
Code:
Pi 4 GCC 8 Livermore Loops Benchmark 64 Bit gcc 8 via C/C++ Mon May 25 10:39:10 2020 MFLOPS for 24 loops 2108.4 936.3 959.9 965.1 382.5 808.6 2312.9 2488.4 2065.7 668.7 500.3 980.7 180.7 404.8 815.0 643.8 726.8 1189.6 449.8 397.2 1716.0 366.9 817.7 312.7 Overall Ratings Maximum Average Geomean Harmean Minimum 2616.7 959.8 766.7 613.0 169.7 Numeric results were as expected Pi 5 GCC 8 Livermore Loops Benchmark 64 Bit gcc 8 via C/C++ Thu Aug 10 16:14:33 2023 MFLOPS for 24 loops 7423.6 2147.9 2356.6 2472.9 911.5 1871.0 9872.3 5317.7 5162.9 2125.8 1173.2 2672.0 709.1 1108.7 2966.6 1598.5 1761.3 5526.8 1190.0 956.0 5425.1 1489.5 2147.9 858.2 Overall Ratings Maximum Average Geomean Harmean Minimum 9872.3 2873.9 2208.3 1763.4 646.6 Numeric results were as expected----------------------------------------------------------------------------------- GCC 8 Pi5/Pi4 Performance Ratios For 24 loops 3.52 2.29 2.46 2.56 2.38 2.31 4.27 2.14 2.50 3.18 2.34 2.72 3.92 2.74 3.64 2.48 2.42 4.65 2.65 2.41 3.16 4.06 2.63 2.74 Min 2.14 Max 4.65 Overall Ratings Maximum Average Geomean Harmean Minimum 3.77 2.99 2.88 2.88 3.81----------------------------------------------------------------------------------- Pi 5 GCC 12 Livermore Loops Benchmark 64 Bit gcc 12 via C/C++ Thu Sep 28 16:38:37 2023 MFLOPS for 24 loops 7833.8 2404.6 2377.2 2346.8 913.0 1857.1 10577 5350.6 5109.2 2117.4 1186.0 2351.4 760.0 1121.2 3103.4 1597.7 1776.1 5455.9 1197.2 2490.5 5657.5 1855.7 2139.8 780.4 Overall Ratings Maximum Average Geomean Harmean Minimum 10576.9 2964.4 2308.1 1870.7 733.9 Numeric results were as expected via #define GCC12ARMPI
This is a real application provided by my collaborator at Compuserve Forum. There are two benchmarks. The first one is the original C program. The second is an optimised version, originally using my x86 assembly code, but translated back into C code, making use of the partitioning and (my) arrangement to optimise for burst reading from RAM. Three measurements use both single and double precision data, calculating FFT sizes between 1K and 1024K, with data from caches and RAM. Note that steps in performance levels occur at data size changes between caches, then to RAM.
Comparisons of averages of the three runs are provided. Those for FFT1 demonstrate the clear and different advantage of the Pi 5 over the Pi 4, depending on the source of the data, with that from L3 cache providing gains of up to 13.34 times and up to 4.71 times involving the larger L2 cache. Most other gains are in the two to four times range. With the faster CPU speed limited FFT3c, gains were mainly mbetween 2 and 3 times. GCC 12 over GCC 8 comparisons indicate a slight advantage of the former using data from caches, but the role reversed, dealing with RAM data transfers.
Code:
Pi 4 GCC 8 Pi 4 RPi FFT gcc 8 64 Bit Benchmark 1 Mon May 25 10:54:42 2020 Size milliseconds K Single Precision Double Precision 1 0.05 0.04 0.04 0.04 0.04 0.05 2 0.08 0.08 0.08 0.15 0.14 0.14 4 0.23 0.23 0.23 0.39 0.38 0.44 8 0.73 0.80 0.70 0.97 1.04 0.97 16 1.98 1.87 1.79 2.66 2.52 2.83 32 4.92 4.92 5.29 5.67 4.92 4.89 64 8.80 8.69 8.67 32.21 32.23 33.31 128 49.82 49.79 50.17 161.36 159.61 159.39 256 295.55 280.43 303.20 411.97 415.90 340.34 512 506.01 601.29 572.36 781.10 779.05 782.21 1024 1375.42 1377.64 1375.77 1898.28 1876.88 1896.22 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Mon May 25 10:55:00 2020 Pi 4 RPi FFT gcc 8 64 Bit Benchmark 3c.0 Mon May 25 10:56:49 2020 Size milliseconds K Single Precision Double Precision 1 0.06 0.04 0.04 0.04 0.04 0.03 2 0.09 0.07 0.07 0.10 0.10 0.10 4 0.23 0.20 0.20 0.23 0.26 0.23 8 0.50 0.44 0.46 0.52 0.50 0.50 16 1.21 1.19 1.05 1.23 1.17 1.19 32 2.36 2.23 2.18 3.33 3.32 3.29 64 6.16 5.70 5.31 10.20 10.20 10.18 128 16.39 15.69 15.69 24.35 24.45 24.48 256 38.70 37.46 37.40 54.57 54.65 54.59 512 83.83 80.96 81.40 119.71 118.70 119.27 1024 182.08 176.05 176.97 268.43 259.16 259.30 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Mon May 25 10:56:52 2020 Pi 5 GCC 8 Pi 5 RPi FFT gcc 8 64 Bit Benchmark 1 Fri Aug 11 16:47:11 2023 Size milliseconds Average Pi5/Pi4 K Single Precision Double Precision SP DP 1 0.02 0.02 0.02 0.02 0.02 0.02 2.20 2.51 2 0.04 0.04 0.04 0.04 0.04 0.04 1.98 3.81 4 0.09 0.09 0.09 0.09 0.09 0.09 2.64 4.71 8 0.19 0.20 0.19 0.29 0.29 0.29 3.88 3.48 16 0.56 0.56 0.56 0.65 0.67 0.78 3.35 3.82 32 1.30 1.27 1.29 1.55 1.50 1.80 3.92 3.18 64 3.18 3.00 2.99 4.16 3.90 3.91 2.85 8.17 128 7.76 7.30 7.28 14.27 14.44 13.71 6.70 11.33 256 23.23 21.27 21.40 99.92 94.38 94.97 13.34 4.04 512 157.82 152.33 173.93 329.15 321.16 323.41 3.47 2.41 1024 608.66 606.77 600.94 1069.84 1048.00 1049.41 2.27 1.79 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Fri Aug 11 16:47:19 2023 Pi 5 RPi FFT gcc 8 64 Bit Benchmark 3c.0 Fri Aug 11 16:48:27 2023 Size milliseconds Average Pi5/Pi4 K Single Precision Double Precision SP DP 1 0.03 0.02 0.02 0.02 0.02 0.02 1.88 1.96 2 0.05 0.04 0.04 0.04 0.04 0.04 1.93 2.61 4 0.10 0.08 0.08 0.09 0.09 0.09 2.37 2.74 8 0.21 0.18 0.18 0.23 0.21 0.21 2.43 2.37 16 0.45 0.41 0.41 0.53 0.48 0.49 2.70 2.40 32 1.16 0.90 0.93 1.22 1.07 1.06 2.27 2.97 64 2.39 2.04 2.39 2.98 2.76 2.69 2.52 3.63 128 5.26 4.82 4.86 9.92 9.90 9.86 3.20 2.47 256 14.58 13.92 13.89 29.15 27.71 26.90 2.68 1.96 512 42.03 39.73 39.84 72.71 72.32 71.70 2.02 1.65 1024 101.56 99.35 98.31 176.62 171.45 175.48 1.79 1.50 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Fri Aug 11 16:48:29 2023 Pi 5 GCC 12 RPi FFT gcc 12 64 Bit Benchmark 1 Thu Sep 28 19:10:33 2023 Size milliseconds Average GCC 12/8 K Single Precision Double Precision SP DP 1 0.02 0.02 0.02 0.02 0.02 0.02 1.15 1.02 2 0.06 0.04 0.04 0.04 0.04 0.04 0.92 1.05 4 0.08 0.08 0.08 0.08 0.08 0.08 1.09 1.05 8 0.18 0.18 0.18 0.80 0.26 0.25 1.09 0.65 16 0.55 0.62 0.61 0.78 0.62 0.68 0.95 1.01 32 1.19 1.19 1.18 3.14 1.66 2.23 1.08 0.69 64 2.90 2.87 3.12 4.14 3.83 4.62 1.03 0.95 128 8.01 7.72 8.41 19.04 16.31 19.17 0.93 0.78 256 28.65 29.22 30.38 142.81 143.44 144.91 0.75 0.67 512 256.41 209.11 215.07 400.84 410.99 448.06 0.71 0.77 1024 798.30 749.85 753.61 1073.95 1075.09 1051.38 0.79 0.99 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Thu Sep 28 19:10:41 2023 RPi FFT gcc 12 64 Bit Benchmark 3c.0 Thu Sep 28 19:13:51 2023 Size milliseconds Average GCC 12/8 K Single Precision Double Precision SP DP 1 0.02 0.02 0.02 0.02 0.02 0.02 1.20 1.06 2 0.04 0.04 0.04 0.04 0.04 0.04 1.04 1.06 4 0.09 0.08 0.08 0.08 0.08 0.08 1.06 1.06 8 0.19 0.18 0.18 0.20 0.19 0.19 1.06 1.10 16 0.41 0.39 0.39 0.46 0.43 0.43 1.07 1.12 32 0.88 0.85 0.86 1.01 0.96 0.96 1.15 1.14 64 1.98 1.91 1.91 2.57 2.48 2.47 1.17 1.12 128 5.65 4.68 4.63 10.10 10.04 10.06 1.00 0.98 256 14.59 14.50 14.59 36.02 35.29 34.84 0.97 0.79 512 55.50 54.91 55.79 100.99 102.62 99.96 0.73 0.71 1024 143.39 142.49 143.22 231.27 228.44 229.17 0.70 0.76 1024 Square Check Maximum Noise Average Noise SP 9.999520e-01 3.346482e-06 4.565234e-11 DP 1.000000e+00 1.133294e-23 1.428110e-28 End at Thu Sep 28 19:13:53 2023
This is a read only benchmark with data from caches and RAM. The program reads one word with 32 word increments for the next one, skipping following data word by decreasing increments. finally reading all data. This shows where data is read in bursts, enabling estimates being made of bus speeds, as 16 times the speed of appropriate measurements at Inc16.
The most important ratios are from Read All, others demonstrating when all data is not being read sequentially and the Pi 5 appears to be significantly faster than the Pi 4. The main results indicate Pi 5 gains of just over twice reading data from L1 and L2 caches, but can be more than four times from L3 and more than three times from RAM. Maximum bus speed, using one CPU core, is estimated as around 14 GB/second from Inc16 also shown under Read All. See MP results for higher estimates.
Pi 5 performance produced from GCC 8 and GCC 12 compilations was essentially the same.
Code:
Pi 4 GCC 8 BusSpeed 64 Bit gcc 8 Mon May 25 22:13:11 2020 Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All Cache Pi 5 16 4898 5109 5626 5860 5879 9238 L1 L1 32 1109 1389 2485 3804 5026 8435 64 804 1030 2025 3285 4871 8312 L2 Shared 128 737 951 1877 3130 4908 8556 L2 256 732 953 1897 3147 4941 8617 512 701 939 1766 2902 4601 8150 1024 323 494 986 1807 3060 5553 RAM L3 Shared 4096 242 259 486 964 1932 3856 RAM 16384 236 268 493 971 1939 3878 65536 242 271 494 973 1942 3884 End of test Mon May 25 22:13:21 2020 Pi 5 GCC 8 P5/P4 Comparison BusSpeed 64 Bit gcc 8 Fri Aug 11 16:46:13 2023 Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All Words Words Words Words Words AllMP-bus 16 8300 8413 15451 17849 18151 18721 1.69 1.65 2.75 3.05 3.09 2.03 32 9159 9235 15509 17911 18132 18721 8.26 6.65 6.24 4.71 3.61 2.22 64 7460 7644 13739 17008 17665 18593 9.28 7.42 6.78 5.18 3.63 2.24 128 2375 4452 7168 11555 13968 18203 3.22 4.68 3.82 3.69 2.85 2.13 256 2375 4425 7225 11540 13964 18243 3.24 4.64 3.81 3.67 2.83 2.12 512 1784 2980 5758 10362 13685 18203 2.54 3.17 3.26 3.57 2.97 2.23 1024 1225 2325 4639 9336 13467 18281 3.79 4.71 4.70 5.17 4.40 3.29 4096 656 1375 2700 5120 9599 15984 2.71 5.31 5.56 5.31 4.97 4.15 16384 579 864 1741 3502 7020 14015 2.45 3.22 3.53 3.61 3.62 3.61 65536 604 796 1595 3195 6351 12699 2.50 2.94 3.23 3.28 3.27 3.27 End of test Fri Aug 11 16:46:22 2023 Pi 5 GCC 12 Pi 5 GCC 12/8 Comparison BusSpeed 64 Bit gcc 12 Thu Sep 28 19:02:33 2023 Reading Speed 4 Byte Words in MBytes/Second Memory Inc32 Inc16 Inc8 Inc4 Inc2 Read Inc32 Inc16 Inc8 Inc4 Inc2 Read KBytes Words Words Words Words Words All Words Words Words Words Words All 16 8493 8509 16377 17918 18170 18733 1.02 1.01 1.06 1.00 1.00 1.00 32 9127 9295 16478 18023 18212 18740 1.00 1.01 1.06 1.01 1.00 1.00 64 7530 7604 14030 17241 17877 18603 1.01 0.99 1.02 1.01 1.01 1.00 128 2375 4189 7212 11566 13961 18230 1.00 0.94 1.01 1.00 1.00 1.00 256 2358 4275 7265 11595 13985 18274 0.99 0.97 1.01 1.00 1.00 1.00 512 1557 2879 5524 10229 13877 18231 0.87 0.97 0.96 0.99 1.01 1.00 1024 1225 2339 4606 9318 13902 18271 1.00 1.01 0.99 1.00 1.03 1.00 4096 780 1387 2672 5115 9407 16053 1.19 1.01 0.99 1.00 0.98 1.00 16384 652 880 1763 3479 7034 13979 1.13 1.02 1.01 0.99 1.00 1.00 65536 624 801 1605 3178 6416 12800 1.03 1.01 1.01 0.99 1.01 1.01
The benchmark includes CPU speed dependent calculations using data from caches and RAM, via single and double precision floating point and integer functions. The instruction sequences used are shown in the results column titles.
When compiled with GCC 6, earlier results identified unusual slow operation dealing with 32 bit floating point and integer calculations. This looks as though the effect is to read data from RAM instead of caches, and why Pi 5 performance gains were mainly less than two times. With double precision floating point, average Pi 5 gains were around four times for the first two sets of calculations, including more that 10 times with L3 cache involvement.
The GCC 12 compilation appears to have corrected the above misoperations, providing gains of more than eight times over GCC 8. These calculations also show slight improvements in double precision calculations. Maximum calculated speeds are provided, indicating 15.3 single core GFLOPS SP and 6.86 DP, the relationship expected using SIMD calculations. The tests also confirmed this with the near 6.4 GFLOPS/GHz SP and near half that DP. This performance was obtained using data from L1 and L2 caches with almost that from L3 cache.
Code:
Pi 4 GCC 8 Memory Reading Speed Test 64 Bit gcc 8 by Roy Longbottom Start of test Mon May 25 22:23:53 2020 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 15531 3999 3957 15576 4387 4358 11629 9313 9314 16 15717 3992 3922 15770 4355 4377 11799 9444 9446 32 12020 3818 3814 12043 4179 4198 11549 9496 9497 64 12228 3816 3887 12220 4166 4195 8935 8506 8506 128 12265 3869 3941 12157 4182 4206 8080 8193 8196 256 12230 3873 3932 12073 4199 4216 8129 8224 8223 512 9731 3832 3902 9709 4150 4171 8029 7845 7865 1024 3772 3682 3769 3467 3887 3920 5478 5543 5378 2048 1896 3463 3496 1886 3616 3612 2937 2945 2923 4096 1924 3520 3528 1933 3651 3394 2752 2796 2785 8192 1996 3523 3555 1988 3643 3630 2668 2661 2663 End of test Mon May 25 22:24:10 2020 Pi 5 GCC 8 Memory Reading Speed Test 64 Bit gcc 8 by Roy Longbottom Start of test Fri Aug 11 16:34:06 2023 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 50862 6851 6746 50686 7193 7490 37629 18595 25168 16 51032 6820 6717 51024 7164 7468 38002 18888 24946 32 49985 6814 6676 50568 7150 7446 37609 18972 25259 64 50868 6857 6656 50864 7168 7411 37799 19114 25426 128 32618 6797 6670 32666 7142 7278 35466 19143 25439 256 32540 6788 6640 32744 7183 7278 34821 19144 25360 512 26949 6786 6668 30112 7155 7246 33493 14598 16816 1024 25094 6719 6645 19272 6821 7206 21805 17292 22671 2048 20586 6365 6586 19261 6887 7172 4740 4662 13673 4096 5004 6680 6710 4963 6776 6249 7938 8990 8797 8192 3229 5589 4662 3205 6496 6573 6654 6719 4613 End of test Fri Aug 11 16:34:22 2023 P5/P4 Comparison Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 3.27 1.71 1.70 3.25 1.64 1.72 3.24 2.00 2.70 16 3.25 1.71 1.71 3.24 1.65 1.71 3.22 2.00 2.64 32 4.16 1.78 1.75 4.20 1.71 1.77 3.26 2.00 2.66 64 4.16 1.80 1.71 4.16 1.72 1.77 4.23 2.25 2.99 128 2.66 1.76 1.69 2.69 1.71 1.73 4.39 2.34 3.10 256 2.66 1.75 1.69 2.71 1.71 1.73 4.28 2.33 3.08 512 2.77 1.77 1.71 3.10 1.72 1.74 4.17 1.86 2.14 1024 6.65 1.82 1.76 5.56 1.75 1.84 3.98 3.12 4.22 2048 10.86 1.84 1.88 10.21 1.90 1.99 1.61 1.58 4.68 4096 2.60 1.90 1.90 2.57 1.86 1.84 2.88 3.22 3.16 8192 1.62 1.59 1.31 1.61 1.78 1.81 2.49 2.52 1.73 Pi 5 GCC 12 Memory Reading Speed Test 64 Bit gcc 12 by Roy Longbottom Start of test Thu Sep 28 18:54:28 2023 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 54902 61264 65610 55241 65554 63848 37768 25475 25486 16 54803 60539 64671 55169 64700 64750 38078 24891 24891 32 51859 60967 64278 52558 65247 65275 37520 25234 25234 64 52597 61169 65523 52485 65514 65523 37945 25408 25402 128 33580 60278 63742 33647 63692 62897 37218 25370 25457 256 33724 60317 63873 33711 63840 63865 35555 25371 25375 512 33522 59194 63298 33502 63259 63175 35909 25459 25451 1024 32078 57946 60718 31576 60680 59199 26110 22319 23059 2048 29249 55376 57648 29028 57558 57290 16245 18242 19514 4096 4508 11981 11906 4864 11894 9313 10254 10529 10668 8192 3175 6507 6150 3178 6441 6499 6678 6904 6364Max MFLOPS 6862 15316 End of test Thu Sep 28 18:54:43 2023 Pi 5 GCC 12/8 Memory x[m]=x[m]+s*y[m] Int+ x[m]=x[m]+y[m] x[m]=y[m] KBytes Dble Sngl Int32 Dble Sngl Int32 Dble Sngl Int32 Used MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S MB/S 8 1.08 8.94 9.73 1.09 9.11 8.52 1.00 1.37 1.01 16 1.07 8.88 9.63 1.08 9.03 8.67 1.00 1.32 1.00 32 1.04 8.95 9.63 1.04 9.13 8.77 1.00 1.33 1.00 64 1.03 8.92 9.84 1.03 9.14 8.84 1.00 1.33 1.00 128 1.03 8.87 9.56 1.03 8.92 8.64 1.05 1.33 1.00 256 1.04 8.89 9.62 1.03 8.89 8.78 1.02 1.33 1.00 512 1.24 8.72 9.49 1.11 8.84 8.72 1.07 1.74 1.51 1024 1.28 8.62 9.14 1.64 8.90 8.22 1.20 1.29 1.02 2048 1.42 8.70 8.75 1.51 8.36 7.99 3.43 3.91 1.43 4096 0.90 1.79 1.77 0.98 1.76 1.49 1.29 1.17 1.21 8192 0.98 1.16 1.32 0.99 0.99 0.99 1.00 1.03 1.38
This carries out some of the same calculations as MemSpeed. All results are for 32 bit floating point and integer calculations. Norm functions were as generated by the compiler and NEON through using intrinsic functions.
The initial GCC 8 test functions produced the same irregular results as MemSpeed first “Normal Float and Int” calculations that appear to only read RAM based data. Performance from NEON code indicated that the Pi 5 was typically 2.5 times faster than the Pi 4, using cache based data, and 1.5 times from RAM. Exceptions were gains of up to 7.9 times using L3 cache and nearly 4.8 from lower level caches.
The GCC 12 compiler produced acceptable “Normal” performance on the Pi 5, reflected by gains of up to more than ten times over GCC 8 results. This compiler is also shown to provide faster operation than that from NEON functions. Many of the latter show 20% improvements but some were slower. Maximum floating point speed demonstrated was nearly 17 GFLOPS.
Code:
Pi 4 GCC 8NEON Speed 64 Bit gcc 8 Mon May 25 22:21:51 2020 Vector Reading Speed in MBytes/SecondMemory Float v=v+s*v Int v=v+v+s Neon v=v+vKBytes Norm Neon Norm Neon Float Int 16 3629 14987 3925 13643 14457 16642 32 3475 10933 3821 9970 11029 11055 64 3447 11749 3845 11098 11802 12079 128 3332 11392 3912 10813 11430 11513 256 3325 11565 3926 10981 11598 11699 512 3313 10553 3917 10269 10755 10740 1024 3239 3331 3737 3291 3302 3321 4096 2987 1888 3331 1777 1881 1878 16384 3150 1821 3347 1814 1812 1834 65536 2747 1954 3132 2017 1904 2021 Max MFLOPS 3747 End of test Mon May 25 22:22:11 2020Pi 5 GCC 8 P5/P4 ComparisonNEON Speed 64 Bit gcc 8 Fri Aug 11 16:44:52 2023 Vector Reading Speed in MBytes/SecondMemory Float v=v+s*v Int v=v+v+s Neon v=v+v Float v=v+s*v Int v=v+v+s Neon v=v+vKBytes Norm Neon Norm Neon Float Int Norm Neon Norm Neon Float Int 16 6745 46851 6968 44490 46849 46847 1.86 3.13 1.78 3.26 3.24 2.81 32 6727 47104 6947 44618 47061 47056 1.94 4.31 1.82 4.48 4.27 4.26 64 6703 46642 6962 44166 47040 46955 1.94 3.97 1.81 3.98 3.99 3.89 128 6587 27383 6840 27199 27404 27398 1.98 2.40 1.75 2.52 2.40 2.38 256 6579 27491 6857 27299 27509 27509 1.98 2.38 1.75 2.49 2.37 2.35 512 6571 27433 6862 26599 24237 26163 1.98 2.60 1.75 2.59 2.25 2.44 1024 6531 26340 6756 25226 24597 24527 2.02 7.91 1.81 7.67 7.45 7.39 4096 6414 9410 6505 9986 9474 8835 2.15 4.98 1.95 5.62 5.04 4.70 16384 5690 2850 5501 2830 2865 2488 1.81 1.57 1.64 1.56 1.58 1.36 65536 4837 2534 4736 2458 2401 2450 1.76 1.30 1.51 1.22 1.26 1.21 Max MFLOPS 11776 End of test Fri Aug 11 16:45:12 2023 Pi 5 GCC 12 Pi 5 GCC 12/8NEON Speed 64 Bit gcc 12 Thu Sep 28 18:57:35 Vector Reading Speed in MBytes/SecondMemory Float v=v+s*v Int v=v+v+s Neon v=v+v Float v=v+s*v Int v=v+v+s Neon v=v+vKBytes Norm Neon Norm Neon Float Int Norm Neon Norm Neon Float Int 16 67042 45164 67037 45358 54228 54166 9.94 0.96 9.62 1.02 1.16 1.16 32 67631 45190 67621 45415 53833 53675 10.05 0.96 9.73 1.02 1.14 1.14 64 67812 44856 67491 45171 52338 51321 10.12 0.96 9.69 1.02 1.11 1.09 128 62779 33147 64360 33074 33619 33458 9.53 1.21 9.41 1.22 1.23 1.22 256 64352 33405 64803 33187 33699 33719 9.78 1.22 9.45 1.22 1.23 1.23 512 61159 33171 61798 32263 33178 28319 9.31 1.21 9.01 1.21 1.37 1.08 1024 58937 32149 57732 31639 32219 32108 9.02 1.22 8.55 1.25 1.31 1.31 4096 9215 2639 7168 3800 3823 3776 1.44 0.28 1.10 0.38 0.40 0.43 16384 5546 2830 5592 2772 2753 2503 0.97 0.99 1.02 0.98 0.96 1.01 65536 4633 2445 4196 1922 2196 2294 0.96 0.96 0.89 0.78 0.91 0.94 Max MFLOPS 16953
Statistics: Posted by RoyLongbottom — Tue Jan 16, 2024 2:32 pm