Quantcast
Channel: Raspberry Pi Forums
Viewing all articles
Browse latest Browse all 5827

General discussion • Single Core Benchmarks

$
0
0
Whetstone Benchmark - whetstonePi64g8 and g12
Vector Versions - Whetv64SPg8 and g12, whetvDP64g8 and g12


This has a number of simple programming loops, with the overall MWIPS rating dependent on floating point calculations. with no accessing of data in L2 cache or RAM.

Results are provided for the original scalar single precision (SP) version, along with those for single and double precision (DP) varieties of the vector version, originally written for use on the first Cray 1 supercomputer delivered to the UK. For more information see Pi 5 The Vector Processor later.

Examination of the time used by the different tests shows that this can be dominated by those executing such as COS and EXP functions.

Pi 5/Pi 4 comparisons are provided for the gcc 8 scalar versions, indicting performance gains between 2.44 to 2.59 times for the three (MFLOPS) floating point tests and 2.79 on overall MWIPS. Performance of the Pi 5 gcc 12 compilations were essentially identical to those from gcc 8.

Pi 5/Pi 4 vector SP and DP gcc 8 performance gains were similar between 2.34 to 3.10 times for MFLOPS and around 2.3 for MWIPS. Pi 5 SP Vector/Scalar gains are also provided, giving 5.40 to 7.86 times for MFLOPS but only 1.88 times for overall MWIPS, deflated by the COS/EXP tests. Maximum SP scalar speed was 1.36 GFLOPS with vectors at 8.08 SP and 4.0 DP.

Code:

Pi 4 GCC 8Whetstone Single Precision C Benchmark  64 Bit gcc 8R, Fri May 22 10:48:53 2020Loop content                   Result              MFLOPS      MOPS   SecondsN1 floating point      -1.12475013732910156       524.251               0.076N2 floating point      -1.12274742126464844       534.904               0.524N3 if then else         1.00000000000000000                2978.570     0.073N4 fixed point         12.00000000000000000                2493.078     0.264N5 sin,cos etc.         0.49911010265350342                  57.643     3.012N6 floating point       0.99999982118606567       397.676               2.831N7 assignments          3.00000000000000000                 996.647     0.387N8 exp,sqrt etc.        0.75110864639282227                  27.327     2.841MWIPS                                            2085.311              10.008Pi 5 GCC 8Whetstone Single Precision C Benchmark  64 Bit gcc 8R, Thu Aug 10 15:44:50 2023Loop content                   Result              MFLOPS      MOPS   Seconds  G8 Pi5/4N1 floating point      -1.12475013732910156      1279.196               0.087    2.44N2 floating point      -1.12274742126464844      1364.748               0.573    2.55N3 if then else         1.00000000000000000                7190.834     0.084    2.41N4 fixed point         12.00000000000000000                5995.954     0.306    2.41N5 sin,cos etc.         0.49911010265350342                 154.725     3.131    2.68N6 floating point       0.99999982118606567      1027.998               3.055    3.59N7 assignments          3.00000000000000000                2398.668     0.449    2.41N8 exp,sqrt etc.        0.75110864639282227                  93.596     2.314    3.43MWIPS                                            5822.922               9.998    2.79Pi 5 GCC 12 Whetstone Single Precision C Benchmark  64 Bit gcc 12, Thu Sep 28 11:46:43 2023Loop content                   Result              MFLOPS      MOPS   SecondsN1 floating point      -1.12475013732910156      1279.140               0.088N2 floating point      -1.12274742126464844      1364.558               0.575N3 if then else         1.00000000000000000                3594.939     0.168N4 fixed point         12.00000000000000000                5994.963     0.307N5 sin,cos etc.         0.49911010265350342                 157.996     3.075N6 floating point       0.99999982118606567      1027.940               3.064N7 assignments          3.00000000000000000                2398.054     0.450N8 exp,sqrt etc.        0.75110864639282227                  95.590     2.273MWIPS                                           5839.767              10.000#################### Vector Whetstone Vecton Length 258 ####################Pi 4 GCC 8 SPWhetstone Vector Benchmark 64 Bit Single Precision, Wed Aug 30 10:41:57 2023Loop content                   Result              MFLOPS      MOPS   SecondsN1 floating point      -1.13316142559051514      2338.496               0.391N2 floating point      -1.13312149047851562      1651.957               3.877N3 if then else         1.00000000000000000                4427.445     1.114N4 fixed point         12.00000000000000000                1733.458     8.659N5 sin,cos etc.         0.49998238682746887                  74.913    52.923N6 floating point       0.99999982118606567      2573.346               9.988N7 assignments          3.00000000000000000               18596.381     0.474N8 exp,sqrt etc.        0.75002217292785645                  78.503    22.581MWIPS                                            4764.843             100.007
Note different single and double precision numeric results.

Code:

Pi 5 GCC 8 SPWhetstone Vector Benchmark 64 Bit Single Precision, Sat Oct  7 10:15:16 2023Loop content                   Result              MFLOPS      MOPS   Seconds  G8 Pi5/4N1 floating point      -1.13316142559051514      7111.676               0.290    3.04N2 floating point      -1.13312149047851562      3857.446               3.746    2.34N3 if then else         1.00000000000000000               10141.446     1.097    2.29N4 fixed point         12.00000000000000000                2396.242    14.135    1.38N5 sin,cos etc.         0.49998238682746887                 177.032    50.534    2.36N6 floating point       0.99999982118606567      7986.011               7.263    3.10N7 assignments          3.00000000000000000               42584.598     0.467    2.29N8 exp,sqrt etc.        0.75002217292785645                 178.102    22.459    2.27MWIPS                                           10753.538              99.990    2.26Pi 5 GCC 12 SPWhetstone Vector Benchmark gcc 12 64 Bit Single Precision, Sat Oct  7 10:46:30 2023                                                                                      Vector/                                                                                 Pi 5 ScalarLoop content                   Result              MFLOPS      MOPS   Seconds GCC12/8 G12 Pi5    N1 floating point      -1.13316142559051514      7393.282               0.286    1.04    5.78    N2 floating point      -1.13312149047851562      7364.751               2.009    1.91    5.40N3 if then else         1.00000000000000000               14169.053     0.804    1.40    3.94N4 fixed point         12.00000000000000000                2398.742    14.457    1.00    0.40N5 sin,cos etc.         0.49998238682746887                 177.260    51.673    1.00    1.12N6 floating point       0.99999982118606567      8078.622               7.351    1.91    7.86N7 assignments          3.00000000000000000               26419.105     0.770    0.62   11.02N8 exp,sqrt etc.        0.75002217292785645                 178.359    22.961    1.00    1.87MWIPS                                           10974.928             100.311    1.02    1.88Pi 4 GCC 8 DPWhetstone Vector Benchmark 64 Bit Double Precision, Wed Aug 30 10:48:05 2023Loop content                   Result              MFLOPS      MOPS   SecondsN1 floating point      -1.13314558088707962      1146.624               0.709N2 floating point      -1.13310306766606850      1094.230               5.203N3 if then else         1.00000000000000000                4405.221     0.995N4 fixed point         12.00000000000000000                1730.427     7.711N5 sin,cos etc.         0.49998080312723675                  73.193    48.149N6 floating point       0.99999988868927014      1294.129              17.655N7 assignments          3.00000000000000000                9967.123     0.785N8 exp,sqrt etc.        0.75002006515491115                  83.614    18.845MWIPS                                            4233.571             100.052Pi 5 GCC 8 DPWhetstone Vector Benchmark 64 Bit Double Precision, Sat Oct  7 10:18:59 2023   Loop content                   Result              MFLOPS      MOPS   Seconds  G8 Pi5/4N1 floating point      -1.13314558088707962      3499.307               0.535    3.05N2 floating point      -1.13310306766606850      2793.370               4.688    2.55N3 if then else         1.00000000000000000               10158.471     0.993    2.31N4 fixed point         12.00000000000000000                2396.163    12.809    1.38N5 sin,cos etc.         0.49998080312723675                 171.834    47.176    2.35N6 floating point       0.99999988868927014      3994.760              13.156    3.09N7 assignments          3.00000000000000000               21713.754     0.829    2.18N8 exp,sqrt etc.        0.75002006515491115                 184.857    19.607    2.21MWIPS                                            9763.593              99.793    2.31Pi 5 GCC 12 DPWhetstone Vector Benchmark gcc 12 64 Bit Double Precision, Sat Oct  7 10:50:40 2023Loop content                   Result              MFLOPS      MOPS   SecondsN1 floating point      -1.13314558088707962      3602.841               0.523N2 floating point      -1.13310306766606739      3619.564               3.647N3 if then else         1.00000000000000000               14167.623     0.718N4 fixed point         12.00000000000000000                2398.696    12.898N5 sin,cos etc.         0.49998080312723675                 172.068    47.491N6 floating point       0.99999988868927014      3997.801              13.252N7 assignments          3.00000000000000000               13172.392     1.378N8 exp,sqrt etc.        0.75002006515491115                 182.557    20.014MWIPS                                            9829.517              99.920  
Dhrystone Benchmark - dhrystonePi64g8 and g12

This is the most popular ARM integer benchmark, often subject to over optimisation, rated in VAX MIPS aka DMIPS.

Pi 5 GCC 8 gain over Pi 4 was 2.37 times. There was a slight gain using GCC 12, where DMIPS/MHz ratio reached 8.57.

Code:

  Pi 4 GCC 8 Dhrystone Benchmark 2.1 64 Bit gcc8, Mon May 25 22:16:05 2020 Nanoseconds one Dhrystone run:        72.83 Dhrystones per Second:             13729822 VAX MIPS rating =                   7814.36 Numeric results were correct Pi 5 GCC 8 Dhrystone Benchmark 2.1 64 Bit gcc8, Thu Aug 10 15:49:13 2023 Nanoseconds one Dhrystone run:        30.69 Dhrystones per Second:             32578833 VAX MIPS rating =                  18542.31   Pi 5/Pi 4 Gain 2.37 Numeric results were correct Pi 5 GCC 12 Dhrystone Benchmark 2.1 64 Bit gcc12, Thu Sep 28 11:44:33 2023 Nanoseconds one Dhrystone run:        27.68 Dhrystones per Second:             36120831 VAX MIPS rating =                  20558.24   GCC 12/8 Gain 1.11 Numeric results were correct  
Linpack 100 Benchmark MFLOPS - linpackPi64g8 and g12, linpackPi64gSP, linpackPi64NEONig8

This original Linpack benchmark executes double precision arithmetic. I introduced two single precision versions, one using NEON functions to include vector processing.
Performance of this benchmark can vary, with its dependence on data placement in L2 cache.

Unlike when the Pi 5 was introduced. later compilers produced code as fast as the NEON version. Now with GCC 12, The NEON variety was slower and the others produced a small gain over GCC 8 compiations. Comparisons for the latter indicated Pi 5 gains were between 3.16 and 3.54 times over the three versions. Maximum Pi 5 speeds were 6.60 GFLOPS SP and 3.93 GFLOPS DP.

Code:

 Pi 4 GCC 8 Linpack Double Precision Unrolled Benchmark n @ 100 Optimisation 64 Bit gcc 8, Mon May 25 22:05:47 2020 Speed    1111.51 MFLOPS Numeric results were as expected Linpack Single Precision Unrolled Benchmark n @ 100 Optimisation 64 Bit gcc 8, Mon May 25 22:09:12 2020 Speed    1930.27 MFLOPS Numeric results were as expected Linpack Single Precision Benchmark n @ 100  NEON Intrinsics 64 bit gcc 8, Mon May 25 22:11:15 2020 Speed    2030.95 MFLOPS Numeric results were as expected------------------------------------------------------ Pi 5 GCC 8                                                   Pi5/Pi4 Linpack Double Precision Unrolled Benchmark n @ 100 Optimisation 64 Bit gcc 8, Thu Aug 10 16:12:47 2023 Speed    3933.38 MFLOPS                                        3.54 Numeric results were as expected Linpack Single Precision Unrolled Benchmark n @ 100 Optimisation 64 Bit gcc 8, Thu Aug 10 16:04:18 2023 Speed    6106.68 MFLOPS                                        3.16 Numeric results were as expected  Linpack Single Precision Benchmark n @ 100  NEON Intrinsics 64 bit gcc 8, Thu Aug 10 16:13:52 2023 Speed    6603.58 MFLOPS                                        3.25 Numeric results were as expected------------------------------------------------------ Pi 5 GCC 12                                                   GCC 12/5 Linpack Double Precision Unrolled Benchmark n @ 100 Optimisation 64 Bit gcc 12, Thu Sep 28 15:58:07 2023 Speed    4136.39 MFLOPS                                        1.05 Numeric results were as expected Linpack Single Precision Unrolled Benchmark n @ 100 Optimisation 64 Bit gcc 12, Thu Sep 28 16:04:19 2023 Speed    6472.77 MFLOPS                                        1.06 Numeric results were as expected Linpack Single Precision Benchmark n @ 100  NEON Intrinsics 64 bit gcc 12, Thu Sep 28 15:49:56 2023 Speed    5665.39 MFLOPS                                        0.86 Numeric results were as expected But 4 needed changing in program, via #define GCC12ARM64N, to avoid unnecessary error reports. 
Livermore Loops Benchmark MFLOPS - liverloopsPi64g8 and g12

This benchmark measures performance of 24 double precision kernels, initially used to select the latest supercomputer. The official average is geometric mean, where Cray 1 supercomputer was rated as 11.9 MFLOPS. Following are MFLOPS for the individual kernels, followed by overall scores.
Although each kernel is executed for a relatively long time, performance of some can be inconsistent.

Pi 5 GCC 8 maximum speed was 9.87 DP GFLOPS, with gains over the Pi 4 between 2.14 and 4.65 over the 24 loops.

Maximum performance via GCC 12 was 10.57 DP GFLOPS, with those for all of the loops similar to GCC 8 scores.

Code:

 Pi 4 GCC 8 Livermore Loops Benchmark 64 Bit gcc 8 via C/C++ Mon May 25 10:39:10 2020 MFLOPS for 24 loops 2108.4  936.3  959.9  965.1  382.5  808.6 2312.9 2488.4 2065.7  668.7  500.3  980.7  180.7  404.8  815.0  643.8  726.8 1189.6  449.8  397.2 1716.0  366.9  817.7  312.7 Overall Ratings Maximum Average Geomean Harmean Minimum  2616.7   959.8   766.7   613.0   169.7 Numeric results were as expected Pi 5 GCC 8 Livermore Loops Benchmark 64 Bit gcc 8 via C/C++ Thu Aug 10 16:14:33 2023 MFLOPS for 24 loops 7423.6 2147.9 2356.6 2472.9  911.5 1871.0 9872.3 5317.7 5162.9 2125.8 1173.2 2672.0  709.1 1108.7 2966.6 1598.5 1761.3 5526.8 1190.0  956.0 5425.1 1489.5 2147.9  858.2 Overall Ratings Maximum Average Geomean Harmean Minimum  9872.3  2873.9  2208.3  1763.4   646.6 Numeric results were as expected----------------------------------------------------------------------------------- GCC 8 Pi5/Pi4 Performance Ratios For 24 loops   3.52   2.29   2.46   2.56   2.38   2.31   4.27   2.14   2.50   3.18   2.34   2.72   3.92   2.74   3.64   2.48   2.42   4.65   2.65   2.41   3.16   4.06   2.63   2.74   Min    2.14   Max    4.65 Overall Ratings Maximum Average Geomean Harmean Minimum    3.77    2.99    2.88    2.88    3.81----------------------------------------------------------------------------------- Pi 5 GCC 12 Livermore Loops Benchmark 64 Bit gcc 12 via C/C++ Thu Sep 28 16:38:37 2023 MFLOPS for 24 loops 7833.8 2404.6 2377.2 2346.8  913.0 1857.1  10577 5350.6 5109.2 2117.4 1186.0 2351.4  760.0 1121.2 3103.4 1597.7 1776.1 5455.9 1197.2 2490.5 5657.5 1855.7 2139.8  780.4 Overall Ratings Maximum Average Geomean Harmean Minimum 10576.9  2964.4  2308.1  1870.7   733.9 Numeric results were as expected via #define GCC12ARMPI  
Fast Fourier Transforms Benchmarks - fft1Pi64g, fft3cPi64g8 and g12

This is a real application provided by my collaborator at Compuserve Forum. There are two benchmarks. The first one is the original C program. The second is an optimised version, originally using my x86 assembly code, but translated back into C code, making use of the partitioning and (my) arrangement to optimise for burst reading from RAM. Three measurements use both single and double precision data, calculating FFT sizes between 1K and 1024K, with data from caches and RAM. Note that steps in performance levels occur at data size changes between caches, then to RAM.

Comparisons of averages of the three runs are provided. Those for FFT1 demonstrate the clear and different advantage of the Pi 5 over the Pi 4, depending on the source of the data, with that from L3 cache providing gains of up to 13.34 times and up to 4.71 times involving the larger L2 cache. Most other gains are in the two to four times range. With the faster CPU speed limited FFT3c, gains were mainly mbetween 2 and 3 times. GCC 12 over GCC 8 comparisons indicate a slight advantage of the former using data from caches, but the role reversed, dealing with RAM data transfers.

Code:

 Pi 4 GCC 8  Pi 4 RPi FFT gcc 8 64 Bit Benchmark 1 Mon May 25 10:54:42 2020    Size                    milliseconds       K      Single Precision        Double Precision       1    0.05    0.04    0.04    0.04    0.04    0.05       2    0.08    0.08    0.08    0.15    0.14    0.14       4    0.23    0.23    0.23    0.39    0.38    0.44       8    0.73    0.80    0.70    0.97    1.04    0.97      16    1.98    1.87    1.79    2.66    2.52    2.83      32    4.92    4.92    5.29    5.67    4.92    4.89      64    8.80    8.69    8.67   32.21   32.23   33.31     128   49.82   49.79   50.17  161.36  159.61  159.39     256  295.55  280.43  303.20  411.97  415.90  340.34     512  506.01  601.29  572.36  781.10  779.05  782.21    1024 1375.42 1377.64 1375.77 1898.28 1876.88 1896.22        1024 Square Check Maximum Noise Average Noise        SP   9.999520e-01  3.346482e-06  4.565234e-11        DP   1.000000e+00  1.133294e-23  1.428110e-28               End at Mon May 25 10:55:00 2020 Pi 4  RPi FFT gcc 8 64 Bit Benchmark 3c.0 Mon May 25 10:56:49 2020    Size                    milliseconds       K      Single Precision        Double Precision       1    0.06    0.04    0.04    0.04    0.04    0.03       2    0.09    0.07    0.07    0.10    0.10    0.10       4    0.23    0.20    0.20    0.23    0.26    0.23       8    0.50    0.44    0.46    0.52    0.50    0.50      16    1.21    1.19    1.05    1.23    1.17    1.19      32    2.36    2.23    2.18    3.33    3.32    3.29      64    6.16    5.70    5.31   10.20   10.20   10.18     128   16.39   15.69   15.69   24.35   24.45   24.48     256   38.70   37.46   37.40   54.57   54.65   54.59     512   83.83   80.96   81.40  119.71  118.70  119.27    1024  182.08  176.05  176.97  268.43  259.16  259.30        1024 Square Check Maximum Noise Average Noise        SP   9.999520e-01  3.346482e-06  4.565234e-11        DP   1.000000e+00  1.133294e-23  1.428110e-28               End at Mon May 25 10:56:52 2020 Pi 5 GCC 8  Pi 5 RPi FFT gcc 8 64 Bit Benchmark 1 Fri Aug 11 16:47:11 2023    Size                    milliseconds                  Average Pi5/Pi4       K      Single Precision        Double Precision         SP      DP       1    0.02    0.02    0.02    0.02    0.02    0.02     2.20    2.51       2    0.04    0.04    0.04    0.04    0.04    0.04     1.98    3.81       4    0.09    0.09    0.09    0.09    0.09    0.09     2.64    4.71       8    0.19    0.20    0.19    0.29    0.29    0.29     3.88    3.48      16    0.56    0.56    0.56    0.65    0.67    0.78     3.35    3.82      32    1.30    1.27    1.29    1.55    1.50    1.80     3.92    3.18      64    3.18    3.00    2.99    4.16    3.90    3.91     2.85    8.17     128    7.76    7.30    7.28   14.27   14.44   13.71     6.70   11.33     256   23.23   21.27   21.40   99.92   94.38   94.97    13.34    4.04     512  157.82  152.33  173.93  329.15  321.16  323.41     3.47    2.41    1024  608.66  606.77  600.94 1069.84 1048.00 1049.41     2.27    1.79        1024 Square Check Maximum Noise Average Noise        SP   9.999520e-01  3.346482e-06  4.565234e-11        DP   1.000000e+00  1.133294e-23  1.428110e-28               End at Fri Aug 11 16:47:19 2023  Pi 5 RPi FFT gcc 8 64 Bit Benchmark 3c.0 Fri Aug 11 16:48:27 2023    Size                    milliseconds                  Average Pi5/Pi4       K      Single Precision        Double Precision         SP      DP       1    0.03    0.02    0.02    0.02    0.02    0.02     1.88    1.96       2    0.05    0.04    0.04    0.04    0.04    0.04     1.93    2.61       4    0.10    0.08    0.08    0.09    0.09    0.09     2.37    2.74       8    0.21    0.18    0.18    0.23    0.21    0.21     2.43    2.37      16    0.45    0.41    0.41    0.53    0.48    0.49     2.70    2.40      32    1.16    0.90    0.93    1.22    1.07    1.06     2.27    2.97      64    2.39    2.04    2.39    2.98    2.76    2.69     2.52    3.63     128    5.26    4.82    4.86    9.92    9.90    9.86     3.20    2.47     256   14.58   13.92   13.89   29.15   27.71   26.90     2.68    1.96     512   42.03   39.73   39.84   72.71   72.32   71.70     2.02    1.65    1024  101.56   99.35   98.31  176.62  171.45  175.48     1.79    1.50        1024 Square Check Maximum Noise Average Noise        SP   9.999520e-01  3.346482e-06  4.565234e-11        DP   1.000000e+00  1.133294e-23  1.428110e-28               End at Fri Aug 11 16:48:29 2023 Pi 5 GCC 12   RPi FFT gcc 12 64 Bit Benchmark 1 Thu Sep 28 19:10:33 2023    Size                    milliseconds                  Average GCC 12/8       K      Single Precision        Double Precision         SP      DP       1    0.02    0.02    0.02    0.02    0.02    0.02     1.15    1.02       2    0.06    0.04    0.04    0.04    0.04    0.04     0.92    1.05       4    0.08    0.08    0.08    0.08    0.08    0.08     1.09    1.05       8    0.18    0.18    0.18    0.80    0.26    0.25     1.09    0.65      16    0.55    0.62    0.61    0.78    0.62    0.68     0.95    1.01      32    1.19    1.19    1.18    3.14    1.66    2.23     1.08    0.69      64    2.90    2.87    3.12    4.14    3.83    4.62     1.03    0.95     128    8.01    7.72    8.41   19.04   16.31   19.17     0.93    0.78     256   28.65   29.22   30.38  142.81  143.44  144.91     0.75    0.67     512  256.41  209.11  215.07  400.84  410.99  448.06     0.71    0.77    1024  798.30  749.85  753.61 1073.95 1075.09 1051.38     0.79    0.99        1024 Square Check Maximum Noise Average Noise        SP   9.999520e-01  3.346482e-06  4.565234e-11        DP   1.000000e+00  1.133294e-23  1.428110e-28               End at Thu Sep 28 19:10:41 2023   RPi FFT gcc 12 64 Bit Benchmark 3c.0 Thu Sep 28 19:13:51 2023    Size                    milliseconds                  Average GCC 12/8       K      Single Precision        Double Precision         SP      DP       1    0.02    0.02    0.02    0.02    0.02    0.02     1.20    1.06       2    0.04    0.04    0.04    0.04    0.04    0.04     1.04    1.06       4    0.09    0.08    0.08    0.08    0.08    0.08     1.06    1.06       8    0.19    0.18    0.18    0.20    0.19    0.19     1.06    1.10      16    0.41    0.39    0.39    0.46    0.43    0.43     1.07    1.12      32    0.88    0.85    0.86    1.01    0.96    0.96     1.15    1.14      64    1.98    1.91    1.91    2.57    2.48    2.47     1.17    1.12     128    5.65    4.68    4.63   10.10   10.04   10.06     1.00    0.98     256   14.59   14.50   14.59   36.02   35.29   34.84     0.97    0.79     512   55.50   54.91   55.79  100.99  102.62   99.96     0.73    0.71    1024  143.39  142.49  143.22  231.27  228.44  229.17     0.70    0.76        1024 Square Check Maximum Noise Average Noise        SP   9.999520e-01  3.346482e-06  4.565234e-11        DP   1.000000e+00  1.133294e-23  1.428110e-28               End at Thu Sep 28 19:13:53 2023  
BusSpeed Benchmark - busspeedPi64g8 and g12

This is a read only benchmark with data from caches and RAM. The program reads one word with 32 word increments for the next one, skipping following data word by decreasing increments. finally reading all data. This shows where data is read in bursts, enabling estimates being made of bus speeds, as 16 times the speed of appropriate measurements at Inc16.

The most important ratios are from Read All, others demonstrating when all data is not being read sequentially and the Pi 5 appears to be significantly faster than the Pi 4. The main results indicate Pi 5 gains of just over twice reading data from L1 and L2 caches, but can be more than four times from L3 and more than three times from RAM. Maximum bus speed, using one CPU core, is estimated as around 14 GB/second from Inc16 also shown under Read All. See MP results for higher estimates.

Pi 5 performance produced from GCC 8 and GCC 12 compilations was essentially the same.

Code:

 Pi 4 GCC 8   BusSpeed 64 Bit gcc 8 Mon May 25 22:13:11 2020    Reading Speed 4 Byte Words in MBytes/Second   Memory Inc32  Inc16  Inc8   Inc4   Inc2   Read    KBytes Words  Words  Words  Words  Words  All  Cache      Pi 5     16   4898   5109   5626   5860   5879   9238  L1          L1     32   1109   1389   2485   3804   5026   8435     64    804   1030   2025   3285   4871   8312  L2 Shared    128    737    951   1877   3130   4908   8556              L2    256    732    953   1897   3147   4941   8617    512    701    939   1766   2902   4601   8150   1024    323    494    986   1807   3060   5553  RAM         L3 Shared   4096    242    259    486    964   1932   3856              RAM  16384    236    268    493    971   1939   3878  65536    242    271    494    973   1942   3884        End of test Mon May 25 22:13:21 2020 Pi 5 GCC 8                                       P5/P4 Comparison   BusSpeed 64 Bit gcc 8 Fri Aug 11 16:46:13 2023    Reading Speed 4 Byte Words in MBytes/Second Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read  Inc32  Inc16   Inc8   Inc4   Inc2   Read KBytes  Words  Words  Words  Words  Words    All  Words  Words  Words  Words  Words    AllMP-bus                                       16   8300   8413  15451  17849  18151  18721   1.69   1.65   2.75   3.05   3.09   2.03     32   9159   9235  15509  17911  18132  18721   8.26   6.65   6.24   4.71   3.61   2.22     64   7460   7644  13739  17008  17665  18593   9.28   7.42   6.78   5.18   3.63   2.24    128   2375   4452   7168  11555  13968  18203   3.22   4.68   3.82   3.69   2.85   2.13    256   2375   4425   7225  11540  13964  18243   3.24   4.64   3.81   3.67   2.83   2.12    512   1784   2980   5758  10362  13685  18203   2.54   3.17   3.26   3.57   2.97   2.23   1024   1225   2325   4639   9336  13467  18281   3.79   4.71   4.70   5.17   4.40   3.29   4096    656   1375   2700   5120   9599  15984   2.71   5.31   5.56   5.31   4.97   4.15  16384    579    864   1741   3502   7020  14015   2.45   3.22   3.53   3.61   3.62   3.61  65536    604    796   1595   3195   6351  12699   2.50   2.94   3.23   3.28   3.27   3.27        End of test Fri Aug 11 16:46:22 2023 Pi 5 GCC 12                                      Pi 5 GCC 12/8 Comparison  BusSpeed 64 Bit gcc 12 Thu Sep 28 19:02:33 2023   Reading Speed 4 Byte Words in MBytes/Second Memory  Inc32  Inc16   Inc8   Inc4   Inc2   Read  Inc32  Inc16   Inc8   Inc4   Inc2   Read KBytes  Words  Words  Words  Words  Words    All  Words  Words  Words  Words  Words    All     16   8493   8509  16377  17918  18170  18733   1.02   1.01   1.06   1.00   1.00   1.00     32   9127   9295  16478  18023  18212  18740   1.00   1.01   1.06   1.01   1.00   1.00     64   7530   7604  14030  17241  17877  18603   1.01   0.99   1.02   1.01   1.01   1.00    128   2375   4189   7212  11566  13961  18230   1.00   0.94   1.01   1.00   1.00   1.00    256   2358   4275   7265  11595  13985  18274   0.99   0.97   1.01   1.00   1.00   1.00    512   1557   2879   5524  10229  13877  18231   0.87   0.97   0.96   0.99   1.01   1.00   1024   1225   2339   4606   9318  13902  18271   1.00   1.01   0.99   1.00   1.03   1.00   4096    780   1387   2672   5115   9407  16053   1.19   1.01   0.99   1.00   0.98   1.00  16384    652    880   1763   3479   7034  13979   1.13   1.02   1.01   0.99   1.00   1.00  65536    624    801   1605   3178   6416  12800   1.03   1.01   1.01   0.99   1.01   1.01  
MemSpeed Benchmark MB/Second - memspeedPi64g8 and g12

The benchmark includes CPU speed dependent calculations using data from caches and RAM, via single and double precision floating point and integer functions. The instruction sequences used are shown in the results column titles.

When compiled with GCC 6, earlier results identified unusual slow operation dealing with 32 bit floating point and integer calculations. This looks as though the effect is to read data from RAM instead of caches, and why Pi 5 performance gains were mainly less than two times. With double precision floating point, average Pi 5 gains were around four times for the first two sets of calculations, including more that 10 times with L3 cache involvement.

The GCC 12 compilation appears to have corrected the above misoperations, providing gains of more than eight times over GCC 8. These calculations also show slight improvements in double precision calculations. Maximum calculated speeds are provided, indicating 15.3 single core GFLOPS SP and 6.86 DP, the relationship expected using SIMD calculations. The tests also confirmed this with the near 6.4 GFLOPS/GHz SP and near half that DP. This performance was obtained using data from L1 and L2 caches with almost that from L3 cache.

Code:

 Pi 4 GCC 8    Memory Reading Speed Test 64 Bit gcc 8 by Roy Longbottom               Start of test Mon May 25 22:23:53 2020  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S       8   15531   3999   3957  15576   4387   4358  11629   9313   9314      16   15717   3992   3922  15770   4355   4377  11799   9444   9446      32   12020   3818   3814  12043   4179   4198  11549   9496   9497      64   12228   3816   3887  12220   4166   4195   8935   8506   8506     128   12265   3869   3941  12157   4182   4206   8080   8193   8196     256   12230   3873   3932  12073   4199   4216   8129   8224   8223     512    9731   3832   3902   9709   4150   4171   8029   7845   7865    1024    3772   3682   3769   3467   3887   3920   5478   5543   5378    2048    1896   3463   3496   1886   3616   3612   2937   2945   2923    4096    1924   3520   3528   1933   3651   3394   2752   2796   2785    8192    1996   3523   3555   1988   3643   3630   2668   2661   2663                End of test Mon May 25 22:24:10 2020 Pi 5 GCC 8     Memory Reading Speed Test 64 Bit gcc 8 by Roy Longbottom               Start of test Fri Aug 11 16:34:06 2023  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S       8   50862   6851   6746  50686   7193   7490  37629  18595  25168      16   51032   6820   6717  51024   7164   7468  38002  18888  24946      32   49985   6814   6676  50568   7150   7446  37609  18972  25259      64   50868   6857   6656  50864   7168   7411  37799  19114  25426     128   32618   6797   6670  32666   7142   7278  35466  19143  25439     256   32540   6788   6640  32744   7183   7278  34821  19144  25360     512   26949   6786   6668  30112   7155   7246  33493  14598  16816    1024   25094   6719   6645  19272   6821   7206  21805  17292  22671    2048   20586   6365   6586  19261   6887   7172   4740   4662  13673    4096    5004   6680   6710   4963   6776   6249   7938   8990   8797    8192    3229   5589   4662   3205   6496   6573   6654   6719   4613                End of test Fri Aug 11 16:34:22 2023 P5/P4 Comparison  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S       8    3.27   1.71   1.70   3.25   1.64   1.72   3.24   2.00   2.70      16    3.25   1.71   1.71   3.24   1.65   1.71   3.22   2.00   2.64      32    4.16   1.78   1.75   4.20   1.71   1.77   3.26   2.00   2.66      64    4.16   1.80   1.71   4.16   1.72   1.77   4.23   2.25   2.99     128    2.66   1.76   1.69   2.69   1.71   1.73   4.39   2.34   3.10     256    2.66   1.75   1.69   2.71   1.71   1.73   4.28   2.33   3.08     512    2.77   1.77   1.71   3.10   1.72   1.74   4.17   1.86   2.14    1024    6.65   1.82   1.76   5.56   1.75   1.84   3.98   3.12   4.22    2048   10.86   1.84   1.88  10.21   1.90   1.99   1.61   1.58   4.68    4096    2.60   1.90   1.90   2.57   1.86   1.84   2.88   3.22   3.16    8192    1.62   1.59   1.31   1.61   1.78   1.81   2.49   2.52   1.73 Pi 5 GCC 12     Memory Reading Speed Test 64 Bit gcc 12 by Roy Longbottom               Start of test Thu Sep 28 18:54:28 2023  Memory   x[m]=x[m]+s*y[m] Int+   x[m]=x[m]+y[m]         x[m]=y[m]  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S       8   54902  61264  65610  55241  65554  63848  37768  25475  25486      16   54803  60539  64671  55169  64700  64750  38078  24891  24891      32   51859  60967  64278  52558  65247  65275  37520  25234  25234      64   52597  61169  65523  52485  65514  65523  37945  25408  25402     128   33580  60278  63742  33647  63692  62897  37218  25370  25457     256   33724  60317  63873  33711  63840  63865  35555  25371  25375     512   33522  59194  63298  33502  63259  63175  35909  25459  25451    1024   32078  57946  60718  31576  60680  59199  26110  22319  23059    2048   29249  55376  57648  29028  57558  57290  16245  18242  19514    4096    4508  11981  11906   4864  11894   9313  10254  10529  10668    8192    3175   6507   6150   3178   6441   6499   6678   6904   6364Max MFLOPS  6862  15316                End of test Thu Sep 28 18:54:43 2023 Pi 5 GCC 12/8  Memory   x[m]=x[m]+s*y[m] Int+  x[m]=x[m]+y[m]        x[m]=y[m]  KBytes    Dble   Sngl  Int32   Dble   Sngl  Int32   Dble   Sngl  Int32    Used    MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S   MB/S       8    1.08   8.94   9.73   1.09   9.11   8.52   1.00   1.37   1.01      16    1.07   8.88   9.63   1.08   9.03   8.67   1.00   1.32   1.00      32    1.04   8.95   9.63   1.04   9.13   8.77   1.00   1.33   1.00      64    1.03   8.92   9.84   1.03   9.14   8.84   1.00   1.33   1.00     128    1.03   8.87   9.56   1.03   8.92   8.64   1.05   1.33   1.00     256    1.04   8.89   9.62   1.03   8.89   8.78   1.02   1.33   1.00     512    1.24   8.72   9.49   1.11   8.84   8.72   1.07   1.74   1.51    1024    1.28   8.62   9.14   1.64   8.90   8.22   1.20   1.29   1.02    2048    1.42   8.70   8.75   1.51   8.36   7.99   3.43   3.91   1.43    4096    0.90   1.79   1.77   0.98   1.76   1.49   1.29   1.17   1.21    8192    0.98   1.16   1.32   0.99   0.99   0.99   1.00   1.03   1.38  
NeonSpeed Benchmark MB/Second - NeonSpeedPi64g8 and g12

This carries out some of the same calculations as MemSpeed. All results are for 32 bit floating point and integer calculations. Norm functions were as generated by the compiler and NEON through using intrinsic functions.

The initial GCC 8 test functions produced the same irregular results as MemSpeed first “Normal Float and Int” calculations that appear to only read RAM based data. Performance from NEON code indicated that the Pi 5 was typically 2.5 times faster than the Pi 4, using cache based data, and 1.5 times from RAM. Exceptions were gains of up to 7.9 times using L3 cache and nearly 4.8 from lower level caches.

The GCC 12 compiler produced acceptable “Normal” performance on the Pi 5, reflected by gains of up to more than ten times over GCC 8 results. This compiler is also shown to provide faster operation than that from NEON functions. Many of the latter show 20% improvements but some were slower. Maximum floating point speed demonstrated was nearly 17 GFLOPS.

Code:

Pi 4 GCC 8NEON Speed 64 Bit gcc 8 Mon May 25 22:21:51 2020       Vector Reading Speed in MBytes/SecondMemory  Float v=v+s*v   Int v=v+v+s  Neon  v=v+vKBytes   Norm   Neon   Norm   Neon  Float    Int     16   3629  14987   3925  13643  14457  16642     32   3475  10933   3821   9970  11029  11055     64   3447  11749   3845  11098  11802  12079    128   3332  11392   3912  10813  11430  11513    256   3325  11565   3926  10981  11598  11699    512   3313  10553   3917  10269  10755  10740   1024   3239   3331   3737   3291   3302   3321   4096   2987   1888   3331   1777   1881   1878  16384   3150   1821   3347   1814   1812   1834  65536   2747   1954   3132   2017   1904   2021    Max  MFLOPS          3747       End of test Mon May 25 22:22:11 2020Pi 5 GCC 8                                        P5/P4 ComparisonNEON Speed 64 Bit gcc 8 Fri Aug 11 16:44:52 2023       Vector Reading Speed in MBytes/SecondMemory  Float v=v+s*v   Int v=v+v+s  Neon  v=v+v  Float v=v+s*v   Int v=v+v+s  Neon  v=v+vKBytes   Norm   Neon   Norm   Neon  Float    Int   Norm   Neon   Norm   Neon  Float    Int     16   6745  46851   6968  44490  46849  46847   1.86   3.13   1.78   3.26   3.24   2.81     32   6727  47104   6947  44618  47061  47056   1.94   4.31   1.82   4.48   4.27   4.26     64   6703  46642   6962  44166  47040  46955   1.94   3.97   1.81   3.98   3.99   3.89    128   6587  27383   6840  27199  27404  27398   1.98   2.40   1.75   2.52   2.40   2.38    256   6579  27491   6857  27299  27509  27509   1.98   2.38   1.75   2.49   2.37   2.35    512   6571  27433   6862  26599  24237  26163   1.98   2.60   1.75   2.59   2.25   2.44   1024   6531  26340   6756  25226  24597  24527   2.02   7.91   1.81   7.67   7.45   7.39   4096   6414   9410   6505   9986   9474   8835   2.15   4.98   1.95   5.62   5.04   4.70  16384   5690   2850   5501   2830   2865   2488   1.81   1.57   1.64   1.56   1.58   1.36  65536   4837   2534   4736   2458   2401   2450   1.76   1.30   1.51   1.22   1.26   1.21    Max  MFLOPS         11776       End of test Fri Aug 11 16:45:12 2023  Pi 5 GCC 12                                     Pi 5 GCC 12/8NEON Speed 64 Bit gcc 12 Thu Sep 28 18:57:35        Vector Reading Speed in MBytes/SecondMemory  Float v=v+s*v   Int v=v+v+s  Neon  v=v+v  Float v=v+s*v   Int v=v+v+s  Neon  v=v+vKBytes   Norm   Neon   Norm   Neon  Float    Int   Norm   Neon   Norm   Neon  Float    Int     16  67042  45164  67037  45358  54228  54166   9.94   0.96   9.62   1.02   1.16   1.16     32  67631  45190  67621  45415  53833  53675  10.05   0.96   9.73   1.02   1.14   1.14     64  67812  44856  67491  45171  52338  51321  10.12   0.96   9.69   1.02   1.11   1.09    128  62779  33147  64360  33074  33619  33458   9.53   1.21   9.41   1.22   1.23   1.22    256  64352  33405  64803  33187  33699  33719   9.78   1.22   9.45   1.22   1.23   1.23    512  61159  33171  61798  32263  33178  28319   9.31   1.21   9.01   1.21   1.37   1.08   1024  58937  32149  57732  31639  32219  32108   9.02   1.22   8.55   1.25   1.31   1.31   4096   9215   2639   7168   3800   3823   3776   1.44   0.28   1.10   0.38   0.40   0.43  16384   5546   2830   5592   2772   2753   2503   0.97   0.99   1.02   0.98   0.96   1.01  65536   4633   2445   4196   1922   2196   2294   0.96   0.96   0.89   0.78   0.91   0.94    Max MFLOPS  16953  

Statistics: Posted by RoyLongbottom — Tue Jan 16, 2024 2:32 pm



Viewing all articles
Browse latest Browse all 5827

Trending Articles