[转帖]老外惊呼: 世界变化快! PostgreSQL 14 on ARM 8.1 加上LSE patch性能飙升到140万tps

老外,惊呼,世界,变化,postgresql,on,arm,加上,lse,patch,性能,飙升,tps · 浏览次数 : 0

小编点评

Sure. Here is the summary you requested: **Graviton2 Arm Instructions for Speeding Up Machine Learning:** * LSE (Large-System Extensions) provide low-cost atomic operations, improving system throughput for CPU-to-CPU communication, locks, and mutexes. * GCC's `-moutline-atomics` flag enables LSE instructions, allowing the compiler to generate LSE instructions for applications that use atomic operations. * Porting codes with SSE/AVX intrinsics to NEON can significantly speed up performance by leveraging specific instructions. * Graviton2 processors support specific ARM instructions, such as dot-product and Half precision floating point instructions, which can be used for Machine Learning workloads. * The compiler generates LSE instructions for applications compiled with the `-moutline-atomics` flag.

正文

https://www.modb.pro/db/91515

 

PostgreSQL , arm , lse , 14


背景

老外惊呼: 世界变化快!

ARM 8.1 defines a set of LSE instructions, which, in particular, provide the way to implement atomic operation in a single instruction (just like x86).

从性能测试来看, PG 14打了使用lse指令的patch后, 1亿数据量pgbench, 140万只读TPS, 18万读写TPS. arm 8.1 已经和x86相当了. x86 64核 估计也就140万左右tps了.

[转载全文如下]

包含一些测试数据.

https://akorotkov.github.io/blog/2021/04/30/arm/

https://www.postgresql.org/message-id/CAPpHfdsGqVd6EJ4mr_RZVE5xSiCNBy4MuSvdTrKmTpM0eyWGpg%40mail.gmail.com

https://www.postgresql.org/message-id/CAPpHfdsKrh7c7P8-5eG-qW3VQobybbwqH%3DgL5Ck%2BdOES-gBbFg%40mail.gmail.com

The world changes. ARM architecture breaks into new areas of computing. An only decade ago, only your mobile, router, or another specialized device could be ARM-based, while your desktop and server were typically x86-based. Nowadays, your new MacBook is ARM-based, and your EC2 instance could be ARM as well.

In the mid-2020, Amazon made graviton2 instances publically available. The maximum number of CPU core there is 64. This number is where it becomes interesting to check PostgreSQL scalability. It’s exciting to check because ARM implements atomic operations using pair of load/store. So, in a sense, ARM is just like Power, where I’ve previously seen a significant effect of platform-specific atomics optimizations.

But on the other hand, ARM 8.1 defines a set of LSE instructions, which, in particular, provide the way to implement atomic operation in a single instruction (just like x86). What would be better: special optimization, which puts custom logic between load and store instructions, or just a simple loop of LSE CAS instructions? I’ve tried them both.

You can see the results of read-only and read-write pgbench on the graphs below (details on experiments are here). pg14-devel-lwlock-ldrex-strex is the patched PostgreSQL with special load/store optimization for lwlock, pg14-devel-lse is PostgreSQL compiled with LSE support enabled.

pic

pic

You can see that load/store optimization gives substantial positive effect, but LSE rocks here!

So, if you’re running PostgreSQL on graviton2 instance, make sure you’ve binaries compiled with LSE support (see the instruction) because the effect is dramatic.

BTW, it appears that none of these optimizations have a noticeable effect on the performance of Apple M1. Probably, M1 has a smart enough inner optimizer to recognize these different implementations to be equivalent. And it was surprising that LSE usage might give a small negative effect on Kunpeng 920. It was discouraging for me to know an ARM processor, where single instruction operation is slower than multiple instruction equivalent. Hopefully, processor architects would fix this in new Kunpeng processors.

In general, we see that now different ARM embodiments have different performance characteristics and different effects of optimizations. Hopefully, this is a problem of growth, and it will be overcome soon.

LSE

C/C++ on Graviton

Enabling Arm Architecture Specific Features

To build code with the optimal processor features use the following. If you want to support both Graviton and Graviton2 you'll have to limit yourself to the Graviton features.

CPU | GCC | LLVM ---------|----------------------|------------- Graviton | -march=armv8-a+crc+crypto | -march=armv8-a+crc+crypto Graviton2 | -march=armv8.2-a+fp16+rcpc+dotprod+crypto |-march=armv8.2-a+fp16+rcpc+dotprod+crypto

Note: GCC-7 does not support +rcpc+dotprod.

Core Specific Tuning

CPU | GCC < 9 | GCC >=9 ---------|----------------------|------------- Graviton | -mtune=cortex-a72 | -mtune=cortex-a72 Graviton2 | -mtune=cortex-a72 | -mtune=neoverse-n1

Large-System Extensions (LSE)

The Graviton2 processor in C6g, M6g, and R6g instances has support for the Armv8.2 instruction set. Armv8.2 specification includes the large-system extensions (LSE) introduced in Armv8.1. LSE provides low-cost atomic operations. LSE improves system throughput for CPU-to-CPU communication, locks, and mutexes. The improvement can be up to an order of magnitude when using LSE instead of load/store exclusives.

POSIX threads library needs LSE atomic instructions. LSE is important for locking and thread synchronization routines. The following systems distribute a libc compiled with LSE instructions: - Amazon Linux 2, - Ubuntu 18.04 (needs apt install libc6-lse), - Ubuntu 20.04.

The compiler needs to generate LSE instructions for applications that use atomic operations. For example, the code of databases like PostgreSQL contain atomic constructs; c++11 code with std::atomic statements translate into atomic operations. GCC's -march=armv8.2-a flag enables all instructions supported by Graviton2, including LSE. To confirm that LSE instructions are created, the output of objdump command line utility should contain LSE instructions: $ objdump -d app | grep -i 'cas\|casp\|swp\|ldadd\|stadd\|ldclr\|stclr\|ldeor\|steor\|ldset\|stset\|ldsmax\|stsmax\|ldsmin\|stsmin\|ldumax\|stumax\|ldumin\|stumin' | wc -l To check whether the application binary contains load and store exclusives: $ objdump -d app | grep -i 'ldxr\|ldaxr\|stxr\|stlxr' | wc -l

GCC's -moutline-atomics flag produces a binary that runs on both Graviton and Graviton2. Supporting both platforms with the same binary comes at a small extra cost: one load and one branch. To check that an application has been compiled with -moutline-atomicsnm command line utility displays the name of functions and global variables in an application binary. The boolean variable that GCC uses to check for LSE hardware capability is __aarch64_have_lse_atomics and it should appear in the list of symbols: ``` $ nm app | grep __aarch64_have_lse_atomics | wc -l

the output should be 1 if app has been compiled with -moutline-atomics

```

Porting codes with SSE/AVX intrinsics to NEON

When programs contain code with x64 intrinsics, the following procedure can help to quickly obtain a working program on Arm, assess the performance of the program running on Graviton processors, profile hot paths, and improve the quality of code on the hot paths.

To quickly get a prototype running on Arm, one can use https://github.com/DLTcollab/sse2neon a translator of x64 intrinsics to NEON. sse2neon provides a quick starting point in porting performance critical codes to Arm. It shortens the time needed to get an Arm working program that then can be used to extract profiles and to identify hot paths in the code. A header file sse2neon.h contains several of the functions provided by standard x64 include files like xmmintrin.h, only implemented with NEON instructions to produce the exact semantics of the x64 intrinsic. Once a profile is established, the hot paths can be rewritten directly with NEON intrinsics to avoid the overhead of the generic sse2neon translation.

Signed vs. Unsigned char

The C standard doesn't specify the signedness of char. On x86 char is signed by default while on Arm it is unsigned by default. This can be addressed by using standard int types that explicitly specify the signedness (e.g. uint8_t and int8_t) or compile with -fsigned-char.

Using Graviton2 Arm instructions to speed-up Machine Learning

Graviton2 processors been optimized for performance and power efficient machine learning by enabling Arm dot-product instructions commonly used for Machine Learning (quantized) inference workloads, and enabling Half precision floating point - _float16 to double the number of operations per second, reducing the memory footprint compared to single precision floating point (_float32), while still enjoying large dynamic range.

Using Graviton2 Arm instructions to speed-up common code sequences

The Arm instruction set includes instructions that can be used to speedup common code sequences. The table below lists common operations and links to code sequences:

Operation | Description ----------|------------ crc | Graviton processors support instructions to accelerate both CRC32 which is used by Ethernet, media and compression and CRC32C (Castagnoli) which is used by filesystems.

与[转帖]老外惊呼: 世界变化快! PostgreSQL 14 on ARM 8.1 加上LSE patch性能飙升到140万tps相似的内容:

[转帖]老外惊呼: 世界变化快! PostgreSQL 14 on ARM 8.1 加上LSE patch性能飙升到140万tps

https://www.modb.pro/db/91515 PostgreSQL , arm , lse , 14 背景 老外惊呼: 世界变化快! ARM 8.1 defines a set of LSE instructions, which, in particular, provide the

[转帖]老板让我在Linux中使用traceroute排查服务器网络问题,幸好我收藏了这篇文章!

https://bbs.huaweicloud.com/blogs/386325 【摘要】 traceroute 命令是一个有用且易于运行的网络诊断工具,本文给大家介绍了12个traceroute 命令示例,希望本文能够对您使用traceroute 命令有所帮助,如果有问题可以在下方评论区与我讨论!

[转帖]龙芯、海光、飞腾、兆芯同桌对比性能力求公平

https://zhuanlan.zhihu.com/p/627627813 老夫桌上有酒,不喜独酌,闻数家国产CPU有擅桌面者,故许利淘宝陆续擒得之,长随老夫左右伴饮。已得龙芯、海光、飞腾、兆芯四姓围坐,皆为桌面CPU才俊,老夫甚慰。 此日海光新至,为其接风饮宴。席间其乐融融,众CPU互报姓名,曰

【转帖】一道面试题:JVM老年代空间担保机制

面试问题 昨天面试的时候,面试官问的问题: 什么是老年代空间担保机制?担保的过程是什么?老年代空间担保机制是谁给谁担保?为什么要有老年代空间担保机制?或者说空间担保机制的目的是什么?如果没有老年代空间担保机制会有什么不好? 下面我们就带着这些问题去了解一下JVM老年代空间担保机制吧。 老年代空间担保

[转帖]Oracle JDBC中的语句缓存

老熊 Oracle性能优化 2013-09-13 在Oracle数据库中,SQL解析有几种: 硬解析,过多的硬解析在系统中产生shared pool latch和library cache liatch争用,消耗过多的shared pool,使得系统不具有可伸缩性。 软解析,过多的软解析仍然可能会导

[转帖]深度学习和机器学习的区别

最近在听深度学习的课,老师提了一个基本的问题:为什么会出现深度学习?或者说传统的机器学习有什么问题。老师讲解的时候一带而过,什么维度灾难啊之类的,可能觉得这个问题太浅显了吧(|| Д)````不过我发现自己确实还不太明白,于是Google了一下,发现一篇很棒的科普文,这里翻译一下,分享给大家:翻译自

[转帖]龙芯二进制翻译性能的不严谨分析

https://zhuanlan.zhihu.com/p/580008360 先读一下胡老师的大作 节取一些内容如下,下面有官方测试参数: 一通操作猛如虎,一看跑分不如知乎答主: 龙芯UnixBench分高,龙芯说了原因如下: 二进制翻译性如下(LATX就是翻译到X86): 胡老师说spec2000

[转帖]JVM(3)之垃圾回收(GC垃圾收集器+垃圾回收算法+安全点+记忆集与卡表+并发可达性分析......)

《深入理解java虚拟机》+宋红康老师+阳哥大厂面试题2总结整理 一、堆的结构组成 堆位于运行时数据区中是线程共享的。一个进程对应一个jvm实例。一个jvm实例对应一个运行时数据区。一个运行时数据区有一个堆空间。 java堆区在jvm启动的时候就被创建了,其空间大小也就被确定了(堆是jvm管理的最大

[转帖]面渣逆袭:二十二图、八千字、二十问,彻底搞定MyBatis!

https://cdn.modb.pro/db/334793 大家好,我是老三,面渣逆袭系列继续,这节我们的主角是MyBatis,作为当前国内最流行的ORM框架,是我们这些crud选手最趁手的工具,赶紧来看看面试都会问哪些问题吧。 基础 1.说说什么是MyBatis? MyBatis logo 先吹

[转帖]中国与美国光纤网络连接详解

众所周知,中美刚好在地球的对面,隔着老大老大的太平洋。为了两国人民的“友谊”,网络互通,中美之间的网络、电话等数据传输得通过跨域太平洋的海底光缆来进行。 目前中美之间有两条直达海底光缆,一条是中美直达海底光缆(China-US CableNetwork),网络容量达到了 80Gbps。中国大陆登陆点