Parallel Programming Abstractions¶

Abstraction vs. Implementation

Conflating abstraction with implementation is a common cause for confusion in this course.

这节课以 ISPC 展开做了详细讲解:

ISPC: Intel SPMD Program Compiler
SPMD: single program multiple data
http://ispc.github.com/
The Story of ISPC

Part 1: Intro to ISPC¶

ISPC 本质上是在 “编译” 层面做的工作

alt text

特点:

gang: 含有许多实例(instance)
instance
programCount: number of simultaneously executing instances in this gang (uniform value, equals to 8 in this class)
programIndex: id of the current instance in the gang. (varying value)

多想想

其实看起来和 C++ thread programming 很像:

instance 就像一个个独立的 thread
gang 像 thread pool
programIndex 就像 threadID

它们之间最主要的差异, 其实还是 abstraction 和 implementation 的区别:

(1) ISPC (抽象: SPMD，实现: SIMD)

ISPC 采用 SPMD (Single Program, Multiple Data) 编程模型

程序员编写一个程序，并想象 programCount 个逻辑指令流在并行运行，每个流有不同的 programIndex 值

然而，ISPC 编译器会将这个 SPMD 抽象实现为 SIMD 指令，例如 AVX2 或 ARM NEON

这意味着一个 ISPC gang 默认情况下是在 单个 CPU 核心的单个硬件线程内 通过 SIMD 指令 实现的

编译器通过诸如“掩码”等技术来处理条件控制流，确保不同的程序实例能够基于它们各自的数据执行不同的路径

(2) C++ 线程编程 (抽象和实现通常更接近)

C++ 标准库中的 std::thread 通常直接映射到操作系统提供的硬件线程 (execution unit text)

这意味着每个 std::thread 实际上可以在 一个独立的 CPU 核心或硬件线程上并行执行

这种抽象与底层的多核硬件实现通常更加直接

(1) Schedule 1: interleaved assignment

代码模式:

alt text

调度效果:

alt text

(2) Schedule 2: blocked assignment

代码模式:

alt text

调度效果:

alt text

(3) for each 关键字

alt text

提供了一种"抽象"

“让底层自行选择assignment方式” -> interleaved or blocked assignment

(4) ISPC: abstraction vs. implementation

计算模型: SPMD (Single program, multiple data (SPMD) programming model)
底层实现: SIMD (Single instruction, multiple data (SIMD) implementation)
- ISPC compiler emits vector instructions (e.g., AVX2, ARM NEON) that carry out the logic performed by a ISPC gang
- ISPC gang abstraction is implemented by SIMD instructions that execute within on thread running on one x86 core of a CPU

Warning

当然, ISPC 不会只有 gang 这么一个简单的 “bunch”抽象 (简单/局限: 仅限一个core内)

如果需要底层使用“multi-core”, 应该使用 task 这个关键字

这个我们后面几节课再说

Part 2¶

概念1: parallel programming models

这个很重要, 它是programmer编写程序的第一直觉要素, 也是底层硬件设计的影响因素:

influence how programmers think when writing programs
influence the design of parallel hardware platforms designed to execute them efficiently

概念2: communication/synchronization

分布式系统最重要的两个行为 -> "合而为一" 的必要因素

Three programming models¶

Programming Model (并行模型) 是一个 “抽象”

Shared address space
Message passing
Data parallel

(1) Shared Address Space (共享地址空间)

关于 Shared Address Space, 有几个重要的概念/观点:

Threads communicate by reading/writing to locations in a shared address space (shared variables)
很显然, 对于 Shared Variables 存在 R/W 冲突, 因此需要原子操作 (atomicity):
- Lock/unlock mutex around a critical section
  C++
  1 2 3
  mylock.lock(); // critical section mylock.unlock();
- Some languages have ﬁrst-class support for atomicity of code blocks
  C++
  1 2 3
  atomic { // critical section }
- Intrinsics (内置函数) for hardware-supported atomic read-modify-write operations
  C++
  1
  atomicAdd(x, 10);
Methods for Threads Communication (线程间通信)
- R/W to shared variables in a shared address space: load() / store()
- Manipulating synchronization primitives: lock() / unlock()
Shared Address Space 的硬件实现

PS: Non-uniform memory access (NUMA)

alt text

NUMA是为了说明一个重要的观点: CPU 核心访问内存的延迟和带宽取决于内存相对于该核心的位置, 并非所有的内存访问都是等速的

(2) Message passing model (信息传递模型)

Threads operate within their own private address spaces
Threads communicate by sending/receiving messages
- send: specifies recipient, buffer to be transmitted, and optional message identifier (“tag”)
- receive: sender, specifies buffer to store data, and optional message identifier

alt text

如上图所示:

基于线程间通信进行communication

这种通信是显式的，即在源代码中通过 send 和 receive 函数调用来明确指定通信行为

多想一步

在 Shared Address Space 里我们提到 threads communication 的方式是:

load/store 相同的变量.
特定的 primitive, 如 lock() / unlock().

这里我们说 threads communication 的方式是通过 send() / recv()

是不是"矛盾"呢?

答案肯定是 "不矛盾", 但是这牵扯到了一个很重要的概念:

一个地址空间可以存在多个线程，这时，这些线程之间的通信在 shared address model 的范畴
"跨地址空间" 的线程通信, 是在 message passing model 范围内

alt text

(3) The data-parallel model

一句话概括: 在大型数据集合的元素上执行相同的操作来简化编程并实现高效优化

数据并行模型旨在提供一种编程范式，以最大化处理器效率，尤其是在处理大量同构数据时，例如在GPU等吞吐量导向的处理器上