Parallel Programming Basics¶

top-down: creating a program

alt text

这张图是本节课的灵魂，非常重要!

decomposition
1. who's responsible: programmer
2. 目前这件事很难让 compiler 来做，还是得 programmer 亲力亲为
assignment
1. 任务: assigning tasks to threads
  1. tasks: things to do
  2. threads: workers
2. who's responsible: compiler / runtime
3. 两种模式都可以:
  1. static: 在应用运行前就assign好
  2. dynamic: 在程序执行时逐渐assign
4. 目标: good load balance + reducing communication cost
orchestration
1. 任务:
  - Structuring communication (构建通信)
    - 比如: 上节课提到的communication方式, Shared Address Space 通过读取和写入共享内存中的变量 / Msg Passing Model 通过 send() && recv() 之类的函数 / Data-parallel Model 通过特殊的内置原语
  - Adding synchronization to preserve dependencies if necessary (添加同步以在必要时保留依赖关系)
    - 比如: 上节课提到的一些偏 "ctrl" 的操作, lock() / unlock() && barrier && atomic op
  - Organizing data structures in memory (组织内存中的数据结构)
    - 比如: 上节课提到的 Blocked Assignment && Interleaved Assignment
  - Scheduling tasks (任务调度)
    - 比如: 上面刚刚才提到的 Static && Dynamic
2. who's responsible: 主体是threads自己
3. 目标:
  1. 减少 communication/sync 开销
  2. 减少 overhead
  3. 保存本地数据的引用
mapping
1. 任务: Mapping “threads” (“workers”) to hardware execution units
2. who's responsible: 花样挺多
  1. OS: HW exec. context -> CPU core
  2. compiler: ISPC program instances -> vector inst. lanes ("车道")
  3. hardware: CUDA thread blocks -> GPU cores

在 mapping 的时候, 我们会遇见一些"决策":

相近的thread在相同CPU:
1. maximize locality
2. data shring
3. minimize sync/comm cost
相异的thread在相同CPU:
1. 可能是因为 limited! (bandwidth / computation)

compiler 和 runtimes 区别是什么

编译器(compiler)的工作主要发生在 程序运行之前 ，关注的是代码的翻译、转换和静态优化，包括生成并行指令和实现静态任务分配
运行时系统(runtime system)的工作主要发生在 程序运行期间 ，关注的是程序的执行管理、资源调度和动态协调，特别是动态任务分配和多线程管理

Amdahl’s Law: dependencies limit maximum speedup due to parallelism

\(speedup = \frac{1}{s + \frac{1-s}{p}}\)

s: 存在依赖关系之类的, 导致 "必须要顺序执行" 所占的比例
p: processor number

得出结论: A small serial region can limit speedup on a large parallel machine

alt text