Transcript
Porting a C Library for Fine Grained Independent Task Parallelism to Enea OSE RTOS
Master of Science Thesis
YIHUI WANG
Master’s Thesis at Xdin AB Supervisor: Barbro Claesson, Xdin AB Detlef Scholle, Xdin AB Examiner: Ingo Sander, KTH
TRITA-ICT-EX-2012:141
Acknowledgements I would like to express my sincere gratitude to my supervisors, Detlef Scholle and Barbro Claesson from Xdin AB, for giving me the precious opportunity to accomplish this master thesis and giving me constant support. I would also like to thank Ingo Sander, my examiner from KTH for the useful suggestions and important guidance in writting this report. I would like to give my thanks to other thesis workers in Xdin, especially Sebastian Ullström and Johan Sundman Norberg, who help me a lot with both the project and the thesis. The amazing time that we spent together became one of my most precious memories in life. My thanks extends to everyone in Xdin AB and Enea AB who are friendly and hospitable. Finally I would like to thank my beloved family for their support through my entire life. I could not go so far in academic without their support. In particular, I must acknowledge my friend Junzhe Tian for his encouragement and constant assistance through the duration of my master study.
Yihui Wang Stockholm, Sweden June, 2012
Abstract Multi-core starts an era to improve the performance of computations by executing instructions in parallel. However, the improvement in performance is not linear with the number of cores, because of the overhead caused by intercommunication and unbalanced load over cores. Wool provides a solution to improve the performance of multi-core systems. It is a C library for fine grained independent task parallelism developed by Karl-Filip Faxén in SICS, which helps to keep load balance over cores by work stealing and leapfrogging. In this master thesis project, Wool is ported to the Enea OSE real-time operating system aiming at supplying an approach to improve the performance of the multi-core system. To reach this goal, multi-core architecture, task parallelism algorithms, as well as POSIX threads are studied. Besides, hardware synchronization primitives which are defined by processors are studied and implemented in Wool. The target hardware for this study is Freescale P4080 board with eight e500mc cores. Wool is ported on the same target with both Linux and OSE operating systems. Finally, the porting is tested and verified.
Contents List of Figures List of Tables List of Abbreviations 1 Introduction 1.1 Background . . . . 1.2 Problem statement 1.3 Goals . . . . . . . 1.4 Method . . . . . . 1.5 Contributions . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
I Theoretical Study
1 1 1 2 3 3
5
2 Multi-core 2.1 Multi-core operating system . . . . . . . . . . . . . . . . . . . . . . . 2.2 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 7 9 10
3 Parallel Computing 3.1 Parallel performance metrics . . . . . . . 3.2 Levels of parallelism . . . . . . . . . . . . 3.2.1 Instruction-level parallelism (ILP) 3.2.2 Thread-level parallelism (TLP) . . 3.3 Parallel programming models . . . . . . . 3.4 Design in parallel . . . . . . . . . . . . . . 3.5 Summary . . . . . . . . . . . . . . . . . .
. . . . . . .
11 11 12 12 12 12 13 14
4 Threads 4.1 Thread overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Process & thread . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Threaded program models . . . . . . . . . . . . . . . . . . . .
17 17 17 18
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
. . . . . . .
4.2 4.3 4.4 4.5
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
19 19 20 21 21 21 22 22
5 Wool 5.1 Basic concepts . . . . . . . . . . 5.1.1 Worker . . . . . . . . . . 5.1.2 Task pool . . . . . . . . . 5.1.3 Granularity in Wool . . . 5.1.4 Load balance in Wool . . 5.1.5 Structure . . . . . . . . . 5.2 Work stealing . . . . . . . . . . . 5.3 Leap frogging . . . . . . . . . . . 5.4 Direct task stack algorithm . . . 5.4.1 Spawn . . . . . . . . . . . 5.4.2 Sync . . . . . . . . . . . . 5.5 Wool optimization . . . . . . . . 5.5.1 Sampling victim selection 5.5.2 Set based victim selection 5.6 Other multi-threaded scheduler . 5.7 Summary . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . .
25 25 25 26 26 26 27 27 29 31 31 32 32 32 33 33 34
4.6 4.7
Thread management . . . . Mutual exclusion . . . . . . Conditional variables . . . . Synchronization . . . . . . . 4.5.1 Atomic operations . 4.5.2 Hardware primitives Threads on multi-core . . . Summary . . . . . . . . . .
. . . . . . . .
. . . . . . . .
II Implementation
35
6 OSE 6.1 OSE fundamentals . . . . . . 6.2 OSE process & IPC . . . . . 6.3 OSE for multi-core . . . . . . 6.3.1 OSE load balancing . 6.3.2 Message passing APIs 6.4 OSE pthreads . . . . . . . . . 6.5 Summary . . . . . . . . . . . 7 P4080 and TILE64 7.1 Features . . . . . . . . . . . 7.1.1 Instruction set . . . 7.1.2 Memory consistency 7.1.3 Memory barrier . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
. . . .
. . . . . . .
37 37 38 39 40 40 40 41
. . . .
43 43 43 44 45
7.2
7.3
Freescale QorIQ P4080 platform 7.2.1 P4080 architecture . . . . 7.2.2 e500mc core . . . . . . . . Tilera’s TILE64 . . . . . . . . . . 7.3.1 TILE64 architecture . . . 7.3.2 TILE . . . . . . . . . . .
. . . . . .
8 Porting Wool to P4080 8.1 Design . . . . . . . . . . . . . . . . 8.1.1 Storage access ordering . . 8.1.2 Atomic update primitives . 8.2 Implementation . . . . . . . . . . . 8.2.1 Prefetch & CAS . . . . . . 8.2.2 Gcc inline assembler code . 8.2.3 Enea Linux . . . . . . . . . 8.3 Experiment . . . . . . . . . . . . . 8.3.1 Experiment: Fibonacci code 8.3.2 Analysis . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . with . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . and without Wool . . . . . . . . . . .
. . . . . .
. . . . . . . . . .
. . . . . .
. . . . . . . . . .
. . . . . .
. . . . . . . . . .
. . . . . .
. . . . . . . . . .
. . . . . .
46 46 46 47 47 47
. . . . . . . . . .
49 49 49 50 50 50 51 51 52 52 53
9 Porting Wool to OSE 9.1 Design . . . . . . . . . . . . . . . . . . 9.1.1 Malloc library . . . . . . . . . . 9.1.2 Pthread . . . . . . . . . . . . . 9.2 Implementation . . . . . . . . . . . . . 9.2.1 Load module configuration . . 9.2.2 Makefile . . . . . . . . . . . . . 9.2.3 System information . . . . . . 9.2.4 Application on OSE multi-core 9.3 Experiments . . . . . . . . . . . . . . . 9.3.1 Experiment on OSE & Linux . 9.3.2 Experiment on the performance 9.3.3 Experiment on the performance
. . . . . . . . . . . . . . . . . . . . vs. vs.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . the number the number
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . of cores (1) of cores (2)
. . . . . . . . . . . .
55 55 55 57 58 58 59 59 59 59 59 60 61
10 Conclusions and Future Work 10.1 Conclusions . . . . . . . . . . 10.1.1 Theoretical study . . . 10.1.2 Implementation . . . . 10.1.3 Result Discussion . . . 10.2 Future work . . . . . . . . . .
. . . . .
. . . . .
. . . . .
. . . . .
63 63 63 64 64 65
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
. . . . .
Appendices
65
A Demo Results
67
Bibliography
69
List of Figures 1.1
Implementation Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3
2.1 2.2 2.3
Multi-core Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . Architecture of SMP System . . . . . . . . . . . . . . . . . . . . . . . . Architecture of AMP System . . . . . . . . . . . . . . . . . . . . . . . .
8 8 9
4.1 4.2
Mutex Structure [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Condition Variable [15] . . . . . . . . . . . . . . . . . . . . . . . . . . .
20 20
5.1 5.2 5.3 5.4
Work Stealing Structure. Work Stealing. . . . . . . Leap Frogging . . . . . . . Spawn. . . . . . . . . . . .
. . . .
28 29 30 31
6.1 6.2 6.3
Processes in Blocks with Pools in Domains [26] . . . . . . . . . . . . . . Message Passing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . MCE: Hybrid AMP/SMP Multiprocessing [31] . . . . . . . . . . . . . .
38 39 40
7.1 7.2
P4080. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tilera’s TILE64 Architecture [38] . . . . . . . . . . . . . . . . . . . . . .
46 48
8.1 8.2
Enea Linux. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fib (5) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
52 53
9.1 9.2
Fib(43) with Wool Library on the Linux & OSE on P4080 . . . . . . . . Fib(12) with Large Task Body on Multiple Cores . . . . . . . . . . . . .
60 61
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .
List of Tables 2.1 2.2
Comparison of Multiple Processors and Multiple Cores. . . . . . . . . . Comparison between Multiple Threads and Multiple Cores . . . . . . .
9 10
3.1
Comparison of Shared Memory Model and Message Passing Model . . .
13
4.1 4.2
Processes and Threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . Advantages and Disadvantages of Threads on Multi-core . . . . . . . . .
18 22
7.1 7.2 7.3
Instruction Set Comparison . . . . . . . . . . . . . . . . . . . . . . . . . Memory Consistency Models . . . . . . . . . . . . . . . . . . . . . . . . Memory Barriers & Instruction . . . . . . . . . . . . . . . . . . . . . . .
43 45 45
8.1
Time for Fib(43) with Wool Library on the Linux X86 Platform . . . .
53
9.1 9.2
Fib(46) with Memalign, Malloc and Calloc . . . . . . . . . . . . . . . . Fib(43) with Wool Library on Multiple Cores . . . . . . . . . . . . . . .
57 60
10.1 Target . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
64
List of Abbreviations
Pthread
POSIX thread
OSE
Operating System Embedded
P4080
QorIQ P4080 Eight-Core Communications Processors
CMP
Chip Multi-Processor
SMP
Symmetric Multi-Processing
OS
Operating System
SMT
Simultaneous Multi-Threading
ILP
Instruction-Level Parallelism
TLP
Thread-Level Parallelism
API
Application Programming Interface
LIFO
Last In First Out
FIFO
First In First Out
IPC
Inter Process Communication
MCE
Multi-Core Embedded
ISA
Instruction Set Architecture
CISC
Complex Instruction Set Computer
RISC
Reduced Instruction Set Computing
MPPA
Massively Parallel Processor Array
Mutex
Mutual Exclusion
Fib
Fibonacci
Chapter 1
Introduction 1.1
Background
This thesis project is conducted at Xdin AB1 , a technology and IT consultant company, developing and delivering competence for world leading companies. The thesis is part of the MANY2 project hosted by ITEA23 . The objective of MANY is to provide scalable, reusable and fast developed software for embedded systems. Multi-core became popular in the recent decade, which takes advantage of singlecore with parallel computing. However, the overhead caused by synchronizations among processors becomes a problem as the number of cores keeps on increasing. An efficient task parallelism scheduler is demanded to decrease the communication overhead and improve the performance. Wool4 is a C-library for fine grained independent task parallelism on top of POSIX threads (pthread) [1], which is developed at SICS5 by Karl-Filip Faxén. It can be regarded as a low overhead scheduler for concurrent programs. The demands on an efficient parallel computing make it important to port Wool to OSE6 multi-core platforms to gain a better performance. This master thesis focuses on porting Wool to Enea OSE real-time operating system.
1.2
Problem statement
Wool is a C library concentrating on reducing the communication overhead and keeping load balanced across multiple cores. It is designed to give a better perfor1
http://xdin.com/ Many-core programming and resource management for high-performance Embedded Systems 3 http://www.itea2.org 4 Wool’s homepage: http://www.sics.se/ kff/wool/ 5 Swedish Institute of Computer Science 6 A real-time embedded operating system created by Enea AB 2
1
CHAPTER 1. INTRODUCTION
mance for multi-core systems, like P40807 , because it helps to distribute the work load to each core with a low overhead. Wool works on Linux on X868 currently. To port Wool to Enea OSE on P4080, a good knowledge on Wool task parallelism, PowerPC instruction set and OSE operating system are required. Differences between the operating systems and the hardware targets should be considered before the porting. The main problems are listed below. P1: Wool task parallelism Wool works on task parallelism. How tasks are assigned to different cores and cooperate to achieve a more efficient parallelism is an interesting topic. The modification of the Wool source code to fit the new platform is the main part of the project, so it is important to find out which part of source code should be modified to port Wool to the new platform without changing the task parallelism. P2: Pthread Wool is a C library based on POSIX threads (pthreads), which is a library provided by the operating systems. The Linux operating system implements pthreads, and all the thread functions and data types are declared in the pthread header file. However, the Enea OSE operating system is based on OSE processes and message passing instead of pthreads. Though most pthread APIs are implemented and supported by OSE to make it easier to port pthread based applications, changes and reconfigurations of OSE pthreads remain to be problems in this project. P3: Hardware synchronization primitives Hardware synchronization primitives are used in the implementation of Wool, which are used for the optimization of Wool. These hardware primitives are defined by computer verdors. To port Wool to the new target P4080, one must add these primitives for the specified target. These hardware dependent code relies on the memory consistency models, memory barriers and the type of the instruction set. These low level assembly code should be embedded in the C program.
1.3
Goals
This project is conducted in 2 phases:
Step1: Port Wool to P4080. 1. Configure P4080 with Linux kernel. 2. Modify the hardware dependent code of Wool using gcc inline assember. 7
http://www.freescale.com X86: A series of computer microprocessor instruction set architectures based on the Intel 8086 CPU. 8
2
1.4. METHOD
Linux
Linux
OSE (pthread)
X86
P4080
P4080
Figure 1.1. Implementation Steps
3. Verify Wool on P4080
Step2: Port Wool to OSE. 1. Configure OSE on P4080. 2. Modify OSE libraries, including the pthread library 3. Verify Wool upon OSE pthreads.
1.4
Method
This thesis lasts for 20 weeks, including two phases: theoretical study and implementation. The theoretical study is conducted during the first 10 weeks. Background knowledge of Wool, OSE and P4080 is prepared by reading manuals and Papers. In the second phase, design and implementation of Wool on P4080 are conducted first, and followed by test verification of the implementation. Finally, the performance of Wool is tested on the new platform and a demonstration is conducted. The platform is a Freescale board with QorIQ P4080 EightCore Communications Processors and the implementation is done according to Xdin AB standards.
1.5
Contributions
This master thesis involves the study of parallel programming on multi-core, POSIX thread and Wool task parallelism strategies. Enea Linux has been setup on the Freescale P4080 platform. The source code regarding to the hardware dependent primitives has been changes to fit the new platform. And a set of tests are performed to verify the results. Wool is also ported to Enea OSE on the Freescale P4080 platform. The main problem is that some of the OSE libraries differ from Linux libraries. To fix the library problem, additional configurations and functions are used. Wool is configured as a load module in the implementation. By compiling the 3
CHAPTER 1. INTRODUCTION
Wool library together with the application code, the application gains a better performance. The performance of Wool has been tested on the new platform and the results are compared with that on Enea Linux.
4
Part I
Theoretical Study
5
Chapter 2
Multi-core According to Moore’s Law, the chip performance doubles every 18 months [2]. To keep this trend, micro-processor vendors used to improve the computers’ performance by increasing its clock frequency on single-core processor. However, the increase in performance reaches a bottleneck, deal to the power consumption and heat dissipation, which grows exponentially with the clock frequency. Therefore multi-core comes as a solution, which improves the performance with multiple execution units executing in parallel [3]. However, it increases the difficulty to program a multi-core system for the massive parallelism, and the inter-core dependencies can decrease its performance as well. Multi-core is a processor which integrates more than one processing unit on a single chip [4]. Each unit is called a "core" which is independent. Multi-core is also called Chip Multi-Processor (CMP) because the cores fit on a single processor. The architecture of a multi-core processor generally includes several cores, one system bus, private or shared caches (There is always at least one level cache that is shared, which helps to speedup data transfer time, reduce cache-coherency complexity and reduce data-storage redundancy.) and a shared memory, as shown in Figure 2.1. It differs from a single core processor in both architecture and parallel computing.
2.1
Multi-core operating system
Operating systems on multi-core are mainly divided into Symmetric MultiProcessing (SMP) and Asymmetric Multi-Processing (AMP). SMP requires homogeneous hardware with the same operating system on each core, while AMP is heterogeneous with different operating systems running on different cores. There are also bare systems with no operating system running. 7
CHAPTER 2. MULTI-CORE
Core1 CPU
Core2 CPU
CoreN CPU
L1 Data Cache
L1 Data Cache
L1 Data Cache
L1 Instr Cache
L1 Instr Cache
L1 Instr Cache
. . .
L2 Cache System Bus Shared Memory Multi-core Processor Figure 2.1. Multi-core Architecture
SMP systems
SMP systems (see Figure 2.2) are the most commonly used form, which allow any core to work on any task regardless of where the data is located in memory. There is a single operating system (OS) image for SMPs. Each processor is able to access the shared memory and complete a given task, which is contrary to master and slave processors [5]. So it is easy to distribute tasks among cores to balance the workload dynamically. Application
Application
. . .
Application
. . .
CoreN
Operating System Core1
Core2
Figure 2.2. Architecture of SMP System
AMP systems
AMP systems (see Figure 2.3) are used in some embedded devices where cores are assigned different roles. Independent OS images for cores are used, and each OS kernel has a dedicated local memory and shared memory. Tasks cannot be merged between processors easily in AMP, and resources cannot be treated the same over the system [5]. 8
2.2. COMPARISON
Application
Application
OSE OS
Linux OS
Core1
Core2
Application . . .
Linux OS CoreN
Figure 2.3. Architecture of AMP System
2.2
Comparison with multi-processor & multi-threading
Multi-core systems are unlike multi-processor systems and multi-threading. To illustrate multi-core system clearly, we make a comparison below.
Multi-processor system
Multi-processor Systems require more power to be driven because signals between cores are routed off-chip. Compared with a multi-processor system, a multi-core system is faster with inter-core bus and cache-snoop [4]. Differences between them are shown in Table 2.1. Multiple Processors Two different chips connected by bus Parallelism needs external software support Heat consumption -
Multiple Cores Connected within a chip Multi-processes run in parallel automatically Less heat consumption Lower package cost
Table 2.1. Comparison of Multiple Processors and Multiple Cores.
Multi-threading
Simultaneous Multi-threading (SMT) belongs to instruction level parallelism, which can run on both single-core and multi-core. Multi-threading is used to avoid pipeline stalls. If one thread is waiting for a hardware resource to response, the other thread can take over the task and execute without waiting. However, if both of the threads are waiting for the same hardware resource, they will stall. Differences between multi-core and SMT are shown in Table 2.2. 9
CHAPTER 2. MULTI-CORE
Multiple Threads Instruction level parallelism Share one core and L1 cache One large super-scalar core
Multiple Cores Thread level parallelism More than one set of cores and L1 caches Several cores
Table 2.2. Comparison between Multiple Threads and Multiple Cores
2.3
Summary
Multi-core allows microprocessors to achieve a boost in performance without increasing the power consumption and the complexity of the hardware components. However, the improvement in performance can only be achieved when the multi-core architecture combines with the parallel programming. Multi-core brings the designers with difficulties like cache coherency and intercommunication overhead. If these problems are handled improperly, performance may even decreases. To summarize, software designers should come up with an efficient parallel strategy to take full advantage of the architecture of multi-core.
10
Chapter 3
Parallel Computing Parallel computing executes computations simultaneously instead of serial computing, which executes one instruction at a time. The main idea of parallelism is to break a problem into small parts, assign them to different execution units, execute them simultaneously and complete the problem faster [6]. A highly efficient parallel computation can improve the performance of the system, but to program a parallel hardware is difficult for the massive of parallelism. Parallel computing models and challenges are discussed in this chapter.
3.1
Parallel performance metrics
There are various ways to evaluate the improvement of performance in a parallel computation. Actual speedup in a parallel computation can be measured as: T imeserial Sp = T imeparallel This function shows that the upper bound of speedup equals to the number of CPUs. However, the speedup is not only limited by the number of CPUs, but also the proportion of the program that can be parallelized, which can be calculated with Amdahl’s law. By splitting a program into serial program and parallel program, the maximum speedup is: Sp ≤
1 (1 − P ) +
P N
P: The proportion of a program that can be made parallel N: The number of cores
11
CHAPTER 3. PARALLEL COMPUTING
The upper limit of speedup is determined by the serial fraction of code. So by parallelizing as much as possible, it is possible for us to get a better performance [4]. The basic idea of parallelism is to break a program of instructions into small pieces and execute them simultaneously. However, with the number of cores increasing, the performance is not increasing linearly due to the overhead caused by cache effects, extra synchronization and bus contention [7]. Therefore tasks (pieces of a program) need to be large enough to run in parallel to reduce overhead, but not so large that it could lead to the problem that there is not enough work to be run in parallel to keep load balance.
3.2
Levels of parallelism
There are different ways to implement parallelism to improve the performance. By using parallel programing, we can decrease the time needed for a single problem, and increase the throughput at the same time. 3.2.1
Instruction-level parallelism (ILP)
Instructions can be overlapped and executed in parallel when they are independent. By ILP, processors will reorder the instruction pipeline, decompose them into sub-instructions and execute multiple instructions in a simultaneous way. ILP is implicitly parallelism, which reduces the latency of memory accesses by applying the pipelined techniques and super-scalar architectures1 . Furthermore, by reordering instructions, processors can perform useful work instead of stalling on data and instruction dependencies [8]. 3.2.2
Thread-level parallelism (TLP)
TLP is the main architecture for high performance multi-core or multiple processors. TLP means that when a thread is idle waiting for memory access, another thread is initialized and run immediately, so that the pipeline can stay in ’busy’ state all the time. In addition, the throughput of the system is increased.
3.3
Parallel programming models
Parallel programming models exist as an abstraction above hardware and memory architectures [9]. They can be applied to any type of hardwares or memory architectures. For example, a shared memory model on a distributed 1
Splitting an instruction into several stages: instruction fetch, instruction decode, execution, memory, write back.
12
3.4. DESIGN IN PARALLEL
memory machine appears as a shared memory to the users but is physically distributed. There are several parallel programming models stated below. Shared memory
In this model, programs share a common piece of memory, which is accessed by programs asynchronously. A typical example of this model is a global variable, which can be accessed and modified by all the programs. Shared variables can be protected by various mechanisms like locks (see Section 4.5), which help to control the accesses to them [9]. Message passing
Distributed memory module is used in this model. Each processor owns its data in a private memory, and different processors exchange data by sending and receiving messages. processors need to cooperate with each other, for example, a send must match a receive operation [9]. A comparison of shared memory model and message passing model is shown in Table 3.1. communication memory synchronization
Shared memory load and store shared or private required when sharing data
Message passing send and receive private implicit
Table 3.1. Comparison of Shared Memory Model and Message Passing Model
3.4
Design in parallel
Though parallel computing improves performance, challenges like deadlock and race conditions, exist because of the competition of shared resources (processor, memory or devices) and sequential dependencies (data dependency and communication and synchronization). To cope with these challenges, synchronization primitives are used, see Section 4.5. To design in parallel, the following aspects should be paid attention to. Decomposition
To convert a serial program into a parallel one, the first step is to break the program into several pieces. There are two ways to implement decomposition: domain decomposition and functional decomposition. Domain decomposition means that each task works on a partition of the data. While function decomposition means each task performs a partition of the overall work [10]. 13
CHAPTER 3. PARALLEL COMPUTING
Inter-communication
If the partitioned tasks are assigned to different processors and they share data. Communication must be performed, which will introduce both latency and overhead. The cost of communication needs to be considered before decomposition. While waiting for an event or some data, a process stalls and waits, which is called blocking. Otherwise, the process is called non-blocking, and can make progress without suspending while waiting. Synchronization & data dependency
Synchronization is used to synchronize different tasks and ensure data correctness. Most synchronization operations like semaphores are based on memory barriers. A memory barrier makes the memory operations before the barrier visible to all the tasks and then continues to execute. Applications with little data dependency can benefit most from multi-core system because less synchronizations are needed which are costly. Load balancing
In a multi-core system, the overall performance is up to the slowest CPU. So it is important for each task to have a similar work load, and keep the workload on each CPU balanced. Dynamic work assignment could be used to keep load balancing. Granularity
Granularity is a quantitative measure of the ratio of computation to communication. According to the amounts of work being done between communication events, Granularity is divided by fine-grain parallelism and coarse-grain parallelism. Fine-grain parallelism consists of higher number of small tasks and more communication is needed. It helps to keep load balancing, at the cost of higher overhead.
3.5
Summary
To achieve a better performance of multi-core, one should fully convert a serial program to several fragments to execute them concurrently. Parallel computing enables concurrent computation but it increases the programming complexity. While different levels of parallelism could be combined to gain a better performance. Parallel challenges need to be paid attention to by the designers. Parallel 14
3.5. SUMMARY
programs are designed upon parallelism models. The shared memory model and the message passing model are commonly used to implement parallelism. When applying the shared memory model, one should synchronize between processes to keep memory coherence and avoid other problems like deadlock. Solutions to these problems are given in the Chapter 4.
15
Chapter 4
Threads Threads are commonly used to implement parallelism in a multi-core system with shared memory. They are used to increase the throughput and decrease the idle time for processing units. There are several versions of threads, while the most commonly used threads are POSIX threads (also called pthreads), which is a standardized C language threads specified by the IEEE POSIX 1003.1c standard. Linux implements native POSIX threads on it and our discussion is limited within pthreads in this section [11]. The pthread APIs can be grouped into four groups: thread management, mutex variables, condition variables and synchronization. The features of pthread APIs make it simpler to guarantee memory coherency and implement synchronization.
4.1
Thread overview
A thread is an independent stream of instructions that can be scheduled to run as such by the operating system [12]. Threads are defined in the program environment and initialized by the compiler; they run simultaneously in a process independently and are charged by the operating system. Threads are used for improving the performance with less overhead and fewer system resources compared with processes. The actual execution unit of threads is a processor [11]. 4.1.1
Process & thread
A process includes the program in execution and all the resources that the program involved. Processes are created by the operating system, with a large amount of information, like program, control words, data and status etc. A thread is started by a process, which is the basic element in a process running 17
CHAPTER 4. THREADS
as a logic flow. There would be several threads executing in parallel within one process, which is called multi-thread process. Each thread is independent because it has its own thread context, thread ID, pointer, stack, register and condition word, and they are recognized by thread ID. A comparison between thread and process is listed in Table 4.1. Category Address space
Interaction Context Switching
Process A process has its own address space protected by operating system Shared locations in operating system heavy, the entire process state must be preserved
Thread Threads in the same process share the same address space shared memory within the process light, only current register state needs to be saved.
Table 4.1. Processes and Threads
Managing a thread requires fewer system resources and less overhead than a process. Multiple threads can overlap CPU work with I/O operations (with one thread waiting for I/O, another thread is performed by CPU). On the other hand, thanks to the threads, share resources within process must be synchronized. Therefore, it increases the difficulty to write and debug the program. 4.1.2
Threaded program models
Parallel programming is suitable to be applied in a threaded program. Methods of designing parallel programs (like decomposing an serial application into independent small tasks) is given in section 3.4. Here we discuss threaded program models. A manager is a single thread, which assigns work to other threads (workers). The manager takes charge of task assignments. The size of the worker pool can be either static or dynamic. Manager/worker:
By breaking a task into smaller pieces, each thread takes care of one piece of code and executes in series. Different threads work concurrently and the task is executed in parallel. Pipeline:
Unlike manager/worker model, threads are equal in the peer module. Each thread, a worker, can assign tasks to others. Peer:
18
4.2. THREAD MANAGEMENT
4.2
Thread management
There are four status of a thread, ready, running, blocked and terminated, which can be changed by thread management APIs and the status of shared resources. Threads can be created and attributes (joinable, scheduling) can be assigned with these routines.
Creating and terminating threads There is one default thread per process,
while other threads should be created and initialized by the default one. Each thread is named by an unique ID (identifier). The threads become peers after they are created, which means there is neither hierarchy nor dependency between them. Each thread can create a new thread. Once the tasks are done, a thread will terminate itself. They can also be terminated by other threads or the main function. Thread attributes can be set by the arguments in the routines [11].
Joining and detaching threads Joining is a way to synchronize threads.
Once worker threads completes their tasks, they are joined with their master thread. While some tasks need not to be joined, so we detach them to free some system resources [11].
4.3
Mutual exclusion
Mutex is short for "mutual exclusion", which is used to protect data then multiple write operations occur. Mutex is a way to implement thread synchronization. Once a thread enteres a critical section, other threads must wait until it finished. Even if multiple threads ask for the same mutex, only one of them will success. It is useful in parallel executions because data in the critical path can only be modified by one thread at a time, so that "race" is avoid [11]. Mutex is implemented as a lock (see Section 4.5). Once it is created and initialized, threads will attempt to lock it. While one thread succeeds to lock it and performs some executions, the losers block that call or they can unblock it with trylock instead of lock call. After a thread completes work in the critical section, mutex is released and available for other threads to lock [13]. The structure of Mutex is shown in Figure 4.1. 19
CHAPTER 4. THREADS
Figure 4.1. Mutex Structure [14]
4.4
Conditional variables
Conditional Synchronization means a thread will stay blocked until the system satisfied the condition. In other words, it makes the thread wait before the condition is satisfied, like waiting for a notification. It is also a way to implement synchronization. Conditional variables differ from mutex because conditional synchronization synchronize threads based on variable values instead of controlling the thread accesses to the protected data. What’s more, multiple threads may be permitted accesses to the condition at the same time. The conditions are specified by programmers [11].
Figure 4.2. Condition Variable [15]
A condition is implemented by sharing the status of the condition of the shared data. When a specific condition is satisfied, the thread will be woken 20
4.5. SYNCHRONIZATION
up. There are three operations on conditional variables: wait(L)1 , signal(L)2 and broadcast(L)3 , which are atomic operations. While a thread is waiting for a conditional variable, it is blocked until the condition is signaled. Signal a conditional variable is used to wake up another thread which is waiting for it. Broadcasting is waking up all threads which is in a blocking wait state [11]. In Figure 4.2, we can see that one thread waits on condition ready, then wakes up and proceeds, while the other thread signals condition ready.
4.5
Synchronization
In the multi-core system, threads share memory and other resources, which requires synchronizations to coordinate of parallel tasks, including serializing memory instructions, protecting shared resources and waiting for multiple tasks to reach a specified point. Synchronization coheres within threads and protects shared memory by constrainting relative instruction orderings. While synchronization can be a major factor in decreasing parallel speedup because tasks has to wait for other’s completion. 4.5.1
Atomic operations
An atomic operation is a simple way to achieve synchronization by working on data types [13]. Atomic operations are performed with no possibility of interruption. It is visible as either completed or not started without intermediate state [16]. Atomic operations are used to implement other synchronization primitives, like semaphore, which do not block competing threads when it access shared data, that makes it possible to achieve a better performance [13]. 4.5.2
Hardware primitives
Memory barriers
To achieve a better performance, both CPUs and compilers reorder the instructions to gain a fully paralleled pipeline. However, after such reordering optimization, the instructions which access the shared memory are performed out of order, which may lead to incorrect results, especially when there are data dependencies. A memory barrier is a common way to synchronize threads on multi-core. 1
Release its lock and wait. Once it completes, it indicates that the lock is required by others Execute the waiting thread once and then go on execution. The lock is still held by the original thread 3 Allow all the waiting thread working, the lock is still hold by the original one. 2
21
CHAPTER 4. THREADS
It is a non-blocking mechanism, which is implemented by instructions to ensure memory accesses perform in the expected order by forcing the processor to see a load or store positioned in front of the barrier before it sees the ones after the barrier [13]. In a word, it enforces ordering on the memory in one thread, and guarantees that all other threads have a consistent view of memory in the system. Memory barriers are always defined and specified by processors.
4.6
Threads on multi-core
Multiple threads on multi-core differs from single-core system, see Table 4.2. Threads need not to wait for resources (like CPUs) on multi-core because they run independently on their own cores and do not compete for resources, e.g. floating point unit, and etc. There are two main differences between multithread on single-core and multi-core: caching and priority. Advantages Increased performance Better resource utilization Efficient data sharing Less resource is needed to change contexts Communications between tasks is simple
Disadvantages Data races Deadlocks Code complexity Portability issues Testing and debugging difficulty
Table 4.2. Advantages and Disadvantages of Threads on Multi-core
Cache synchronization becomes a topic for threads on multi-core, because the shared memory modified by two threads may interfere with each other [13]. Besides, the threads having higher priority cannot ignore the lower ones in multi-core system because they can execute in parallel, which may lead the system to an unstable state.
4.7
Summary
The use of threads enables multi-core achieve performance boosts by paralleling computations. Threads on multi-core need to be synchronized to cohere private caches and protect shared resources. Most synchronization techniques like locks and atomic operations generally involve the use of memory barriers and kernel-level synchronizations [13], which are hardware dependent. However, synchronization helps to ensure correctness at the sacrifice of performance. The more strict a barrier is, the more costly it is.
22
4.7. SUMMARY
Parallel overhead is the barrier to get a desired speedup. These overhead and latencies are caused by threads creation, synchronization, scheduling mechanisms as well as inter communications. To get a desired speedup, one should reduce the bus contention as well as communication overhead. Therefore, a scheduler for multi-core system is required to keep load balancing over the threads (to get threads work in full parallelism) without inducing much overhead. It is up to a designer to trade-off between sufficient parallelism and parallel overheads to make a final decision.
23
Chapter 5
Wool — A low overhead C scheduler To achieve a good performance on multi-core, problems are broken-down into independent tasks to execute concurrently. It is of great importance to apply an efficient scheduler on multi-threaded computations to keep load balancing with a low overhead. Work stealing and leap frogging are dedicated for scheduling multi-threaded computations, and help the system to achieve a good performance by distributing work to underutilized processors with a minimized communication overhead between threads. Wool is a C library applying work stealing and leapfrogging, which is developed by Karl-Filip Faxén at SICS, aiming at improving the performance on multi-core system by distributing sequential programs over multiple cores. Wool is a low overhead user level task scheduler, providing lightweight tasks on top of pthreads [1] (see Section 3.3). According to test results, which are shown in his paper [17], the performance of Wool is compatible with that of Clik, Intel TBB and OpenMP on an eight core system.
5.1 5.1.1
Basic concepts Worker
Each thread acts as a worker which is initialized with the same attribution. In Wool, each physical processor is assigned one thread by default, which means that each processor owns one worker. It is possible for each processor to have multiple workers, but one worker per processor is recommended because that context switch and locality cache collisions on the same processor will increase the overhead. On the other hand, the number of workers should be large enough to keep each processor busy [1]. Workers work as peers in Wool (see Section 4.1.2), who take care of task management with a pool of tasks that are ready to execute [17]. 25
CHAPTER 5. WOOL
5.1.2
Task pool
A task pool is a pool of tasks which are managed by a worker. As the task pool grows and shrinks dynamically, Wool implements it as a stack (dequeue) [17]. Newly spawned tasks are placed on top of the pool, and old tasks can be stolen by other workers from the bottom of the queue, more details are given in Section 5.1.5 on page 27. 5.1.3
Granularity in Wool
As it is mentioned in Section 3.4 on page 14, granularity measures the ratio of computation to communication, which reflects the performance of task parallelism. We can measure the efficiency of parallelism could be measured in two aspects: task granularity Ts (5.1) Gt = Nt and load balancing granularity Gl =
Ts Nm
(5.2)
where Ts : Sequential execution time (with no task overheads). Nt : The number of tasks spawned. Nm : The number of migrations of tasks, in our case, the number of steals. Task granularity represents the average useful work per task. Lower task granularity means higher overhead. Load balancing granularity measures the average useful work per steal [1]. Wool uses a fine grained parallelism strategy, which is good for load balancing. There is a finest grain constant, that defines the smallest sequential program to execute. It is mainly used in the loop functions, where more than one tasks may be executed on one worker as sequential tasks instead of being executed as parallel tasks, if the cost of one iteration task is less than the finest grain constant. By defining this constant, Wool increases the load balancing granularity with lower task granularity. 5.1.4
Load balance in Wool
The concept of load balance has been given in Section 3.4. Load balance is an important factor to measure the performance of the system. A good scheduler 26
5.2. WORK STEALING
could distribute work evenly to each processor. In Wool, work stealing is defined by stealing tasks from the bottom of the task pool, which locates closest to the root. In this way, fewer steals are needed to distribute work evenly because of the relatively larger parallel regions [18]. A deeper discussion is given in Section 5.2 on page 29. 5.1.5
Structure
A multi-threaded computation consists of multiple threads, with one thread per processor. A thread (usually starts one worker) takes care of a pool of tasks, which are organized in a doubly-ended queue. There are two pointers pointing to the top and the bottom of the queue respectively. Newly created tasks are added to the top of the queue and are ready to execute, so the queue is also called a ready queue. A ready queue is treated as a stack by its own processor, which pushes and pops tasks from the top of the queue as LIFO1 . But it is regarded as a queue by other processors (who attempts to steal work) and tasks are stolen from the bottom of the queue, like FIFO2 [19]. Figure 5.1.5 is a work stealing model. Tasks are initialized and ready to execute when they are spawned. Newly created tasks are always inserted on the top of the task pool and they are called children tasks, while the tasks who create the new tasks are called parent tasks. Parent tasks are located closer to the root than children tasks in a dedicated task pool. The model of Wool is shown in Figure 5.1.
5.2
Work stealing
Work stealing is an algorithm for task parallelism and load balancing across processors [20]. Whenever a worker has no tasks to do, it attempts to steal tasks from another worker. The victim3 is chosen randomly. If the victim has no tasks in its own task pool, then the steal attempt fails. If a steal attempt fails, the thief will choose another victim. In Wool, workers choose victims linearly from a random start choice of victim [20]. Once a thief steal successfully, the thief4 executes the task, and returns the results to the victim, then the steal process ends. In this way, we can keep underutilized processors working instead of idling while other processors are busy working. 1
Last In First Out First In First Out 3 Victim: refers to the worker which has tasks in its task pool, it becomes a victim when tasks are stolen by other workers. 4 Thief: refers to the worker which steals tasks from other workers. 2
27
CHAPTER 5. WOOL
Processor1 Pop
Processor2
ProcessorN
.. .
.. .
Push .. .
Empty Task C Task B Task A .. . Worker1
Top ... Bottom
.. .
Task D .. .
Worker2
WorkerN
Figure 5.1. Work Stealing Structure.
There are three status of a worker.
Working: A worker is working on a task. Stealing: A worker has no task to do and starts to steal. Workers will keep on stealing until it gets a task to do. Spin: A worker stalls and waits for the results of the stolen task which has not been finished yet. The work stealing algorithm starts with only the main task in the ready queue. The main task will spawn tasks and other workers start to work by stealing tasks. The work stealing process is showed in Figure 5.1. We assume at a time, there are four tasks in the system, three tasks in Worker1 and one task in WorkerN. There is no task in Worker2. In Worker1, Task A is the parent task, and the tasks on top of it are children tasks. Figure 5.2 shows the next time stamp: Worker1 spawns a new task (Task E) and puts it on top of the task pool; Worker2 has no work to do, so it steals from other workers randomly. It finds that there is a task in Worker1 and steals the task (Task A), then Worker2 becomes a thief; WorkerN has its own task, so it executes its own task (Task D). Then in the next time stamp: suppose that Task A spawns new tasks Task F and Task G. They are children tasks of Task A, so that they are put in the same worker with their parent task (Task A) in Worker2. 28
5.3. LEAP FROGGING
Processor1 Pop
Push
.. . Empty Task E Task C Task B
Top
Processor2
ProcessorN
Task A .. .
Task D .. . Top
Bottom
.. .
Task G Task F .. .
Worker1: Victim
Worker2: Thief
...
Bottom .. . WorkerN
Figure 5.2. Work Stealing.
Compared with work sharing
Work stealing is a lazy scheduling, which means that a processors will not assign tasks to others until they are idle. It achieves lower overhead than eager scheduling, which keeps assigning tasks to keep load balance. A typical eager scheduling is work sharing, which attempts to migrate some of the tasks to other processors once new tasks are generated in order to distribute the work load [21]. The advantage of a work stealing algorithm is the migration of the tasks is less frequently compared with work sharing. Another advantage is that, by stealing tasks from the bottom of the queue, parent tasks are stolen which would generate children tasks later. In this way, it reduces the stealing times, because parent tasks are close to the root, it is more efficient to steal a parent task than steal several children tasks. It helps to keep load balance and reduces overhead by reducing the times of steals.
5.3
Leap frogging
Leap frogging is a way to optimize the work stealing algorithm. It helps to enhance load balance and improves performance by reducing the blocking time. In work stealing algorithm, the victim is blocked and must wait for the results come out if a task is stolen by others and is in execution. Leap frogging is implemented by stealing tasks from the thief while waiting and further it helps to finish its own task. In this way, it avoids deadlock and solves the memory 29
CHAPTER 5. WOOL
problem, since the task that the victim steals back is always a task that the victim would have executed if no steals happen [23]. Otherwise, there are a number of drawbacks by stealing work from a random worker, for instance, a task pool may grow beyond its size, since stealing will add a new stack on top of the blocked task. The leap frogging process is showed in Figure 5.3, which is the next time stamp of Figure 5.2. After worker2 stealing Task A, Task A spawns two more tasks: Task F and Task G. At the same time, Worker1 completes Task E, Task C and Task B. And WorkerN stills work on Task D. As there is no task in Worker1, it begins to synchronize with Task A, which has been stolen by Worker2 and still in execution. Instead of waiting for Task A completes, Worker1 tries to steal tasks back from the thief and help to finish Task A (A parent task can only finish on condition of all its children tasks are done). So Task F is stolen back because it is in the bottom of the ready queue. In this way, workers can work in parallel and the load balance is enhanced. However, if the task which the victim steals back is really small, and no other tasks could be stolen back, the victim will spin and wait for the results come out. Then it will loss parallelism in this extreme situation.
Processor1
Processor2
ProcessorN
Task F .. .
Task G (Task A) .. . Pop
Task D .. .
Top Bottom
Top Task G
...
Bottom
.. .
.. .
.. .
Worker1: Victim
Worker2: Thief
WorkerN
Figure 5.3. Leap Frogging
30
5.4. DIRECT TASK STACK ALGORITHM
5.4
Direct task stack algorithm
The Wool scheduler uses macros and inline functions to implement independent task parallelism. The basic operations are spawn and sync, which are like asynchronous function call. With a spawn, a task is created and put into its ready queue, while it executes only when the control reaches the corresponding jsync, rather than being executed immediately [17]. The task may be done by either the processor which spawns it, or a thief who steals it.
5.4.1
Spawn
By spawning a task, the task is initialized and the space is allocated. The task is put into the worker who spawns it. The task is neither executed nor returns a value until it reaches a sync. Any task in the task pool can be stolen by other workers. If it has not been stolen when control reaches the corresponding sync, it is executed inline by the same worker who spawned it. Otherwise, the worker will get the results back from the thief [1]. Processor1
Processor1
Spawn .. . Empty Task3 Task2 Task1 .. .
.. . Top
Top
Task4 Task3 Task2 Task1 .. .
Worker1
Worker1 Figure 5.4. Spawn.
A spawn operation does the following assignments. First, allocate a task descriptor on top of the task pool and initialize the descriptor with the input arguments. Secondly, initialize the status of the task with NOT STOLEN. This parameter will change after the task is stolen by other workers. Then, get the owner of the task, if the task is not stolen, the owner is the worker itself, otherwise, the owner is a thief. This parameter is important when syn31
CHAPTER 5. WOOL
chronize the task or leap frog. A task with a wrapper function5 can either be stolen by others or inlined by itself. In the end of the process, the top pointer is moved one step up, and points to the first blank space where the next spawned task will be put into [19]. 5.4.2
Sync
A pair of sync & spawn is like a LIFO(last-in, first-out). A spawn pushes a task in the task pool while a sync pops the top task off. When the code reaches a sync, it begins to find the latest spawned & unsynchronized task and execute it. By Synchronizing, a task is popped from the task pool and synchronized in one of the following ways according to different situations. Case1:
If the task is not stolen, the worker itself executes it by call.
Case2:
If the task is stolen and finished, the result is returned to the victim.
Case3:
If the task is stolen but not finished, then leap frogging.
5.5
Wool optimization
Work stealing and leap frogging reduce the communication overhead and keep load balanced in a multi-core system. However, as the number of cores increasing up to 50 or more, randomly work stealing algorithm will lead to significant overhead which affects the work stealing efficiency and cannot be ignored. Nonrandom victim selection is proposed to solve this problem. Two advanced algorithms are discussed in this Section: sampling victim selection and set based victim selection in this section. 5.5.1
Sampling victim selection
In stead of stealing work randomly from the first victim found by the thief, a thief samples several workers before each steal operation and chooses the task which is closest to the root to steal. In the task tree, the tasks closer to the root are typically larger, so that stealing tasks which are close to the root can contribute to load balance with smaller overhead (fewer steal operations take place for larger tasks are stolen). Sampling victim and selecting take extra time, but not as much as the time it takes to complete one steal operation. When the number of cores increases to a larger number, like 128 cores, the performance would be much improved [18]. 5
Wrapper executes tasks and returns the value of task functions.
32
5.6. OTHER MULTI-THREADED SCHEDULER
5.5.2
Set based victim selection
When the number of thieves is significant, the contention for task pool is fierce. That means more steal attempts compared with the number of successful steals. In this case, thieves will only steal from a subset of workers, which is a private random permutation P of the induces of the other workers (these workers are its own potential victims) [18]. When steal starts, the thief picks up a random start in the permutation and proceeds through it attempting to steal. If there are no work to steal, it fleshes and starts over from the starting point [18].
5.6
Other multi-threaded scheduler
Clik
Intel Cilk6 is designed by the MIT Laboratory for Computer Science, which is a general-purpose programming language designed for multi-threaded parallel computing. Clik provides simple linguistic extensions to ANSI C. For example, keywords spawn and join to implement work stealing scheduler. Clik++ is open source but the compiler (ICC compiler with Intel Cilk Plus) and tool suite are commercial. Intel TBB
Intel TBB7 is short for "Threading Building Blocks". It is open source, representing a task-based parallelism in C++. The TBB library manages threads dynamically and provides templates for common parallel patterns. It is implemented with work stealing scheduler and it could keep load balancing automatically. OpenMP
The OpenMP APIs8 are in C/C++ dedicated for shared memory parallel programming. It is built on top of native threads. A number of compilers implement the OpenMP API, like gcc, Visual Studio, and etc. Supported languages include C/C++ and Fortran. However, it is designed for shared memory systems, which does not work for distributed memory systems. 6
http://supertech.csail.mit.edu/cilk/ http://threadingbuildingblocks.org/ 8 http://openmp.org/wp/ 7
33
CHAPTER 5. WOOL
Wool Optimizations & Limitations
Compared with other multi-threaded schedulers, Wool is much simpler and has a good performance at the sacrifice of some limitations, for instance, only independent tasks may be executed in parallel [22]. In addition, task descriptors are in fixed size in the current Wool9 [24, 25].
5.7
Summary
Wool is a macro-based parallelism library in C. It is based on two basic operations: spawn and sync. Wool focuses on independent fine grained parallelism, based on work stealing and leap frogging. It has compatible performance than Cilk, TBB and OpenMP. However, it is limited in independent tasks that is not flexible.
9
To gain a desired speedup, Wool make all task descriptors in the same size, to simplify the management of the task pool, and to control the total size of task arguments.
34
Part II
Implementation
35
Chapter 6
OSE Enea OSE is short for "Operating System Embedded" that has two types of kernels: OSE Real Time Kernel for embedded systems, and OSE Soft Kernel for host computers (like UNIX workstation and PC). Both kernels can work in single-core and multi-core systems [26]. OSE is a distributed operating system that is based on message passing. This chapter is based on Enea documents and white papers [26, 27, 28, 29, 30].
6.1
OSE fundamentals
Load module
Modules are applications, which can be included in the core module (core module is a model that is linked with the kernel during the compile time). or they can be built as separately linked load modules and loaded at runtime [27]. Each load module is assigned a piece of memory in the system. It consists of a block of processes and a block pool. Memory is shared by processes within a block. Communication within or cross load module is done by message passing. Each load module is linked to one core dynamically in the run time, and parallelism is realized at the load module level [26]. Figure 6.1 shows the structure of OSE. Memory pool
A memory pool is the basic type of the memory, which has buffers and stacks allocated in it. There is one global memory pool with system processes and data, which is called the system pool and is allocated in kernel memory. There are local pools which belong to their own processes. These pools can be created dynamically or statically. The maximum size and fragment size can also be changed dynamically. 37
CHAPTER 6. OSE
Figure 6.1. Processes in Blocks with Pools in Domains [26]
Block
A block is used to group OSE processes together, and acts like a process which could be started, stopped and killed. Each block may have its own memory pool. If a pool becomes corrupted, this will only affect the block connected to it without any influence on other blocks. Normally, new process (child process) should be part of the same block as its parent unless it is specified when created. Domain
A domain is a memory region contained with programs, which is formed by grouping one or several pools in a separate "domain". In this way, it can avoid the danger when a signal contains a pointer to the memory pool of the sender, for the receiving process has the ability to destroy the pool. By forming a domain, the user can choose to copy the signal buffer from the sender while sending a signal across segment boundaries.
6.2
OSE process & IPC
The OSE kernel is based on message passing between processes. This IPC (Inter process communication) is implemented as a simple API between processes in a distributed single or multi-core system [29]. OSE processes are equivalent to POSIX threads with some special features, including an individual stack, a set of specific variables and register values. Processes within a load module share the CPU time allocated by the kernel based on their priorities. So a process is not always running, instead it has three states: ready, waiting and running (almost the same as a thread, see Section 4). A static process is created by the kernel at the start of a load module, which is 38
6.3. OSE FOR MULTI-CORE
not allowed to be killed during the existence of the load module. In contrast, dynamic processes can be created, configured and killed during the run-time. Each process is assigned a process identity (ID) in its life.
Figure 6.2. Message Passing
Inter process communications and synchronizations are handled by signals. A signal is used as an acknowledge message (semaphores, barrier or monitors) or a data-carrying message. Each process is assigned one unique signal queue. Each message carries the information of the sender, receiver and the owner, which makes it easier to trace and redirect. A process can only be received by the dedicated process which the message is sent to and it is upon the process to choose which message to accept. This process is called Direct Asynchronous Message Passing (DAMP). Message passing between cores is supported by OSE using the same message passing API as that used in the single-core version of OSE [28]. Process of message passing is shown in Figure 6.2.
6.3
OSE for multi-core
The OSE Multi-core version is available for PowerPC from OSE 5.4 and later, which is called MCE (Multi-core Embedded). The architecture of MCE is shown in Figure 6.3. The MCE is a hybrid of SMP and AMP (see Section 2.1), which performs as SMP on application level with a homogeneous single OS image and a shared memory programming model. It also has AMP characteristics on the kernel level with multiple schedulers for multiple cores and the support of message passing [30]. One OSE image controls all the cores and shared resources. Cores are marked with cpu_id. Cpu_id0 is the main execution unit from which the system is started. OSE supports SMP transparently without modifying applications to be moved from single core to multi-core. With the kernel thread running on the main execution unit, programs in load modules are allowed to be run on any execution unit. Entire programs can be move between execution units 39
CHAPTER 6. OSE
while running, apart from interrupt or timer interrupt processes. Programs can also be locked to a given execution unit by the designer[26]. 6.3.1
OSE load balancing
Nowadays, OSE offers signal interfaces in the program manager to measure the load statistic and move programs between execution units. Applications regard OSE as a SMP since it handles work load distribution over cores. But to achieve a determinism require, it is up to the applications to control load balancing. Parallelism is at a load module level [33]. So thread level parallelism (see Section 3.2.2) and dynamic load balancing are expected in OSE, which is what Wool does. 6.3.2
Message passing APIs
OSE message passing APIs are supported in MCE to communicate over cores. The message passing model fits multi-core better. Because in the shared memory model, inter communication overhead grows with the increment of the number of cores. When there are a greater number of cores, the shared memory model has a higher overhead than message passing. What’s more, synchronizations are not needed because they are implicit in this process. However, message passing may also induce higher latency when parallel programs have much dependency.
6.4
OSE pthreads
POSIX Thread is supported by OSE. The OSE pthread is a pthread emulation layer that implements most of the thread standard in the POSIX 9945-1 (1996) standard, which is essentially 1003.1c [27]. OSE implemented a subset of the POSIX thread, which is dedicated for simplifying the porting process
Figure 6.3. MCE: Hybrid AMP/SMP Multiprocessing [31]
40
6.5. SUMMARY
for the POSIX thread based applications. It is recommended to use native OSE processes, which are more efficient than OSE pthreads. An OSE pthread could be regarded as a prioritized OSE process, except that killing a thread is not allowed (it could be terminated instead). Fast semaphores are used to implement OSE pthread Mutexes, so they cannot be used for other purpose. Pthreads in the core module need no special configuration, while pthreads in the load modules require that the program is configured in the shared mode [27].
6.5
Summary
First, OSE is based on the message passing model, which is a good solution for multi-core. Secondly, OSE supports POSIX thread by mapping OSE processes to pthreads. Most functions of pthread are supported, which makes OSE a friendly programming environment for migrating POSIX-compliant applications. Thirdly, GCC Complier is supported by OSE, which makes it easier to port Wool, for the same compiler is used when Wool is compiled in Linux. Last but not least, LINX helps to communicate over cores in MCE regardless of the operating system. It makes it possible to build a hypervisor of Wool (with either pthreads or signals) which could run on a platform with multiple operating systems.
41
Chapter 7
P4080 and TILE64 Multi-core platforms Freescale QorIQ P4080 with eight cores and Tilera TILE64 with 64 cores are discussed in this chapter, regarding to their features and architectures. The Freescale P4080 is the target board for the porting. TILE64 is discussed for comparison.
7.1 7.1.1
Features Instruction set
Instruction set architecture (ISA) is one of the most important features of the computer architecture related to programming. An instruction set includes a specification of a set of instructions, data types, addressing modes and etc. Operation bits of the instruction set describe the number of bits an instruction takes in a specific CPU. The main stream CPUs are either 32bits or 64 bits. The instruction set can also be classified into CISC and RISC. RISC includes the most frequently used instructions with the same number of bits in every instruction. While CISC emphasis on hardware which has complex instructions and the soft code size could be smaller. Several ISAs are compared in Table 7.1. Architecture PowerPC Tile IA-64 SPARC X86-64 X86
Bits 32/64 32 64 64 64 32
Instruction set RISC RISC EPIC1 RISC CISC CISC
Table 7.1. Instruction Set Comparison
43
CHAPTER 7. P4080 AND TILE64
Programs built on different instruction sets differ a lot. One CISC style instruction may need to be implemented by several instructions in RISC style. 7.1.2
Memory consistency
To enhance the performance on multi-core systems, large caches are set to reduce the overhead of memory accesses, but it rises a problem that the memory operations may be performed out of order and the data in the same memory location may present different values to different processors [32]. Different computer architectures build different models addressing this problem by pipelining memory accesses which is called memory consistency. The memory consistency model is a contract between the shared memory and the programs running on it [33]. They are mainly divided into sequential consistency and relaxed consistency. The sequential consistency performs the access to memory in the same order as the program order to each individual processor. However, sequential consistency is too complicated and causes a lot overhead. The relaxed consistency means that instruction reorderings are permitted but synchronizations are required when there are data dependencies. In the relaxed consistency model, memory barriers are needed to constrain the memory access order. Most processors define a relaxed memory consistency model. Out-of-order
The out-of-order execution means the execution order2 of a program differs from the program order3 due to both compiler and CPU implementation optimizations. Perceived order4 , which is determined by caching, interconnect, and memory-system optimization may differ from execution order. Out-oforder execution is an optimization to give a better performance, but it may lead to unexpected results. Here is an example of out-of-order. Initial State : x = 0 , s =0 P1 P2 while s == 0 x = 42; ; // memory barrier ; Print x ; s = 1; 2
The order that the individual instructions are executed on a given CPU. The order specified in the code of a given CPU. 4 The order that a given CPU perceives its and other CPUs’ memory operations. 3
44
7.1. FEATURES
If this piece of code is executed in program order, the result will be 42, for x is printed after the execution of s = 1. If the program is executed or perceived out-of-order, the result may be 0, because P1 may see s = 1 before x = 42. Memory barriers are needed for protection in this case. Memory consistency models for different microprocessors
Different Processors supply different memory consistency models. For instance, X86 and Sparc have strict constrains which do not reorder write operations, while PowerPC, IA64 have weaker constrains which need memory barriers to constrain execution order and perceived order. Table 7.2 lists different memory consistency models defined by different processors. [36] [32] [38] Order Read After Read? Read After Write? Write After Write? Write After Read?
IA64 Y Y Y Y
PowerPC Y Y Y Y
Sparc N N N Y
X86 N N N Y
TILE Y Y Y Y
Table 7.2. Memory Consistency Models
7.1.3
Memory barrier
Memory barriers (memory fences) are mainly used with weak consistency models to constrain memory access order and ensure data correctness. There are different kinds of barriers, for example, store fence (SFENCE), which constrains the execution order of write operations is the same as the program order. Generally, the more strict the memory barrier is, the more costly (pipeline is forced to drop the pre-fetched instructions). Developers should only use memory barriers when it is necessary. Memory barriers are hardware dependent, that are defined by processor vendors. Table 7.3 lists some types of memory barriers and instructions. sparc x86_64 powerpc TILE
SFENCE membar(StoreStore) sfence lwsync MF
MFENCE membar(StoreLoad or StoreStore) mfence msync MF
Table 7.3. Memory Barriers & Instruction
45
CHAPTER 7. P4080 AND TILE64
7.2 7.2.1
Freescale QorIQ P4080 platform P4080 architecture
The Freescale P4080 Q or IQ integrated multi-core communication processor5 is based on eight Power Architecture6 processor cores – e500mc7 . The P4080 is a high performance networking platform, which can be used in routers, switches, base station controllers, and general-purpose embedded computing systems. Compared with multiple discrete devices, it offers a better performance and simplifies the board design [35].
Figure 7.1. P4080.
7.2.2
e500mc core
P4080 is based on eight e500mc cores and it supports both symmetric and asymmetric mode. Each core is a 32-bit low power processor. The e500mc core is based on the Power Architecture technology dedicated for embedded systems. The e500mc is a super-scalar dual issue processor (two instructions per clock cycle), which support both out-of-order execution and in-order completion. With seven-stage pipeline, e500 cores is able to perform more instructions per clock [35]. 5
Freescale Semiconductor Power Architecture is a broad term to describe similar RISC instruction sets for microprocessors developed and manufactured by such companies as IBM, Freescale, AMCC, Tundra and P.A. Semi. 7 E500: A 32-bit microprocessor core based on Power Architecture [34]. 6
46
7.3. TILERA’S TILE64
Memory consistency model
The Power ISA provides weak consistency memory access model to create the opportunities for processors to reschedule memory transactions [36]. 1. Read after read may be reordered unless caching-inhibited is in use. 2. Write after write can be reordered. 3. Read after write, write after read can only be protected by msync. Memory barrier primitives
msync: This instruction ensures that all read and write preceding msync have completed before subsequent instructions. lwarx and stwcx: The instructions are used to perform a read-modify-write operation to memory. They ensure that only one processor modifies the memory location between the execution of the lwarx instruction and the stwcx instruction. With the combination of lwarx and stwcx, Instructions like prefetch and compare-and-exchange could be implemented.
7.3 7.3.1
Tilera’s TILE64 TILE64 architecture
TILE64 is a multi-core processor with a mesh network of 64 cores (each core is called a TILE). It could be called Massively Parallel Processor Array (MPPA). TILE64 delivers scalable performance, power efficiency, and low processing latency [38]. 7.3.2
TILE
A TILE core has 32-bit, RISC instruction set, with Three-way VLIW8 pipeline for instruction level parallelism. Memory consistency model
There are two properties of TILE memory consistency model: instruction reordering rules and store atomicity [38]. This model belongs to relaxed memory model (see Section 7.1.2). To a given processor, memory accesses as well as its visible order to other processors may be reordered, except the following cases [38]: 1. Data dependencies in a single processor are enforced, including read after write, write after write and write after read. 8
VLIW: Very long instruction word, which refers to a processor architecture designed to take advantage of instruction level parallelism [38]
47
CHAPTER 7. P4080 AND TILE64
Figure 7.2. Tilera’s TILE64 Architecture [38]
2. Local visible order is determined by data dependencies through registers or memory. 3. The global visible order cannot be determined by the local visible order. Memory barrier primitives
TILE processors define a relaxed consistency model, which need memory barriers to make the piece of memory visible to guarantee that internetwork communications occur in order [37]. The TILE64 processor has instructions of memory fences and global barrier syncs. MF: Memory fence (MF) instruction is provided to ensure memory operations prior to the fence are globally visible before the operations after the fence [38].
48
Chapter 8
Porting Wool to P4080 In the design of porting Wool to the Freescale P4080 platform with the Linux operating system, the idea is to walk through the source code and modify the code quoting with the hardware primitives. As Wool is a C library which is built upon the operating system, we do not have to take care of the other part of the code other than the hardware dependent part. The operating system running on P4080 is Enea Linux, which supports all the libraries used in Wool.
8.1
Design
The hardware synchronization primitives applied in Wool are mainly memory related code, like memory fences, atomic operations and memory allocation functions, which have been mentioned in Chapter 4 & 7. The object is to add the hardware synchronization primitives defined by the e500mc processor, to the Wool’s source code and write it in the gcc inline assembly format. Freescale introduces the e500mc processor based on PowerISA v.2.061 in the QorIQ family chips. The e500mc processor is a 32-bit processor supporting shared storage between processors [39]. Its predominant model is weakly consistent, which allows the reordering of code and provides an opportunity to improve the performance over the stronger consistency module. In this case, it is up to the programmer to take care of the ordering and synchronization instructions when shared storage is used among multiple cores. 8.1.1
Storage access ordering
Different memory fences used in Wool should be defined in the header file. A SFENCE is short for store fence, which ensures that all write operations 1
A new instruction set, combining late versions of POWER and PowerPC instruction sets. Designed by IBM and Freescale.
49
CHAPTER 8. PORTING WOOL TO P4080
preceding the barrier are committed before the subsequentwrite operations are issued. A MFENCE is short for memory fence, which controls all the memory accesses (both read and write operations). It ensures that memory access operations before the barrier are committed before the subsequent memory access operations are initialized. A memory fence will induce more overhead than a load fence or a store fence. It is because that CPU would discard all the pre-fetched instructions and empty the pipeline. The e500mc core defined the following operations which are implemented in Wool. sync: provides full sequential consistency. So it is used as MFENCE in Wool, which is strict to any instruction ordering. msync: ensures that all instructions preceding msync have completed before msync completes. It also ensured the data accesses across all storage classes. It could be used as MFENCE also. One can regard it as an alternative operation to sync. lwsync: is a low overhead memory fence, which constrains the following accesses: read after read, write after write and write after read. It is used as SFENCE in Wool. isync: causes the prefetched instructions to be discarded by the core, which ensures that the instructions preceding to isync are fetched and then executed before the subsequent instructions. But isync does not affect data accesses. 8.1.2
Atomic update primitives
Atomic update primitives used in Wool are the prefetch and CAS operations, which are based on lwarx and stwcx. lwarx creates a reservation and stwcx stores a word on condition that there exists a reservation created by lwarx on the same storage location. The reservation will be lost if another processor modified the same storage location before stwcx, then a new reservation must be set to perform the atomic operation.
8.2
Implementation
8.2.1
Prefetch & CAS
The prefetch primitive loads and replaces a word in storage atomically [39]. The following assembly code shows the processor fetches a new value in r4 and stores it in r3, and the old value in r3 is reserved in r5. The key property of this operation is that it updates a memory value atomically [16]. loop : lwarx r5 ,0 , r3 stwcx . r4 ,0 , r3
#load the old value in r5 from r3 and reserve #store new value in r4 to r3 if still reserved 50
8.2. IMPLEMENTATION
bne - loop
#loop if lost reservation
The Compare and Swap (CAS) primitive compares a value in a register with a word in storage [39]. It loads the old value from r3 to r6 and compares the value pointed by r4 with the old value. If they are equal, modify the contents(r3) of r4 to the given value r5. The old value will be returned in any case. This is a typical atomic CPU operation to achieve synchronization. It avoids that two threads update the value at the same time, because the second one will fail and re-compute[16]. loop : lwarx r6 ,0 , r3 cmpw r4 , r6 bne - exit stwcx . r5 ,0 , r3 bne - loop exit : mr r4 , r6
8.2.2
#load the old value and reserve #whether r4 equals to r6 #skip if not #store new value if still reserved #loop if lost reservation #return value from storage
Gcc inline assembler code
The hardware primitives (assembly code) described in the design section should be embedded into the Wool’s source code, which is written in C. So the assembly code is written as gcc inline assembler, which is supported by the gcc compiler. Gcc inline assembler should be written according to its basic rules. The basic format is: asm ( " instructions " : outputs : inputs : registers - modified ) ;
The first parameter includes instructions with ";" to separate them. The second and third parameters could be either a storage location or a register. They could be left empty when no input or output is used. The assembly instructions used in Wool are operations on the memory, so the last parameter should include "memory" to declare that the memory has been changed. 8.2.3
Enea Linux
The operating system used is Enea Linux, which is powered by the Yocto Project2 open source configuration. The Yocto project provides standard tools, which ensures quick access to the latest Board Support Packages (BSPs) for the most common hardware architectures [40]. The architecture of Enea OSE is shown in Figure 8.1. The operating system could assign tasks to different cores automatically. 2
http://www.yoctoproject.org/
51
CHAPTER 8. PORTING WOOL TO P4080
Figure 8.1. Enea Linux.
8.3
Experiment
A Fibonacci code is used as the test code. There are two versions of the test code, with Wool (fib_Wool) and without Wool (fib_wo). The main difference is that the tasks are executed at once without wool, while with Wool, tasks are SPAWNed and placed in a task pool instead of executing immediately, and the tasks are executed only when they reach a correspond SYNC. Execution time is measured for comparison. 8.3.1
Experiment: Fibonacci code with and without Wool
fib_Wool — Pseudo test code with Wool for each iteration: SPAWN ( Fibonacci , n -1 ) ; k = CALL ( Fibonacci , n -2 ) ; m = SYNC ( Fibonacci ) ; return m + k ;
fib_wo — Pseudo test code without Wool for each iteration: k = Fibonacci (n -1) m = Fibonacci (n -2) return m + k ;
Fibonacci(43) is tested on different platforms with different gcc optimizations. From Table 8.1, we can see that Wool should apply gcc optimization to achieve a good performance. From the first two lines in the Table, we can see that fib with Wool gains a speedup of 4.32 and 4.70 respectively by using the optimization option of gcc. While fib without Wool has a speedup of 2.08 on the X86 platform and a speedup of 1.54 on the P4080 platform. Wool is an optimized code for the compiler and gives a much better performance with the gcc optimization option. By comparing the results of fib with Wool and without Wool (gcc -O3), we 52
8.3. EXPERIMENT
could see fib with Wool gives a better performance on P4080 than fib without Wool (9.2s with Wool compared with 11.13s without Wool). While things are different on the X86 platform, fib with Wool needs more time to finish the tasks (5.14s with Wool while 4.7s without Wool). This is because that the program runs on one core on X86, the overhead is created by extra SPAWN and SYNC operations. When multiple cores are used in a system, Wool will distribute tasks to different cores with a low overhead to give better performance. There are more discussions in the Chapter 9. Code fib_Wool fib_Wool fib_wo fib_wo
Gcc optimization gcc -O0 gcc -O3 gcc -O0 gcc -O3
X86 22.23(s) 5.14(s) 9.79(s) 4.7(s)
P4080 43.25(s) 9.2(s) 17.13(s) 11.13(s)
Table 8.1. Time for Fib(43) with Wool Library on the Linux X86 Platform
8.3.2
Analysis
Take fib(5) as an example, and suppose that there are multiple cores in the system. fib(5) fib(4) fib(3) fib(2) fib(1)
fib(1)
fib(3)
fib(2) fib(1)
fib(2) fib(0)
fib(1) fib(1)
fib(0)
fib(0) Figure 8.2. Fib (5)
From 8.3.2, we can see that there are 15 tasks in total in fib(5). If fib(5) is executed without Wool, all the 15 tasks will be done by one core and the overall time is 15 execution time. If the Wool library is used, one processor will execute fib(5), spawn fib(4) and execute fib(3), and then spawn fib(2) and execute fib(1). Then sync fib(3) and fib(5) (fib(4) & fib(2) in red, which fib(3) & fib(5) in blue sync with, may be stolen by other cores). The critical path 53
CHAPTER 8. PORTING WOOL TO P4080
of completing fib(5) consists of 2 spawns, 2 syncs and 3 task units at least (suppose there are infinite number of cores and the other spawned tasks are stolen and executed by other processors). Each SPAWN and SYNC takes little time when each task unit is large. However, when each task unit consists only SPAWN and SYNC (like fib_Wool), the overhead created by them could not be ignored. As shown in Table 8.1, fib(43) with Wool running on one core (X86) takes 5.14s, while it takes 4.7s without Wool. The overhead mainly includes the cost of the extra SPAWN and SYNC operations.
54
Chapter 9
Porting Wool to OSE When porting Wool from the Linux operating system to Enea OSE, the library differences should be considered first. OSE is designed to work with OSE processes and message passing, but it also provides most pthread APIs. As Wool is based on top of pthreads, OSE pthreads are tested first. As OSE supports the gcc compiler, it makes it easier to port Wool. Modifications in Wool has been done to fit it on OSE. Besides, the OSE load module configuration is also edited to support Wool.
9.1
Design
OSE supplies Standard C libraries, but there exists differences between the OSE library and the Linux library. Some functions are not supported by OSE, which need to be added when porting Wool. 9.1.1
Malloc library
The data alignment refers to the relationship between the data address and the size of memory blocks."A variable located at a memory address that is a multiple of its size is said to be naturally aligned [41]." The alignment rules derive from hardware. Some computer architectures have stringent requirements on the data alignment, in which unaligned data alignments will result in a processor trap. Wool is supposed to work on multiple platforms, so that data should be naturally aligned to make it fit on different platforms. OSE malloc library
The malloc library includes heap memory allocation functions [28]. Four storage allocation functions are defined in this file, including malloc, realloc and calloc. Malloc allocates dynamic memory for an object in the specified size, 55
CHAPTER 9. PORTING WOOL TO OSE
which could in private mode or shared mode. Calloc allocates dynamic memory for an array of objects in a specified size. Realloc is used to change the size of the dynamically allocated memory. The pointers returned by the storage allocation functions are unique and discontinuous. The pointers remain valid from the point of allocation until deallocation [28]. They can be set as private pointers or shared pointers. A shared pointer can resize or deallocate a memory block from any OSE process whereas a private pointer can only be operated on by the process which allocated it. Private pointers could be collected when the owner process terminates even though they may not deallocated explicitly. An allocated pointer is always private by default in OSE. Pointers should be set in the shared mode in Wool in order to share pointers between OSE processes to synchronize among workers. It could be set in the following three ways: (1) The block execution model is shared. (2) The block environment variable HEAP_SHARED_MODE is set. (3) Code is compiled with MALLOC_SHARED. Setting MALLOC_SHARED is the most convenience way, which can forcibly made allocations shared when compiled. This macro forcibly enables shared heap mode pointers when using standard allocation functions like malloc. An OSE error will be generated if there is insufficient memory to satisfy an allocation attempt [28]. The compiler and C library handle data alignments. It will return a memory that is aligned along an 8 byte boundary on 32-bit systems and a 16 byte boundary on 64-bit systems [41] via malloc, calloc and realloc functions. When a larger boundary is required, such as a page, additional functions are needed, like memalign. Linux malloc library
Expect these basic functions, advanced memory allocation functions are added in the library supplied by Linux. memalign is used in Wool which are included in the malloc library, but not in the OSE malloc library. allocates a dynamic memory with the size of bytes set by the second parameter. The aligned memory address is a multiple of alignment, which must be a power of 2 [41]. Memalign
56
9.1. DESIGN
void * memalign ( size_t alignment , size_t size ) ;
memalign is used when a larger boundary or more strict alignment is required. The advantage of it is to improve the performance. More information regarding to how does it improve the performance is available in [16]. Comparison
A simple test is performed to compare the performance of malloc, calloc and memalign on the Linux platform. Results are shown in Table 9.1. gcc optimization gcc -O0 gcc -O3
memalign 94.36s 21.94s
malloc 100.30s 22.32s
calloc 94.32s 22.36s
Table 9.1. Fib(46) with Memalign, Malloc and Calloc
Memalign has the best performance because of the size of the boundary. Malloc and calloc have a compatible performance, and both of them can guarantee the correctness of the code with the same size of aligned memory. As there is no memalign function supplied by OSE, we substitute it with malloc at the expense of performance. 9.1.2
Pthread
The OSE pthread emulates the POSIX thread, and supports most of the thread standard in the POSIX 9945-1 (1996) standard, which is essentially 1003.1c [27]. In general, the native OSE processes are used instead of the OSE pthreads to gain a higher efficiency. But OSE pthreads make it easier to port POSIX applications to OSE, like Wool. Properties
A OSE Pthread is implemented on top of a prioritized OSE process, therefore it has similar properties with a OSE process. Besides, each Pthread ID is unique and is the same with its corresponding OSE process ID. Unlike OSE processes, OSE pthreads cannot be killed. OSE pthreads should be terminated as pthreads, like pthread_exit [42]. Pthreads are widely used in systems with shared memory. The OSE block which pthreads are running on should be set with a shared heap, so that all resources are taken from the same heap and resources will be freed when killing the block. Users could specify which OSE block a thread should belong to [27].
57
CHAPTER 9. PORTING WOOL TO OSE
For OSE 5.5 and later, OSE pthreads could be used in both core module and load module. Considering Wool is a C library, we build it as a load module, which should be configured in the shared mode. More information is available in [44]. Mutex
OSE pthread Mutexes are implemented with fast semaphores. So that fast semaphores could not be used in OSE pthread applications [27]. Condition variable
Condition Variable functions are not well supported in OSE [43]. But they could be implemented in the OSE process with condition_signal and condition_wait signaling. It is recommended to have a signaler process, a waiter process and a conditional server process, which could also be customized according to the requirement. However, it is not like other Pthread APIs which could be used without change. Pthread bind with cores
Programs are created within one core by default. All of the pthreads started in Wool will be assigned to one core. To distribute pthreads to multiple cores, the program should be set in the supervisor mode, so that it can take effect of a system call which can move functions to different execution units. As P4080 has 8 cores, we start 8 pthreads, and bind each pthread to each core using ose_bind_process() function [26]. A system call ose_bind_process is used to bind an OSE process, block or domain to a specified core. It can also used to rebind (move) processes or blocks during run time. If a process is specified, only that process will be bind; If a block is specified, all the processes will be bind to that core together with the block; If a domain is specified, all the processes and blocks inside the domain is bind to the same core. This system call is only valid in a SMP system [44].
9.2 9.2.1
Implementation Load module configuration
In order to enable Wool running on multiple cores, load module should be configured in the shared mode. In order to bind a pthread with a dedicated CPU, a load module should be set in the supervisor mode instead of user mode 58
9.3. EXPERIMENTS
to apply the system call. Besides, the stack size should be big enough to align each worker enough memory space. 9.2.2
Makefile
Include all the source code which you want to compile together into the Makefile. The compiler will be called and it will link all the source code included in the Makefile. For example, if we want to compile fib with the support of Wool library, we should include both wool.c and fib.c in the Makefile. Makefile should be in the same folder with the application code you want to build. 9.2.3
System information
Some multicore API calls are used to check the information of pthreads, load modules and cores. These system information is useful for debugging. 9.2.4
Application on OSE multi-core
The OSE architecture makes it possible to migrate an OSE single core program to multi-core without changes. All single core C runtime APIs are supported by OSE MCE [26]. Processes could be spread out onto multiple cores to achieve parallelism by MCE automatically. Processes communicate with each other via OSE signaling which is the same with single core applications. However, multi-core systems could neither assume only one entity at the time can execute code nor assume that a higher priority provides exclusiveness, because the multiple parallel execution units have multiple entities which could execute processes with different priorities at the same time. So it is of great importance to ensure all the depended code are locked on the same core in a multi-core environment. As Wool is a scheduler trying to keep load balancing among cores, all of the pthreads (actually prioritized processes) should be set with the same priority [45].
9.3
Experiments
The same test case is performed on OSE with that on Linux. The Fibonacci test code with and without Wool are tested. 9.3.1
Experiment on OSE & Linux
The target used for this test is the Freescale P4080 board. The test code are performed on Enea Linux and Enea OSE respectively. Fib(43) is compiled with the highest optimization option provided by gcc. From the column chart we can see fib(43) on Linux and OSE have compatible 59
CHAPTER 9. PORTING WOOL TO OSE
Figure 9.1. Fib(43) with Wool Library on the Linux & OSE on P4080
performance. Fib_Wool has a relatively better performance. The reason for such little improvement has been illustrated in Section 8.3.2. And the following test further proves the analysis. 9.3.2
Experiment on the performance vs. the number of cores (1)
As it is allowed to bind pthreads to cores, we can setup the number of working cores. Two experiments are performed: one is fib(43), the other is fib(12) with a slowdown loop, which is used to enlarge the size of each task. Code fib_Wool fib_Wool fib_Wool fib_Wool fib_Wool fib_Wool fib_Wool fib_Wool
number of cores 1 2 3 4 5 6 7 8
Operating system OSE OSE OSE OSE OSE OSE OSE OSE
Time for fib(43) (s) 19.154 9.949 9.188 8.939 8.896 8.885 9.459 9.424
Table 9.2. Fib(43) with Wool Library on Multiple Cores
From Table 9.2, we can see the performance is almost doubled using two cores compared with using one core. However, when the number of cores goes on increasing, the performance is not much improved. On the contrary, the performance begin to decrease when more than seven cores are used. That means, the overhead induced by synchronizations affect its performance. The reason for this may be because the basic version of Wool is used here. If the advanced version has been ported, the results is expected to be better. To 60
9.3. EXPERIMENTS
further verify whether Wool works well on a multi-core system, the following test is conducted. 9.3.3
Experiment on the performance vs. the number of cores (2)
An extra slowdown loop is added into the test code, which makes the program 20,000,000 times slower than before. That means, the size of each task is increased. The results are shown in Figure 9.2. We can see that the performance of fib_Wool is improved continuously with the number of cores.
Figure 9.2. Fib(12) with Large Task Body on Multiple Cores
The red columns indicate an ideal speedup, while the blue columns show the real speedup. The differences between them are overhead induced by extra synchronizations among threads. In Figure 9.2, the performance of Wool has been improved linearly. However, when it comes to 8 cores, the overhead increases greatly. That is because Core0 acts as a worker when it comes to 8 cores. Considering that Core0 also takes care of other system tasks, it is reasonable that the speed of the test code is affected.
61
Chapter 10
Conclusions and Future Work 10.1
Conclusions
This master thesis discusses Wool and its performance on multi-core system. In the theoretical study part, multi-core, parallel programming and pthreads are discussed. As multi-core starts an era to improve the performance of computations, Wool is developed, providing a solution to carry out computations concurrently and efficiently. In the implementation part, Wool is ported to a new platform and a set of tests are performed on the new platform. 10.1.1
Theoretical study
Answers to the questions raised in Chapter1 are stated below. P1: Wool task parallelism
Wool implements task parallelism with work stealing and leap frogging. Source code regarding to hardware synchronization primitives and OSE libraries need to be modified to fit the new platform. P2: Pthreads
The OSE multi-core edition MCE supports most functions of pthreads which are built upon OSE processes. However, functions of pthread conditions are not fully supported by OSE. If these functions are used, one should fit it in OSE processes. This part is stated in Chapter 6 and Chapter 9. P3: Hardware architecture
Wool is able to work on P4080 with hardware specific synchronization primitives guaranteeing its performance. To multi-core systems, the number of cores influence Wool’s performance. Features, like sampling victim selection, 63
CHAPTER 10. CONCLUSIONS AND FUTURE WORK
are added as the number of cores keeps on growing. This part is illustrated in Chapter 5 and Chapter 9. 10.1.2
Implementation
To port Wool a C library to another platform, one should consider hardware related code, library support, compiler, and application configurations. Porting from X86 to P4080
When porting Wool from one platform to another, hardware dependent code should be modified to fit the new platform. All the hardware dependent code in Wool are memory related primitives, which are defined by the processor. These primitives are written in the gcc inline assembler format to be compiled together with the rest of the code. Porting from Linux to OSE
The operating systems supply with the libraries and compilers. Both of them may affect the porting. To port Wool to OSE, gcc compiler is used (gcc is also the compiler used on Linux platform), but there are some functions missing in the libraries provided by OSE. So one must manage to make proper changes to the code to compensate. Another thing is that the file system and the architecture of OSE are different from Linux, so configurations should be changed accordingly. 10.1.3
Result Discussion
Table 10.1 shows the results of fib(43) on different platforms. fib(43) has more than one billion tasks spawned, executed and synced. The extra cost induced by spawns and syncs is quite low considering the large number of spawn and sync operations (one billion spawns and syncs cost 0.43s, which is the difference between use Wool and without Wool on Linux(X86)). Time Platform OSE(P4080) Linux(P4080) Linux(X86)
with 9.42s 9.2s 5.14s
Wool without 11.50s 11.13s 4.71s
Table 10.1. Target
Wool gives a better performance on the P4080 multi-core platform. The fundamental operations SPAWN and SYNC are quite cheap, so that an low 64
10.2. FUTURE WORK
scheduler overhead could be achieved. Besides, Wool works with pain ordinary C, which is unlike C++ used by Intel TBB or C# used by Microsoft TPL. From this respect, Wool is a good choice for Enea OSE. However, as Wool implements independent task parallelism, the usage of Wool is limited within independent tasks, and the users should take care of the task dependencies.
10.2
Future work
Though Wool could be run on OSE, it is recommended to replace the pthread primitives with OSE message passing, which will make it more efficient. Besides, Wool works on top of pthreads, which need explicit synchronization when workers communicate over cores. With OSE, communications among cores need not to be synchronized when using message passing.
(1)
Porting is done with the basic version of Wool. It is worth to port the newest version to OSE, which may give a better performance with more complicated algorithms. (2)
Enea Hypervisor supports both OSE and Linux operating systems. It is worth to work out a hypervisor version of Wool (with both pthreads and signals) that can work over the system regardless of OS types. (3)
A serial of target boards may enhance their performance with Wool. For example, multi-core: Ambric Am2045, Sun Sparc T3 (Rainbowfalls), IBM Power7, and etc. They may become further target platforms. (4)
65
Appendix A
Demo Results ose5> ./romfs/wooll.gz -p 4 12 Worker started Worker started Worker started Worker started Worker 558303160 stole a task from worker 558175144 Worker 558596184 stole a task from worker 558175144 Worker 558461016 stole a task from worker 558303160 Worker 558175144 is leapfrogging with worker 558596184 Worker 558175144 stole a task from worker 558596184 Worker 558596184 is leapfrogging with worker 558175144 Worker 558596184 stole a task from worker 558175144 Worker 558175144 is leapfrogging with worker 558596184 Worker 558175144 stole a task from worker 558596184 Worker 558596184 is leapfrogging with worker 558175144 Worker 558596184 stole a task from worker 558175144 Worker 558175144 is leapfrogging with worker 558596184 Worker 558175144 stole a task from worker 558596184 Worker 558596184 is leapfrogging with worker 558175144 Worker 558596184 stole a task from worker 558303160 Worker 558175144 is leapfrogging with worker 558303160 Worker 558175144 stole a task from worker 558303160 Worker 558303160 is leapfrogging with worker 558175144 Worker 558303160 stole a task from worker 558175144 Worker 558175144 is leapfrogging with worker 558303160 Worker 558175144 stole a task from worker 558303160 Worker 558303160 is leapfrogging with worker 558175144 Worker 558303160 is leapfrogging with worker 558596184 Worker 558303160 stole a task from worker 558596184 Worker 558175144 stole a task from worker 558303160
67
APPENDIX A. DEMO RESULTS
Worker 558596184 is leapfrogging with worker 558303160 Worker 558596184 stole a task from worker 558303160 Worker 558303160 is leapfrogging with worker 558596184 Worker 558303160 is leapfrogging with worker 558175144 Worker 558303160 stole a task from worker 558175144 Worker 558175144 is leapfrogging with worker 558303160 Worker 558596184 stole a task from worker 558461016 Worker 558303160 is leapfrogging with worker 558461016 Worker 558303160 stole a task from worker 558461016 Worker 558175144 stole a task from worker 558303160 Worker 558303160 is leapfrogging with worker 558175144 Worker 558303160 stole a task from worker 558175144 Worker 558461016 is leapfrogging with worker 558303160 Worker 558175144 is leapfrogging with worker 558303160 Worker 558461016 is leapfrogging with worker 558596184 Worker 558461016 stole a task from worker 558596184 Worker 558303160 stole a task from worker 558461016 Worker 558175144 stole a task from worker 558303160 Worker 558303160 is leapfrogging with worker 558175144 Worker 558303160 stole a task from worker 558175144 Worker 558175144 is leapfrogging with worker 558303160 Worker 558175144 stole a task from worker 558303160 Worker 558303160 is leapfrogging with worker 558175144 Worker 558303160 stole a task from worker 558175144 Worker 558175144 is leapfrogging with worker 558303160 Worker 558303160 stole a task from worker 558461016 Worker 558461016 is leapfrogging with worker 558303160 Worker 558461016 stole a task from worker 558596184 Worker 558596184 is leapfrogging with worker 558461016 Worker 558596184 stole a task from worker 558461016 Worker 558461016 is leapfrogging with worker 558596184 Worker 558461016 stole a task from worker 558596184 Worker 558596184 is leapfrogging with worker 558461016 result = 144 time is 28.519000. workers stoped Yay! ose5>
68
Bibliography [1]
Karl-Filip Faxén, "Wool 0.1 users guide," http://www.sics.se/kff/wool, June 1,2009
[2]
M. Hill and M. Marty, "Amdahl’s law in the multicore era," IEEE Comput., vol. 41, no. 7, pp. 33-38, Jul. 2008.
[3]
Kumar, R.; Farkas, K.I.; Jouppi, N.P.; Ranganathan, P.; Tullsen, D.M.; , "Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction," Microarchitecture, 2003. MICRO-36. Proceedings. 36th Annual IEEE/ACM International Symposium on , vol., no., pp. 81- 92, 3-5, Dec. 2003
[4]
S. Akhter and J. Roberts. "Multi-Core Programming: Increasing Performance through Software Multi-threading," Intel Press, 2006.
[5]
Paul N. Leroux and Robert Craig, "Easing the Transition to Multi-Core Processors," Information Quarterly, Vol. 5, No. 4, pp. 34-37, 2006
[6]
Timothy M.; Beverly S.; and Berna M., Patterns for Parallel Programming (First ed.). Addison-Wesley Professional, 2004.
[7]
Duranton, M.; , "The Challenges for High Performance Embedded Systems," Digital System Design: Architectures, Methods and Tools, DSD 2006. 9th EUROMICRO Conference on , vol., no., pp.3-7, 2006.
[8]
Chu, M.; Ravindran, R.; Mahlke, S.; , "Data Access Partitioning for Finegrain Parallelism on Multicore Architectures," Microarchitecture, 2007. MICRO 2007. 40th Annual IEEE/ACM International Symposium on , vol., no., pp.369-380, 1-5 Dec. 2007.
[9]
Gene Golub and James M. Ortega. Scientific Computing: An Introduction with Parallel Computing, Academic Press Prof., Inc., San Diego, CA, USA, 1993.
[10] Nicholas Carriero and David Gelernter. 1989. How to write parallel programs: a guide to the perplexed. ACM Comput. Surv. 21, 3 (September 1989), 323-357. [11] Blaise Barney, "POSIX Threads Programming," Lawrence Livermore National Laboratory, https://computing.llnl.gov/tutorials/pthreads. 69
BIBLIOGRAPHY
[12] "IEEE Standard for Information Technology- Portable Operating System Interface (POSIX)- Part 1: System Application Program Interface (API)Amendment J: Advanced Real-time Extensions [C Language]," IEEE Std 1003.1j-2000 , vol., no., pp.0_1-88, 2000. [13] Apple Developer, Threading Programming Guide, 2010. [14] Paul Bridger, Concepts and Synchronisation http://www.paulbridger.com/deadlock/, 2006.
Primitives,
[15] Bil Lewis and DanielJ. Berg; "Prentice Hall. Pthreads Primer, A Guide to Multithreaded Programming," , California, USA, 1995. [16] John L. Hennessy and David A. Patterson. In Praise of Computer Architecture: A Quantitative Approach , pages 196-264, 2007. [17] Faxén, K.-F.; , "Efficient Work Stealing for Fine Grained Parallelism," Parallel Processing (ICPP), 2010 39th International Conference on, vol., no., pp.313322, 13-16 Sept 2010. [18] Karl-Filip Faxén and John Ardelius, "Manycore work stealing," In Proceedings of the 8th ACM International Conference on Computing Frontiers (CF ’11). ACM, New York, NY, USA, Article 10 , 2 pages, 2011. [19] Karl-Filip Faxén, "Wool-A work stealing library." SIGARCH Comput. Archit. News 36, 5 (June 2009), 93-100, 2009. [20] Sen, Siddhartha. "Dynamic Processor Allocation for Adaptively Parallel Work Stealing Jobs," Thesis, Chapter4, Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2004. [21] Blumofe, R.D.; Leiserson, C.E.; , "Scheduling multithreaded computations by work stealing," Foundations of Computer Science, 1994 Proceedings., 35th Annual Symposium on , vol., no., pp.356-368, 20-22 Nov 1994. [22] Faxén, K.-F.; , "Efficient Work Stealing for Fine Grained Parallelism," Parallel Processing (ICPP), 2010 39th International Conference on , vol., no., pp.313322, 13-16 Sept. 2010. [23] David B. Wagner and Bradley G. Calder. 1993. "Leapfrogging: a portable technique for implementing efficient futures." SIGPLAN Not. 28, 7 (July 1993), 208-217. [24] Podobas, Artur and Brorsson, Mats and Faxén, Karl-Filip (2010) A Comparison of some recent Task-based Parallel Programming Models. In: 3rd Workshop on Programmability Issues for Multi-Core Computers , 24 Jan 2010, Pisa, Italy. 70
BIBLIOGRAPHY
[25] Karl-Filip Faxén, Christer Bengtsson, Mats Brorsson, Hakan ˙ Grahn, Erik Hagersten, Bengt Jonsson, Christoph Kessler, Björn Lisper, Per Stenström, Bertil Svensson. "Multicore computing - the state of the art. SICS," 2008 [26] Enea, Enea OSE Architecture User’s Guide, REV.BL140702. [27] Enea, Enea OSE Core User’s Guide, REV.BL140702. [28] Enea, OSE Application REV.BL330018.
Programming
Interface
Reference
Manual,
[29] Patrik Stromblad, "White Paper: ENEA Multicore: High performance packet processing enabled with a hybrid SMP/AMP OS technology", Enea AB Whitepapers, 2009. [30] Enea, "High-Availability RTOS for Complex, Distributed Systems," Enea AB, OSE 5.5 Datasheet, 2010. [31] Magnus Karlsson, "The Future Multicore OS – A Visionary Look into the Future of Multicore Embedded Operating Systems", Enea. [32] Ibrahim Hur and Calvin Lin, Memory scheduling for modern microprocessors. ACM Trans. Comput. Syst. 25, 4, Article 10 (December 2007), 2007. [33] Kourosh Gharachorloo, Daniel Lenoski, James Laudon, Phillip Gibbons, Anoop Gupta, and John Hennessy, Memory consistency and event ordering in scalable shared-memory multiprocessors, SIGARCH Comput. Archit. News 18, 3a (May 1990), 15-26. 1990. [34] Freescale, "e500mc Core Reference Manual", REV.1 03/2012. [35] Freescale, "P4080 QorIQ Integrated Multicore Communication Processor Family Reference Manual", Rev.0, 04/2011. [36] Freescale, "EREF: A Programmer’s Reference Manual for Freescale Embedded Processors," Rev. 1 12/2007. [37] Wentzlaff, D.; Griffin, P.; Hoffmann, H.; Liewei Bao; Edwards, B.; Ramey, C.; Mattina, M.; Chyi-Chang Miao; Brown, J.F.; Agarwal, A.; , "On-Chip Interconnection Architecture of the Tile Processor," Micro, IEEE , vol.27, no.5, pp.15-31, Sept.-Oct. 2007. [38] Tilera, "TILE Processor User Architecture Manual," page 274279, REL.2.4, DOC.NO.UG101, 05/2011. [39] Power ISA Version 2.06 Revision B, July 23, 2010. [40] Enea AB, Enea Linux User’s Guide, REV.20120419_1426. 71
BIBLIOGRAPHY
[41] Robert Love, "Linux System Programming," Chapter 8, O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. [42] Enea software AB, ENEA OSE Pthreads User’s Guide. [43] Enea software AB, ENEA OSE Pthreads Reference Manual. [44] Enea AB, OSE REV.BL141108.
System
Programming
Interface
[45] Enea AB, OSE 5 Migration Guide, REV.BL150551.
72
Reference
Manual,