Transcript
Written Exam / Tentamen∗ Computer Organization and Components / Datorteknik och komponenter (IS1500), 9 hp Computer Hardware Engineering / Datorteknik, grundkurs (IS1200), 7.5 hp KTH Royal Institute of Technology
2015-01-14
Examiner / Examinator: David Broman Teacher on duty / Ansvarig l¨arare: David Broman,
[email protected], +46 73 765 20 44
Instructions in English • Allowed aids: One sheet of A4 paper with handwritten notes. You may write on both sides of the paper. • Explicitly forbidden aids: Textbooks, electronic equipment, calculators, mobile phones, machinewritten pages, photocopied pages, pages of different size than A4. • Please write and draw carefully. Unreadable text may lead to zero points. • You do not need to return these exam papers when you hand in your exam solutions. • You may write your answers in either Swedish or English. The exam consists of two parts: • Part I: Fundamentals: The maximal number of points for Part I is 48 points (for IS1500) and 40 points (for IS1200). There are 8 points for each of the six course modules. All questions in Part I expect only short answers. At most a few sentences are needed. • Part II: Advanced: The maximal number of points for Part II is 50 points. In the answers, it is required that the student discuss, analyze, or construct. Forthermore, answers to these questions require clear motivations. True/False/Don’t know Questions in Part I These questions can give between 0 and 4 points. Each question consists of 4 statements. For each statement, you should answer either true, false, or “don’t know”. The points are calculated as follows: Each correct answer (answer true or false) gives one point. Each incorrect answer (true or false) gives minus one point. If you answer “don’t know” for a statement, it neither adds nor removes points. The rationale for introducing “don’t know” answers is to avoid that the student makes guesses. Example: Assume that the correct answers to four statements are: true, false, true, false. Assume that the student answered: true, true, don’t know, false. The total number of points is then: 1 - 1 + 0 + 1 = 1 point. Note that even if the answers to all four statements are wrong, the points for the whole question can never be negative, that is, the final points will always be between 0 and 4. ∗
This version of the exam paper has been update February 6, 2015. The updates include changes of the pass criteria (the pass criterion was changed (lowered) from 36 to 33 points in the fundamental part for IS1500 and 30 to 27 for IS1200). We also updated the criterion for FX (only 11 points are needed on the advanced part) and added a few clarifications on the solution suggestions.
1
Grades To get a pass grade (A, B, C, D, or E), it is required to pass Part I of the exam. For IS1500 students, it is required to get 33 points or more on Part I to pass the exam. IS1200 students should not answer question 1 in Part I. For IS1200 students, it is required to have 27 points or more in total for questions 2-6 on Part I. Grading scale (For both IS1200 and IS1500): • A: 41-50 points on Part II • B: 31-40 points on Part II • C: 21-30 points on Part II • D: 11-20 points on Part II • E: 0-10 points on Part II • FX: 30-32 points (IS1500) or 24-26 (IS1200) on Part I, and 11-50 points on Part II. • F: otherwise Results • The result will be announced at latest 2015-02-04. • If a student received grade FX, it is possible to request a complementary oral examination. Such complementary oral examination must be requested by the student. Please send an email to
[email protected] at latest 2015-02-25.
Instruktioner p˚a Svenska • Till˚atna hj¨alpmedel: En A4-sida med handskrivna anteckningar. Det a¨ r till˚atet att skriva p˚a b˚ada sidorna av pappret. • F¨orbjudna hj¨alpmedel: L¨arob¨ocker, elektroniska hj¨alpmedel, minir¨aknare, mobiltelefoner, maskinskrivna sidor, kopierade papper, sidor av andra storlekar a¨ n A4. • Skriv och rita noggrant. Ol¨asbar text kan resultera i noll po¨ang. • Du beh¨over inte l¨amna tillbaka dessa tentamenspapper n¨ar du l¨amnar in tentamensl¨osningarna. • Du kan skriva dina svar p˚a antingen engelska eller svenska. Tentamen best˚ar av tv˚a delar: • Del I: Fundamentala delen: Maximalt antal po¨ang f¨or del I a¨ r 48 po¨ang (f¨or IS1500) och 40 po¨ang (f¨or IS1200). Totalpo¨angen per kursmodul a¨ r 8 po¨ang (6 moduler totalt). F¨or del I f¨orv¨antas det endast korta svar p˚a fr˚agorna. Endast ett f˚atal meningar kr¨avs. • Del II: Avancerade delen: Det maximala antalet po¨ang f¨or del II a¨ r 50 po¨ang. I svaren f¨or denna del kr¨avs att studenten diskuterar, analyserar och konstruerar. Vidare kr¨aver svaren till dessa fr˚agor tydliga motiveringar.
2
Sant/Falskt/”Vet ej” fr˚agor f¨or Del I Dessa fr˚agor kan ge mellan 0 och 4 po¨ang. Varje fr˚aga best˚ar av 4 p˚ast˚aenden. F¨or varje p˚ast˚aende ska du svara antingen sant, falskt, eller “vet ej”. Po¨angen ber¨aknas enligt f¨oljande: Varje korrekt svar (svar sant eller falskt) ger 1 po¨ang. Varje felaktigt svar (sant eller falskt) ger minus en po¨ang. Om du svarar “vet ej” f¨or ett p˚ast˚aende ger det varken extrapo¨ang eller minuspo¨ang. Svarsalternativet “vet ej” finns f¨or att undvika att studenten gissar. Exempel: Anta att de korrekta svaren f¨or fyra p˚ast˚aenden a¨ r: sant, falskt, sant, falskt. Anta att studenten svarar: sant, sant, “vet ej”, falskt. Det totala antalet po¨ang blir d˚a: 1 - 1 + 0 + 1 = 1 po¨ang. Notera att a¨ ven om samtliga svar till alla fyra p˚ast˚aenden a¨ r felaktiga s˚a kan po¨angen f¨or hela fr˚agan aldrig bli negativ. S˚aldes a¨ r den slutliga po¨angen f¨or fr˚agan alltid mellan 0 och 4 po¨ang. Betyg F¨or att erh˚alla godk¨ant betyg (A, B, C, D eller E) kr¨avs att man f˚ar godk¨ant p˚a del I. F¨or IS1500studenter kr¨avs det 33 po¨ang eller mer f¨or del I f¨or att f˚a godk¨ant p˚a tentamen. IS1200-studenter ska inte svara p˚a fr˚aga 1 i del I. F¨or IS1200-studenter kr¨avs det 27 po¨ang eller mer totalt f¨or fr˚agorna 2-6 p˚a del I f¨or att bli godk¨and. Betygsskala (F¨or b˚ade IS1200 och IS1500): • A: 41-50 po¨ang p˚a del II • B: 31-40 po¨ang p˚a del II • C: 21-30 po¨ang p˚a del II • D: 11-20 po¨ang p˚a del II • E: 0-10 po¨ang p˚a del II • FX: 30-32 po¨ang (IS1500) eller 24-26 (IS1200) p˚a del I och 11-50 po¨ang p˚a del II. • F: i o¨ vriga fall Resultat • Resultaten kommer att meddelas senast 2015-02-04. • Om en student f˚ar betyg FX a¨ r det m¨ojligt att beg¨ara en muntlig examination. En s˚adan muntlig examination m˚aste beg¨aras via epost av studenten. Skicka ett epost-meddelande till
[email protected] senast 2015-02-25.
Part I: Fundamentals 1. Module 1: Logic Design (for IS1500 only) (a) English: Consider the figure below. Assume that the register initially has a zero value and that it is triggered on a rising clock edge. What are then the values for signals A, B, Y0 , Y1 , Y2 , and Y3 after the first raising clock edge when all signals have stabilized? (4 points) Swedish: Studera figuren nedan. Anta att register initialt har v¨ardet noll och att stigande klockflank a¨ r aktiv. Vad a¨ r d˚a v¨ardena p˚a signalerna A, B, Y0 , Y1 , Y2 , och Y3 efter den f¨orsta stigande klockflank d˚a alla signaler har stabiliserats? (4 po¨ang)
3
2" +
310#
2"
A# CLK# 2"
2" B#
1# 0# 1# 0#
00" 01" 10" 11"
1#
Decoder" 00" 0" 01" 1" 10" 11"
Y0# Y1# Y2# Y3#
Solution: A = 102 , B = 112 , Y0 = 0, Y1 = 0, Y2 = 0, and Y3 = 1. Explanation: The initial state value of the register is zero. Hence, after the first raising clock edge (and after stabilization) the register receives value 3 + 0 = 3, and therefore B = 112 = 310 . The adder then adds 3 plus 3 and since the signal is only two bits, the counter wraps around and A = 102 . The multiplexer selects the third signal 10, which has value 1. Hence, the signal that is decoded is 11, which means that only Y3 is 1, and the rest of the output signals from the decoder are zero. (b) English: For each of the following statements, answer if the statement is true, false, or “don’t know”. Note that incorrect answers give minus points. Please see the first pages of the exam for an explanation of how the points are calculated for these true/false/don’t know questions. (4 points) Swedish: F¨or f¨oljande p˚ast˚aenden, svara om p˚ast˚aendet a¨ r sant, falskt, eller “vet ej”. Notera att felaktigt svar ger minuspo¨ang. F¨or en mer detaljerad f¨orklaring av hur po¨ang ber¨aknas f¨or sant/falskt/”vet ej” fr˚agor, se de f¨orsta sidorna av tentamen. (4 po¨ang) • Statement 1: English: Proof by perfect induction means that a theorem in boolean algebra can be proven correct by exhaustively showing all possible combinations in a truth table. Swedish: Bevis med hj¨alp av perfekt induktion inneb¨ar att ett teorem i boolesk algebra kan bevisas vara korrekt genom att utt¨ommande visa alla olika kombinationer i en sanningstabell. Solution: True. Proof by perfect induction is the same as Proof by Exhaustion (see the lecture slides). Proof by perfect induction (note the word perfect) should not be mixed up with standard mathematical induction proofs. • Statement 2: English: When a tristate buffer is disabled, the output is said to be floating. Swedish: N¨ar en tristate buffer a¨ r avaktiverad a¨ r dess utsignal flytande (floating). Solution: True. • Statement 3: English: The main difference between a SR latch and a D latch is that a D latch is triggered on the edge of a clock signal, whereas an SR latch is transparent. Swedish: Den st¨orta skillnaden mellan en SR-latch och en D-latch a¨ r att en Dlatch aktiveras p˚a flanken av en klocksignal, medan en SR-latch a¨ r transparent. Solution: False. Neither SR latches nor D latches are edge triggered. The main difference between an SR latch and a D latch is that a D latch is clocked, whereas 4
an SR latch is not clocked. • Statement 4: English: In a synchronous sequential circuit, the propagation delay of the combinational logic in the circuit is an important factor when determining the maximal clock frequency. Swedish: F¨or en synkront sekvenskrets a¨ r grindf¨ordr¨ojningen av den kombinatoriska logiken i kretsen en viktig faktor n¨ar man best¨ammer den maximala klockfrekvensen. Solution: True. There are several delays in the circuit that determines the maximal clock frequency. The delay in the combinatorial logic is one of these important delays.
2. Module 2: C and Assembly Programming ========================================== (a) English: What is the binary machine code representation of the MIPS instruction lb
$s1,-67($t2)
Answer as a 32-bit binary number. Note that a page with the structure of the encoding of MIPS is available at the end of the exam. (4 points) Swedish: Vad a¨ r den bin¨ara maskinkodsrepresentationen f¨or MIPS-instruktionen lb
$s1,-67($t2)
Svara som ett 32-bitars bin¨art tal. Notera att strukturen f¨or MIPS-kodning a¨ r tillg¨angligt p˚a en sida i slutet av tentan. (4 po¨ang) Solution: 1000 0001 0101 0001 1111 1111 1011 1101 (b) English: For each of the following statements, answer if the statement is true, false, or “don’t know”. Note that incorrect answers give minus points. Please see the first pages of the exam for an explanation of how the points are calculated for these true/false/don’t know questions. (4 points) Swedish: F¨or f¨oljande p˚ast˚aenden, svara om p˚ast˚aendet a¨ r sant, falskt, eller “vet ej”. Notera att felaktigt svar ger minuspo¨ang. F¨or en mer detaljerad f¨orklaring av hur po¨ang ber¨aknas f¨or sant/falskt/”vet ej” fr˚agor, se de f¨orsta sidorna av tentamen. (4 po¨ang) • Statement 1: int x = 5; int y = 6; int *p = &x; *p = x + y; p = &y; *p = x + y; printf("%d,%d", x, y);
English: When executing the C code above, the string 11,17 is printed to the standard output. Swedish: N¨ar C-programmet ovan exekveras s˚a printas str¨angen 11,17 till standard output. Solution: True. 5
• Statement 2: English: The datastructure, that is used for storing local variables and additional arguments during function calls, stores the values as a first-in first-out (FIFO) queue in the memory. Swedish: Datastrukturen, som anv¨ands f¨or att spara lokala variabler och ytterligare argument vid funktionsanrop, sparar v¨ardena som en f¨orst-in-f¨orst-ut (FIFO) k¨o i minnet. Solution: False. The datastructure is a stack and it is stored as a last-in first-out (LIFO) queue. • Statement 3: English: In the MIPS ISA, a calling function (called the caller) does not have to save the registers $s0 to $s7 because the callee (the function that is called) is responsible for saving these registers, if they are used in the called function. Swedish: F¨or MIPS ISA, beh¨over inte en anropande funktion (¨aven kallad “caller”) spara registerna $s0 till $s7, d˚a funktionen som a¨ r anropad a¨ r ansvarig f¨or att spara dessa register, om de anv¨ands i den anropade funktionen. Solution: True. • Statement 4: English: A benefit of PC-relative addressing is that not all bits of the absolute address need to be stored in the instruction. Swedish: En f¨ordel med PC-relativ adressering a¨ r att alla bitar av den absoluta adressen inte beh¨over lagras i instruktionen. Solution: True. Parts of the current PC address is used when computing the branch target address (BTA). That is, the absolute address that is used when updating the PC and when executing the branch. 3. Module 3: Processor Design (a) English: Consider the figure below that shows the datapath for a single-cycle MIPS processor. Assume that the current instruction that is executing is slt $t0,$t1,$t3. What are then the values of signals MemToReg, ALUSrc, Branch, and A3? (4 points) Swedish: Studera figuren nedan som visar en datav¨ag f¨or en enkel-cyklig MIPSprocessor. Anta att den nuvarande instruktionen som exekveras a¨ r slt $t0,$t1,$t3. Vad a¨ r d˚a v¨ardena av signalerna MemToReg, ALUSrc, Branch och A3? (4 po¨ang)
6
*
PC#
32#
25:21# A1$ 20:16# A2$
RD2$
0* 1*
32#
WD3$ 32#
A$
RD$
0* 1* *
32#
WD$
+$
0* 15:11# 1* *
4* 25:0#
WE$
Zero#
*
20:16#
31:28#
<<2*
32#
A3$ 32#
27:0#
3#
RD1$
MemToReg*
CLK$
15:0#
<<2* Sign*Extend*
+$
32#
A$ RD$
ALUControl*
WE3$
Data* Memory*
0* 1*
Inst#
ALU$
32#
Instruc(on* Memory*
*
MemWrite*
ALUSrc*
CLK$
CLK$ 0* 1*
Branch*
RegWrite** RegDst*
Jump*
32#
Solution: MemToReg = 0, ALUSrc = 0, Branch = 0, and A3 = 010002 (b) English: For each of the following statements, answer if the statement is true, false, or “don’t know”. Note that incorrect answers give minus points. Please see the first pages of the exam for an explanation of how the points are calculated for these true/false/don’t know questions. (4 points) Swedish: F¨or f¨oljande p˚ast˚aenden, svara om p˚ast˚aendet a¨ r sant, falskt, eller “vet ej”. Notera att felaktigt svar ger minuspo¨ang. F¨or en mer detaljerad f¨orklaring av hur po¨ang ber¨aknas f¨or sant/falskt/”vet ej” fr˚agor, se de f¨orsta sidorna av tentamen. (4 po¨ang) • Statement 1: English: In a five-stage MIPS pipelined datapath, the execute stage is usually used for reading out the values from the register file. Swedish: I en fem-stegs MIPS-pipelinad datav¨ag s˚a l¨aser vanligtvis exekveringssteget ut v¨ardena fr˚an registerfilen. Solution: False. It is done in the decode stage. • Statement 2: English: A benefit of an arithmetic logic unit (ALU) is that the same hardware can be used for different arithmetic operations. Swedish: En f¨ordel med en aritmetisk-logisk enhet (ALU) a¨ r att samma h˚ardvara kan anv¨andas f¨or olika aritmetiska operationer. Solution: True. • Statement 3: English: A pipelined datapath may, compared to a single-cycle datapath, increase the average cycles per instruction (CPI). Swedish: En pipelinad datav¨ag kan, i j¨amf¨orelse med en enkel-cyklig datav¨ag, o¨ ka genomsnittliga v¨ardet f¨or antalet cykler per instruktion (CPI). Solution: True. Compared to a single-cycle datapath (where each instruction takes one clock cycle) the pipeline may result in hazards, which can result in stalling. Hence, the average CPI may increase when a pipelined datapath is used. • Statement 4: 7
English: Instructions in a CISC instruction set architecture (ISA) can in general perform more complex operations than instructions in a RISC ISA. Swedish: Instruktioner i en CISC instruction set architecture (ISA) kan generellt utf¨ora mer komplexa operationer a¨ n instruktioner i ett RISC ISA. Solution: True. CISC stands for “Complex Instruction Set Computing”. An example of CISC is x86, where an instruction can perform several tasks, for instance both load from memory and perform an addition. ISAs that are based on RISC, which stands for “Reduced Instruction Set Computing” have in general few simple instructions. 4. Module 4: Memory Hierarchy (a) English: Assume that we have a 2-way set associative cache for a processor that uses 32-bits for addressing. The cache has 1024 cache blocks in total and the block size is 16 bytes. How many bits are then the tag field, the set field (also called the index), and the byte offset field, and how many validity bits does the cache contain in total? (4 points) Swedish: Anta att vi har en 2-way set associative cache f¨or en processor som anv¨ander 32 bitar f¨or adressering. Cachen har 1024 cache block totalt och en blockstorlek p˚a 16 bytes. Hur m˚anga bitar a¨ r d˚a adressetiketten (¨aven kallad tag), f¨altet f¨or radnummer (¨aven kallad index eller set), och f¨altet f¨or bytenummer (¨aven kallad byte offset), samt hur m˚anga giltighetsbitar inneh˚aller totalt denna cache? (4 po¨ang) Solution: There are in total 1024/2 = 512 sets (swedish: rader). Hence, the set field is 9 bits. The block size is 16 bytes, making the byte offset field 4 bits. As a consequence, the tag is 32 − 9 − 4 = 19 bits. Finally, there are in total 1024 number of validity bits, the same number as blocks. (b) English: For each of the following statements, answer if the statement is true, false, or “don’t know”. Note that incorrect answers give minus points. Please see the first pages of the exam for an explanation of how the points are calculated for these true/false/don’t know questions. (4 points) Swedish: F¨or f¨oljande p˚ast˚aenden, svara om p˚ast˚aendet a¨ r sant, falskt, eller “vet ej”. Notera att felaktigt svar ger minuspo¨ang. F¨or en mer detaljerad f¨orklaring av hur po¨ang ber¨aknas f¨or sant/falskt/”vet ej” fr˚agor, se de f¨orsta sidorna av tentamen. (4 po¨ang) • Statement 1: English: A common replacement policy for direct mapped caches is Least Recently Used (LRU), since a set does not always have to be replaced directly, even if the validity bit is 1. Swedish: An vanlig utbytespolicy f¨or direkt mappade cachar a¨ r Least Recently Used (LRU). Detta d˚a raden inte alltid beh¨over ers¨attas, a¨ ven om giltighetsbiten a¨ r satt till 1. Solution: False. A direct mapped cache do not need an specific replacement policy; if the set is used, it must always be replaced. • Statement 2:
8
English: One benefit of virtual memory is that each process (each running program) can have its own virtual memory space, which is protected from other concurrently running processes in the operating system. Swedish: En f¨ordel med virtuellt minne a¨ r att varje process (varje exekverande program) kan ha sitt eget virtuella minnesutrymme, vilket a¨ r skyddat fr˚an andra samtidiga processer som exekveras i operativsystemet. Solution: True. • Statement 3: English: Assume that we have a direct mapped cache with block size 16 bytes and 256 blocks in total. If a load byte instruction reads from address 0xff20 215e, followed by another load byte instruction that reads from address 0x1000 5150, the second load instruction will result in a cache miss. Swedish: Anta att vi har en direkt mappad cache med blockstorlek 16 bytes och 256 block totalt. Om en load byte-instruktion l¨aser fr˚an adress 0xff20 215e, vilket f¨oljs av en ytterligare load byte-instruktion som l¨aser fr˚an adress 0x1000 5150, s˚a kommer den andra load-instruktionen att resultera i en cachemiss. Solution: True. Both instructions access the same set (0x15), but have different tags. • Statement 4: English: A modern processor, such as the Intel Core-I7, has only one large cache because the speed of one large cache is typically higher than having several small caches. Swedish: En modern processor, som t.ex. Intel Core-I7, har endast en stor cache d˚a hastigheten p˚a en stor cache typiskt a¨ r h¨ogre a¨ n alternativet att ha flera sm˚a cachar. Solution: False. A modern processor has several caches in the memory hierarchy (e.g., L1, L2, and L3 caches). 5. Module 5: I/O Systems (a) English: Assume that an external interrupt occurs, a program A is preempted, and the program counter is changed so that an interrupt handling routine is executed. Explain shortly how it is possible to continue to execute program A correctly at the program point where it was interrupted. (4 points) Swedish: Anta att ett externt avbrott sker, att ett program A a¨ r a˚ sidosatt (preempted) och programr¨aknaren a¨ ndras s˚a att en avbrottshanteringsrutin exekveras. F¨orklara kortfattat hur det a¨ r m¨ojligt att forts¨atta exekveringen av program A p˚a ett korrekt s¨att vid den programpunkt d¨ar avbrottet intr¨affade (4 po¨ang). Solution: When the external interrupt occurs, the processor must automatically save the program counter, before jumping to the interrupt handling routine. For instance, in MIPS, the PC is saved in a register called EPC. The interrupt handling routine must save all registers on the stack before performing its task. When the interrupt handling routine has finished, it restores all registers and returns to the original program by using the saved PC. 9
(b) English: For each of the following statements, answer if the statement is true, false, or “don’t know”. Note that incorrect answers give minus points. Please see the first pages of the exam for an explanation of how the points are calculated for these true/false/don’t know questions. (4 points) Swedish: F¨or f¨oljande p˚ast˚aenden, svara om p˚ast˚aendet a¨ r sant, falskt, eller “vet ej”. Notera att felaktigt svar ger minuspo¨ang. F¨or en mer detaljerad f¨orklaring av hur po¨ang ber¨aknas f¨or sant/falskt/”vet ej” fr˚agor, se de f¨orsta sidorna av tentamen. (4 po¨ang) • Statement 1: English: SPI and UART use synchronous and asynchronous serial communication, respectively. Swedish: SPI och UART anv¨ander synkron respektive asynkron seriell kommunikation. Solution: True. • Statement 2: English: A memory-mapped general-purpose I/O (GPIO) pin can be configured to be either an output or input port, by writing a configuration parameter to a specific memory address that is dedicated for configuring this GPIO pin. Swedish: En minnes-mappad general-purpose I/O (GPIO) pin kan konfigureras att vara antingen en input eller en output port. Detta g¨ors genom att skriva en konfigurationsparameter till en specifik minnesadress, vilken a¨ r dedikerad till att konfigurera denna GPIO pin. Solution: True. • Statement 3: English: Direct memory access (DMA) is a good alternative to instruction level parallelism (ILP) because it enables multiple instructions to be fetched by the datapath and executed in parallel since the code is read directly from memory. Swedish: Direct memory access (DMA) a¨ r ett bra alternativ till instruktionsparallellism (ILP) d˚a det m¨ojligg¨or att flera instruktioner kan l¨asas av datav¨agen och sedan exekveras parallellt, eftersom koden l¨ases direkt fr˚an minnet. Solution: False. DMA is not related to ILP and is not used for fetching several instructions in parallel. • Statement 4: English: Declaring a pointer volatile in C (as in the example below) means that the pointer itself is volatile and may change to point to a different address at any point in time. Swedish: Att deklarera en pekare flyktig (volatile) i C (som i exemplet nedan) betyder att pekaren sj¨alv a¨ r flyktig och kan n¨ar som helst a¨ ndras s˚a att den pekar p˚a en annan adress. volatile int* x = (volatile int*) 0xff88;
Solution: False. The pointer itself cannot change indirectly, but the value that the pointer points to may change. 6. Module 6: Parallel Processors and Programs 10
(a) English: A program consists of two parts, part A and part B. Part A is trivial to parallelize, whereas part B is not possible to parallelize at all and consists only of sequential code. Assume that the amount of improvement that can be achieved by parallelizing part A is proportional to the number of cores, i.e., using 4 cores, we achieve 4 times improvement on part A. The theoretical maximal speedup is 5, assuming that we have infinite number of cores. Our competitor can run a sequential version of the program in 16s, whereas we can, using our parallel implementation, run the program in 20s on one core. What are then the relative speedup and the true speedup of our implementation when executing the program on 4 cores? Hint: recall that true speedup compares with the fastest available sequential implementation, whereas relative speedup compares with your own implementation running sequentially. (4 points) Swedish: Ett program best˚ar av tv˚a delar, del A och del B. Del A a¨ r trivialt att parallellisera, medan del B inte alls a¨ r m¨ojlig att parallellisera och best˚ar endast av sekventiell programkod. Anta att f¨orb¨attringsm¨ojligheten som man kan uppn˚a genom att parallellisera del A a¨ r proportionell mot antalet processork¨arnor, dvs, om man anv¨ander 4 k¨arnor s˚a uppn˚ar man 4 g˚angers f¨orb¨attring. Den teoretiska maixmala speedupen a¨ r 5, om man antar att vi har o¨andligt m˚anga k¨arnor. V˚ar konkurrent kan k¨ora en sekventiell version av programmet p˚a 16s, medan vi kan k¨ora v˚ar parallella implementation p˚a 20s om den exekverar p˚a 1 core. Vad blir d˚a den relativa speedupen och den sanna speedupen (true speedup) f¨or v˚ar implementation om man exekverar programmet p˚a 4 k¨arnor? Tips: Notera att sann speedup j¨amf¨or med den snabbast tillg¨angliga sekventiella implementationen, medan relativ speedup j¨amf¨or med din egna implementation n¨ar den k¨ors sekventiellt. (4 po¨ang) Solution: Since we know that the theoretical speedup limit is 5, 51 of the execution time is due to part B, the sequential code. By applying Amdahl’s law, we get 16 20 = 20 = 2.5 and speedup true = 16/4+4 = 16 = 2.0. Hence, speedup relative = 16/4+4 8 8 the true speedup is, as expected, somewhat lower than the relative speedup. (b) English: For each of the following statements, answer if the statement is true, false, or “don’t know”. Note that incorrect answers give minus points. Please see the first pages of the exam for an explanation of how the points are calculated for these true/false/don’t know questions. (4 points) Swedish: F¨or f¨oljande p˚ast˚aenden, svara om p˚ast˚aendet a¨ r sant, falskt, eller “vet ej”. Notera att felaktigt svar ger minuspo¨ang. F¨or en mer detaljerad f¨orklaring av hur po¨ang ber¨aknas f¨or sant/falskt/”vet ej” fr˚agor, se de f¨orsta sidorna av tentamen. (4 po¨ang) • Statement 1: English: A good example of MIMD is a modern multicore processor. Swedish: Ett bra exempel p˚a MIMD a¨ r en modern multicore-processor. Solution: True. • Statement 2: English: MapReduce can be based on message passing techniques and is today used extensively in Warehouse-scale computers. Swedish: MapReduce kan vara baserat p˚a meddelandehanteringsteknik och anv¨ands idag i h¨og omfattning i Warehouse-scale computers. 11
Solution: True. • Statement 3: English: If a semaphore is used for mutual exclusion (mutex), it means the programmer can define critical sections by first locking a mutex before entering the critical section, and then unlocking the mutex when exiting the critical section. Hence, a mutex can be used to have controlled access to shared resources in a concurrent environment. Swedish: Om en semaphore anv¨ands f¨or mutual exclusion (mutex), betyder det att programmeraren kan definiera kritiska sektioner genom att f¨orst l˚asa en mutex innan man kommer in i den kritiska sektionen, och sen l˚asa upp mutexen n¨ar man l¨amnar den kritiska sektionen. Allts˚a, en mutex kan anv¨andas f¨or att kontrollera tillg˚ang till delade resurser i en samtidig (concurrent) milj¨o. Solution: True. • Statement 4: English: Assume we have a shared memory processor (SMP) consisting of 4 cores with separate L1 caches and an L2 cache that is shared among the cores. Assume further that two of the cores frequently reads or writes to the same address in memory. As a consequence, we get easily inconsistencies in the L1 caches. This phenomena is called false sharing. Swedish: Anta att vi har en processor med delat minne (shared memory processor) som best˚ar av 4 k¨arnor med separata L1 cachar och en gemensam L2 cache som delas mellan k¨arnorna. Anta a¨ ven att tv˚a av k¨arnorna frekvent l¨aser eller skriver till samma minnesadress. S˚aledes leder detta till inkonsistens i L1 cacharna. Detta fenomen kallas falsk delning (false sharing). Solution: False. The problem described is related to cache coherence, but not directly to false sharing. If it was about false sharing, the cores would write to the same cache block, but not to the exact same address.
12
Part II: Advanced 1. English: Explain in detail how a 2-way set associative data cache works. Your solution should include the following: • Sketch a hardware solution that can handle hit and miss detection. The solution should also return the correct cache value for a cache hit. This illustration should include and explain the terms of tag, set, byte offset, validity bit, and way. • Explain the concept of replacement policy. In particular, explain the meaning of Least Recently Used (LRU). You do not have to provide the hardware solution for the LRU. • Show assembly code examples and step-by-step guides for how the execution of the example affects the cache. Your code examples must illustrate both temporal locality and spatial locality. Your example code does not have to show the effects of the replacement policy. You may write NIOS II or MIPS assembly code. Clearly describe and motivate your answers. Diagrams or figures without explanations will not give any points. (15 points) Swedish: F¨orklara i detalj hur en 2-v¨ags associativ data-cache fungerar. Din l¨osning ska inneh˚alla f¨oljande: • Skissa en h˚ardvarul¨osning som kan hantera hit- och miss-detektering. L¨osningen skall a¨ ven returnera korrekt cache-v¨arde vid en cache tr¨aff. L¨osningen ska inneh˚alla och f¨orklara termerna adressetikett (¨aven kallad tag), rad (¨aven kallad set), bytenummer, giltighetsbit samt “way”. • F¨orklara konceptet utbytespolicy (replacement policy). F¨orklara speciellt betydelsen av Least Recently Used (LRU). Du beh¨over dock inte tillhandah˚alla h˚ardvarul¨osningen f¨or LRU. • Visa assembler kodexempel med steg-f¨or-steg guide f¨or hur exekveringen av exemplet p˚averkar cachen. Dina exempel m˚aste visa b˚ade tidsm¨assig (temporal) och spatiell lokalitet. Dina exempel beh¨over inte visa utbytespolicyns effekter. Du kan v¨alja att skriva antingen assemblerkod f¨or NIOS II eller MIPS. Beskriv och motivera ditt svar tydligt. Diagram eller figurer utan f¨orklaringar ger inga po¨ang. (15 po¨ang) Solution: This tasks can be answered in many different ways. A complete solution of the task is therefore omitted. Instead, we just give a few comments: • When describing replacement policies, it is important to relate this to LRU. A good solution explains this by using an example. • The code examples should show temporal and spatial locality in the data cache (data cache is part of the exercise. The old solution text talked about instruction cache). Temporal locality can be shown by reading from the same data element several times, for instance in a loop. Spatial locality can be shown by, for instance, reading from an array.
13
• In the hardware solution, it is important to show how the tag is compared in the address and in the cache, as well as how the validity bit is compared. This is usually done with an AND-gate. Note also that we need to use a multiplexor to select the correct output from the cache. • It is important to explain all the terms (see item 1 in the question).
14
2. English: There are many ways to make use of parallelism to achieve better performance in a computer system. Three important concepts/techniques are SIMD instructions (for instance multimedia extensions), instruction level parallelism (ILP), and multicore. In this task, you should analyze, discuss, and compare these different techniques. In what way are they similar? In what way are they different? Which are their pros and cons? How do they affect the programmer or the compiler? Where are the limitations? Your answer should consist of a comprehensive and well-thought-out discussion. (10 points) Swedish: Det finns m˚anga olika s¨att att anv¨anda parallellism f¨or att uppn˚a b¨attre prestanda i ett datorsystem. Tre viktiga koncept/tekniker a¨ r SIMD-instruktioner (t.ex. multimedia-ut¨okningar), instruktions-niv˚a parallellism (instruction level parallelism, ILP), och multicore (flera k¨arnor). I denna uppgift ska du analysera, diskutera, och j¨amf¨ora dessa olika tekniker. P˚a vilka s¨att a¨ r de lika? P˚a vilka s¨att a¨ r de olika? Vilka a¨ r deras f¨ordelar och nackdelar? P˚a vilket s¨att p˚averkar de programmeraren eller kompilatorn? Vilka a¨ r begr¨ansningarna? Ditt svar ska innefatta en utf¨orlig och genomt¨ankt diskussion. (10 po¨ang) Solution: This task has not one solution. Completely different answers can still give the same number of points. In the following, we list some important aspects that can be included in an answer. • SIMD makes use of data-level parallelism. Hence, there is limitations of what kind of programs that can actually utilize this kind of parallelism. The same instruction needs to be applied to different data. • ILP can take the form of static and dynamic multiple issue. Today, dynamic multiple issue is very common in modern processors. The main benefit is that ILP does not affect the programmer, parallelism “comes for free” even for a sequential program. However, the amount of parallelism is limited due to dependencies between instructions. • Multicore processors can give parallelism by the help of the programmer. Typically, for a shared memory processor (SMP), a multithreaded programming can be used to achieve parallelism. This is a form of task parallelism (compared to data-level parallelism for SIMD). • Each of these techniques do not have to work in isolation. Instead, a program can make use of all these concepts and techniques to achieve speedups. • A similarity between multicore and SIMD are that both these techniques typically need help from the programmer to actually work. Certain compilers exist, e.g., OpenMP, that can be used to program multicore systems in an easy way. • ILP does not need to have support from the programmer, but if the programmer programs in a special way, more ILP can be explored. One such technique is loop unrolling, which makes it possible for the processor to fetch more instructions to always try to fill the pipeline. • A problem with multicore programming is that it is quite hard, unless there are few dependencies between different tasks. Programs need to be synchronized using for instance semaphores. 15
• SIMD is today typically programmed with special platform dependent libraries. • A problem with SMP multicore systems is cache coherence. If the programmer is not aware of how the communication and access to memory effects the caches, the performance improvements may be much less than expected.
3. English: Explain in detail the meaning of the following terms and concepts and their relationships: Execution time of programs, Cycles Per Instruction (CPI), Clock frequency, Clock period, Power, and Energy. Explain also how computer architects and processor manufacturers try to decrease energy consumption and why it is difficult to do (10 points). Swedish: F¨orklara i detalj betydelsen av f¨oljande termer och koncept, samt deras relationer: Exekveringstid f¨or program, Cykler per instruktion (CPI), Klockfrekvens, Klockperiod, Effekt och Energi. F¨orklara ocks˚a hur datorarkitekter och processortillverkare f¨ors¨oker att minska energi˚atg˚angen, samt varf¨or detta a¨ r sv˚art (10 po¨ang). Solution: This task has not one solution. Completely different answers can still give the same number of points. In the following, we list some important aspects that can be included in an answer. • A programs execution time depends on several factors, where the most important ones are i) the number of executed instructions, ii) the cycles per instruction (CPI), and iii) the clock period. • If a processor has a pipeline, this can increase the clock frequency, which gives better performance. On the other hand, a pipeline can also introduce pipeline hazards, that can result in stalls. This will increase the CPI. • Clock frequency is defined as one divided by the clock period. • For over 30 years, processor manufactures have constantly increased the clock rate of processors, thus also increased the power. As a consequence, the so called power wall was reached around year 2006. Instead of increasing the clock frequency of the processor, manufacturers started to add processor cores. • The dominating source of energy consumption is the dynamic energy, which is consumed when each transistor is switching. The dynamic energy for switching a transistor is proportional to the capacitive load (the number of transistors connected to an output and the manufacturing technology), and the square of the voltage. Since the voltage is the dominating factor (the term is squared), processors are today using much lower voltage for their power supply, compared to just a few years ago. • The dynamic power is proportional to the energy used for a transistor transition, multiplied with the frequency switched. This means that power increases when frequency increases, but the voltage is still dominating due to the square term. Note that the energy for computing a task does not decrease just because the frequency is lower, but the power becomes lower. Although efforts are made to lower voltage, there is a limit for how low the voltage can be without increasing the leakage. In server systems, the static power dissipation due to leakage can be as high as 16
40%. Static energy is consumed (or transformed to other forms of energy) in CMOS technology even if a transistor is off. As a consequence, increasing the number of transistor (for instance, increasing the number of cores in a processor) increases the static energy consumption, even if the transistors are not switching. It is therefore hard to decrease the energy because of the voltage limitation and the increased demand on more cores and larger caches. There are also techniques for switching off parts of a chip during execution to decrease energy further.
17
4. English: Carefully analyze the partially commented MIPS assembler program on the next page. Construct a C program that generates the same result in memory as the MIPS program. Note that there are two memory arrays. A 16x16 word matrix starting at address 0x1001 0000 and an output array of size 16 words (address computed and stored in $s3 in part 2 of the MIPS code). Your C program should start with the variable declarations shown below. Assume that there exist code at the end of the program the prints out the arrays to standard output. Explain in detail how your program works and what the expected result of the program is, i.e., you should explain what the program is actually trying to accomplish (15 points). Swedish: Analysera noggrant det partiellt kommenterade MIPS-assembler-programmet p˚a n¨asta sida. Konstruera ett C-program som genererar samma resultat i minnet som MIPS-programmet. Notera att det finns tv˚a stycken minnesarrayer. En 16x16-ords matris som startar p˚a adress 0x1001 0000 och en “output”-array av storlek 16 ord (adressen ber¨aknas och sparas i $s3 i del 2 av MIPS-koden). Ditt C-program ska b¨orja med de variabeldeklarationer som finns nedan. Anta att det finns kod i slutet av programmet som skriver ut arrayerna till standard output. F¨orklara i detalj hur ditt program fungerar och vad det f¨orv¨antade resultatet av programmet a¨ r, dvs. du ska f¨orklara vad programmet egentligen f¨ors¨oker att g¨ora (15 po¨ang). int main(){ const int maxval = 16; int matrix[maxval*maxval]; int output[maxval]; // Insert your code here. // Assume that there is code here that prints // out the arrays to standard output. }
18
###### PART 1 ##### addi sll lui addi loop1_outer: slt beq addi loop1_inner: slt beq sll add sll add addi addi mul sw addi j done1_inner: addi j done1_outer:
$s0, $s1, $s2, $t1,
$0, 16 $s0, 4 0x1001 $0, 0
sll add addi loop2_outer: slt beq addi addi loop2_inner: slt beq sll add lw bne addi no_match: addi j done2_inner: beq sw addi no_output: addi j done2_outer:
$s3, $s1, 2 $s3, $s3, $s2 $t1, $0, 2
# Address to a matrix that is 16x16 word # counter i
$t3, $t1, $s0 $t3, $0, done1_outer $t2, $0, 0 # counter j $t3, $t2, $s0 $t3, $0, done1_inner $t3, $t1, 4 # $t3, $t3, $t2 $t3, $t3, 2 $t3, $t3, $s2 $t4, $t1, 2 # $t5, $t2, 2 $t4, $t4, $t5 $t4, 0($t3) # $t2, $t2, 1 # loop1_inner
compute address
compute element value
store result in matrix increase and loop
$t1, $t1, 1 loop1_outer ###### PART 2 ##### # address to output array # counter i
$t3, $t3, $t6, $t2,
$t1, $s0 $0, done2_outer $0, 1 $0, 0 # counter j
$t3, $t3, $t4, $t4, $t5, $t1, $t6,
$t2, $s1 $0, done2_inner $t2, 2 # get element from matrix $t4, $s2 0($t4) $t5, no_match # check if elements match $0, 0
$t2, $t2, 1 loop2_inner
# next
$t6, $0, no_output # check if write output $t1, 0($s3) $s3, $s3, 4 $t1, $t1, 1 loop2_outer
# next
19
Solution: The program computes all prime numbers between 1 and 16 and stores them in the output array. The first part of the program computes a multiplication matrix, that is then used in the second part to search for numbers that are prime numbers (i.e. that do not exist in the matrix). This is a simple, but not very efficient way of computing prime numbers. An example C code is shown below. int main(){ const int maxval = 16; int matrix[maxval*maxval]; int output[maxval]; // Compute the multiplication matrix int i,j; for(i=0; i
>>#shamt;# shi`#right#logical#(R,0,2) #srl rd,rt,shamt #reg(rd)#:=#reg(rt)#>>#shamt;# shi`#right#logical#variable#(R,0,6) #srlv rd,rt,rs reg(rd)#:=#reg(rt)#>>#reg(rs4:0); ## store#byte#(I,40,na) # #sb rt,imm(rs) #mem[reg(rs)#+#signext(imm)]7:0#:=#reg(rt)7:0;# store#word#(I,43,na) # #sw rt,imm(rs) #mem[reg(rs)#+#signext(imm)]#:=#reg(rt);# subtract#(R,0,34) # #sub rd,rs,rt reg(rd)#:=#reg(rs)#[#reg(rt);# subtract#unsigned#(R,0,35) #subu rd,rs,rt #reg(rd)#:=#reg(rs)#[#reg(rt);# xor#(R,0,38) # #xor rd,rs,rt reg(rd)#:=#reg(rs)#^#reg(rt);# xor#immediate#(I,14,na) #xori rt,rs,imm #reg(rt)#:=#rerg(rs)#^#zeroext(imm);# # Defini