Final Exam Prep: Concepts

S22.1a: All the threads in a warp share the same front-end (i.e. fetch, decode, etc). State two reasons as to why did GPU designers decide to do so instead of having a front end for each thread. That is, one front-end for each SP?

  • This will save room in the chip for more SPs and hence more potential parallelism.

  • This will reduce the cost of the hardware as we will need fewer front ends which are expensive.

  • All threads are executing the same code. So, it makes sense to reduce the front-end hardware.

S22.1b: Why did NVIDIA GPU designers decide to group several SPs in an SM?

  • It is useful, in many applications, for threads executing on SPs to exchange data in a fast way and to synchronize.

  • It is prohibitively expensive to have a full connection among all the SPs in the chip (we have hundreds of SPs)

  • Also, it leads to poor performance if we let the SPs communicate through the global memory

  • Therefore, a middle ground is to group SPs in clusters called SMs and have fast communication among SPs in the same SM.

S22.1c: If the warp concept did not exist, what will be the implications on GPU programming? State three implications and, for each one, explain in 1-2 lines.

  • The programmer will need to know the number of SPs per SM because the number of threads per block cannot exceed that number.

  • Without warps, we will need front-end for each SP. So, no need to have number of threads per block to be multiple of warp-size.

  • No need to worry about thread-divergence.

S22.1d: Suppose we want to implement vector additions on multi-GPU systems because the two vectors we want to add are huge. Each thread will be responsible for few hundred elements. Which is more beneficial: using the traditional cudaMalloc() and cudaMemcpy()? or using unified memory? And why?

Unified memory is better here [4], for several reasons[6, 2 reasons are enough with 3 points each]:

  • Data will not need to move several times among different GPUs because each thread will be working, exclusively, on subset of elements.

  • Programming is easier.

  • Since the vectors are huge. There is a chance we run out of memory. With unified memory, you can exceed the capacity of the device’s memory and use the system virtual memory.

S22.2c: Can more than one grid exist in the same GPU at the same time? Explain in 1-2 sentences.

Yes, it can happen [3]. A grid is a kernel in execution. There may be more than one kernel (belonging to the same process or different processes) in execution at the GPU at the same time [2].

S22.2d: Even though registers are much faster than shared memory, it is sometimes more beneficial to put data in shared memory. When does this happen?

If the data will be shared among threads of the same block, then using shared memory is better [3] because it reduces the trips to global memory and hence may enhance the performance [2].

F21.2a: If the total number of threads in the whole grid is t, is it always the case that the total number of warps created in the whole grid is ceil(t/32)? Assume warp size = 32 threads. Explain in no more than 2-3 lines.

No, 32 threads can be divided into two blocks (hence two warps) vs being assigned to one block (hence one warp). #warps = #blocks * #warps per block = #blocks * ceil(t/#blocks/32).

F21.2b: State two advantages of warps.

  • Reduces the cost of GPU hardware.

  • Decouples the number of threads per block from number of SPs per SM.

F21.2c: What is needed for GPUs to be standalone machines and not just co-processors to CPUs?

Nothing, the GPU can be a standalone chip if we write an OS that runs on GPU. The performance will be bad though because an OS is not a GPU friendly code.

F21.2d: Do you think there is any kind of virtual memory in GPUs? Justify your answer in 1-2 lines.

Yes, there is. Example: Two blocks assigned to the same SM cannot see each other’s memory. Also, local memories for different threads are isolated from each other.


The GPU global memory is connected to the device itself (i.e. the GPU) and to the system memory (i.e. the one connected to the CPU).

F20.3a: Which connection is slower? And why do you think it is slower?

  • The one connecting the global memory to system memory is slower.

  • The main reason for that is that GPU companies have to follow the industry standard (e.g. PCIe) in order for their GPUs to work with other vendors of CPUs and motherboards. Otherwise, GPU vendors could have built faster buses (e.g. NVLINK).

F20.3b: The GPU global memory and the system memory are both built from the same technology, yet, they are different. State two main differences between the GPU global memory and the system memory.

  • System memory is designed for latency while GPU memory is designed for bandwidth.

  • GPU memory is, on average, smaller in size than system memory.

  • The bus connecting system memory to CPU follows the machine architecture (64-bit for the majority of machines nowadays) which GPU memory has much wider buses.

F19.1: Can some threads of a warp finish before other threads of that same warp? Explain.

Yes, if, for example, there is an if-else and there is nothing after the else-part. So, the threads in the if-part will finish before those of the else-part. There must be branch divergence.

F19.2: If we take a snapshot of a warp during execution we may not always find 32 threads executing. State three scenarios that may cause this to happen.

  • Number of threads in a block is not multiple of 32

  • Branch-divergence

  • Threads finished earlier in the warp.

F19.3: State one reason you will use unified memory in a multi-GPU system. And state another reason you will use the traditional `cudaMemcpy()` instead.

  • Use unified memory: Easier programming and data are not shared among different GPUs. So, unified memory will automatically move the data to the needed GPU.

  • Use traditional cudaMemcpy(): Some data may need to be shared among multiple GPUs. So, unified memory may move the data back and forth. But, if done by hand, you can duplicate the data and/or use streams to overlap data movement and computation.


a. Two blocks corresponding to kernels from different processes can simultaneously exist in the same SM: ✅

b. Two blocks corresponding to kernels from different threads can simultaneously exist in the same SM: ✅

c. An L1 cache miss always results in coalesced global memory access: ❌

d. The concept of streams is important for performance only but has nothing to do with correctness of execution: ❌

e. If a thread in a grid has two local variables: int x; and int a[100];

  • Access to x is faster than access to a.

f. In a multi-GPU system, you can have two blocks corresponding to the same kernel but executing on two different GPUs: ❌

g. A computation intensive application with mostly independent computations is for sure a GPU friendly application: ❌

e. In multi-GPU system, if there is a small piece of data that is needed by all the GPUs: Traditional cudaMemcpy() is better than unified memory.

F18.1: GPUs support virtual memory. State, in no more than one sentence each, two evidences of this.

  • Two blocks assigned to the same SM cannot see each other’s shared memory.

  • The existence of unified virtual address and unified memory.


We have seen that using array of structure(struct) in a kernel execution is not a good idea.

F18.2a: Why?

Because if two threads access two consecutive elements of the array, they cannot be coalesced because they are separated by the other members of the structure.

F18.2b: How can we fix that?

To have several arrays, one for each structure member.

F18.2c: Can you find a scenario where using an array of structure is indeed a good idea in CUDA kernel execution? [Use your imagination but be brief and no code is necessary].

If structure members consist of large arrays and threads in a warp are accessing those elements.

F18.4: Does each block of the same kernel get the same amount shared memory? Justify.

Yes, because there is no dynamic shared memory allocation as soon as the kernel is launched.

F18.5: Does each block of the same kernel get the same number of registers? Justify.

In general yes.

F18.6: In any cuda program, we have computations, communication, and memory access. For each one of them, state one technique to deal with it to get the best overall performance.

  • Computation: Maximizing the number of threads in the system to ensure higher utilization.

  • Communication: Hide it using streams to overall communication and computation.

  • Memory access: Ensure memory access can be coalesced.


a. Having several command queues between the host and a device is needed in OpenCL in order to get as good performance as multiple streams in CUDA. This statement is: ❌

b. Which one of the following is NOT an advantage of streams in CUDA: Allows dynamic parallelism (i.e. kernel calls another kernel).

Advantages of streams in CUDA:

  • Allows parallel kernel execution

  • Overlaps communication and computation

  • Overlaps simultaneous communications

c. An L2 cache miss always results in coalesced memory access: ✅

d. Coalescing is as important for shared memory as it is important for global memory: ✅

e. If large number of threads need to update the same data item. Which technique do we use to ensure correctness: Atomic instructions.

F17.1: We have seen many issues that can affect the performance of a kernel. State three issues that can affect the performance of a kernel, in no more than one sentence each.

  • Branch divergence

  • Non-coalesced memory access

  • Too much resources per block, reducing number of blocks executed in parallel per SM.

F17.2: State three useful usages of streams.

  • Execute multiple kernels at the same time

  • Overlap communication and computation

  • Ensure ordered execution for commands in the same stream

F17.5: In class, we have seen tiling as a useful technique in matrix multiplication. State two scenarios where tiling is useful at:

  • To allow more usage of shared memory to reduce global memory access

  • To allow dealing with very large matrices that cannot fit in global memory

F17.6: We know that CUDA does not allow synchronization among threads in different blocks. Suppose CUDA allows this. State one potential problem that may arise.

Not all blocks of the grid are executing at the same time. Some are waiting to be scheduled. This can be a cause of deadlock.

F17.7: We have seen many type of memories in the GPU. One of them is the local memory. Given that registers are used on per thread basis, what is the point of having a local memory?

For local variables that do not fit in a register, like arrays, structures, etc.

F16.1: State three reasons why a GPU version of a code can be slower than a sequential code even though the code has data parallelism.

  • Problem size not big enough to produce a lot of parallelism

  • Communication overhead is to high

  • There is not enough data parallelism

F16.3: State one advantage and one disadvantage of warps.


  • Amortize the hardware cost of instruction fetch and decode phases

  • Simplifies the scheduling


  • Lockstep execution can cause branch divergence

  • Introduces some restrictions on block sizes

F16.7a: State advantages of CUDA over OpenCL:

  • More mature → more libraries and APIs

  • Supports GPU to GPU communication

  • Gives the programmer more control

F16.7b: State one advantage of OpenCL over CUDA:

  • Can be used with other accelerators than GPUs.

F15.1: We know in CUDA that commands in a stream (e.g. kernel launch, data movement between host and device, etc) are executed in order. Why is this restriction, given that it may lead to some performance loss?

Because commands in a stream usually depend on each other (moving data from device to host need to wait till the kernel is done, etc …) so they need to be in-order to ensure correctness. If they are independent, then they can be issues in different streams for parallelism.

F15.2: We have seen that if-else may lead to branch divergence in a warp due to lockstep execution of instructions. Now, suppose there is a thread that has an `if` without `else`. Can this also lead to performance loss in some cases? If yes, explain a scenario where there is performance loss. If no, explain why not. No need to write full code, just explain.

Yes, it can lead to some performance loss if some threads in the warp have a true condition and others do not. In that case, those who have false, must wait till the others finish the ifpart, leading to performance loss. However, it is not as severe as the if-else.


In OpenCL there is only one queue between the host and each device. So we cannot have several queues between the host and device like streams in CUDA.

F15.3a: Does this restrict the performance of OpenCL? Justify.

No, because the task queue can be configured to be out-of-order, hence ensuring some parallelism.

F15.3b: Will we gain any performance in OpenCL if we allow multiple queues between the host and the device? If yes, give a scenario where multiple queues give better performance. If no, explain why not.

No, we will not. Actually we can lose some performance due to the extra work in managing several task queues instead of one.

F15.4: Beside overlapping data-transfer and computation, state two other scenarios where streams are useful.

  • Executing two different kernels in parallel

  • Streams ensure correctness when commands of a single stream are executed in order

F15.5: State two characteristics of a problem that makes GPU a good candidate instead of CPU

  • Massive data parallelism

  • Compute bound

F15.6: State three reasons you may want to have several kernels instead of one big kernel.

  • If you need to synchronize among threads of different blocks

  • If the different kernels require different geometry

  • If you want smaller threads with smaller resources to ensure parallelism among blocks in the same SM

F15.7: Suppose NVIDIA decides to have larger warps in their future GPUs. Give advantages and disadvantages of doing so.


  • More opportunities for memory coalescing

  • Potentially more parallelism


  • Higher probability of thread divergence

  • Putting more restrictions on the programmer in choosing block size to be a multiple of bigger number


We have discussed a lot the importance of memory coalescing. Also we said that having an L2 cache (servicing all the SMs) helps in global memory coalescing.

F15.8a: How does L2 helps in memory coalescing?

In an L2 miss, a cache block is fetched from the global memory. A cache block is coalesced as it consists of continuous bytes from memory.

F15.8b: Does the existence of L2 mean that the programmer does not need to pay attention to global memory access to be coalesced? Explain.

No, still the program needs to be aware of memory access pattern because accessing separated memory locations means the L2 cache may need to fetch several blocks to satisfy the requirements, which is not coalesced.


A block is not assigned to an SM before it is given all the resources it needs beforehand.

F14.1a: What are these resources?

  • registers

  • shared memory

  • slots in SM scheduler

F14.1b: What is the advantage of doing so?

zero-time scheduling

F14.2: What is wrong with that piece of kernel code? How to deal with it if we need to syncthreads both in the if-body and else-body (i.e. ho to change that code yet preserve the semantic)?

if {
 // ...
} else {
 // ...

May result in deadlock if some threads in the warp go to the if-part and others in the else-part. Solution: Move _syncthreads() outside the if-else part.

F14.3: Briefly explain why CUDA code optimized for one GPU might run inefficiently when executed on another model of GPU (assuming warp size is the same).

  • Different number and sizes of SMs

  • Different size and architecture of shared memory (e.g. separate or with L1 cache)

  • Existence and non-existence of L2 cache

F14.4: A kenel launch is non-blocking from the host perspective. If we have two kernels: kernelA and kernelB. The first kernel produces some results and leave them in the device global memory. The second kernel uses these results to do more calculations. Given the following piece of code at the host:

// ...
kernelA<<<a, b>>>(arg1);
kernelB<<<a, b>>>(arg1);

Does kernelB face a problem of starting execution before the results generated by kernelA are ready? Justify.

No, both are assigned to the default stream and stream commands are executed in order.

F14.5: The fact that instructions in the warp are executed in lockstep makes branch divergence a big performance-loss problem. Then why GPU designers insist on this lockstep design?

To amortize the cost of fetching/decoding instructions, and reduce hardware needed for non-computational purposes.

F14.6: Suppose that your code needs to deal with large amount of data that does not fit in the device memory, explain briefly what you can do to overcome this problem. Assume that applications are usually NOT embarrassingly parallel (i.e. there may be dependencies that may need to be taken into account).

  • Tiling

  • Using streams to move data back-and-forth between CPU and GPU to get new data

  • Make the best use of shared memory

F13.1: As a CUDA programmer, how does knowing about the concept of warps help you, especially that warps are transparent to the programmer?

  • In the kernel invocation, chose a number of threads that divides evenly with the number of threads in a warp. Otherwise, you will not make the best use of underlying hardware.

  • Knowing about warps helps writing your code in a way to reduce branch-divergence as much as you can.

  • Knowing about warps helps you access the memory in a way to increase the opportunities of memory coalescing.

F13.2: Suppose we have a compute bound application with enough parallelism. Discuss the pros and cons of each the following two strategies: (i) more blocks per gird and less threads per block (ii) more threads per block but less blocks per grid. Assume the total number of threads is fixed.

(i) More Blocks, Less Threads per Block


  • smaller blocks can make better use of SM resources: when several blocks are assigned to the same SM.


  • less opportunities to synchronize among threads: because synchronization is done among threads of the same block and here block size is small

  • If block size turned out to be less than a warp-size then there is a performance loss here.

(ii) Less Blocks, More Threads per Block


  • more opportunities for sharing among threads


  • Large blocks may not make the best use of resources in an SM because the granularity of assignments to SM is a block granularity.

F13.5: The line of code below checks for a special case to avoid calling an expensive square root. Describe a situation in which it makes sense for CUDA to do that, and a different situation when it makes no sense (meaning it would be faster to do the square root all the time). Assume that 50% of the time d is equal to 1.

if ( d == 1 ) s = 1; else s = sqrt(d);

It makes sense if there is no branch divergence, meaning that for all warps the branch is either always taken or always not taken. If there is branch divergence then the code will run more slowly than code always performing the square root since within a warp the execution time will be the sum of both paths through the if.

S12.4: In FERMI memory hierarchy we have 64KB that can be configured as 48K shared memory and 16KB L1 cache or 48KB L1 cache and 16KB shared memory. Indicate when will you use the first configuration and when will you use the second configuration.

If there is a lot of locality in your code regarding data access, then a cache is needed more than shared memory because a cache is very successful to benefit from spatial and temporal locality. But if you want full control of what needs to be close to the SM and do not want it to be evicted, then shared memory is better.

S12.5: Assume we have M total threads, each of which need some data from the global memory of the GPU. Those threads can be grouped into X blocks with Y threads each (i.e. XY=M). Keeping the total number of threads fixed, discuss the effect of increasing X (and decreasing Y to keep M fixed) or increasing Y on bandwidth requirement. Assume the GPU can accommodate M total threads per SM and only 1 block per SM, and the total number of SMs is M (i.e. a maximum of MM threads can exist in the whole GPU at the same time). Justify your answer.

In one hand, more threads per block means more warps. This means more opportunities for coalescing and hence potential reduction in bandwidth. On the other hand, more blocks means they will be spread over several SMs and hence will have access to more L1 caches, which can positively reduce L2 access and hence bandwidth requirement.

Last updated