Final Exam Prep: Concepts
S22.1a: All the threads in a warp share the same front-end (i.e. fetch, decode, etc). State two reasons as to why did GPU designers decide to do so instead of having a front end for each thread. That is, one front-end for each SP?
S22.1b: Why did NVIDIA GPU designers decide to group several SPs in an SM?
S22.1c: If the warp concept did not exist, what will be the implications on GPU programming? State three implications and, for each one, explain in 1-2 lines.
S22.1d: Suppose we want to implement vector additions on multi-GPU systems because the two vectors we want to add are huge. Each thread will be responsible for few hundred elements. Which is more beneficial: using the traditional cudaMalloc() and cudaMemcpy()? or using unified memory? And why?
cudaMalloc() and cudaMemcpy()? or using unified memory? And why?S22.2c: Can more than one grid exist in the same GPU at the same time? Explain in 1-2 sentences.
S22.2d: Even though registers are much faster than shared memory, it is sometimes more beneficial to put data in shared memory. When does this happen?
F21.2a: If the total number of threads in the whole grid is t, is it always the case that the total number of warps created in the whole grid is ceil(t/32)? Assume warp size = 32 threads. Explain in no more than 2-3 lines.
F21.2b: State two advantages of warps.
F21.2d: Do you think there is any kind of virtual memory in GPUs? Justify your answer in 1-2 lines.
F20.3
F20.3a: Which connection is slower? And why do you think it is slower?
F20.3b: The GPU global memory and the system memory are both built from the same technology, yet, they are different. State two main differences between the GPU global memory and the system memory.
F19.1: Can some threads of a warp finish before other threads of that same warp? Explain.
F19.2: If we take a snapshot of a warp during execution we may not always find 32 threads executing. State three scenarios that may cause this to happen.
F19.3: State one reason you will use unified memory in a multi-GPU system. And state another reason you will use the traditional `cudaMemcpy()` instead.
F19.5:
F18.1: GPUs support virtual memory. State, in no more than one sentence each, two evidences of this.
F18.2:
F18.2a: Why?
F18.2b: How can we fix that?
F18.2c: Can you find a scenario where using an array of structure is indeed a good idea in CUDA kernel execution? [Use your imagination but be brief and no code is necessary].
F18.4: Does each block of the same kernel get the same amount shared memory? Justify.
F18.5: Does each block of the same kernel get the same number of registers? Justify.
F18.6: In any cuda program, we have computations, communication, and memory access. For each one of them, state one technique to deal with it to get the best overall performance.
F18.7
F17.1: We have seen many issues that can affect the performance of a kernel. State three issues that can affect the performance of a kernel, in no more than one sentence each.
F17.2: State three useful usages of streams.
F17.5: In class, we have seen tiling as a useful technique in matrix multiplication. State two scenarios where tiling is useful at:
F17.6: We know that CUDA does not allow synchronization among threads in different blocks. Suppose CUDA allows this. State one potential problem that may arise.
F17.7: We have seen many type of memories in the GPU. One of them is the local memory. Given that registers are used on per thread basis, what is the point of having a local memory?
F16.1: State three reasons why a GPU version of a code can be slower than a sequential code even though the code has data parallelism.
F16.3: State one advantage and one disadvantage of warps.
F16.7a: State advantages of CUDA over OpenCL:
F16.7b: State one advantage of OpenCL over CUDA:
F15.1: We know in CUDA that commands in a stream (e.g. kernel launch, data movement between host and device, etc) are executed in order. Why is this restriction, given that it may lead to some performance loss?
F15.2: We have seen that if-else may lead to branch divergence in a warp due to lockstep execution of instructions. Now, suppose there is a thread that has an `if` without `else`. Can this also lead to performance loss in some cases? If yes, explain a scenario where there is performance loss. If no, explain why not. No need to write full code, just explain.
F15.3:
F15.3a: Does this restrict the performance of OpenCL? Justify.
F15.3b: Will we gain any performance in OpenCL if we allow multiple queues between the host and the device? If yes, give a scenario where multiple queues give better performance. If no, explain why not.
F15.4: Beside overlapping data-transfer and computation, state two other scenarios where streams are useful.
F15.5: State two characteristics of a problem that makes GPU a good candidate instead of CPU
F15.6: State three reasons you may want to have several kernels instead of one big kernel.
F15.7: Suppose NVIDIA decides to have larger warps in their future GPUs. Give advantages and disadvantages of doing so.
F15.8:
F15.8a: How does L2 helps in memory coalescing?
F15.8b: Does the existence of L2 mean that the programmer does not need to pay attention to global memory access to be coalesced? Explain.
F14.1:
F14.1a: What are these resources?
F14.1b: What is the advantage of doing so?
F14.2: What is wrong with that piece of kernel code? How to deal with it if we need to syncthreads both in the if-body and else-body (i.e. ho to change that code yet preserve the semantic)?
F14.3: Briefly explain why CUDA code optimized for one GPU might run inefficiently when executed on another model of GPU (assuming warp size is the same).
F14.4: A kenel launch is non-blocking from the host perspective. If we have two kernels: kernelA and kernelB. The first kernel produces some results and leave them in the device global memory. The second kernel uses these results to do more calculations. Given the following piece of code at the host:
F14.5: The fact that instructions in the warp are executed in lockstep makes branch divergence a big performance-loss problem. Then why GPU designers insist on this lockstep design?
F14.6: Suppose that your code needs to deal with large amount of data that does not fit in the device memory, explain briefly what you can do to overcome this problem. Assume that applications are usually NOT embarrassingly parallel (i.e. there may be dependencies that may need to be taken into account).
F13.1: As a CUDA programmer, how does knowing about the concept of warps help you, especially that warps are transparent to the programmer?
F13.2: Suppose we have a compute bound application with enough parallelism. Discuss the pros and cons of each the following two strategies: (i) more blocks per gird and less threads per block (ii) more threads per block but less blocks per grid. Assume the total number of threads is fixed.
F13.5: The line of code below checks for a special case to avoid calling an expensive square root. Describe a situation in which it makes sense for CUDA to do that, and a different situation when it makes no sense (meaning it would be faster to do the square root all the time). Assume that 50% of the time d is equal to 1.
S12.4: In FERMI memory hierarchy we have 64KB that can be configured as 48K shared memory and 16KB L1 cache or 48KB L1 cache and 16KB shared memory. Indicate when will you use the first configuration and when will you use the second configuration.
S12.5: Assume we have M total threads, each of which need some data from the global memory of the GPU. Those threads can be grouped into X blocks with Y threads each (i.e. XY=M). Keeping the total number of threads fixed, discuss the effect of increasing X (and decreasing Y to keep M fixed) or increasing Y on bandwidth requirement. Assume the GPU can accommodate M total threads per SM and only 1 block per SM, and the total number of SMs is M (i.e. a maximum of MM threads can exist in the whole GPU at the same time). Justify your answer.
Last updated