Final Exam Prep: Concepts
- This will save room in the chip for more SPs and hence more potential parallelism.
- This will reduce the cost of the hardware as we will need fewer front ends which are expensive.
- All threads are executing the same code. So, it makes sense to reduce the front-end hardware.
- It is useful, in many applications, for threads executing on SPs to exchange data in a fast way and to synchronize.
- It is prohibitively expensive to have a full connection among all the SPs in the chip (we have hundreds of SPs)
- Also, it leads to poor performance if we let the SPs communicate through the global memory
- Therefore, a middle ground is to group SPs in clusters called SMs and have fast communication among SPs in the same SM.
- The programmer will need to know the number of SPs per SM because the number of threads per block cannot exceed that number.
- Without warps, we will need front-end for each SP. So, no need to have number of threads per block to be multiple of warp-size.
- No need to worry about thread-divergence.
Unified memory is better here [4], for several reasons[6, 2 reasons are enough with 3 points each]:
- Data will not need to move several times among different GPUs because each thread will be working, exclusively, on subset of elements.
- Programming is easier.
- Since the vectors are huge. There is a chance we run out of memory. With unified memory, you can exceed the capacity of the device’s memory and use the system virtual memory.
Yes, it can happen [3]. A grid is a kernel in execution. There may be more than one kernel (belonging to the same process or different processes) in execution at the GPU at the same time [2].
If the data will be shared among threads of the same block, then using shared memory is better [3] because it reduces the trips to global memory and hence may enhance the performance [2].
No, 32 threads can be divided into two blocks (hence two warps) vs being assigned to one block (hence one warp). #warps = #blocks * #warps per block = #blocks * ceil(t/#blocks/32).
- Reduces the cost of GPU hardware.
- Decouples the number of threads per block from number of SPs per SM.
F21.2c: What is needed for GPUs to be standalone machines and not just co-processors to CPUs?
Nothing, the GPU can be a standalone chip if we write an OS that runs on GPU. The performance will be bad though because an OS is not a GPU friendly code.
Yes, there is. Example: Two blocks assigned to the same SM cannot see each other’s memory. Also, local memories for different threads are isolated from each other.
The GPU global memory is connected to the device itself (i.e. the GPU) and to the system memory (i.e. the one connected to the CPU).
- The one connecting the global memory to system memory is slower.
- The main reason for that is that GPU companies have to follow the industry standard (e.g. PCIe) in order for their GPUs to work with other vendors of CPUs and motherboards. Otherwise, GPU vendors could have built faster buses (e.g. NVLINK).
- System memory is designed for latency while GPU memory is designed for bandwidth.
- GPU memory is, on average, smaller in size than system memory.
- The bus connecting system memory to CPU follows the machine architecture (64-bit for the majority of machines nowadays) which GPU memory has much wider buses.
Yes, if, for example, there is an if-else and there is nothing after the else-part. So, the threads in the if-part will finish before those of the else-part. There must be branch divergence.
- Number of threads in a block is not multiple of 32
- Branch-divergence
- Threads finished earlier in the warp.
- Use unified memory: Easier programming and data are not shared among different GPUs. So, unified memory will automatically move the data to the needed GPU.
- Use traditional
cudaMemcpy()
: Some data may need to be shared among multiple GPUs. So, unified memory may move the data back and forth. But, if done by hand, you can duplicate the data and/or use streams to overlap data movement and computation.
a. Two blocks corresponding to kernels from different processes can simultaneously exist in the same SM: ✅
b. Two blocks corresponding to kernels from different threads can simultaneously exist in the same SM: ✅
c. An L1 cache miss always results in coalesced global memory access: ❌
d. The concept of streams is important for performance only but has nothing to do with correctness of execution: ❌
e. If a thread in a grid has two local variables: int x; and int a[100];
- Access to x is faster than access to a.
f. In a multi-GPU system, you can have two blocks corresponding to the same kernel but executing on two different GPUs: ❌
g. A computation intensive application with mostly independent computations is for sure a GPU friendly application: ❌
e. In multi-GPU system, if there is a small piece of data that is needed by all the GPUs: Traditional cudaMemcpy() is better than unified memory.
- Two blocks assigned to the same SM cannot see each other’s shared memory.
- The existence of unified virtual address and unified memory.
We have seen that using array of structure(
struct
) in a kernel execution is not a good idea.Because if two threads access two consecutive elements of the array, they cannot be coalesced because they are separated by the other members of the structure.
To have several arrays, one for each structure member.
If structure members consist of large arrays and threads in a warp are accessing those elements.
Yes, because there is no dynamic shared memory allocation as soon as the kernel is launched.
In general yes.
- Computation: Maximizing the number of threads in the system to ensure higher utilization.
- Communication: Hide it using streams to overall communication and computation.
- Memory access: Ensure memory access can be coalesced.
a. Having several command queues between the host and a device is needed in OpenCL in order to get as good performance as multiple streams in CUDA. This statement is: ❌
b. Which one of the following is NOT an advantage of streams in CUDA: Allows dynamic parallelism (i.e. kernel calls another kernel).
Advantages of streams in CUDA:
Allows parallel kernel execution Overlaps communication and computation Overlaps simultaneous communications
c. An L2 cache miss always results in coalesced memory access: ✅
d. Coalescing is as important for shared memory as it is important for global memory: ✅
e. If large number of threads need to update the same data item. Which technique do we use to ensure correctness: Atomic instructions.
- Branch divergence
- Non-coalesced memory access
- Too much resources per block, reducing number of blocks executed in parallel per SM.
- Execute multiple kernels at the same time
- Overlap communication and computation
- Ensure ordered execution for commands in the same stream
- To allow more usage of shared memory to reduce global memory access
- To allow dealing with very large matrices that cannot fit in global memory
Not all blocks of the grid are executing at the same time. Some are waiting to be scheduled. This can be a cause of deadlock.
For local variables that do not fit in a register, like arrays, structures, etc.
- Problem size not big enough to produce a lot of parallelism
- Communication overhead is to high
- There is not enough data parallelism
Advantages:
- Amortize the hardware cost of instruction fetch and decode phases
- Simplifies the scheduling
Disadvantages:
- Lockstep execution can cause branch divergence
- Introduces some restrictions on block sizes
- More mature → more libraries and APIs
- Supports GPU to GPU communication
- Gives the programmer more control
- Can be used with other accelerators than GPUs.
Because commands in a stream usually depend on each other (moving data from device to host need to wait till the kernel is done, etc …) so they need to be in-order to ensure correctness. If they are independent, then they can be issues in different streams for parallelism.
F15.2: We have seen that if-else may lead to branch divergence in a warp due to lockstep execution of instructions. Now, suppose there is a thread that has an `if` without `else`. Can this also lead to performance loss in some cases? If yes, explain a scenario where there is performance loss. If no, explain why not. No need to write full code, just explain.
Yes, it can lead to some performance loss if some threads in the warp have a true condition and others do not. In that case, those who have false, must wait till the others finish the ifpart, leading to performance loss. However, it is not as severe as the if-else.
In OpenCL there is only one queue between the host and each device. So we cannot have several queues between the host and device like streams in CUDA.
No, because the task queue can be configured to be out-of-order, hence ensuring some parallelism.
No, we will not. Actually we can lose some performance due to the extra work in managing several task queues instead of one.
- Executing two different kernels in parallel
- Streams ensure correctness when commands of a single stream are executed in order
- Massive data parallelism
- Compute bound
- If you need to synchronize among threads of different blocks
- If the different kernels require different geometry
- If you want smaller threads with smaller resources to ensure parallelism among blocks in the same SM
Advantages:
- More opportunities for memory coalescing
- Potentially more parallelism
Disadvantages:
- Higher probability of thread divergence
- Putting more restrictions on the programmer in choosing block size to be a multiple of bigger number
We have discussed a lot the importance of memory coalescing. Also we said that having an L2 cache (servicing all the SMs) helps in global memory coalescing.
In an L2 miss, a cache block is fetched from the global memory. A cache block is coalesced as it consists of continuous bytes from memory.
No, still the program needs to be aware of memory access pattern because accessing separated memory locations means the L2 cache may need to fetch several blocks to satisfy the requirements, which is not coalesced.
A block is not assigned to an SM before it is given all the resources it needs beforehand.
- registers
- shared memory
- slots in SM scheduler
zero-time scheduling
if {
// ...
__syncthreads();
} else {
// ...
__syncthreads();
}
May result in deadlock if some threads in the warp go to the if-part and others in the else-part. Solution: Move
_syncthreads()
outside the if-else part.- Different number and sizes of SMs
- Different size and architecture of shared memory (e.g. separate or with L1 cache)
- Existence and non-existence of L2 cache
F14.4: A kenel launch is non-blocking from the host perspective. If we have two kernels: kernelA and kernelB. The first kernel produces some results and leave them in the device global memory. The second kernel uses these results to do more calculations. Given the following piece of code at the host:
// ...
kernelA<<<a, b>>>(arg1);
kernelB<<<a, b>>>(arg1);
Does kernelB face a problem of starting execution before the results generated by kernelA are ready? Justify.
No, both are assigned to the default stream and stream commands are executed in order.
To amortize the cost of fetching/decoding instructions, and reduce hardware needed for non-computational purposes.
F14.6: Suppose that your code needs to deal with large amount of data that does not fit in the device memory, explain briefly what you can do to overcome this problem. Assume that applications are usually NOT embarrassingly parallel (i.e. there may be dependencies that may need to be taken into account).
- Tiling
- Using streams to move data back-and-forth between CPU and GPU to get new data
- Make the best use of shared memory
- In the kernel invocation, chose a number of threads that divides evenly with the number of threads in a warp. Otherwise, you will not make the best use of underlying hardware.
- Knowing about warps helps writing your code in a way to reduce branch-divergence as much as you can.
- Knowing about warps helps you access the memory in a way to increase the opportunities of memory coalescing.
(i) More Blocks, Less Threads per Block
pros:
- smaller blocks can make better use of SM resources: when several blocks are assigned to the same SM.
cons:
- less opportunities to synchronize among threads: because synchronization is done among threads of the same block and here block size is small
- If block size turned out to be less than a warp-size then there is a performance loss here.
(ii) Less Blocks, More Threads per Block
pros:
- more opportunities for sharing among threads
cons:
- Large blocks may not make the best use of resources in an SM because the granularity of assignments to SM is a block granularity.
F13.5: The line of code below checks for a special case to avoid calling an expensive square root. Describe a situation in which it makes sense for CUDA to do that, and a different situation when it makes no sense (meaning it would be faster to do the square root all the time). Assume that 50% of the time d is equal to 1.
if ( d == 1 ) s = 1; else s = sqrt(d);
It makes sense if there is no branch divergence, meaning that for all warps the branch is either always taken or always not taken. If there is branch divergence then the code will run more slowly than code always performing the square root since within a warp the execution time will be the sum of both paths through the if.
If there is a lot of locality in your code regarding data access, then a cache is needed more than shared memory because a cache is very successful to benefit from spatial and temporal locality. But if you want full control of what needs to be close to the SM and do not want it to be evicted, then shared memory is better.
S12.5: Assume we have M total threads, each of which need some data from the global memory of the GPU. Those threads can be grouped into X blocks with Y threads each (i.e. XY=M). Keeping the total number of threads fixed, discuss the effect of increasing X (and decreasing Y to keep M fixed) or increasing Y on bandwidth requirement. Assume the GPU can accommodate M total threads per SM and only 1 block per SM, and the total number of SMs is M (i.e. a maximum of MM threads can exist in the whole GPU at the same time). Justify your answer.
In one hand, more threads per block means more warps. This means more opportunities for coalescing and hence potential reduction in bandwidth. On the other hand, more blocks means they will be spread over several SMs and hence will have access to more L1 caches, which can positively reduce L2 access and hence bandwidth requirement.
Last modified 3mo ago