CUDA
Allocating Device Memory
cudaMalloc(LOCATION, SIZE)
LOCATION: Memory location on Device to allocate memory, an address in the GPU's memory
SIZE: number of bytes to allocate
De-Allocate: cudaFree()
Copy Data between Host and Device
cudaMemory(DST, SRC, NUM_BYTES, DIRECTION)
DST: An address of the memory to copy into
SRC: An address of the memory to copy from
NUM_BYTES: N * sizeof(type)
DIRECTION:
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
Define the Kernel
Thread Index
In kernel definition, built-in variable threadIdx
is accessible to get thread index within the thread block for each thread.
It has 3 dimensions: threadIdx.x
, threadIdx.y
and threadIdx.z
.
Block Index
Index of a block: blockIdx.x
, blockIdx.y
and blockIdx.z
.
Indexing within Grid
__syncthreads
To explicitly synchronize all threads (adding barriers), use __syncthreads.
Launch the Kernel
Last updated