CUDA

Allocating Device Memory

cudaMalloc(LOCATION, SIZE)

  • LOCATION: Memory location on Device to allocate memory, an address in the GPU's memory

  • SIZE: number of bytes to allocate

De-Allocate: cudaFree()

Copy Data between Host and Device

cudaMemory(DST, SRC, NUM_BYTES, DIRECTION)

  • DST: An address of the memory to copy into

  • SRC: An address of the memory to copy from

  • NUM_BYTES: N * sizeof(type)

  • DIRECTION:

    • cudaMemcpyHostToDevice

    • cudaMemcpyDeviceToHost

Define the Kernel

Thread Index

In kernel definition, built-in variable threadIdx is accessible to get thread index within the thread block for each thread.

It has 3 dimensions: threadIdx.x, threadIdx.y and threadIdx.z.

Block Index

Index of a block: blockIdx.x, blockIdx.y and blockIdx.z.

Indexing within Grid

__syncthreads

To explicitly synchronize all threads (adding barriers), use __syncthreads.

Launch the Kernel

Last updated