CUDA C编程权威指南：1.3-CUDA基础知识点梳理

cuda,编程,权威,指南,基础,知识点,梳理 · 浏览次数 : 4

小编点评

**CUDA编程：** * CUDA编程是C语言、C++、Objective-C语言的轻量级编译器。 *源代码发布于BSD协议下。 *CUDA里面的Texture Memory：支持C++的纹理内存。 **Tegra解析：** * Tegra是于推出的基于ARM构架通用处理器品牌（即CPU，NVIDIA称为\"Computer on a chip\"片上计算机）。 *该架构支持CUDA编程。 **Clang解析：** *Clang是一个C语言、C++、Objective-C语言的轻量级编译器。 *源代码发布于BSD协议下。 *Clang将支持其普通lambda表达式、返回类型的简化处理以及更好的处理constexpr关键字。 **PTX代码解析：** *PTX代码是编译后的GPU代码的一种中间形式，它可以再次编译为原生的GPU微码。 **CUDA纹理的使用：** *CUDA纹理是一种存储在CUDA内存中的纹理。 *纹理可以用于存储图像、纹理数据等。 **CUDA计算能力查询表：** *CUDA计算能力查询表提供了各种CUDA编程中所需的信息。 *例如，可以获取CUDA内存大小、纹理内存大小等。

正文

主要整理了N多年前（2013年）学习CUDA的时候开始总结的知识点，好长时间不写CUDA代码了，现在LLM推理需要重新学习CUDA编程，看来出来混迟早要还的。

1.CUDA数组
解析：CUDA数组是使用cudaMallocArray()、cudaMalloc3DArray()分配的，使用cudaFreeArray()释放。

2.OpenGL/DirectX Interoperability
解析：OpenGL的帧缓冲与DirectX的顶点缓冲可以被映射到CUDA可操作的地址空间中，让CUDA读写帧缓冲里面的数据。OpenGL与CUDA互操作，主要是缓冲对象的注册与取消注册、映射与取消映射。如下所示：
（1）cudaGLRegisterBufferObject()：缓冲对象注册。（2）cudaGLUnregisterBufferObject()：取消缓冲对象注册。
（3）cudaGLMapBufferObject()：映射缓冲对象。
（4）cudaGLUnmapBufferObject()：取消映射。
（5）cudaGLMapBufferObject()：映射缓冲对象后，CUDA可以使用其返回的设备存储器地址读取和写入缓冲对象。Direct 3D与CUDA互操作，主要是Direct 3D设备的设置、资源的注册、资源映射、映射后信息获取、取消映射、取消注册。以Direct3D9为例，如下所示：
（1）cudaD3D9SetDirect3DDevice()：Direct3D设备的设置。
（2）cudaD3D9RegisterResource ()：注册资源。
（3）cudaD3D9MapResources()：资源映射。
（4）cudaD3D9ResourceGetMappedPointer()：获取资源映射后的CUDA设备存储器地址。
（5）cudaD3D9ResourceGetMappedSize()：获取大小。
（6）cudaD3D9ResourceGetMappedPitch()：获取间隔。
（7）cudaD3D9UnmapResources()：取消映射。
（8）cudaD3D9UnregisterResource()：取消注册。

3.CUDA软件环境
解析：
（1）NVIDIA Jetson TK1：NVIDIA提供的基于GPU的嵌入式开发板。
（2）NVRTC（NVIDIA Runtime Compilation）：基于CUDA C++的运行时编译库。
（3）cuSolver：基于cuBLAS和cuSPARSE库的高级包。
（4）ptxas：PTX汇编工具。
（5）cuobjdump：CUDA目标文件转储工具。
（6）nvidia-smi：英伟达系统管理接口。
（7）CUDA Binary Utilities：cuobjdump；nvdisasm；nvprune。
（8）CUDA-MEMCHECK：CUDA工具套件中提供的独立的内存检查实用程序。

4.cudaGetLastError和cudaGetErrorString
解析：
（1）cudaError_t cudaGetLastError( void )：返回同一主线程中运行时调用所返回的最新错误，并将其重置为cudaSuccess。
（2）cudaError_t cudaGetLastError( void )：返回同一主线程中运行时调用所返回的最新错误，并将其重置为cudaSuccess。

5.零拷贝内存
解析：可以在CUDA核函数中直接访问主机内存，不需要复制到GPU。如下所示：
（1）开辟Host内存空间：cudaHostAlloc((void**)&host_data_to_device, size_in_bytes, cudaHostAllocMapped);
（2）获取Device端指针：cudaHostGetDevicePointer(&dev_host_data_to_device, host_data_to_device, 0);
说明：零拷贝内存技术适用于计算密集型、读取写入次数少的程序中。

6.static函数
解析：静态函数（内部函数）只在声明的文件中可见，不能被其它文件调用。

7.cudaError_t和checkCudaErrors
解析：
（1）#include <helper_cuda.h>
（2）#include <helper_functions.h>

8.texture<type, dimension, readtype> texreference;[4]
解析：
（1）type：int，uchar，float等。
（2）dimension：1，2，3。
（3）readtype：cudaReadModeNormalizedFloat（归一化），cudaReadModeElementType（默认）。

9.纹理存储器绑定[5]
解析：纹理存储器绑定有两种，一种是绑定到cudaMalloc()，cudaMemcpy()开辟的一维数组，另一种是绑定到cudaMallocArray，cudaMemcpyToArray开辟的二维数组或者三维数组。

10.cudaCreateChannelDesc
解析：__host__ cudaChannelFormatDesc cudaCreateChannelDesc ( int x, int y, int z, int w, cudaChannelFormatKind f )：Returns a channel descriptor using the specified format。
说明：where cudaChannelFormatKind is one of cudaChannelFormatKindSigned, cudaChannelFormatKindUnsigned, or cudaChannelFormatKindFloat.

11.cudaBindTextureToArray
解析：__host__ cudaError_t cudaBindTextureToArray ( const textureReference* texref, cudaArray_const_t array, const cudaChannelFormatDesc* desc )：Binds an array to a texture。
说明：texref：Texture to bind；array：Memory array on device；desc：Channel format.

12.CUDA同步函数
解析：
（1）cudaDeviceSynchronize()：停止CPU端线程执行，直到GPU端完成CUDA任务，包括kernel、数据拷贝等。
（2）cudaThreadSynchronize()：和cudaDeviceSynchronize()基本相同，过时版本。
（3）cudaStreamSynchronize()：该方法接受一个Stream ID，它将阻止CPU执行直到GPU端完成相应Stream ID的CUDA任务，但并不关心其它Stream ID中的CUDA任务是否完成。

13.cudaGetLastError
解析：__host__ __device__ cudaError_t cudaGetLastError ( void )：返回运行时调用的最后错误。

14.PGM图像格式
解析：PGM是Portable Gray Map的缩写，它是灰度图像格式中一种最简单的格式标准。

15.CUDA架构
解析：Tesla架构，Femi架构，Kepler架构，Maxwell架构，Pascal架构，Volta架构。

16.tex2D
解析：在核函数中访问纹理存储器的操作称为纹理拾取。通过tex2D()来读取纹理内存中的数据。

17.simpleTexture.cu代码剖析
解析：

#include <iostream>#include <cuda_runtime.h>#include <helper_image.h>using namespace std; const char *imagePath = "./data/lena_bw.pgm";const char *outputFilename = "./data/lena_bw_out.pgm";const float angle = 0.5f;// Texture reference for 2D float texturetexture<float, 2, cudaReadModeElementType> tex; // @param outputData  output data in global memory__global__ void transformKernel(float *outputData, int width, int height, float theta){    // calculate normalized texture coordinates unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;     float u = (float)x - (float)width / 2; float v = (float)y - (float)height / 2;  u /= (float)width; v /= (float)height;  float tu = u*cosf(theta) - v*sinf(theta); float tv = v*cosf(theta) + u*sinf(theta);     // read from texture and write to global memory outputData[y * width + x] = tex2D(tex, tu + 0.5f, tv + 0.5f);}int main(int argc, char **argv){    unsigned int width = 512; unsigned int height = 512; unsigned int size = width * height * sizeof(float);         // Allocate device memory for result float *dData = NULL; cudaMalloc((void **)&dData, size);  float *hData = NULL; sdkLoadPGM(imagePath, &hData, &width, &height);  // Allocate array and copy image data cudaChannelFormatDesc channelDesc = cudaCreateChannelDesc(32, 0, 0, 0, cudaChannelFormatKindFloat); cudaArray *cuArray; cudaMallocArray(&cuArray, &channelDesc, width, height); cudaMemcpyToArray(cuArray, 0, 0, hData, size, cudaMemcpyHostToDevice);  // Set texture parameters tex.addressMode[0] = cudaAddressModeWrap; tex.addressMode[1] = cudaAddressModeWrap; tex.filterMode = cudaFilterModeLinear; tex.normalized = true;  // Bind the array to the texture cudaBindTextureToArray(tex, cuArray, channelDesc);  dim3 dimBlock(8, 8, 1); dim3 dimGrid(width / dimBlock.x, height / dimBlock.y, 1);  // Execute the kernel transformKernel << <dimGrid, dimBlock, 0 >> >(dData, width, height, angle);     // Allocate mem for the result on host side float *hOutputData = (float *)malloc(size); // copy result from device to host cudaMemcpy(hOutputData, dData, size, cudaMemcpyDeviceToHost); sdkSavePGM(outputFilename, hOutputData, width, height);  cudaFree(dData);    cudaFreeArray(cuArray);}
复制

18.Texrure<Type, Dim, ReadMode> texRef;
解析：
（1）Type：数据类型。
（2）Dim：纹理引用维数。
（3）ReadMode：cudaReadModeNormalizedFloat或cudaReadModeElementType。
说明：纹理引用只能声明为全局静态变量，不能作为函数参数传递。

19.struct cudaChannelFormatDesc { int x, y, z, w; enum cudaChannelFormatKind f;};
解析：
（1）结构体中的成员x，y，z以及w指定了纹理元素中每个成员的比特数。比如，仅包含一个浮点元素的纹理对应的x为32，其它成员的值为0。
（2）结构体cudaChannelFormatKind指明了该数据的类型，是带符号的整数（cudaChannelFormatKindSigned），还是无符号整数（cudaChannelFormatKindUnsigned），或者浮点数（cudaChannelFormatKindFloat）。

20.struct cudaChannelFormatDesc cudaCreateChannelDesc(int x,int y,int z,int w,enum cudaChannelFormatKind f);
解析：通过函数cudaCreateChannelDesc创建cudaChannelFormatDesc结构体。

21.cudaError_t cudaMallocArray(struct cudaArray** array, const struct cudaChannelFormatDesc* desc, size_t width, size_t height);
解析：根据cudaChannelFormatDesc结构desc分配一个CUDA 数组，并返回一个在*array的新CUDA数组的句柄。

22.cudaError_t cudaMemcpyToArray(struct cudaArray* dstArray, size_t dstX, size_t dstY, const void* src, size_t count, enum cudaMemcpyKind kind);
解析：拷贝count字节，从src指向的内存区域到dstArray指向的CUDA数组，从数组的左上角（dstX, dstY）开始，kind可以是cudaMemcpyHostToHost，cudaMemcpyHostToDevice，cudaMemcpyDeviceToHost，或cudaMemcpyDeviceToDevice的拷贝方向。

23.struct textureReference {int normalized; enum cudaTextureFilterMode filterMode; enum cudaTextureAddressMode addressMode[2]; struct cudaChannelFormatDesc channelDesc;}[1]
解析：
（1）normalized：归一化模式。
（2）filterMode：cudaFilterModePoint或cudaFilterModeLinear。（滤波模式）
（3）addressMode：cudaAddressModeClamp或cudaAddressModeWrap。（寻址模式）

24.cudaError_t cudaBindTextureToArray(const struct textureReference* texRef, const struct cudaArray* array, const struct cudaChannelFormatDesc* desc);
解析：绑定CUDA数组array到纹理引用texRef。desc描述了在纹理拾取时内存如何被解释。任何之前被绑定到texRef的内存将被解除绑定。

25.CUDA数组纹理操作
解析：
（1）template<class Type, enum cudaTextureReadMode readMode> Type tex1D(texture<Type, 1, readMode> texRef, float x);
（2）template<class Type, enum cudaTextureReadMode readMode> Type tex2D(texture<Type, 2, readMode> texRef, float x, float y);
说明：tex1D和tex2D函数通过纹理坐标x和y拾取CUDA数组中绑定到纹理引用texRef的区域。

26.CPU/GPU并发性
解析：CPU/GPU并发性指CPU在已经发送一些请求给GPU后能够继续处理的能力。其实，CPU/GPU并发性最重要的用处就是隐藏来自GPU请求任务的开销。

27.cudaError_t cudaThreadSynchronize(void);
解析：阻止直到设备上所有请求任务执行完毕。cudaThreadSynchronize()返回一个错误，如果其中一个任务失败。

28.cudaError_t cudaThreadExit(void);
解析：清除主机调用的线程中所有runtime相关的资源。任何后来的API将重新初始化runtime。

29.cudaError_t cudaStreamCreate(cudaStream_t* stream);
解析：创建一个流。

30.cudaError_t cudaStreamQuery(cudaStream_t stream);
解析：返回cudasuccess，如果所有流中的操作完成。返回cudaErrorNotReady，如果不是。

31.cudaError_t cudaStreamSyncronize(cudaStream_t stream)
解析：阻止直到设备上完成流中的所有操。

32.cudaError_t cudaStreamDestroy(cudaStream_t stream);
解析：销毁一个流。

33.cudaError_t cudaEventCreate(cudaEvent_t* event);
解析：创建一个事件。

34.cudaError_t cudaEventRecord(cudaEvent_t event, CUstream stream);
解析：记录一个事件。如果stream是非零的，当流中所有的操作完毕，事件被记录；否则，当CUDA context中所有的操作完毕，事件被记录。由于这个操作是异步的，必须使用cudaEventQuery和/或cudaEventSyncronize 来决定何时事件被真的记录了。

35.cudaError_t cudaEventQuery(cudaEvent_t event);
解析：返回cudaSuccess，如果事件被真的记录了。返回cudaErrorNotReady，如果不是。

36.cudaError_t cudaEventSyncronize(cudaEvent_t event);
解析：阻止直到事件被真的记录了。如果cudaEventRecord()在这个事件中没有被调用，函数返回cudaErrorInvalidValue。

37.cudaError_t cudaEventDestroy(cudaEvent_t event);
解析：销毁一个事件。

38.cudaError_t cudaEventElapsedTime(float* time, cudaEvent_t start, cudaEvent_t end);
解析：计算两个事件之间花费的时间（millisecond）。如果事件未被记录，函数返回cudaErrorInvalidValue。

39.cudaError_t cudaGetDeviceCount(int* count);
解析：返回计算兼容性大于等于1.0的设备数量到指针*count。

40.cudaError_t cudaSetDevice(int dev);
解析：记录dev作为设备在哪个活动的主机线程中执行设备代码。

41.cudaError_t cudaGetDevice(int* dev);
解析：返回设备在哪个活动的主机线程中执行设备代码到指针*dev。

42.cudaError_t cudaGetDeviceProperties(struct cudaDeviceProp* prop, int dev);
解析：返回设备dev的属性到指针*prop。

43.cudaError_t cudaChooseDevice(int* dev, const struct cudaDeviceProp* prop);
解析：返回设备的哪些属性最匹配*prop到指针*dev。

44.CUDA Runtime API
解析：
（1）低级API（cuda_runtime_api.h）是C接口类型的，不需要nvcc编译。
（2）高级API（cuda_runtime.h ）是C++接口类型的，基于低级API之上的，可直接使用C++代码，并被任何的C++编译器编译。高级API还有一些CUDA特定的包，它们需要nvcc编译。
说明：CUDA Runtime API和CUDA Driver API提供了设备管理，线程管理，流管理，事件管理，内存管理，纹理引用管理，执行控制，OpenGL互操作性，Direct3D互操作性，错误处理等函数。

45.__noinline__
解析：默认下，__device__函数总是inline的。__noinline__函数可以作为一个非inline函数的提示。

46.#pragma unroll
解析：编译器默认情况下将循环展开小的次数，#pragma unroll能够指定循环以多少次展开。

47.CUDA内置矢量类型
解析：char1，uchar1，char2，uchar2，char3，uchar3，char4，uchar4，short1，ushort1，short2，ushort2，short3，ushort3，short4，ushort4，int1，uint1，int2，uint2，int3，uint3，int4，uint4，long1，ulong1，long2，ulong2，long3，ulong3，long4，ulong4，float1，float2，float3，float4。

48.CUDA类型转换函数
解析：
（1）int __float2int_[rn,rz,ru,rd](float);：用指定的舍入模式转换浮点参数到整型。
（2）unsigned int __float2unit_[rn,rz,ru,zd](float);:用指定的舍入模式转换浮点参数到无符号整型。
（3）float __int2float_[rn,rz,ru,rd](int);：用指定的舍入模式转换整型参数到浮点数。
（4）float __int2float_[rn,rz,ru,rd](unsigned int);：用指定的舍入模式转换无符号整型参数到浮点数。
说明：rn是求最近的偶数，rz是逼近零，ru是向上舍入[到正无穷]，rd是向下舍入[到负无穷]。

49.asyncAPI.cu代码剖析
解析：

#include <stdio.h>#include <cuda_runtime.h>#include <helper_cuda.h>#include <helper_functions.h>  __global__ void increment_kernel(int *g_data, int inc_value){ int idx = blockIdx.x * blockDim.x + threadIdx.x; g_data[idx] = g_data[idx] + inc_value;}int main(int argc, char *argv[]){    int n = 16 * 1024 * 1024; int nbytes = n * sizeof(int); int value = 26;  // allocate host memory int *a = 0; cudaMallocHost((void **)&a, nbytes); memset(a, 0, nbytes);  // allocate device memory int *d_a = 0; cudaMalloc((void **)&d_a, nbytes); cudaMemset(d_a, 255, nbytes);  // set kernel launch configuration dim3 threads = dim3(512, 1); dim3 blocks = dim3(n / threads.x, 1);  // create cuda event handles cudaEvent_t start, stop; cudaEventCreate(&start); cudaEventCreate(&stop);  // define time StopWatchInterface *timer = NULL; sdkCreateTimer(&timer); sdkResetTimer(&timer);  cudaDeviceSynchronize(); float gpu_time = 0.0f;  // asynchronously issue work to the GPU (all to stream 0) sdkStartTimer(&timer); cudaEventRecord(start, 0); cudaMemcpyAsync(d_a, a, nbytes, cudaMemcpyHostToDevice, 0); increment_kernel << <blocks, threads, 0, 0 >> >(d_a, value); cudaMemcpyAsync(a, d_a, nbytes, cudaMemcpyDeviceToHost, 0); cudaEventRecord(stop, 0); sdkStopTimer(&timer);  // have CPU do some work while waiting for stage 1 to finish unsigned long int counter = 0; while (cudaEventQuery(stop) == cudaErrorNotReady) {  counter++; }  cudaEventElapsedTime(&gpu_time, start, stop);  // print the cpu and gpu times printf("time spent executing by the GPU: %.2f\n", gpu_time); printf("time spent by CPU in CUDA calls: %.2f\n", sdkGetTimerValue(&timer)); printf("CPU executed %lu iterations while waiting for GPU to finish\n", counter);  // release resources cudaEventDestroy(start); cudaEventDestroy(stop); cudaFreeHost(a); cudaFree(d_a);  cudaDeviceReset();}
复制

解析：
（1）使用事件管理API主要作用是用于记录GPU状态，使CPU可以通过查询CUDA事件来确定GPU是否执行结束。
（2）常见的异步执行（主机端和设备端）函数包括Kernel启动；以Async为后缀的内存拷贝函数；device到device内存拷贝函数；存储器初始化函数，比如cudaMemset()，cudaMemset2D()，cudaMemset3D()。

50.流的创建与初始化
解析：

cudaStream_t *streams = (cudaStream_t *)malloc(nstreams * sizeof(cudaStream_t));for (int i = 0; i < nstreams; i++){ checkCudaErrors(cudaStreamCreate(&(streams[i])));}
复制

51.simpleStreams.cu代码剖析
解析：

#include <stdio.h>#include <cuda_runtime.h>__global__ void init_array(int *g_data, int *factor, int num_iterations){ int idx = blockIdx.x * blockDim.x + threadIdx.x; for (int i = 0; i<num_iterations; i++) { g_data[idx] += *factor;}} int main(int argc, char **argv){ int nstreams = 4;                int nreps = 10;                  int n = 16 * 1024 * 1024;        int nbytes = n * sizeof(int);    dim3 threads, blocks;            float elapsed_time, time_memcpy, time_kernel;    int niterations = 5;      // allocate host memory int c = 5;                       int *h_a = 0;                   cudaMallocHost((void**)&h_a, nbytes);  // allocate device memory int *d_a = 0, *d_c = 0;             cudaMalloc((void **)&d_a, nbytes); cudaMalloc((void **)&d_c, sizeof(int)); cudaMemcpy(d_c, &c, sizeof(int), cudaMemcpyHostToDevice);  // allocate and initialize an array of stream handles cudaStream_t *streams = (cudaStream_t *)malloc(nstreams * sizeof(cudaStream_t)); for (int i = 0; i < nstreams; i++) { cudaStreamCreate(&(streams[i]));}  // create CUDA event handles, use blocking sync cudaEvent_t start_event, stop_event; cudaEventCreate(&start_event); cudaEventCreate(&stop_event);  // time memcopy from device cudaEventRecord(start_event, 0); cudaMemcpyAsync(h_a, d_a, nbytes, cudaMemcpyDeviceToHost, streams[0]); cudaEventRecord(stop_event, 0); cudaEventSynchronize(stop_event);    cudaEventElapsedTime(&time_memcpy, start_event, stop_event); printf("memcopy:\t%.2f\n", time_memcpy);  // time kernel threads = dim3(512, 1); blocks = dim3(n / threads.x, 1); cudaEventRecord(start_event, 0); init_array << <blocks, threads, 0, streams[0] >> >(d_a, d_c, niterations); cudaEventRecord(stop_event, 0); cudaEventSynchronize(stop_event); cudaEventElapsedTime(&time_kernel, start_event, stop_event); printf("kernel:\t\t%.2f\n", time_kernel);     // time non-streamed execution for reference threads = dim3(512, 1); blocks = dim3(n / threads.x, 1); cudaEventRecord(start_event, 0); for (int k = 0; k < nreps; k++) {  init_array << <blocks, threads >> >(d_a, d_c, niterations);  cudaMemcpy(h_a, d_a, nbytes, cudaMemcpyDeviceToHost); } cudaEventRecord(stop_event, 0); cudaEventSynchronize(stop_event); cudaEventElapsedTime(&elapsed_time, start_event, stop_event); printf("non-streamed:\t%.2f\n", elapsed_time / nreps);  // time execution with nstreams streams threads = dim3(512, 1); blocks = dim3(n / (nstreams*threads.x), 1); memset(h_a, 255, nbytes);      cudaMemset(d_a, 0, nbytes);  cudaEventRecord(start_event, 0); for (int k = 0; k < nreps; k++) {   // 异步加载nstreams个kernel  for (int i = 0; i < nstreams; i++)  {   init_array << <blocks, threads, 0, streams[i] >> >(d_a + i *n / nstreams, d_c, niterations);  }     // 异步加载nstreams个memcopy  for (int i = 0; i < nstreams; i++)  {   cudaMemcpyAsync(h_a + i * n / nstreams, d_a + i * n / nstreams, nbytes / nstreams, cudaMemcpyDeviceToHost, streams[i]);  } } cudaEventRecord(stop_event, 0); cudaEventSynchronize(stop_event); cudaEventElapsedTime(&elapsed_time, start_event, stop_event); printf("%d streams:\t%.2f\n", nstreams, elapsed_time / nreps);     // release resources for (int i = 0; i < nstreams; i++) { cudaStreamDestroy(streams[i]); } cudaEventDestroy(start_event); cudaEventDestroy(stop_event); cudaFree(h_a); cudaFree(d_a); cudaFree(d_c);  cudaDeviceReset();}
复制

解析：simpleStreams.cu进行了流与事件的创建，并分别进行了内存拷贝计时，使用流的kernel执行计时，不使用流的kernel执行计时，以及使用nstreams个流的整体计时。为了使计时更加准确，采用了执行nreps次求平均值的方法。

52.CUDA中的流
解析：在一个给定的流中，操作顺序进行，但在不同流上的操作是乱序执行的，也可能是并行执行的。流的定义方法是创建一个cudaStream_t对象，并在启动内核和进行内存复制时将该对象作为参数传入，参数相同的属于同一个流，参数不同的属于不同的流。

53.Tegra
解析：Tegra是于推出的基于ARM构架通用处理器品牌（即CPU，NVIDIA称为"Computer on a chip"片上计算机），能够为便携设备提供高性能、低功耗体验。

54.Ubuntu 16.04安装CUDA 10.1
解析：
（1）禁用nouveau驱动

lsmod | grep nouveausudo vim /etc/modprobe.d/blacklist.confblacklist nouveauoptions nouveau modeset=0sudo update-initramfs –usudo rebootlsmod | grep nouveau
复制

（2）文本命令行模式运行runfile文件安装CUDA

sudo service lightdm stopsudo sh cuda_10.1.168_418.67_linux.run --no-opengl-libssudo /usr/local/cuda-10.1/bin/cuda-uninstallersudo /usr/bin/nvidia-uninstallls /dev/nvidia*
复制

（3）设置环境变量/etc/profile

sudo vim /etc/profileexport PATH=/usr/local/cuda-10.1/bin${PATH:+:${PATH}}export LD_LIBRARY_PATH=/usr/local/cuda-10.1/lib64${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}cat /proc/driver/nvidia/versionnvcc -v
复制

（4）编译cuda提供的samples

cd /home/xxx/NVIDIA_CUDA-10.1_Samplesmakecd /home/lxxx/NVIDIA_CUDA-10.1_Samples/bin/x86_64/linux/release./deviceQuery./bandwidthTest
复制

（5）安装cudnn

sudo cp cuda/include/cudnn.h /usr/local/cuda/include/sudo cp cuda/lib64/libcudnn* /usr/local/cuda/lib64/
复制

（6）查看cuDNN是否安装成功

~/Downloads$ cat /usr/local/cuda/include/cudnn.h | grep CUDNN_MAJOR -A 2#define CUDNN_MAJOR 7#define CUDNN_MINOR 6#define CUDNN_PATCHLEVEL 4--#define CUDNN_VERSION (CUDNN_MAJOR * 1000 + CUDNN_MINOR * 100 + CUDNN_PATCHLEVEL) #include "driver_types.h"
复制

说明：至此就把CUDA和cuDNN安装好了。

55.CMakeLists.txt
解析：

[1]MESSAGE(STATUS "Project: SERVER")：打印相关消息[2]SET(CMAKE_BUILE_TYPE DEBUG)：指定编译类型，debug或者为release[3]SET(CMAKE_C_FLAGS_DEBUG "-g -Wall")：指定编译器[4]CMAKE_C_FLAGS_DEBUG：C编译器[5]CMAKE_CXX_FLAGS_DEBUG：C++编译器[6]-g：只是编译器，在编译的时候，产生调试信息[7]-Wall：生成所有警告信息[8]ADD_SUBDIRECTORY()：添加子目录[9]SET(*.cpp)：设置变量，表示所有的源文件[10]INCLUDE_DIRECTORIES：相关头文件的目录[11]LINK_DIRECTORIES：相关库文件的目录[12]ADD_LIBRARY：生成静态链接库[13]TARGET_LINK_LIBRARY：依赖的库文件[14]SET_TARGET_PROPERTIES：表示生成的执行文件所在路径[15]add_executable：指定生成目标[16]find_package()：用来查找依赖包的[17]FILE(GLOB EXTENDED doc/**)：加载doc文件夹下的所有文件
复制

56.CUDA API
解析：Runtime API和Driver API，两种API各有其适用的范围。

57.PTX代码
解析：并行线程执行（Parallel Thread eXecution，PTX）代码是编译后的GPU代码的一种中间形式，它可以再次编译为原生的GPU微码。

58.Clang
解析：Clang是一个C语言、C++、Objective-C语言的轻量级编译器。源代码发布于BSD协议下。Clang将支持其普通lambda表达式、返回类型的简化处理以及更好的处理constexpr关键字。

59.clang-tidy
解析：clang-tidy是一个基于clang的静态代码分析框架，支持C++/C/Objective-C。

参考文献：
[1] CUDA编程：http://www.cnblogs.com/stewart/archive/2013/01/05/2846860.html
[2] DirectX学习经典参考书籍：http://blog.csdn.net/kuangfengwu/article/details/7344009
[3] 数字图像处理高级应用：基于MATLAB与CUDA的实现
[4] CUDA里面的Texture Memory：http://blog.csdn.net/qq_25716575/article/details/52444686
[5] CUDA纹理的使用：http://preston2006.blog.sohu.com/253531751.html
[6] CUDA纹理存储器的特性及其使用：http://blog.csdn.net/darkstorm2111203/articl
[7]CUDA计算能力查询表：https://blog.csdn.net/allyli0022/article/details/54628987e/details/4294012