主要整理了N多年前(2013年)学习CUDA的时候开始总结的知识点,好长时间不写CUDA代码了,现在LLM推理需要重新学习CUDA编程,看来出来混迟早要还的。
1.闭扫描和开扫描
对于一个二元运算符和一个元输入数组。如果返回输出数组为,那么是闭扫描;如果返回输出数组为,那么是开扫描。串行闭扫描算法,如下所示:
/** * x: input array * y: output array */void sequential_scan(float* x, float* y, int Max_i) { y[0] = x[0]; for (int i=1; i<Max_i; i++) { y[i] = y[i-1] + x[i]; }}
复制
说明:在闭扫描的输出和开扫描的输出间互相转换还是比较简单的,只需要移一次位并填上一个元素即可。
(1)从闭扫描转换到开扫描,只需把所有元素右移,0号元素填0值。
(2)从开扫描转换到闭扫描,需要把所有元素向左移动一位,最后一个元素填充原来最后一个元素和输入数组的最后一个元素之和。
说明:假设输入数组为[3, 1, 7, 0, 4, 1, 6, 3],闭操作输出数组为[3, 4, 11, 11, 15, 16, 22, 25],开操作输出数组为[0, 3, 4, 11, 11, 15, 16, 22],可以验证。
2.简单并行扫描
在实践中,并行扫描常常作为一些并行算法的原始操作,比如基数排序、快速排序、字符串比较、多项式求值、递归求解、树操作和直方图等。
解析:
(1)__syncthreads()
确保所有线程在开始下一次迭代之前完成归约树中当前这次迭代的加法。
(2)inclusive scan表示闭扫描部分,而exclusive scan表示开扫描部分。
说明:除了简单并行扫描外,还有工作高效的并行扫描,任意输入长度的并行扫描。
3.Thrust与CUDA的互操作性
解析:Thrust与CUDA的互操作性有利于迭代开发策略,比如使用Thrust库快速开发出并行应用的原型,确定程序瓶颈,使用CUDA C实现特定算法并作必要优化。When a Thrust function is called, it inspects the type of the iterator to determine whether to use a host or a device implementation. This process is known as static dispatching since the host/device dispatch is resolved at compile time. Note that this implies that there is no runtime overhead to the dispatch process.
(1)Thrust到CUDA的互操作性
size_t N = 1024;device_vector<int> d_vec(N);int raw_ptr = raw_pointer_cast(&d_vec[0]);cudaMemset(raw_ptr, 0, N*sizeof(int));my_kernel << <N / 128, 128 >> >(N, raw_ptr);
复制
说明:通过raw_pointer_cast()将设备地址转换为原始C指针,原始C指针可以调用CUDA C API函数,或者作为参数传递到CUDA C kernel函数中。
(2)CUDA到Thrust的互操作性
size_t N = 1024;int raw_ptr;cudaMalloc(&raw_ptr, N*sizeof(int));device_ptr<int> dev_ptr = device_pointer_cast(raw_ptr);sort(dev_ptr, dev_ptr+N);dev_ptr[0] = 1;cudaFree(raw_ptr);
复制
说明:通过device_pointer_cast()将原始C指针转换为设备向量的设备指针,以便访问Thrust库中的算法。
4.GPU,SM,SP与Grid,Block,Thread之间的映射关系
解析:GPU的任务分配单元将Grid分配到GPU芯片上。任务分配单元使用轮询策略将Block分配到SM上,决定能否分配的因素包括每个Block使用的共享存储器数量,每个Block使用的寄存器数量,以及其它的一些限制条件。SM中的线程调度单元又将分配到的Block进行细分,将其中的线程组织成线程束(Warp),Block中的每一个Thread被发射到一个SP上。一个SM可以同时处理多个Block,比如现在有16个SM、64个Block、每个SM可以同时处理3个Block,那么设备刚开始的时候就会同时处理48个Block,剩下的16个Block等待SM。一个SM一次只会执行一个Block中的一个Warp,但是SM遇到正在执行的Warp需要等待的时候(比如存取Global Memory等),就切换到别的Warp继续做运算。
5.固定内存(pinned memory)
解析:malloc()分配的是可分页的主机内存,而cudaHostAlloc()分配的是页锁定的主机内存,也称固定内存(pinned memory),它的一个重要特点是操作系统不会对这块内存分页并交换到磁盘上,从而保证了这块内存不会被破坏或者重新定位。
6.CUDA 7.5和cuDNN 5.0安装
解析:
(1)解压缩会生成cuda/include、cuda/lib、cuda/bin三个目录;
(2)分别将cuda/include、cuda/lib、cuda/bin三个目录中的内容拷贝到C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v7.5对应的include、lib、bin目录下。
说明:CUDA 8.0对应的cuDNN 5.0和CUDA 7.5对应的cuDNN 5.0是不一样的。
7.NVIDIA Deep Learning SDK
解析:
(1)Deep Learning Primitives (cuDNN): High-performance building blocks for deep neural network applications including convolutions, activation functions, and tensor transformations.(2)Deep Learning Inference Engine (TensorRT): High-performance deep learning inference runtime for production deployment.(3)Deep Learning for Video Analytics (DeepStream SDK): High-level C++ API and runtime for GPU-accelerated transcoding and deep learning inference.(4)Linear Algebra (cuBLAS): GPU-accelerated BLAS functionality that delivers 6x to 17x faster performance than CPU-only BLAS libraries.(5)Sparse Matrix Operations (cuSPARSE): GPU-accelerated linear algebra subroutines for sparse matrices that deliver up to 8x faster performance than CPU BLAS (MKL), ideal for applications such as natural language processing.(6)Multi-GPU Communication (NCCL): Collective communication routines, such as all-gather, reduce, and broadcast that accelerate multi-GPU deep learning training on up to eight GPUs.(7)NVIDIA DIGITS:Interactively manage data and train deep learning models for image classification, object detection, and image segmentation without the need to write code.说明:Fast Fourier Transforms (cuFFT);Dense and Sparse Direct Solvers (cuSOLVER);Random Number Generation (cuRAND);Image & Video Processing Primitives (NPP);NVIDIA Graph Analytics Library (nvGRAPH);Templated Parallel Algorithms & Data Structures (Thrust);CUDA Math Library.
复制
8.istream_iterator和ostream_iterator
解析:
(1)template <class T, class charT=char, class traits=char_traits
#include <iostream> // std::cout#include <iterator> // std::ostream_iterator#include <vector> // std::vector#include <algorithm> // std::copy int main () { std::vector<int> myvector; for (int i=1; i<10; ++i) myvector.push_back(i*10); std::ostream_iterator<int> out_it (std::cout, ", "); std::copy ( myvector.begin(), myvector.end(), out_it ); return 0;}
复制
(2)template <class T, class charT=char, class traits=char_traits
#include <iostream> // std::cin, std::cout#include <iterator> // std::istream_iteratorusing namespace std; int main() { double value1, value2; std::cout << "Please, insert two values: "; std::istream_iterator<double> eos; // end-of-stream iterator std::istream_iterator<double> iit(std::cin); // stdin iterator if (iit != eos) { cout << *eos << endl; cout << *iit << endl; cout << "test1" << endl; value1 = *iit; } ++iit; if (iit != eos) { cout << *eos << endl; cout << *iit << endl; cout << "test2" << endl; value2 = *iit; } std::cout << value1 << "*" << value2 << "=" << (value1*value2) << '\n'; return 0;}
复制
9.__host__ __device__ int foo(int a){}
解析:__host__ int foo(int a){}表示由CPU调用的函数。__device__ int foo(int a){}表示由GPU调用的函数。__host__和__device__关键字可以连用,比如__host__ __device__
int foo(int a){}会被编译成两个版本,分别可以由CPU和GPU调用。
10.SAXPY
解析:SAXPY(Scalar Alpha X Plus Y)是一个在Basic Linear Algebra Subprograms(BLAS)数据包中的函数,并且是一个并行向量处理机(vector processor)中常用的计算操作指令。SAXPY是标量乘法和矢量加法的组合:y=ax+y,其中a是标量,x和y矢量。
struct saxpy_functor { const float a; saxpy_functor(float _a) : a(_a) {} __host__ __device__ float operator()(const float& x, const float& y) const { return a * x + y; } }; void saxpy_fast(float A, thrust::device_vector<float>& X, thrust::device_vector<float>& Y) { // Y <- A * X + Y thrust::transform(X.begin(), X.end(), Y.begin(), Y.begin(), saxpy_functor(A)); } void saxpy_slow(float A, thrust::device_vector<float>& X, thrust::device_vector<float>& Y{ thrust::device_vector<float> temp(X.size()); // temp <- A thrust::fill(temp.begin(), temp.end(), A); // temp <- A * X thrust::transform(X.begin(), X.end(), temp.begin(), temp.begin(), thrust::multiplies<float>()); // Y <- A * X + Y thrust::transform(temp.begin(), temp.end(), Y.begin(), Y.begin(), thrust::plus<float>()); }
复制
说用:仿函数(functor),就是使一个类的使用看上去像一个函数。其实现就是类中实现一个operator(),这个类就有了类似函数的行为,就是一个仿函数类了。
11.Thrust中的Transformations(转换)
解析:
(1)thrust::fill
(2)thrust::sequence
(3)thrust::replace
(4)thrust::transform
(5)thrust::negate
(6)thrust::modulus
(7)thrust::zip_iterator
(8)thrust::for_each
12.Thrust中的Reductions(规约)
(1)thrust::reduce
(2)thrust::count
(3)thrust::count_if
(4)thrust::min_element
(5)thrust::max_element
(6)thrust::is_sorted
(7)thrust::inner_product
(8)thrust::transform_reduce
(9)transform_inclusive_scan
(10)transform_exclusive_scan
13.初始化thrust::device_vector
解析:
float x[4] = { 1.0, 2.0, 3.0, 4.0 };thrust::device_vector<float> d_x(x, x + 4);for (int i = 0; i < d_x.size(); i++) cout << d_x[i] << endl;
复制
14.template<typename T> struct thrust::plus< T >
解析:
#include <thrust/functional.h>。int sum = thrust::reduce(D.begin(), D.end(), (int) 0, thrust::plus<int>());float norm = std::sqrt(thrust::transform_reduce(d_x.begin(), d_x.end(), unary_op, init, binary_op));
复制
15.cudaDeviceReset
解析:重置当前线程所关联过的当前设备的所有资源。
16.CUDART_VERSION
解析:CUDA 7.5版本的CUDART_VERSION为7050,包含在头文件#include<cuda_runtime_api.h>中。
17.thrust::count
解析:thrust:count函数原型,如下所示:
template<typename InputIterator , typename EqualityComparable >thrust::iterator_traits<InputIterator>::difference_type thrust::count ( InputIterator first,InputIterator last,const EqualityComparable & value )
复制
说明:count returns the number of iterators i in [first, last) such that *i == value.
18.transform_reduce
解析:transform_reduce函数原型,如下所示:
template<typename InputIterator , typename UnaryFunction , typename OutputType , typename BinaryFunction >OutputType thrust::transform_reduce ( InputIterator first,InputIterator last,UnaryFunction unary_op,OutputType init,BinaryFunction binary_op )
复制
举个例子,如下所示:
#include <thrust\transform_reduce.h>#include <thrust\functional.h>#include <thrust\device_vector.h>#include <thrust\host_vector.h>#include <cmath>using namespace std;using namespace thrust; template <typename T>struct square{ __host__ __device__ T operator()(const T& x) const { return x*x; }}; int main(void){ float x[4] = { 1.0, 2.0, 3.0, 4.0 }; device_vector<float> d_x(x, x + 4); square<float> unary_op; thrust::plus<float> binary_op; float init = 10; float norm = thrust::transform_reduce(d_x.begin(), d_x.end(), unary_op, init, binary_op); cout << norm << endl; return 0;}
复制
19.Prefix-Sums:inclusive_scan和exclusive_scan
解析:
#include <thrust\scan.h>#include <thrust\device_vector.h>#include <thrust\host_vector.h>using namespace std;using namespace thrust; int main(void){ int data[6] = { 1, 0, 2, 2, 1, 3 }; // data is now {1, 1, 3, 5, 6, 9} // data[2] = data[0] + data[1] + data[2] // thrust::inclusive_scan(data, data + 6, data); // data is now {0, 1, 1, 3, 5, 6} // data[2] = data[0] + data[1] thrust::exclusive_scan(data, data + 6, data); for (int i = 0; i < 6; i++) { cout << data[i] << endl; } return 0;}
复制
20.thrust::sort
和thrust::stable_sort
解析;thrust::stable_sort函数原型,如下所示:
template<typename DerivedPolicy , typename RandomAccessIterator , typename StrictWeakOrdering >__host__ __device__ void thrust::stable_sort ( const thrust::detail::execution_policy_base< DerivedPolicy > & exec,RandomAccessIterator first,RandomAccessIterator last,StrictWeakOrdering comp )
复制
(1)exec:The execution policy to use for parallelization.
(2)first:The beginning of the sequence.
(3)last:The end of the sequence.
(4)comp:Comparison operator.
举个例子,如下所示:
#include <thrust\sort.h>using namespace std;using namespace thrust; int main(void){ const int N = 6; int A[N] = { 1, 4, 2, 8, 5, 7 }; // A is now {1, 2, 4, 5, 7, 8} // thrust::sort(A, A + N); // A is now {1, 2, 4, 5, 7, 8} thrust::stable_sort(A, A + N); for (int i = 0; i < 6; i++) { cout << A[i] << endl; } return 0;}
复制
(1)#include <thrust/functional.h>:操作的函数对象和工具。
(2)#include <thrust/execution_policy.h>:Thrust执行策略。
21.thrust::sort_by_key和thrust::stable_sort_by_key
解析:
#include <thrust\sort.h>using namespace std;using namespace thrust; int main(void){ const int N = 6; int keys[N] = { 1, 4, 2, 8, 5, 7 }; char values[N] = { 'a', 'b', 'c', 'd', 'e', 'f' }; // keys is now { 1, 2, 4, 5, 7, 8} // values is now {'a', 'c', 'b', 'e', 'f', 'd'} // thrust::sort_by_key(keys, keys + N, values); // keys is now { 1, 2, 4, 5, 7, 8} // values is now {'a', 'c', 'b', 'e', 'f', 'd'} thrust::stable_sort_by_key(keys, keys + N, values); for (int i = 0; i < 6; i++) { cout << values[i] << endl; } return 0;}
复制
22.Thrust中的Iterator
解析:
(1)constant_iterator
(2)counting_iterator
#include <thrust\iterator\constant_iterator.h>#include <thrust/iterator/counting_iterator.h>#include <thrust\reduce.h>#include <iostream>using namespace std;using namespace thrust; int main(void){ thrust::constant_iterator<int> first(10); thrust::constant_iterator<int> last = first + 3; // returns 30 (i.e. 3 * 10) // thrust::reduce(first, last); // returns 33 (i.e. 10 + 11 + 12) thrust::reduce(first, last); cout << thrust::reduce(first, last) << endl; return 0;}
复制
(3)transform_iterator
#include <thrust\iterator\transform_iterator.h>#include <thrust\device_vector.h>#include <iostream>using namespace std;using namespace thrust; int main(void){ thrust::device_vector<int> vec(3); vec[0] = 10; vec[1] = 20; vec[2] = 30; // returns -60 (i.e. -10 + -20 + -30) cout << thrust::reduce(thrust::make_transform_iterator(vec.begin(), thrust::negate<int>()), thrust::make_transform_iterator(vec.end(), thrust::negate<int>())) << endl; return 0;}
复制
(4)permutation_iterator
#include <thrust\iterator\permutation_iterator.h>#include <thrust\device_vector.h>#include <iostream>using namespace std;using namespace thrust; int main(void){ thrust::device_vector<int> map(4); map[0] = 3; map[1] = 1; map[2] = 0; map[3] = 5; thrust::device_vector<int> source(6); source[0] = 10; source[1] = 20; source[2] = 30; source[3] = 40; source[4] = 50; source[5] = 60; // sum = source[map[0]] + source[map[1]] + ... int sum = thrust::reduce(thrust::make_permutation_iterator(source.begin(), map.begin()), thrust::make_permutation_iterator(source.begin(), map.end())); cout << sum << endl; return 0;}
复制
(5)zip_iterator
#include <thrust\iterator\zip_iterator.h>#include <thrust\device_vector.h>#include <iostream>using namespace std;using namespace thrust; int main(void){ thrust::device_vector<int> A(3); thrust::device_vector<char> B(3); A[0] = 10; A[1] = 20; A[2] = 30; B[0] = 'x'; B[1] = 'y'; B[2] = 'z'; thrust::maximum< thrust::tuple<int, char> > binary_op; thrust::tuple<int, char> init = thrust::make_zip_iterator(thrust::make_tuple(A.begin(), B.begin()))[0]; thrust::tuple<int, char> result = thrust::reduce(thrust::make_zip_iterator(thrust::make_tuple(A.begin(), B.begin())), thrust::make_zip_iterator(thrust::make_tuple(A.end(), B.end())), init, binary_op); cout << thrust::get<0>(result) << endl; cout << thrust::get<1>(result) << endl; return 0;}
复制
23.#include<stdlib.h>
解析:
(1)#define EXIT_SUCCESS 0
(2)#define EXIT_FAILURE 1
24.cuBLAS与CUBLASXT
解析:在CUDA 6的开发包中,提供了一个新的API——CUBLASXT,它是在cuBLAS API的上层封装了一个矩阵分块算法,解决了当数据量大时显存不足的问题。
25.cuRAND库
解析:cuRAND库提供了通过GPU生成随机数的接口,包含头文件#include <curand_kernel.h>。
26.CUDA同步方式
解析:在CUDA中,有两种方式实现同步,如下所示:
(1)System-level:等待所有host和device的工作完成。
(2)Block-level:等待device中block的所有thread执行到某个点。
27.CUDA中的on-chip和off-chip内存
解析:
(1)共享内存(share memory)是on-chip。局部内存(local memory)和全局内存(global memory)是off-chip。它为一个线程块(block)中所有线程共享。
(2)局部内存(local memory)是全局内存(global memory)中划出的一部分。它为一个线程网格(grid)中的所有线程共享。
28.CUDA内存管理
(1)cudaError_t cudaMalloc(void** devPtr, size_t count);(2)cudaError_t cudaMallocPitch(void** devPtr, size_t* pitch, size_t widthInBytes, size_t height);(3)cudaError_t cudaFree(void* devPtr);(4)cudaError_t cudaMallocArray(struct cudaArray** array, const struct cudaChannelFormatDesc* desc, size_t width, size_t height);(5)cudaError_t cudaFreeArray(struct cudaArray* array);(6)cudaError_t cudaMallocHost(void** hostPtr, size_t size);(page-locked)(7)cudaError_t cudaFreeHost(void* hostPtr);(8)cudaError_t cudaMemset(void* devPtr, int value, size_t count);(9)cudaError_t cudaMemset2D(void* dstPtr, size_t pitch, int value, size_t width, size_t height)(10)cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, enum cudaMemcpyKind kind);说明:kind可以是cudaMemcpyHostToHost,cudaMemcpyHostToDevice,cudaMemcpyDeviceToHost或cudaMemcpyDeviceToDevice。(11)cudaError_t cudaMemcpyAsync(void* dst,constvoid*src, size_t count, enum cudaMemcpyKind kind, cudaStream_t stream);说明:它只能应用于page-locked的主机内存。(12)cudaError_t cudaMemcpy2D(void* dst, size_t dpitch, const void* src, size_t spitch, size_t width, size_t height, enum cudaMemcpyKind kind);(13)cudaError_t cudaMemcpy2DAsync(void* dst, size_t dpitch, const void* src, size_t spitch, size_t width, size_t height, enum cudaMemcpyKind kind, cudaStream_t stream);说明:dpitch:Pitch of destination memory;spitch:Pitch of source memory。(14)cudaError_t cudaMemcpyToArray(struct cudaArray* dstArray, size_t dstX, size_t dstY, const void* src, size_t count, enum cudaMemcpyKind kind);(15)cudaError_t cudaMemcpyToArrayAsync(struct cudaArray* dstArray, size_t dstX, size_t dstY, const void* src, size_t count, enum cudaMemcpyKind kind, cudaStream_t stream);说明:拷贝count字节,从src指向的内存区域到dstArray指向的CUDA数组,从数组的左上角(dstX, dstY)开始。(16)cudaError_t cudaMemcpy2DToArray(struct cudaArray* dstArray, size_t dstX, size_t dstY, const void* src, size_t spitch, size_t width, size_t height, enum cudaMemcpyKind kind);(17)cudaError_t cudaMemcpy2DToArrayAsync(struct cudaArray* dstArray, size_t dstX, size_t dstY, const void* src, size_t spitch, size_t width, size_t height, enum cudaMemcpyKind kind, cudaStream_t stream);说明:拷贝一个矩阵,从src指向的内存区域到dstArray指向的CUDA数组,从数组的左上角(dstX, dstY)开始。spitch是由src指向的2D数组中的内存宽度字节,2D 数组中每行的最后包含自动填充的数值。(18)cudaError_t cudaMemcpyFromArray(void* dst, const struct cudaArray* srcArray, size_t srcX, size_t srcY, size_t count, enum cudaMemcpyKind kind);(19)cudaError_t cudaMemcpyFromArrayAsync(void* dst, const struct cudaArray* srcArray, size_t srcX, size_t srcY, size_t count, enum cudaMemcpyKind kind, cudaStream_t stream);说明:拷贝count字节,从srcArray指向的CUDA数组,从数组的左上角(srcX, srcY)开始,到dst指向内存区域。(20)cudaError_t cudaMemcpy2DFromArray(void* dst, size_t dpitch, const struct cudaArray* srcArray, size_t srcX, size_t srcY, size_t width, size_t height, enum cudaMemcpyKind kind);(21)cudaError_t cudaMemcpy2DFromArrayAsync(void* dst, size_t dpitch, const struct cudaArray* srcArray, size_t srcX, size_t srcY, size_t width, size_t height, enum cudaMemcpyKind kind, cudaStream_t stream);说明:拷贝一个矩阵,从srcArray指向的CUDA数组,从数组的左上角(srcX, srcY)开始,到dst指向的内存区域。dpitch是由dst指向的2D数组中的内存宽度字节,2D数组中每行的最后包含自动填充的数值。(22)cudaError_t cudaMemcpyArrayToArray(struct cudaArray* dstArray, size_t dstX, size_t dstY, const struct cudaArray* srcArray, size_t srcX, size_t srcY, size_t count, enum cudaMemcpyKind kind);说明:拷贝count字节,从srcArray指向的CUDA数组,从数组的左上角(srcX, srcY)开始,到dstArray指向的CUDA数组,从数组的左上角(dstX, dstY)。(23)cudaError_t cudaMemcpyArrayToArray(struct cudaArray* dstArray, size_t dstX, size_t dstY, const struct cudaArray* srcArray, size_t srcX, size_t srcY, size_t width, size_t height, enum cudaMemcpyKind kind);说明:拷贝一个矩阵,从srcArray指向的CUDA数组,从数组的左上角(srcX, srcY)开始,到dstArray指向的CUDA数组,从数组的左上角(dstX, dstY)。(24)template<class T> cudaError_t cudaMemcpyToSymbol(const T& symbol, const void* src, size_t count, size_t offset = 0, enum cudaMemcpyKind kind = cudaMemcpyHostToDevice);说明:拷贝count字节,从src指向的内存区域到由符号symbol起始的offset字节指向的内存区域。symbol指向的是在device中的global或者constant memory。(25)template<class T> cudaError_t cudaMemcpyFromSymbol(void *dst, const T& symbol, size_t count, size_t offset = 0, enum cudaMemcpyKind kind = cudaMemcpyDeviceToHost);说明:拷贝count字节,从由符号symbol起始的offset字节指向的内存区域到dst指向的内存区域。symbol指向的是在device中的global或者constant memory。(26)template<class T> cudaError_t cudaGetSymbolAddress(void** devPtr, const T& symbol);说明:返回设备中符号symbol的地址*devPtr。(27)template<class T> cudaError_t cudaGetSymbolSize(size_t* size, const T& symbol );说明:返回符号symbol大小的地址*size。
复制
参考文献:
[1] CUDA并行算法系列之规约:http://blog.5long.me/2016/algorithms-on-cuda-reduction/
[2] 大规模并行处理器编程实战(第2版)
[3] CUDA之深入理解threadIdx:http://blog.csdn.net/canhui_wang/article/details/51730264
[4] Thrust:http://docs.nvidia.com/cuda/thrust/index.html#abstract
[5] GPU中的几个基本概念:http://blog.sina.com.cn/s/blog_80ce3a550101lntp.html
[6] Thrust:http://docs.nvidia.com/cuda/thrust/index.html#axzz4aFPI7CYb
[7] on-chip memory概念:http://bbs.csdn.net/topics/340269551
在CUDA程序中, 访存优化个人认为是最重要的优化项. 往往kernel会卡在数据传输而不是计算上, 为了最大限度利用GPU的计算能力, 我们需要根据GPU硬件架构对kernel访存进行合理的编写.