日本电子维修技术显卡<更新>Titan V + Tensor Core小测，想说爱真不

日期：2021-09-29 栏目：维修经验

IMG_0216 - 副本.jpg (258.61 KB, 下载次数: 0)

2018-4-4 11:06 上传

想用tensor core，哪怕是最蹩脚最愚蠢的tensor core，只要是tensor core就行了。不过呢也就停留在“想”上，实在没抽空下来研究研究。
昨晚稍微有点时间，就有了个大胆的想法，做个测试。

1，什么是Tensor。
物理学概念，在这里只指代高维度的数据集合。零维标量，一维向量，二维矩阵，三维及以上都可以称为张量tensor，dl中常用的数据体一般是4维的，而最新的capsule是5维数据体。

2，Tensor Core能做哪些处理。
回答这个问题很简单，因为说明书中写着呢。翻开reference manual第六页，上面清楚的写着，Tensor Core只能应用在两种网络结构上（CNN/RNN），每种网络结构中只适用两种处理，即前馈和后馈。看起来，范围很窄，不过其实这两个处理占据了深度学习非常之高的比重，大概有90%吧。

3，Tensor Core的开启条件。
也就是说，哪些场合可以开启Tensor Core。条件还是很苛刻的，不仅仅是用半精度数据那么简单。
3.1，首先，要为卷积操作设定math type为tensor op，告诉gpu说尽量的使用tensor core，这个很简单，当然如果剩下的条件不符合要求，它还是不会用tensor core的，只是“尽量”使用。

3.2，在执行前馈处理时，采用一种特殊的矩阵乘法算法，这个是最简单的一部，同样是设置一下就行了。no big deal。

3.3，输入/输出/卷积核都必须是fp16格式，也就是半精度浮点格式。设置一下是很容易，不过相应的精度损失需要掂量一下，是否吃的下之类的，当然可以不停的试错来知道。好在大多dl模型是精度不敏感的。

3.4，所有通道必须是8的倍数，这个是很伤，反过来要求模型怎样怎样。什么是通道，比如彩色图片是三通道，就是这个概念。这是一个很苛刻的条件，是一种限制吧。

我为什么知道这么多，因为说明书里面白字黑字写着呢。

4，如何测试。
我没有使用框架进行测试，毕竟框架都做了高度封装，完全不知道他支持到什么程度，这里我使用的是raw cudnn代码。
我仅针对一个函数进行了impl，并做了测试，测试中反复计算同一组数据，可以简单直接的剔除数据交换的耗时，控制了其他变量后，比较也容易了一些。
我测试的是CNN经典的前馈网络算法的cudnn实现：cudnnConvolutionForward，记录的是运行一万次的总运行时间。
下面我一边贴数据，一边说下我的一些个人想法。

TEST环节：

前面说到了经典CNN一般是4维数据集合，但这4维并非平权，而是有分工的。基本上，第一维自成一方，剩下的三维组成另一方。第一维即所谓的batch_size（批次），它代表的是同一时间有多少个样本被一次性塞入GPU，而剩下的三维则是描述一个样本的内在属性。比如一张彩色图片有类似24*24*3的属性，三个维度分别代表长度，宽度和通道（彩色有三个通道）。

这意味测试的时候，主要是调整这两个大方向的平衡，因为在内存大小限定之下，如果样本自身很大，那么一次性能处理的样本就很少了。简单的讲就是这么个理。

第一组，批次设定在512，即一次性处理512个样本。样本的内在属性是128*128*8，可以理解为长宽都为128，通道为8的图片（当然图片没有8维的，这里只是制作的假测试数据，为了满足tensor core启用的第三个条件，即通道必须是8的倍数）

结果如下：
Tensor Core关闭，SP采用半精度浮点计算,fp16 Tensor Core关闭，SP采用单精度计算,fp32Tensor Core开启，并采用半精度计算 ,fp16 78.01s160.20s 69.60s

评论

代码奉上。使用nvcc编译后即可运行，有TODO标签告诉你要改哪些部分。
这个过期了，参考14楼。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <cuda_runtime.h>
#include <assert.h>
#include <cudnn.h>
#include <stddef.h>
#include <iostream>
#include <time.h>
// Linux
#include <sys/time.h>

#define checkCUDNN(expression)                            \
  {                                                       \
cudnnStatus_t status = (expression);                   \
if (status != CUDNN_STATUS_SUCCESS) {                   \
   std::cerr << "Error on line " << __LINE__ << ": "    \
            << cudnnGetErrorString(status) << std::endl; \
   std::exit(EXIT_FAILURE);                            \
}                                                       \
  }

using namespace std;

double get_wall_time()
{
struct timeval time ;
if (gettimeofday(&time,NULL)){
      return 0;
}
return (double)time.tv_sec + (double)time.tv_usec * .000001;
}

int main( int argc, char** argv )
{

size_t cudnn_version = cudnnGetVersion();
size_t cuda_version = cudnnGetCudartVersion();

cout << "version : cuda runtime: " << cuda_version << " | cudnn: " << cudnn_version << endl;

cudnnHandle_t cudnn;
checkCUDNN(cudnnCreate(&cudnn));

//TODO 4 : The number of Input and Output feature maps is a multiple of 8.
//TODO 5 : The Filter is of type CUDNN_TENSOR_NCHW or CUDNN_TENSOR_NHWC. When using a filter of type CUDNN_TENSOR_NHWC, Input, Filter and Output data pointers (X, Y, W, dX, dY, and dW as applicable) need to be aligned to 128 bit boundaries.

const int batch_size = 100;
const int input_channel = 8;
const int output_channel = input_channel;
const int input_height = 256;
const int input_width = input_height;
const int kernel_height = 3;
const int kernel_width = kernel_height;

//TODO 3 : Input, Filter and Output descriptors (xDesc, yDesc, wDesc, dxDesc, dyDesc and dwDesc as applicable) have dataType = CUDNN_DATA_HALF.

// INPUT
cudnnTensorDescriptor_t input_descriptor;
checkCUDNN(cudnnCreateTensorDescriptor(&input_descriptor));
checkCUDNN(cudnnSetTensor4dDescriptor(input_descriptor,
                                       /*format=*/CUDNN_TENSOR_NHWC/*batch_size | height | width | channel*/,
                                       /*dataType=*/CUDNN_DATA_HALF/*fp16*/,
                                       /*batch_size=*/batch_size,
                                       /*channels=*/input_channel,
                                       /*image_height=*/input_height,
                                       /*image_width=*/input_width));
// OUTPUT
cudnnTensorDescriptor_t output_descriptor;
checkCUDNN(cudnnCreateTensorDescriptor(&output_descriptor));
checkCUDNN(cudnnSetTensor4dDescriptor(output_descriptor,
                                       /*format=*/CUDNN_TENSOR_NHWC,
                                       /*dataType=*/CUDNN_DATA_HALF,
                                       /*batch_size=*/batch_size,
                                       /*channels=*/output_channel,
                                       /*image_height=*/input_height,
                                       /*image_width=*/input_width));

// KERNEL
cudnnFilterDescriptor_t kernel_descriptor;
checkCUDNN(cudnnCreateFilterDescriptor(&kernel_descriptor));
checkCUDNN(cudnnSetFilter4dDescriptor(kernel_descriptor,
                                       /*dataType=*/CUDNN_DATA_HALF,
                                       /*format=*/CUDNN_TENSOR_NCHW,
                                       /*out_channels=*/output_channel,
                                       /*in_channels=*/input_channel,
                                       /*kernel_height=*/kernel_height,
                                       /*kernel_width=*/kernel_width));

// CONVOLUTION
cudnnConvolutionDescriptor_t convolution_descriptor;
checkCUDNN(cudnnCreateConvolutionDescriptor(&convolution_descriptor));
checkCUDNN(cudnnSetConvolution2dDescriptor(convolution_descriptor,
                                             /*pad_height=*/1/*zero-padding*/,
                                             /*pad_width=*/1/*zero-padding*/,
                                             /*vertical_stride=*/1,
                                             /*horizontal_stride=*/1,
                                             /*dilation_height=*/1/*holing, new kernel height = dilation_factor * ( original_height - 1 ) + 1*/,
                                             /*dilation_width=*/1/*dilation_factor = 1 means no change*/,
                                             /*mode=*/CUDNN_CROSS_CORRELATION,
                                             /*computeType=*/CUDNN_DATA_FLOAT));

// TODO 1 : cudnnSetConvolutionMathType is called on the appropriate convolution descriptor setting mathType to CUDNN_TENSOR_OP_MATH.

cudnnSetConvolutionMathType(convolution_descriptor, CUDNN_TENSOR_OP_MATH);
// cudnnSetConvolutionMathType(convolution_descriptor, CUDNN_DEFAULT_MATH);

// using 10GB GPU memory for workspace
float workspace_bytes = 10 * 1024.0f * 1024.0f * 1024.0f;
void* d_workspace{nullptr};
cudaMalloc(&d_workspace, workspace_bytes);

int dummy_input_bytes = batch_size * input_width * input_height * input_channel * sizeof(float);
int output_bytes = dummy_input_bytes;

// generate dummy input data
static float h_dummy_input[batch_size][input_width][input_height][input_channel];
for (int inner_batch_size = 0; inner_batch_size < batch_size; ++inner_batch_size) {
         for (int inner_height = 0; inner_height < input_height; ++inner_height) {
                  for (int inner_width = 0; inner_width < input_width; ++inner_width) {
                        for (int inner_channel = 0; inner_channel < input_channel; ++inner_channel) {
                                 h_dummy_input[inner_batch_size][inner_height][inner_width][inner_channel] = 0.12345f;
      }
      }
   }
}

float* d_dummy_input{nullptr};
cudaMalloc(&d_dummy_input, dummy_input_bytes);
cudaMemcpy(d_dummy_input, h_dummy_input, dummy_input_bytes, cudaMemcpyHostToDevice);

// output data
float* d_output{nullptr};
cudaMalloc(&d_output, output_bytes);
cudaMemset(d_output, 0, output_bytes);

// 3 by 3 dummy kernel
const float kernel_template[kernel_width][kernel_height] = {
   {1.1,  1.2, 1.3},
   {1.4, -1.5, 1.6},
   {1.7,  1.8, 1.9}
};

float h_dummy_kernel[input_channel][output_channel][kernel_width][kernel_height];
for (int kernel = 0; kernel < input_channel; ++kernel) {
   for (int channel = 0; channel < output_channel; ++channel) {
      for (int row = 0; row < kernel_width; ++row) {
      for (int column = 0; column < kernel_height; ++column) {
         h_dummy_kernel[kernel][channel][row][column] = kernel_template[row][column];
      }
      }
   }
}

float* d_dummy_kernel{nullptr};
cudaMalloc(&d_dummy_kernel, sizeof(h_dummy_kernel));
cudaMemcpy(d_dummy_kernel, h_dummy_kernel, sizeof(h_dummy_kernel), cudaMemcpyHostToDevice);

      double start_time = get_wall_time();

      // TODO 2 : cudnnConvolutionForward is called using algo = CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM.

      // execute
      const float alpha = 1, beta = 0;
for(int i = 0; i < 10000; i++){

         checkCUDNN(cudnnConvolutionForward(cudnn,
                                                &alpha,
                                                input_descriptor,
                                                d_dummy_input,
                                                kernel_descriptor,
                                                d_dummy_kernel,
                                                convolution_descriptor,
                                                CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM,
                                                d_workspace,
                                                workspace_bytes,
                                                &beta,
                                                output_descriptor,
                                                d_output));
}

      double end_time = get_wall_time();
      cout<<"time cost : "<< end_time - start_time <<"s." << endl;

      cudaFree(d_dummy_kernel);
      cudaFree(d_dummy_input);
      cudaFree(d_workspace);

cudnnDestroyTensorDescriptor(input_descriptor);
cudnnDestroyTensorDescriptor(output_descriptor);
cudnnDestroyFilterDescriptor(kernel_descriptor);
cudnnDestroyConvolutionDescriptor(convolution_descriptor);
cudnnDestroy(cudnn);
}
复制代码

评论
不想说了，放眼望去全是金币
wifi天线用asus新出的站立式的应该会比较和谐吧

评论
学习了…
说起来gp102和gp100一个支持int8一个支持fp16，就是不给你个全乎的…

评论

Test bench什么型号, 哪里入?

好吧, 坛里搜到了

评论
树导师科普必须顶一个！tensor规模大理论dl算力高，但实际性能要达到理论水平还是需要巨大的优化

评论
等等，树先生
吃瓜群众表示，你这结论容易虐死强迫症
你得直观的告诉大家，搞这个，具体哪款显卡更合适
神马优化之类的不care！

评论
ps. 本来是像下面这样回复的，后面才注意到你用的是CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM。感觉tensor根本打不过CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD啊、、、

--------
em......
CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD优化的太好了吧。。。毕竟tensor不能这样优化了？

曾经对比过：
0 CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_GEMM          0:00:28.527458
1 CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM 0:00:15.827436
6 CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD                0:00:10.990773

评论
tensor core目前看来是这次光追技术的关键，老黄下代游戏卡会不会加入tensor core

评论

我试了下，这个wingrad fft 算法真是非常厉害啊，第一组数据，半精度42.299s，单精度38.70s。intel测试了它的i7 975和cudnn [email protected] x（还未应用wingrad算法之前，采用CUDNN_CONVOLUTION_FWD_ALGO_DIRECT）的对比，居然是吊打。
https://arxiv.org/pdf/1509.09308.pdf
https://ai.intel.com/winograd/

1.PNG (118.67 KB, 下载次数: 0)

2018-4-4 18:39 上传

2.PNG (146.22 KB, 下载次数: 0)

2018-4-4 18:39 上传

但这个算法只能应用到small tiles上，需要卷积核很小，比如本贴中用的3x3。

评论

买贵的就行了。

评论

6个字就终结我了？

评论
更新一组新的测试数据，在限定其他参数之下，改变样本尺寸的性能测试。batch_size = 128, input_channel = 8, kernel_size = 3*3

数据
hyper_param       tensor_core       sp_fp16       sp_fp32       sp_fp32_winograd       sp_fp16_winograd
32       1.24807       1.14181       2.25537       0.917808       0.921247
40       1.65325       1.72399       3.33615       1.35495       1.44183
48       2.25016       2.37614       4.67559       1.61697       1.72384
56       2.95046       3.18836       6.35148       2.31077       2.50663
64       3.76574       4.09601       8.22187       2.65733       2.85273
80       5.67113       6.2488       12.7032       4.00899       4.33426
96       8.05187       8.90431       18.2806       5.65286       6.13542
112       10.8515       12.0267       24.6829       7.57724       8.27656
128       14.0958       32.0654       32.1668       9.87031       10.7502
152       19.7337       45.0854       45.233       14.3792       15.7652
176       26.3397       29.4172       60.5687       18.3978       20.1214
200       33.9266       37.9171       78.1265       24.5277       26.8089
224       42.502       47.4743       97.9334       29.6405       32.3995
256       55.5108       32.9729       127.826       38.9912       42.0938

chart化：

chart.jpg (93.91 KB, 下载次数: 1)

2018-4-5 11:13 上传

纵坐标为time cost，越低越好。

结论：
sp的半精度性能在暴走。。。
winograd的优化很强，现阶段最强。

评论
更新新的测试代码，经过简单的封装后可以进行循环调用。

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#include <cuda_runtime.h>
#include <assert.h>
#include <cudnn.h>
#include <stddef.h>
#include <iostream>
#include <time.h>
#include<vector>
// Linux
#include <sys/time.h>
#include <boost/multi_array.hpp>

using namespace std;

#define checkCUDNN(expression)                            \
  {                                                       \
cudnnStatus_t status = (expression);                   \
if (status != CUDNN_STATUS_SUCCESS) {                   \
   cerr << "Error on line " << __LINE__ << ": "          \
            << cudnnGetErrorString(status) << endl;    \
   exit(EXIT_FAILURE);                                  \
}                                                       \
  }

double get_wall_time()
{
struct timeval time;
if (gettimeofday(&time,NULL)){
      return 0;
}
return (double)time.tv_sec + (double)time.tv_usec * .000001;
}

cudnnHandle_t cudnn;
cudnnTensorDescriptor_t input_descriptor;
cudnnTensorDescriptor_t output_descriptor;
cudnnFilterDescriptor_t kernel_descriptor;
cudnnConvolutionDescriptor_t convolution_descriptor;

float workspace_bytes;
void* d_workspace{nullptr};
float* d_dummy_input{nullptr};
float* d_dummy_kernel{nullptr};
float* d_output{nullptr};

void post(){
cout << "version : CUDA runtime: " << cudnnGetVersion() << " | CuDNN: " << cudnnGetCudartVersion() << endl;
}

void initializeDescriptor(
int batch_size,
int input_channel,
int output_channel,
int input_height,
int input_width,
int output_height,
int output_width,
int kernel_height,
int kernel_width,
cudnnDataType_t dataType,
cudnnMathType_t mathType){

checkCUDNN(cudnnCreate(&cudnn));

//TODO 3 : Input, Filter and Output descriptors (xDesc, yDesc, wDesc, dxDesc, dyDesc and dwDesc as applicable) have dataType = CUDNN_DATA_HALF.

// INPUT
checkCUDNN(cudnnCreateTensorDescriptor(&input_descriptor));
checkCUDNN(cudnnSetTensor4dDescriptor(input_descriptor,
                                       /*format=*/CUDNN_TENSOR_NHWC/*batch_size | height | width | channel*/,
                                       /*dataType=*/dataType/*fp16*/,
                                       /*batch_size=*/batch_size,
                                       /*channels=*/input_channel,
                                       /*image_height=*/input_height,
                                       /*image_width=*/input_width));
// OUTPUT
checkCUDNN(cudnnCreateTensorDescriptor(&output_descriptor));
checkCUDNN(cudnnSetTensor4dDescriptor(output_descriptor,
                                       /*format=*/CUDNN_TENSOR_NHWC,
                                       /*dataType=*/dataType,
                                       /*batch_size=*/batch_size,
                                       /*channels=*/output_channel,
                                       /*image_height=*/output_height,
                                       /*image_width=*/output_width));

// KERNEL
checkCUDNN(cudnnCreateFilterDescriptor(&kernel_descriptor));
checkCUDNN(cudnnSetFilter4dDescriptor(kernel_descriptor,
                                       /*dataType=*/dataType,
                                       /*format=*/CUDNN_TENSOR_NCHW,
                                       /*out_channels=*/output_channel,
                                       /*in_channels=*/input_channel,
                                       /*kernel_height=*/kernel_height,
                                       /*kernel_width=*/kernel_width));

// CONVOLUTION
checkCUDNN(cudnnCreateConvolutionDescriptor(&convolution_descriptor));
checkCUDNN(cudnnSetConvolution2dDescriptor(convolution_descriptor,
                                             /*pad_height=*/1/*zero-padding*/,
                                             /*pad_width=*/1/*zero-padding*/,
                                             /*vertical_stride=*/1,
                                             /*horizontal_stride=*/1,
                                             /*dilation_height=*/1/*holing, new kernel height = dilation_factor * ( original_height - 1 ) + 1*/,
                                             /*dilation_width=*/1/*dilation_factor = 1 means no change*/,
                                             /*mode=*/CUDNN_CROSS_CORRELATION,
                                             /*computeType=*/CUDNN_DATA_FLOAT));

// TODO 1 : cudnnSetConvolutionMathType is called on the appropriate convolution descriptor setting mathType to CUDNN_TENSOR_OP_MATH.

cudnnSetConvolutionMathType(convolution_descriptor, mathType);

}

void createDeviceData(
int batch_size,
int input_channel,
int output_channel,
int input_height,
int input_width,
int output_height,
int output_width,
int kernel_height,
int kernel_width){

// use 10GB GPU memory for workspace
workspace_bytes = 10 * 1024.0f * 1024.0f * 1024.0f;
cudaMalloc(&d_workspace, workspace_bytes);

int dummy_input_bytes = batch_size * input_width * input_height * input_channel * sizeof(float);
int output_bytes = dummy_input_bytes;

// generate dummy input data
boost::multi_array<float, 4> h_dummy_input(boost::extents[batch_size][input_height][input_width][input_channel]);
for (int inner_batch_size = 0; inner_batch_size < batch_size; ++inner_batch_size) {
for (int inner_height = 0; inner_height < input_height; ++inner_height) {
for (int inner_width = 0; inner_width < input_width; ++inner_width) {
for (int inner_channel = 0; inner_channel < input_channel; ++inner_channel) {
h_dummy_input[inner_batch_size][inner_height][inner_width][inner_channel] = 0.12345f;
      }
      }
   }
}

cudaMalloc(&d_dummy_input, dummy_input_bytes);
cudaMemcpy(d_dummy_input, h_dummy_input.data(), dummy_input_bytes, cudaMemcpyHostToDevice);

// output data
cudaMalloc(&d_output, output_bytes);
cudaMemset(d_output, 0, output_bytes);

// 3 by 3 dummy kernel
float kernel_template[kernel_width][kernel_height] = {
   {1.1,  1.2, 1.3},
   {1.4, -1.5, 1.6},
   {1.7,  1.8, 1.9}
};

float h_dummy_kernel[input_channel][output_channel][kernel_width][kernel_height];
for (int kernel = 0; kernel < input_channel; ++kernel) {
   for (int channel = 0; channel < output_channel; ++channel) {
      for (int row = 0; row < kernel_width; ++row) {
      for (int column = 0; column < kernel_height; ++column) {
         h_dummy_kernel[kernel][channel][row][column] = kernel_template[row][column];
      }
      }
   }
}

cudaMalloc(&d_dummy_kernel, sizeof(h_dummy_kernel));
cudaMemcpy(d_dummy_kernel, h_dummy_kernel, sizeof(h_dummy_kernel), cudaMemcpyHostToDevice);

}

void freeDeviceData(){
cudaFree(d_dummy_kernel);
cudaFree(d_dummy_input);
cudaFree(d_output);
cudaFree(d_workspace);
}

int main( int argc, char** argv )
{

post();

//TODO 4 : The number of Input and Output feature maps is a multiple of 8.

int batch_size_hparam[1] = {128};
int input_channel_hparam[1] = {8};
int input_height_hparam[14] = {32, 40, 48, 56, 64, 80, 96, 112, 128, 152, 176, 200, 224, 256};

int kernel_height = 3;
int kernel_width = kernel_height;

for(int x = 0; x < 1; x++){

int batch_size = batch_size_hparam[x];

      for(int y = 0; y < 1; y++){

        int input_channel = input_channel_hparam[y];
        int output_channel = input_channel;

         for(int z = 0; z < 14; z++){
            int input_height = input_height_hparam[z];
            int input_width = input_height;
            int output_height = input_height;
            int output_width = output_height;

            // cout << "hyper-param: batch_size = " << batch_size << " | input_channel = " << input_channel << " | input_height = " << input_height << endl;

            initializeDescriptor(
            batch_size,
            input_channel,
            output_channel,
            input_height,
            input_width,
            output_height,
            output_width,
            kernel_height,
            kernel_width,
            CUDNN_DATA_HALF,
            CUDNN_DEFAULT_MATH);

            createDeviceData(
            batch_size,
            input_channel,
            output_channel,
            input_height,
            input_width,
            output_height,
            output_width,
            kernel_height,
            kernel_width);

            double start_time = get_wall_time();
            // execute
            const float alpha = 1, beta = 0;
            for(int i = 0; i < 10000; i++){

            // TODO 2 : cudnnConvolutionForward is called using algo = CUDNN_CONVOLUTION_FWD_ALGO_IMPLICIT_PRECOMP_GEMM.
            checkCUDNN(cudnnConvolutionForward(cudnn,
                                                      &alpha,
                                                      input_descriptor,
                                                      d_dummy_input,
                                                      kernel_descriptor,
                                                      d_dummy_kernel,
                                                      convolution_descriptor,
                                                      CUDNN_CONVOLUTION_FWD_ALGO_WINOGRAD,
                                                      d_workspace,
                                                      workspace_bytes,
                                                      &beta,
                                                      output_descriptor,
                                                      d_output));
            }
            double end_time = get_wall_time();

            // cout<<"time cost : "<< end_time - start_time <<"s." << endl;
            cout<< end_time - start_time <<endl;

            freeDeviceData();

         }
      }
}

cudnnDestroyTensorDescriptor(input_descriptor);
cudnnDestroyTensorDescriptor(output_descriptor);
cudnnDestroyFilterDescriptor(kernel_descriptor);
cudnnDestroyConvolutionDescriptor(convolution_descriptor);
cudnnDestroy(cudnn);
}
复制代码

评论
机架哪里买的(⊙_⊙?)

评论
树娘写的都是中文我却完全看不懂

评论
好奇问下tt5有信仰灯吗

评论

淘宝，bc1。

评论

没有灯。。。

评论

好奇问下 Ubuntu 对 4K 屏支持如何？

评论

应该比windows和mac强50个身位。

评论

这么屌啊……

还是又想骗我去用 Ubuntu……

我对 Ubuntu 的印象还停留在，装好之后第一件事先替换宋体、雅黑什么的……

这些年一直用 Mac，但是系统问题越来越多，一直又下不了决心切到 Windows，核心原因是 Mac 的中文字体和触摸板比 Win 好得多

评论
pascal开始大核心支持FP32:FP16=1:2，软件上有没有屏蔽不知道（有Quadro GP100的可以试试）
三大框架要支持这玩意不知道还需要多久，感觉目前状态下NV还是很难说服大部分人多花一倍多的价格购卖Titan V……
或许该考虑出个16GB/24GB显存的版本

评论
Titan V不是被爆出计算错误么？

评论

即便16.04, 18.04也是一个吊样，虽然已经集成Google noto，但是依然各种问题不断，而且infinality停止维护了，即便想自己折腾都无从动手了。

只是，真正的高手并非只是使用linux系统而已，人家是全英文工作环境，不输入中文，甚至几乎不阅读中文，没有你说的中文烦恼问题。

之所以LINUX显得非常高大上，主要是语言环境，而不是操作系统。

评论

那英文字体在 4K 屏的显示如何？

Ubuntu 这些年绕了太多圈子，unity、upstart 啥的，开发资源都浪费了

评论

从16.04开始集成Google Noto字体了，4K下英文挺漂亮的。

评论

linux的桌面现实可编程，这才是最屌的，windows恐怕只能去写驱动。

评论
4个字概括，  不觉明历！

评论
学渣路过电路电子维修我现在把定影部分拆出来了。想换下滚，因为卡纸。但是我发现灯管挡住了。拆不了。不会拆。论坛里的高手拆解过吗？评论认真看，认真瞧。果然有收电路电子维修求创维42c08RD电路图评论电视的图纸很少见评论电视的图纸很少见评论创维的图纸你要说版号，不然无能为力评论板号5800-p42ALM-0050 168P-P42CLM-01
·日本中文新闻唐田绘里香为新剧《极恶女王》剃光头展现演员决心
·日本中文新闻真子小室夫妇新居引发隐私担忧
·日本中文新闻前AKB48成员柏木由纪与搞笑艺人交往曝光
·日本学校｛日本国际学校｝梅田インターナショナルスクール
·日本学校 LINE：sm287 陳雨菲、20歳、台湾からの留学生、東京に来たばかり
·日本留学生活出售平成22年走行48000km 代步小车
·日本华人网络交流円相場　一時1ドル＝140円台まで上昇？
·日本华人网络交流问日本华人一个问题
·日本旅游代购富山接机
·生活百科英国转澳大利亚转换插头
·汽车【求助】修车遇到困难怎么办？

维修经验

日本电子维修技术显卡<更新>Titan V + Tensor Core小测，想说爱真不

CPUcpu-z 1.77版低调发布

CPU这几天经常开机黑屏，热重启后又正常

CPU超频求助！关于华擎H170和6700K

CPU液态金属会侵蚀cpu核心吗？

CPUAMD Zen处理器、AM4接口实物曝光：1331个针脚

CPUm6i究竟支不支持e3 1231v3

CPU华擎 HYPER 妖板正确玩法

CPUE5 2686 V3和i7 6800K如何选择

CPUHD530硬解4K能力还是有点弱呀！

CPU在组一个小机箱，关于i5 6600和i7 6700的选择

CPUwin10超频稳定，但是睡眠唤醒不了，pll电压di

CPU6900k 1.25V到4.2体质怎么样

CPUI3 6100 华擎B150M pro4超4.5g测试。

CPU系统稳定性测试，我发现prime95半个小时内问题

CPU7系u会兼容100系主板吗？

CPU请教各位：J3710和G1840，哪个性能稍好些？

CPU昨日遇到土豪朋友，又被吓到了，有朋友比这

CPU有心入5820k了，求教下温度问题

CPU6600&6600K才100的差价

CPU打算组双路E5 2670，大家有什么好的建议吗？

日本电子维修技术 显卡&lt;更新&gt;Titan V + Tensor Core小测，想说爱真不

相关推荐

日本电子维修技术显卡<更新>Titan V + Tensor Core小测，想说爱真不