请教matmul效率, 跟Blas比较. — 编程爱好者社区

主题：请教matmul效率, 跟Blas比较.

yeg001 [专家分：14390] 发布于 2010-04-22 00:44:00

请问有人把矩阵操作和向量操作运算跟BLAS的比较过吗? 如果有可不可以谈谈经验?

本帖地址： http://bbs.pfan.cn/post/320160.html

回复列表（共56个回复）

沙发

f2003 [专家分：7960] 发布于 2010-04-22 16:07:00

BLAS中定义了一些向量和矩阵的操作集，直到现在计算数学期刊上还不断有paper发表，研究这些操作的效率。

恐怕没有多少人会使用netlib上的原版blas代码了吧？而是使用对它优化后的库, 比如atlas ,gotoblas。在MKL和IMSL中也包含了BLAS，也都对处理器和编译器是优化过的。这些代码当然比Fortran内在过程效率高，但是考虑到调用外部过程的代价，一般被操作数组的大小超过某值的时候才使用。gfortran就有这样的编译选项。

我认为gotoblas的效率可能是最高的，使用处理器的矢量单元进行计算，虽然它的文件都是c程序，其核心算法大多是内联汇编完成的。

如果你的代码高度依赖于这些操作，自己可以试试看，几种库的效率。

板凳

yeg001 [专家分：14390] 发布于 2010-04-22 17:22:00

多谢f2003的经验, 我的程序中确实是大量使用, 随着运算量加大运算时间增长很快, 希望通过优化一些计算提高速度.

3 楼

yeg001 [专家分：14390] 发布于 2010-04-22 18:44:00

注册了 TACC Texas 似乎只有GotoBLAS2下载了. 过几天测试一下~

4 楼

f2003 [专家分：7960] 发布于 2010-04-22 19:07:00

很少有人知道关注blas，看得出来楼主是爱动脑的人。
还应该关注一下nvidia CUDA SDK中包含的blas库，cublas

CUBLAS now supports all BLAS1, 2, and 3 routines including those for single and double precision complex numbers

http://developer.nvidia.com/object/cuda_3_0_downloads.html

gpu是数值计算的下一个方向，速度比cpu快得多如果程序设计合理。blas早就被nvidia移植到gpu上了。

5 楼

yeg001 [专家分：14390] 发布于 2010-04-22 23:11:00

CUDA 跟其他gpu运算的新闻我也有留意. 只是老板提供的就只有工作站, 估计也不会陪显卡给我们做计算. 所以只停留在观望阶段.

6 楼

allocate [专家分：540] 发布于 2010-04-25 00:06:00

[quote]很少有人知道关注blas，看得出来楼主是爱动脑的人。
还应该关注一下nvidia CUDA SDK中包含的blas库，cublas

CUBLAS now supports all BLAS1, 2, and 3 routines including those for single and double precision complex numbers

http://developer.nvidia.com/object/cuda_3_0_downloads.html

gpu是数值计算的下一个方向，速度比cpu快得多如果程序设计合理。blas早就被nvidia移植到gpu上了。[/quote]
现在GPU上那玩意可以支持双精度计算了吗，我们的程序单精度算出来啥都不是，一直等着出双精度的。。。。。

7 楼

yeg001 [专家分：14390] 发布于 2010-04-26 17:04:00

自己编译了一个GotoBLAS串行库,
  OS               ... Linux
  Architecture     ... x86_64
  BINARY           ... 64bit
  C compiler       ... GCC  (command line : gcc)
  Fortran compiler ... INTEL  (command line : ifort)
  Library Name     ... libgoto2_core2p-r1.13.a(single threaded)

随手写了个矩阵相乘程序跟matmul比较
code:

program test_blas
implicit none
  real(kind=8):: A(2000, 2000), B(2000, 2000), C(2000, 2000)
  real(kind = 8) :: time_begin, time_end

[color=008000]!  CALL RANDOM(A)
!  CALL RANDOM(B)[/color]
[color=FF0000]   CALL RANDOM_NUMBER(A)
   CALL RANDOM_NUMBER(B)[/color]

  CALL CPU_TIME(time_begin)
    C=matmul(A, B)
  CALL CPU_TIME(time_end)
  WRITE(*,*)"consumed CPU_time(s):", time_end - time_begin

  CALL CPU_TIME(time_begin)
  CALL dgemm('N', 'N', 2000, 2000, 2000, 1.0_8, A, 2000, B, 2000, 0.0_8, C, 2000)
  CALL CPU_TIME(time_end)
  WRITE(*,*)"consumed CPU_time(s):", time_end - time_begin
end program

编译器是ivf11.1.072, MKL10.2 updata5
结果如下
[wangxh6@c0117 temppro]$ ifort test_blas.f90 libgoto2_core2-r1.13.a
[wangxh6@c0117 temppro]$ ./a.out
consumed CPU_time(s):   38.0132210000000
consumed CPU_time(s):   2.70958700000001

[wangxh6@c0117 temppro]$ ifort test_blas.f90 -lmkl_intel_lp64 -lmkl_core -lpthread -lmkl_sequential
[wangxh6@c0117 temppro]$ ./a.out
consumed CPU_time(s):   38.0482160000000
consumed CPU_time(s):   2.66859400000000

[wangxh6@c0117 temppro]$ ifort test_blas.f90 -O3 libgoto2_core2-r1.13.a
[wangxh6@c0117 temppro]$ ./a.out
consumed CPU_time(s):   3.76642800000000
consumed CPU_time(s):   2.70958800000000

[wangxh6@c0117 temppro]$ ifort test_blas.f90 -O3 -lmkl_intel_lp64 -lmkl_core -lpthread -lmkl_sequential
[wangxh6@c0117 temppro]$ ./a.out
consumed CPU_time(s):   3.76842800000000
consumed CPU_time(s):   2.67259400000000

BLAS优化过, 在编译器默认优化O2下的matmul速度慢得惊人. 改O3速度追上来一些. 加针对cpu的优化 -xSSE3 -msse3 效果也不明显, 速度基本没有变化.
暂时看到mkl的BLAS也优化的不错.
我这样测试可以吗? 请各位论坛朋友指点指点.

8 楼

f2003 [专家分：7960] 发布于 2010-04-26 21:54:00

还行吧。ifort能把matmul优化成这种程度已经相当牛了。

mkl做了那些优化intel并没有公布，能够知道的只是intel确实是修改了blas、lapack、fftw等库的源码的，也就是代码为编译器而优化，让编译器的优化能够发挥作用。

年底sandy bridge出来，sse升级成avx，向量单元的宽度增加一倍，受益最大的就是矩阵相乘这些操作，相同频率的sandy bridge会足足快一倍比core2和nehalem.

9 楼

yeg001 [专家分：14390] 发布于 2010-04-27 09:04:00

多谢f2003指点. 看来有时间我的代码还是可以考虑改用BLAS, 毕竟里面矩阵和矢量操作比较多.
欢迎各论坛朋友继续讨论优化.

10 楼

vehicle [专家分：310] 发布于 2010-04-30 17:59:00

[quote]自己编译了一个GotoBLAS串行库,
  OS               ... Linux
  Architecture     ... x86_64
  BINARY           ... 64bit
  C compiler       ... GCC  (command line : gcc)
  Fortran compiler ... INTEL  (command line : ifort)
  Library Name     ... libgoto2_core2p-r1.13.a(single threaded)

随手写了个矩阵相乘程序跟matmul比较
code:

program test_blas
implicit none
  real(kind=8):: A(2000, 2000), B(2000, 2000), C(2000, 2000)
  real(kind = 8) :: time_begin, time_end

  CALL RANDOM(A)
  CALL RANDOM(B)

  CALL CPU_TIME(time_begin)
    C=matmul(A, B)
  CALL CPU_TIME(time_end)
  WRITE(*,*)"consumed CPU_time(s):", time_end - time_begin

  CALL CPU_TIME(time_begin)
  CALL dgemm('N', 'N', 2000, 2000, 2000, 1.0_8, A, 2000, B, 2000, 0.0_8, C, 2000)
  CALL CPU_TIME(time_end)
  WRITE(*,*)"consumed CPU_time(s):", time_end - time_begin
end program

编译器是ivf11.1.072, MKL10.2 updata5
结果如下
[wangxh6@c0117 temppro]$ ifort test_blas.f90 libgoto2_core2-r1.13.a
[wangxh6@c0117 temppro]$ ./a.out
consumed CPU_time(s):   38.0132210000000
consumed CPU_time(s):   2.70958700000001

[wangxh6@c0117 temppro]$ ifort test_blas.f90 -lmkl_intel_lp64 -lmkl_core -lpthread -lmkl_sequential
[wangxh6@c0117 temppro]$ ./a.out
consumed CPU_time(s):   38.0482160000000
consumed CPU_time(s):   2.66859400000000

[wangxh6@c0117 temppro]$ ifort test_blas.f90 -O3 libgoto2_core2-r1.13.a
[wangxh6@c0117 temppro]$ ./a.out
consumed CPU_time(s):   3.76642800000000
consumed CPU_time(s):   2.70958800000000

[wangxh6@c0117 temppro]$ ifort test_blas.f90 -O3 -lmkl_intel_lp64 -lmkl_core -lpthread -lmkl_sequential
[wangxh6@c0117 temppro]$ ./a.out
consumed CPU_time(s):   3.76842800000000
consumed CPU_time(s):   2.67259400000000

BLAS优化过, 在编译器默认优化O2下的matmul速度慢得惊人. 改O3速度追上来一些. 加针对cpu的优化 -xSSE3 -msse3 效果也不明显, 速度基本没有变化.
暂时看到mkl的BLAS也优化的不错.
我这样测试可以吗? 请各位论坛朋友指点指点.[/quote]

请问gotoblas是开源代码吗？
我现在也涉及大量矩阵与向量相乘。

我来回复

您尚未登录，请登录后再回复。点此登录或注册

主题：请教matmul效率, 跟Blas比较.

回复列表（共56个回复）

我来回复

程序员工具箱 new

代码片段

本版新帖

主题：请教matmul效率, 跟Blas比较.

回复列表 （共56个回复）

我来回复

程序员工具箱 new

代码片段

本版新帖

回复列表（共56个回复）