主题:请教matmul效率, 跟Blas比较.
yeg001
[专家分:14390] 发布于 2010-04-22 00:44:00
请问有人把矩阵操作和向量操作运算跟BLAS的比较过吗? 如果有可不可以谈谈经验?
回复列表 (共56个回复)
51 楼
yeg001 [专家分:14390] 发布于 2010-07-31 15:27:00
基于我之前的测试程序, 由于要测试并行, 考虑到cpu_time的特点, 我再加了system_clock进去一同作比较.
program test_blasLV3
implicit none
integer, parameter :: order = 2000
integer(kind=8) :: time0, time1, time_rate
real(kind=8):: A(order, order), B(order, order), C(order, order)
real(kind = 8) :: time_begin, time_end
[color=008000]! CALL RANDOM(A)
! CALL RANDOM(B[/color])
[color=FF0000] CALL RANDOM_NUMBER(A)
CALL RANDOM_NUMBER(B)[/color]
CALL system_clock(count=time0)
CALL CPU_TIME(time_begin)
C=matmul(A, B)
CALL system_clock(count=time1, count_rate=time_rate)
CALL CPU_TIME(time_end)
WRITE(*,*)"consumed SYSTEM_time(s):", real(time1 - time0, kind = 8) / real(time_rate, kind = 8)
WRITE(*,*)"consumed CPU_time(s):", time_end - time_begin
CALL system_clock(count=time0)
CALL CPU_TIME(time_begin)
CALL dgemm('N', 'N', order, order, order, 1.0_8, A, order, B, order, 0.0_8, C, order)
CALL system_clock(count=time1, count_rate=time_rate)
CALL CPU_TIME(time_end)
WRITE(*,*)"consumed SYSTEM_time(s):", real(time1 - time0, kind = 8) / real(time_rate, kind = 8)
WRITE(*,*)"consumed CPU_time(s):", time_end - time_begin
end program
---------------------------------------------------------------------------
首先还是串行GOTOBlas2
GotoBLAS build complete.
OS ... Linux
Architecture ... x86_64
BINARY ... 64bit
C compiler ... GCC (command line : gcc)
Fortran compiler ... INTEL (command line : ifort)
Library Name ... libgoto2_core2-r1.13.a (Single threaded)
[wangxh6@c0112 Blas_lib_test]$ make
ifort -O3 -xSSE3 -static-intel -msse3 -c Blas_lev3.f90
ifort -O3 -xSSE3 -static-intel -msse3 -o test Blas_lev3.o libgoto2_core2-r1.13.a
now up2dated!
[wangxh6@c0112 Blas_lib_test]$ ./tsst
-bash: ./tsst: No such file or directory
[wangxh6@c0112 Blas_lib_test]$ ./test
consumed SYSTEM_time(s): 4.19132600000000
consumed CPU_time(s): 4.19036200000000
consumed SYSTEM_time(s): 3.07168900000000
consumed CPU_time(s): 3.07153300000000
---------------------------------------------------------------------------
这一个是并行库
GotoBLAS build complete.
OS ... Linux
Architecture ... x86_64
BINARY ... 64bit
C compiler ... GCC (command line : gcc)
Fortran compiler ... INTEL (command line : ifort)
Library Name ... libgoto2_core2p-r1.13.a (Multi threaded; Max num-threads is 4)
[wangxh6@c0112 Blas_lib_test]$ make
ifort -O3 -xSSE3 -static-intel -msse3 -o test Blas_lev3.o libgoto2_core2p-r1.13.a
now up2dated!
[wangxh6@c0112 Blas_lib_test]$ ./test
consumed SYSTEM_time(s): 4.29963900000000
consumed CPU_time(s): 4.79727100000000
consumed SYSTEM_time(s): 0.829618000000000
consumed CPU_time(s): 3.28850000000000
52 楼
yeg001 [专家分:14390] 发布于 2010-07-31 15:40:00
工作站每个节点是双路, 每个cpu是2核的. 所以线程数是4.
我改大了矩阵的规模, 由之前的2000改为5000. 似乎并行效率很高, 跟我实际应用的程序所得的体会不同(我的应用的计算程序阶数一般在1000一下.)
---------------------------------------------------------------------------
串行
[wangxh6@c0112 Blas_lib_test]$ make
ifort -O3 -xSSE3 -msse3 -mcmodel=medium -i-dynamic -c Blas_lev3.f90
ifort -O3 -xSSE3 -msse3 -mcmodel=medium -i-dynamic -o test Blas_lev3.o libgoto2_core2-r1.13.a
now up2dated!
[wangxh6@c0112 Blas_lib_test]$ ./test
consumed SYSTEM_time(s): 75.1274450000000
consumed CPU_time(s): 75.1145810000000
consumed SYSTEM_time(s): 47.4501470000000
consumed CPU_time(s): 47.4427880000000
---------------------------------------------------------------------------
并行
[wangxh6@c0112 Blas_lib_test]$ make
ifort -O3 -xSSE3 -msse3 -mcmodel=medium -i-dynamic -c Blas_lev3.f90
ifort -O3 -xSSE3 -msse3 -mcmodel=medium -i-dynamic -o test Blas_lev3.o libgoto2_core2p-r1.13.a
now up2dated!
[wangxh6@c0112 Blas_lib_test]$ ./test
consumed SYSTEM_time(s): 77.0329820000000
consumed CPU_time(s): 77.5072180000000
consumed SYSTEM_time(s): 12.6320180000000
consumed CPU_time(s): 50.2303640000000
---------------------------------------------------------------------------
47.4/12.6= 3.76
还算不错的并行效果了.
53 楼
yeg001 [专家分:14390] 发布于 2010-07-31 16:18:00
cgl_lgs 我直接复制你贴的程序, 用工作站计算, 发现跟你计算的时间挺大的出入.
首先陈述一下情况. 在串行情况下, 我观察的cpu利用率, top命令下看到几次刷新, cpu利用率都显示在90%以下. 有点不解.
Tasks: 72 total, 3 running, 69 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0% us, 0.3% sy, 0.0% ni, 90.0% id, 0.0% wa, 9.7% hi, 0.0% si
Cpu1 : 88.3% us, 0.0% sy, 0.0% ni, 0.0% id, 0.0% wa, 11.7% hi, 0.0% si
Cpu2 : 0.0% us, 0.3% sy, 0.0% ni, 91.3% id, 0.0% wa, 8.3% hi, 0.0% si
Cpu3 : 0.0% us,
然后是测试结果
[wangxh6@c0112 Blas_lib_test]$ ifort -O3 -xSSE3 -static-intel -msse3 -o test2 MoreTimesButSmallMatrix.f90 libgoto2_core2-r1.13.a
[wangxh6@c0112 Blas_lib_test]$ ./test2
动态
consumed CPU_time(s): 7.7528210000
consumed CPU_time(s): 12.8580460000
静态
consumed CPU_time(s): 3.2395070000
consumed CPU_time(s): 12.7530620000
[color=FF0000]发现在串行情况下, 用GOTOBlas还不如用matmul. 但是只之前的帖子反映出的结果确实依然Blas速度优于Matmul...[/color]
--------------------------------------------------------------------------
下面是并行库
[wangxh6@c0112 Blas_lib_test]$ ifort -O3 -xSSE3 -static-intel -msse3 -o test2 MoreTimesButSmallMatrix.f90 libgoto2_core2p-r1.13.a
[wangxh6@c0112 Blas_lib_test]$ ./test2
动态
consumed CPU_time(s): 8.4797110000
consumed SYSTEM_time(s): 7.99644100000000
consumed CPU_time(s): 13.1230040000
consumed SYSTEM_time(s): 13.1251480000000
静态
consumed CPU_time(s): 3.2105120000
consumed SYSTEM_time(s): 3.21062200000000
consumed CPU_time(s): 12.6910710000
consumed SYSTEM_time(s): 12.6931770000000
调用并行库libgoto2_core2p-r1.13.a 这个,我在代码里面加了system_clock做比较.
发现, 根本没有进行并行, cpu_time = system_clock !!!
54 楼
yeg001 [专家分:14390] 发布于 2010-08-18 17:51:00
先在这里道个歉, 由于我的粗心, 使得我原本给出的代码有一个大的缺陷.
写这个代码的时候没有使用过fortran的随机数, 虽然之前看过fortran参考资料后面的内部函数, 但一下子没留意直接用了C里面的random()来产生随机数(但有没设置种子). 也可能是ivf里刚好有这个扩展函数所以没有报错. 但是random(A)得到的矩阵A的元全部是零.
这里很感谢网友"dongyuanxun 勋"兄弟的提点.
前面帖子的代码已经注释了错误代码换上正确代码(红色部分). 因为随机数不是我们关心的, 所以没有使用RANDOM_SEED()
修改后的代码可以参考7楼或51楼.
在工作站某个节点上的测试结果跟之前测试得到的结果大体一致.
依然是使用串行GOTOBlas库.
------优化选择 -O2 ----------
[wangxh6@c0102 Blas_lib_test]$ make clean
rm -f Blas_lev3.o *.mod
[wangxh6@c0102 Blas_lib_test]$ make
ifort -O2 -xSSE3 -msse3 -mcmodel=medium -i-dynamic -c Blas_lev3.f90
ifort -O2 -xSSE3 -msse3 -mcmodel=medium -i-dynamic -o test Blas_lev3.o libgoto2_core2-r1.13.a
now up2dated!
[wangxh6@c0102 Blas_lib_test]$ ./test
consumed SYSTEM_time(s): 43.2336820000000
consumed CPU_time(s): 43.1174440000000
consumed SYSTEM_time(s): 2.77990800000000
consumed CPU_time(s): 2.77557800000000
------优化选择 -O3 ----------
[wangxh6@c0102 Blas_lib_test]$ make clean
rm -f Blas_lev3.o *.mod
[wangxh6@c0102 Blas_lib_test]$ make
ifort -O3 -xSSE3 -msse3 -mcmodel=medium -i-dynamic -c Blas_lev3.f90
ifort -O3 -xSSE3 -msse3 -mcmodel=medium -i-dynamic -o test Blas_lev3.o libgoto2_core2-r1.13.a
now up2dated!
[wangxh6@c0102 Blas_lib_test]$ ./test
consumed SYSTEM_time(s): 6.25644200000000
consumed CPU_time(s): 6.23405200000000
consumed SYSTEM_time(s): 2.76002700000000
consumed CPU_time(s): 2.75558100000000
测试得到的时间跟之前错误使用random()函数的测试基本一致. 我估计是编译器并不知道矩阵中的元都是零, 所以没有做太多优化.
虽然这几次测试是在工作站不同的节点进行, 但是工作站每个节点的cpu都是一样, 双路双核, 老至强(型号不记得了). 唯一不同的是内存数. 所以计算得到的时间还是可以比较一下. 显然"非零"后计算时间都增加了. matmul增加得更剧烈些, 比例拉开了.
虽然之前的代码确实有那个缺陷, 但是结果还是有参考价值的. 在这里说声"对不起", 是我疏忽大意了.
55 楼
cgl_lgs [专家分:21040] 发布于 2010-08-18 23:11:00
嘿嘿,这样子啊:)
不过多机测试的方案现在只能暂停了,因为MPICH2对在XP下的中文机器名支持很差,根本无法运行,所以只能等到我提给MPICH2的那个BUG改完后才能再试了:)
但从单机就可以看出一点:
用大矩阵并行显然非常有效,而用小矩阵则时间全消耗在了通信上,非常不划算:)
所以对于计算速度在毫秒级以下的话,还是不要用并行来得好:)
56 楼
pankejia [专家分:0] 发布于 2012-12-24 11:17:00
好像没有什么优势,三者计算时间差不多!
测试平台;Ubuntu 11.04 64位 + ifort 12.1.0 + gcc 4.5.2
CPU: i5 760, 4G 内存
$ ifort a.f90 libgoto2_nehalemp-r1.13.a -O3 -xsse4.2 -o main
$ ./main
consumed CPU_time(s): 1.80000000000000
consumed CPU_time(s): 1.47000000000000
$ ifort a.f90 -mkl -O3 -xsse4.2 -o main
$ ./main
consumed CPU_time(s): 1.76000000000000
consumed CPU_time(s): 1.50000000000000
我来回复