[原创]WinXP下CVF和IVF编译出的Lapack库的执行速度比较(附测试程序)

主题：[原创]WinXP下CVF和IVF编译出的Lapack库的执行速度比较(附测试程序)

hhsy [专家分：330] 发布于 2007-07-13 22:15:00

首先感谢f2003提供的WINDOWS版的lapack3.1源文件！

测试环境：
硬件：
Core Duo2400(1.8GHZ)CPU，1G 533内存
软件：
Windows XP SP2, Compaq Visual Fortran6.6A(CVF), Intel Visual Fortran9.1(IVF),Lapack3.1

共编译了三个版本的Lapack库：
1、用CVF编译，主要优化参数为OPTS     =  /optimize:5        以下简称CVF库
2、用IVF编译，主要优化参数为OPTS     =  /O3 /QaxN /QxN /Qparallel 以下简称IVF_N库
3、用IVF编译，主要优化参数为OPTS     =  /O3 /QaxP /QxP /Qparallel 以下简称IVF_P库

上面第一个版本的库没有针对CPU进行优化；第二个版本的库可以用于一般的Intel P4处理器，第三种库可用于Intel双核CPU或有SSE3指令的Intel P4 CPU

共测试了两个程序，第一个程序是通过SGETRF对方程组分解后，再用SGETRS求解，并检查误差；第二个程序是通过SPBTRF对具有带状分布的稀疏方程组进行分解，然后再用SPBTRS对其求解。程序代码如下：
program sgetrf_test
use dfport !imsl
!DEC$ OBJCOMMENT LIB:'blas_WIN32.lib'
!DEC$ OBJCOMMENT LIB:'lapack_WIN32.lib'
!DEC$ OBJCOMMENT LIB:'tmglib_WIN32.lib'

character*1 trans
integer i,j,m,n,chkunit,info,lda
integer,allocatable::ipiv(:)
real,allocatable::a(:,:),b(:) ,x(:)
real time1,time2,error
!external SGETRF
chkunit=1
temunit=2
open(chkunit,file='abc.chk')
!n=1000
1000    write(*,*)'input n='
read(*,*)n
if(n<=0)stop
m=n
lda=max(1,m)

open(temunit,file='abc.tem',FORM='UNFORMATTED')
allocate(a(n,n),b(n),x(n),ipiv(n))
a=0.;b=0.
!a=rand(a)
!b=rand(b)

do i=1,n
    do j=i+1,n
        a(i,j)=random(0)
        a(j,i)=a(i,j)
    end do
    a(i,i)=random(0)*5
    b(i)=random(0)*5
end do
write(temunit)a
write(temunit)b

call cpu_time(time1)
call SGETRF( M, N, A, LDA, IPIV, INFO )
call cpu_time(time2)
write(chkunit,*)'n and factorizing time=',n,time2-time1

trans='N';nrhs=1;ldb=lda
call SGETRS( TRANS, N, NRHS, A, LDA, IPIV, B, LDB, INFO )
x=b
rewind(temunit)
read(temunit)a
read(temunit)b
b=matmul(a,x)-b
error=dot_product(b,b)
write(chkunit,*)'error=',error
deallocate(a,b,x,ipiv)
close(temunit)
goto 1000
end program sgetrf_test
---------------------------------------------------
program SPBTRF_test
use dfport !imsl
!DEC$ OBJCOMMENT LIB:'blas_WIN32.lib'
!DEC$ OBJCOMMENT LIB:'lapack_WIN32.lib'
!DEC$ OBJCOMMENT LIB:'tmglib_WIN32.lib'

character*1 trans,uplo,date1*25,date2*25
integer i,j,m,n,chkunit,info,lda,ldab,kd
integer,allocatable::ipiv(:)
real,allocatable::a(:,:),b(:) ,x(:)
real time1,time2,error
!external SGETRF
chkunit=1
temunit=2
open(chkunit,file='abc.chk')
!n=1000
1000    write(*,*)'input n='
read(*,*)n
if(n<=0)stop
m=n
kd=int(n/10)+2        !maxband
ldab=kd+1

open(temunit,file='abc.tem',FORM='UNFORMATTED')
allocate(a(ldab,n),b(n),x(n))
a=0.;b=0.;x=0.
!a=rand(a)
!b=rand(b)
uplo='L'
call cpu_time(time1)
do j=1,n
    do i=2,ldab
        a(i,j)=i+j
    end do
    a(1,j)=sum(a(2:ldab,j))
    b(j)=random(0)*5
end do
call cpu_time(time2)
write(chkunit,*)'generation matrix time=',time2-time1

call cpu_time(time1)
call SPBTRF( UPLO, N, KD, a, LDAB, INFO )
call cpu_time(time2)
write(chkunit,*)'n,maxband and factorizing time=',n,ldab,time2-time1

nrhs=1;ldb=n
call SPBTRS( UPLO, N, KD, NRHS, a, LDAB, B, LDB, INFO )
deallocate(a,b,x)
close(temunit)
goto 1000

2000    format(50f8.3)
end program SPBTRF_test

测试结果如下
SGETRF：n为方程阶数，其后为计算时间，error后为计算误差
n and factorizing time=        1000  0.5625000
error=  1.5267158E-04
n and factorizing time=        2000   4.375000
error=  0.1836250
n and factorizing time=        3000   14.56250
error=  1.9454738E-02
n and factorizing time=        4000   34.32812
error=  5.8565985E-02
n and factorizing time=        5000   67.81250
error=  6.5416552E-02

IVF_N
n and factorizing time=        1000  0.2812500
error=  2.3405529E-04
n and factorizing time=        2000   2.234375
error=  4.2834733E-02
n and factorizing time=        3000   7.484375
error=  4.1005041E-02
n and factorizing time=        4000   17.43750
error=  3.5797328E-02
n and factorizing time=        5000   36.84375
error=  0.8729380

IVF_P
n and factorizing time=        1000  0.3125000
error=  2.3405529E-04
n and factorizing time=        2000   2.500000
error=  4.2834733E-02
n and factorizing time=        3000   8.109375
error=  4.1005041E-02
n and factorizing time=        4000   19.00000
error=  3.5797328E-02
n and factorizing time=        5000   40.95312
error=  0.8729380

SPBTRF   n为方程阶数，maxband为方程半带宽，即第一个非0元素到主对角线之间的元素个数，其后为计算时间。
CVF
generation matrix time=  6.2500000E-02
n,maxband and factorizing time=       10000        1003   7.687500
generation matrix time=  0.1718750
n,maxband and factorizing time=       20000        2003   60.01562
generation matrix time=  0.4218750
n,maxband and factorizing time=       30000        3003   200.8750

IVF_N
generation matrix time=  4.6875000E-02
n,maxband and factorizing time=       10000        1003   4.375000
generation matrix time=  0.1875000
n,maxband and factorizing time=       20000        2003   33.40625
generation matrix time=  0.4375000
n,maxband and factorizing time=       30000        3003   109.9375

IVF_P
generation matrix time=  4.6875000E-02
n,maxband and factorizing time=       10000        1003   6.390625
generation matrix time=  0.1875000
n,maxband and factorizing time=       20000        2003   49.01562
generation matrix time=  0.4375000
n,maxband and factorizing time=       30000        3003   164.2344

结论：
1、从两个程序在调用三个不同编译环境得到的Lapack的情况来看，IVF所得的库效率最高，CVF相对差点儿；
2、在IVF的两个库的运行情况来看，IVF_N所得的效率较IVF_P所得的效率要高。所不同的时，我发现在运行IVF_P编译的Lapack库时CPU是以100%的速度运行，即两个CPU都在运行；而当运行IVF_N编译的Lapack库时CPU是以50%的速度运行，即只有一个CPU在运行。

讨论：
1、为什么两个CPU运行的效率还不如一个CPU运行的效率？
2、一般我在用IVF编译程序时，即使用了/QaxP /QxP等优化参数后，运行程序时CPU都还是以50%即一个核心在运行。而用这个优化参数编译Lapack库后运行的程序是两个核心都在运行。难道Lapack里运行了什么技术？

需要说明的是以上测试时没有使用虚拟内存，因此，不存在读写硬盘所用的时间。

最后更新于：2007-07-13 22:20:00

本帖地址： http://bbs.pfan.cn/post/242407.html

回复列表（共9个回复）

沙发

f2003 [专家分：7960] 发布于 2007-07-14 14:10:00

楼主的工作很有意思，

我看了上面的数字，首先ivf比cvf快了一大截，都快接近快一倍了。原因我觉得就是利用了新的指令，尤其是sse。强有力的证明了升级到ivf是有价值的。而且，如果楼主的程序在core2上跑，无需重新编译，速度都会明显更快一大截，因为core2能在一个周期完成一个sse指令，而core需要两个周期。

至于n,p两种情况的疑问，我觉得跟编译器有关，双核必然能比单核快，至于为什么没有做到，是因为ivf9的优化技术还是太差。我建议楼主升级到ivf10，再试试看n,p两种情况。ivf 9.0,9.1的主要是提供对sse指令和core，core2的支持，而ivf10主要是针对双核优化。ivf10跟ivf9是同一个license。至于lapack，没什么特殊的，楼主也试试看用ivf10编译你以前的那些程序，说不定会有变化。

板凳

hhsy [专家分：330] 发布于 2007-07-14 16:00:00

谢谢你提供信息，等以后有空再试吧。
这两天在XP下编译你提供那个BLACS和ScaLapck，花了很多时间，但还是没有成功。
编译成BLACS库是可以的，但测试都通不过。原因可能是由于里面采用了很多C语言写的子程序，涉及到C和FOtran之间的调用问题，比较复杂，程序名不匹配，老是说找不到相应的子程序。而且BLACS编译成的库是可以为C和FORTRAN共用的，所以编译时接口老做不对。

3 楼

hhsy [专家分：330] 发布于 2007-07-14 22:37:00

the result of Linux

SGETRF
IVF_P_Linux
n and factorizing time=        1000  0.3299510
error=  2.9141884E-04
n and factorizing time=        2000   2.522617
error=  4.9888656E-02
n and factorizing time=        3000   8.563700
error=  4.6666354E-02
n and factorizing time=        4000   20.51588
error=  3.3721190E-02
n and factorizing time=        5000   43.91033
error=  0.8857305

IVF_N_Linux
n and factorizing time=        1000  0.3019540
error=  2.9141884E-04
n and factorizing time=        2000   2.274655
error=  4.9888656E-02
n and factorizing time=        3000   7.553852
error=  4.6666358E-02
n and factorizing time=        4000   17.64132
error=  3.3721186E-02
n and factorizing time=        5000   37.61228
error=  0.8857305

SPBTRF
IVF_P_Linux
generation matrix time=  3.2995004E-02
n,maxband and factorizing time=       10000        1003   6.497013
generation matrix time=  0.1319790
n,maxband and factorizing time=       20000        2003   50.53831
generation matrix time=  0.3019524
n,maxband and factorizing time=       30000        3003   165.1369

IVF_N_Linux
generation matrix time=  3.1994998E-02
n,maxband and factorizing time=       10000        1003   4.375335
generation matrix time=  0.1339788
n,maxband and factorizing time=       20000        2003   33.44092
generation matrix time=  0.2999535
n,maxband and factorizing time=       30000        3003   110.0673

4 楼

山东益友 [专家分：0] 发布于 2007-07-27 18:52:00

请问大侠，市场上有没有计算生产加工下料尺寸的软件啊，主要是为了控制原材料消耗为目的。谢谢了。

5 楼

junziyang [专家分：150] 发布于 2007-12-18 10:34:00

感觉IVF的优化跟CPU有很大的关系。我的机器是赛扬的CUP，计算程序主要是矩阵相乘，matmul()较多。用MATLAB的mex，分别调用IVF和CVF编译，结果发现CVF的效率要比IVF高大约10%。请教楼主，像这样的情况该如何优化以尽量提高计算效率？

6 楼

junziyang [专家分：150] 发布于 2007-12-22 11:24:00

今天在一台intel双核机器上用
1、用CVF编译，主要优化参数为OPTS     =  /optimize:5        以下简称CVF库
2、用IVF编译，主要优化参数为OPTS     =  /O3 /QaxN /QxN /Qparallel 以下简称IVF_N库
3、用IVF编译，主要优化参数为OPTS     =  /O3 /QaxP /QxP /Qparallel 以下简称IVF_P库

分别编译了一下我的程序。发现用2和3编译得到的程序运行时都是两个cpu一起运行，而不是像楼主说的2是单核运行。另外发现IVF编译的程序比CVF的要稍微慢一些，虽然ivf得到的程序双核运行。

不知道这是怎么回事？难道IVF的优化仅仅是针对LAPACK之类的库做的吗？对Fortran的内建函数（比如matmul）没有优化？

望赐教！

7 楼

f2003 [专家分：7960] 发布于 2007-12-22 14:19:00

加了Qparallel就能使用双核。

Qx开关使用sse指令代替x87浮点单元，因为sse是simd指令，速度更快。QxP不一定是对你最有利的，要看你的cpu是啥了。
p4 478针：QxN ，仅支持到sse2
775针：QxP ,支持sse3
core2 : QxT ,支持sse3、ssse3
下个月上市的酷睿家族的下一代45nm处理器penryn以及明后年的nehalem将支持sse4.1和4.2，ivf肯定会配有新的开关。sse4有求内积的指令，对matmul之类很有帮助。

一旦已经知道自己的是什么cpu，Qax开关就不必使用了。怎样使用ivf的编译选项我曾经上传过一个文档，你可以搜搜。

ivf仅对lapack优化是无稽之谈，不可能的事情。

你必须看一本书，“软件优化技术 ia32平台高性能手册”，英文网页如下，也有中译本价格59元。
[url=http://www.amazon.com/Software-Optimization-Cookbook-Performance-Platforms/dp/0976483211]http://www.amazon.com/Software-Optimization-Cookbook-Performance-Platforms/dp/0976483211[/url]

优化的第一步是通过vtune之类的调试器发现程序的“热点”，因为“冷点”的快慢是不重要的，只有那些频繁、反复被执行的代码才需要优化。如果matmul不在热点中，优化带来的那点速度完全可以忽略不计。

8 楼

junziyang [专家分：150] 发布于 2007-12-23 07:57:00

感谢f2003！我的程序比较简单。主要是调用matmul。感谢你提供的资料。我研究一下。谢谢。

9 楼

aliouying [专家分：1150] 发布于 2009-09-23 15:03:00

这个原因是因为CALL CPU_TIME（）计算的时间是两个CPU的合计时间
你可以用SECONDS()函数或者其他函数。

我来回复

您尚未登录，请登录后再回复。点此登录或注册

主题：[原创]WinXP下CVF和IVF编译出的Lapack库的执行速度比较(附测试程序)

回复列表（共9个回复）

我来回复

程序员工具箱 new

代码片段

本版新帖

主题：[原创]WinXP下CVF和IVF编译出的Lapack库的执行速度比较(附测试程序)

回复列表 （共9个回复）

我来回复

程序员工具箱 new

代码片段

本版新帖

回复列表（共9个回复）