We succeeded in compiling a parallel version of Vasp and want to share the experience. 

system: 
- 4 x Intel Xeon Quadcore, so 16 cores in sum 
- OpenSuse Linux 10.3 (X86-64), 64 bit 
- Intel Fortran-Compiler 10.1 
- Intel C/C++-Compiler 10.1 
- OpenMPI 1.2.6 
- GotoBLAS 1.26 
- Vasp 4.6.34 

steps: 
- install OpenMPI 
- compile the Blas libraries 
- build the vasp libraries 
- build vasp 
- run Hg benchmark 

Before we start we set the environment variables for the different compilers so we don't have to specify them each and every time at the comand line, for example in .bashrc for the bash shell: 

FC=/opt/intel/fce/10.1.012/bin/ifort ; export FC 
CXX=/opt/intel/cce/10.1.015/bin/icpc ; export CXX 
CC=/opt/intel/cce/10.1.015/bin/icc ; export CC 
F77=/opt/intel/fce/10.1.012/bin/ifort ; export F77 

#OpenMPI# 
Installing OpenMPI is easy, it comes with a configure script (!), we just need to specify the prefix for the installation folder, build and install it: 
./configure --prefix=/openmpi-installation-folder 
make all install 

#GotoBlas# 
Installing GotoBlas was the most confusing part! If you just use the quickbuild.64bit script this script will check if you have a multi CPU environment (SMP) and build threaded Blas libraries, meaning these libraries already make use of your multiple processors. But if you do so, you end up with a parallel version of vasp which is 
"mindblowing slow", at least for me! Furthermore, this script searches for installed Fortran-compilers with a special order and if you have multiple Fortran-compilers installed it might choose a different one than your Intel-Fortran-Compiler (which we don't want since we want to use the same compiler for Blas and Vasp). 
To circumvent any unwanted things you could modify the makefiles by hand or, which is the way I did it, you can modify the "detect" file which does all the detection for compilers and SMP. So we prevent the detect script from looking for other compilers and we prevent it from using SMP. What follows is the "detect" file I used (I cut off the last part where I made no changes!): 
######################################### 
rm -f getarch_cmd 
rm -f getarch_cmd.exe 

make clean 

FCOMPILER=NULL 

##which g77 > /dev/null 2> /dev/null 
##if [ 0 == $? ]; then 
##FCOMPILER=G77 
##fi 

##which g95 > /dev/null 2> /dev/null 
##if [ 0 == $? ]; then 
##FCOMPILER=G95 
##fi 

##which gfortran > /dev/null 2> /dev/null 
##if [ 0 == $? ]; then 
##FCOMPILER=GFORTRAN 
##fi 

which ifort > /dev/null 2> /dev/null ##comment out everything but ifort 
if [ 0 == $? ]; then 
FCOMPILER=INTEL 
fi 

##which pgf77 > /dev/null 2> /dev/null 
##if [ 0 == $? ]; then 
##FCOMPILER=PGI 
##fi 

##which pathf90 > /dev/null 2> /dev/null 
##if [ 0 == $? ]; then 
##FCOMPILER=PATHSCALE 
##fi 

##which xlf > /dev/null 2> /dev/null 
##if [ 0 == $? ]; then 
##FCOMPILER=IBM 
##fi 

HAS_SMP=0 

##NUM_CPU=`cat /proc/cpuinfo | grep -c processor` 
##if [ $NUM_CPU -gt 1 ]; then 
##HAS_SMP=1 ##prevent the check for SMP 
##fi 

############################################# 
I know that this is an absurd way of doing this, one could easily just edit the Makefile.rule by hand. 
Anyway, it worked for me. 
Than enter 
./quickbuild.64bit 
and you should end up with a nice Blas library, in my case "libgoto_core2-r1.26.so". 

#Vasp.4.lib# 
The vasp libraries are easy again, just 
cp makefile.linux_ifc_P4 Makefile 
possibly edit the FC to match the Intel-Fortran-Compiler and 
make 

#Vasp.4.6# 
After 10 days of trying this also felt easy in the end. 
cp makefile.linux_ifc_P4 Makefile 
- possibly edit the FC line 
- change the OFLAG line from -O3 to -O1 (OFLAG=-O1 -xW -tpp7 in my case) 
this is just something I read in the forum, if I don't do that I end up with memory allocation errors when running vasp! 
- edit the path to the Blas libraries 
- possibly edit CPP precompiler flags (the lower ones after the mpi section) 
- edit the path to your mpi fortran-wrapper compiler (opt/openmpi-1.2.6/bin/mpif90 for my case) 
- comment in the mpi libraries 
- comment in the fft (FFT3D) libraries in the mpi part 
What follows is the vasp makefile (I cut off the last part where I made no changes!): 
########################################## 
.SUFFIXES: .inc .f .f90 .F 
#----------------------------------------------------------------------- 
# Makefile for Intel Fortran compiler for P4 systems 

# The makefile was tested only under Linux on Intel platforms 
# (Suse 5.3- Suse 9.0) 
# the followin compiler versions have been tested 
# 5.0, 6.0, 7.0 and 7.1 (some 8.0 versions seem to fail compiling the code) 
# presently we recommend version 7.1 or 7.0, since these 
# releases have been used to compile the present code versions 

# it might be required to change some of library pathes, since 
# LINUX installation vary a lot 
# Hence check ***ALL**** options in this makefile very carefully 
#----------------------------------------------------------------------- 

# BLAS must be installed on the machine 
# there are several options: 
# 1) very slow but works: 
# retrieve the lapackage from ftp.netlib.org 
# and compile the blas routines (BLAS/SRC directory) 
# please use g77 or f77 for the compilation. When I tried to 
# use pgf77 or pgf90 for BLAS, VASP hang up when calling 
# ZHEEV (however this was with lapack 1.1 now I use lapack 2.0) 
# 2) most desirable: get an optimized BLAS 

# the two most reliable packages around are presently: 
# 3a) Intels own optimised BLAS (PIII, P4, Itanium) 
http://developer.intel.com/software/products/mkl/ 
# this is really excellent when you use Intel CPU's 

# 3b) or obtain the atlas based BLAS routines 
http://math-atlas.sourceforge.net/ 
# you certainly need atlas on the Athlon, since the mkl 
# routines are not optimal on the Athlon. 
# If you want to use atlas based BLAS, check the lines around LIB= 

# 3c) mindblowing fast SSE2 (4 GFlops on P4, 2.53 GHz) 
# Kazushige Goto's BLAS 
http://www.cs.utexas.edu/users/kgoto/signup_first.html 

#----------------------------------------------------------------------- 

# all CPP processed fortran files have the extension .f90 
SUFFIX=.f90 

#----------------------------------------------------------------------- 
# fortran compiler and linker 
#----------------------------------------------------------------------- 
FC=/opt/intel/fce/10.1.012/bin/ifort 
# fortran linker 
FCL=$(FC) 


#----------------------------------------------------------------------- 
# whereis CPP ?? (I need CPP, can't use gcc with proper options) 
# that's the location of gcc for SUSE 5.3 

# CPP_ = /usr/lib/gcc-lib/i486-linux/2.7.2/cpp -P -C 

# that's probably the right line for some Red Hat distribution: 

# CPP_ = /usr/lib/gcc-lib/i386-redhat-linux/2.7.2.3/cpp -P -C 

# SUSE X.X, maybe some Red Hat distributions: 

CPP_ = ./preprocess <$*.F | /usr/bin/cpp -P -C -traditional >$*$(SUFFIX) 

#----------------------------------------------------------------------- 
# possible options for CPP: 
# NGXhalf charge density reduced in X direction 
# wNGXhalf gamma point only reduced in X direction 
# avoidalloc avoid ALLOCATE if possible 
# IFC work around some IFC bugs 
# CACHE_SIZE 1000 for PII,PIII, 5000 for Athlon, 8000-12000 P4 
# RPROMU_DGEMV use DGEMV instead of DGEMM in RPRO (depends on used BLAS) 
# RACCMU_DGEMV use DGEMV instead of DGEMM in RACC (depends on used BLAS) 
#----------------------------------------------------------------------- 

CPP = $(CPP_) -DHOST=\"LinuxIFC\" \ 
-Dkind8 -DNGZhalf -DCACHE_SIZE=12000 -Davoidalloc -DMPI -DIFC\ 
# -DRPROMU_DGEMV -DRACCMU_DGEMV 

#----------------------------------------------------------------------- 
# general fortran flags (there must a trailing blank on this line) 
#----------------------------------------------------------------------- 

FFLAGS = -FR -lowercase -assume byterecl 

#----------------------------------------------------------------------- 
# optimization 
# we have tested whether higher optimisation improves performance 
# -axK SSE1 optimization, but also generate code executable on all mach. 
# xK improves performance somewhat on XP, and a is required in order 
# to run the code on older Athlons as well 
# -xW SSE2 optimization 
# -axW SSE2 optimization, but also generate code executable on all mach. 
# -tpp6 P3 optimization 
# -tpp7 P4 optimization 
#----------------------------------------------------------------------- 

OFLAG=-O1 -xW -tpp7 

OFLAG_HIGH = $(OFLAG) 
OBJ_HIGH = 

OBJ_NOOPT = 
DEBUG = -FR -O0 
INLINE = $(OFLAG) 


#----------------------------------------------------------------------- 
# the following lines specify the position of BLAS and LAPACK 
# on P4, VASP works fastest with the libgoto library 
# so that's what I recommend 
#----------------------------------------------------------------------- 

# Atlas based libraries 
#ATLASHOME= $(HOME)/archives/BLAS_OPT/ATLAS/lib/Linux_P4SSE2/ 
#BLAS= -L$(ATLASHOME) -lf77blas -latlas 

# use specific libraries (default library path might point to other libraries) 
#BLAS= $(ATLASHOME)/libf77blas.a $(ATLASHOME)/libatlas.a 

# use the mkl Intel libraries for p4 (www.intel.com) 
# mkl.5.1 
# set -DRPROMU_DGEMV -DRACCMU_DGEMV in the CPP lines 
#BLAS=-L/opt/intel/mkl/lib/32 -lmkl_p4 -lpthread 

# mkl.5.2 requires also to -lguide library 
# set -DRPROMU_DGEMV -DRACCMU_DGEMV in the CPP lines 
#BLAS=-L/opt/intel/mkl/lib/32 -lmkl_p4 -lguide -lpthread 

# even faster Kazushige Goto's BLAS 
http://www.cs.utexas.edu/users/kgoto/signup_first.html 
BLAS=-L/opt/GotoBLAS_not_threaded -lgoto 

# LAPACK, simplest use vasp.4.lib/lapack_double 
LAPACK= ../vasp.4.lib/lapack_double.o 

# use atlas optimized part of lapack 
#LAPACK= ../vasp.4.lib/lapack_atlas.o -llapack -lcblas 

# use the mkl Intel lapack 
#LAPACK= -lmkl_lapack 

#----------------------------------------------------------------------- 

LIB = -L../vasp.4.lib -ldmy \ 
../vasp.4.lib/linpack_double.o $(LAPACK) \ 
$(BLAS) 

# options for linking (for compiler version 6.X, 7.1) nothing is required 
LINK = 
# compiler version 7.0 generates some vector statments which are located 
# in the svml library, add the LIBPATH and the library (just in case) 
#LINK = -L/opt/intel/compiler70/ia32/lib/ -lsvml 

#----------------------------------------------------------------------- 
# fft libraries: 
# VASP.4.6 can use fftw.3.0.X (http://www.fftw.org) 
# since this version is faster on P4 machines, we recommend to use it 
#----------------------------------------------------------------------- 

#FFT3D = fft3dfurth.o fft3dlib.o 
FFT3D = fftw3d.o fft3dlib.o /opt/libs/fftw-3.0.1/lib/libfftw3.a 


#======================================================================= 
# MPI section, uncomment the following lines 

# one comment for users of mpich or lam: 
# You must *not* compile mpi with g77/f77, because f77/g77 
# appends *two* underscores to symbols that contain already an 
# underscore (i.e. MPI_SEND becomes mpi_send__). The pgf90/ifc 
# compilers however append only one underscore. 
# Precompiled mpi version will also not work !!! 

# We found that mpich.1.2.1 and lam-6.5.X to lam-7.0.4 are stable 
# mpich.1.2.1 was configured with 
# ./configure -prefix=/usr/local/mpich_nodvdbg -fc="pgf77 -Mx,119,0x200000" \ 
# -f90="pgf90 " \ 
# --without-romio --without-mpe -opt=-O \ 

# lam was configured with the line 
# ./configure -prefix /opt/libs/lam-7.0.4 --with-cflags=-O -with-fc=ifc \ 
# --with-f77flags=-O --without-romio 

# please note that you might be able to use a lam or mpich version 
# compiled with f77/g77, but then you need to add the following 
# options: -Msecond_underscore (compilation) and -g77libs (linking) 

# !!! Please do not send me any queries on how to install MPI, I will 
# certainly not answer them !!!! 
#======================================================================= 
#----------------------------------------------------------------------- 
# fortran linker for mpi: if you use LAM and compiled it with the options 
# suggested above, you can use the following line 
#----------------------------------------------------------------------- 

FC=/opt/openmpi-1.2.6/bin/mpif90 
FCL=$(FC) 

#----------------------------------------------------------------------- 
# additional options for CPP in parallel version (see also above): 
# NGZhalf charge density reduced in Z direction 
# wNGZhalf gamma point only reduced in Z direction 
# scaLAPACK use scaLAPACK (usually slower on 100 Mbit Net) 
#----------------------------------------------------------------------- 

CPP = $(CPP_) -DMPI -DHOST=\"LinuxIFC\" -DIFC \ 
-Dkind8 -DNGZhalf -DCACHE_SIZE=4000 -Davoidalloc \ 
-DMPI_BLOCK=500 \ 
# -DRPROMU_DGEMV -DRACCMU_DGEMV 

#----------------------------------------------------------------------- 
# location of SCALAPACK 
# if you do not use SCALAPACK simply uncomment the line SCA 
#----------------------------------------------------------------------- 

BLACS=$(HOME)/archives/SCALAPACK/BLACS/ 
SCA_=$(HOME)/archives/SCALAPACK/SCALAPACK 

#SCA= $(SCA_)/libscalapack.a \ 
# $(BLACS)/LIB/blacsF77init_MPI-LINUX-0.a $(BLACS)/LIB/blacs_MPI-LINUX-0.a $(BLACS)/LIB/blacsF77init_MPI-LINUX-0.a 

SCA= 

#----------------------------------------------------------------------- 
# libraries for mpi 
#----------------------------------------------------------------------- 

LIB = -L../vasp.4.lib -ldmy \ 
../vasp.4.lib/linpack_double.o $(LAPACK) \ 
$(SCA) $(BLAS)
 

# FFT: fftmpi.o with fft3dlib of Juergen Furthmueller 
#FFT3D = fftmpi.o fftmpi_map.o fft3dlib.o 

# fftw.3.0.1 is slighly faster and should be used if available 
FFT3D = fftmpiw.o fftmpi_map.o fft3dlib.o /opt/libs/fftw-3.0.1/lib/libfftw3.a 

#----------------------------------------------------------------------- 
# general rules and compile lines 
#----------------------------------------------------------------------- 
################################################################## 

#Hg benchmark# 

finally I ended up with a working parallel version of vasp. 

If I run the Hg benchmark using all 16 cores with 
mpirun -np 16 vasp-install-dir/vasp 
it takes 24 seconds compared to the single core version with 140 seconds! 

Love it! 


Back again with some news: 

The way we compiled GotoBLAS was wrong (besides the stupid way we did it). 
If you compile the blas libraries with the threading turned on but the number of threads set to 1 you gain between 33 and 50% speed when doing larger calculations. 
The User Configuration part of Makefile.rule: 



# Beginning of user configuration 


# This library's version 
REVISION = -r1.26 

# Which C compiler do you prefer? Default is gcc. 
C_COMPILER = GNU 
# C_COMPILER = INTEL 
# C_COMPILER = PGI 

# Now you don't need Fortran compiler to build library. 
# If you don't spcifly Fortran Compiler, GNU g77 compatible 
# interface will be used. 
# F_COMPILER = G77 
# F_COMPILER = G95 
# F_COMPILER = GFORTRAN 
F_COMPILER = INTEL 
# F_COMPILER = PGI 
# F_COMPILER = PATHSCALE 
# F_COMPILER = IBM 
# F_COMPILER = COMPAQ 
# F_COMPILER = SUN 
# F_COMPILER = F2C 

# If you need 64bit binary; some architecture can accept both 32bit and 
# 64bit binary(X86_64, SPARC, Power/PowerPC or WINDOWS). 
BINARY64 = 1 

# If you want to build threaded BLAS 
SMP = 1 

# You can define maximum number of threads. Basically it should be 
# less than actual number of cores. If you don't specify one, it's 
# automatically detected by script. 
MAX_THREADS = 1 

# If you want to use legacy threaded Level 3 implementation. 
# Some architecture prefer this algorithm, but it's rare. 
# USE_SIMPLE_THREADED_LEVEL3 = 1 

# If you want to use GotoBLAS with accerelator like Cell or GPGPU 
# This is experimental and currently won't work well. 
# USE_ACCERELATOR = 1 

# Define accerelator type (won't work) 
# USE_CELL_SPU = 1 

# Theads are still working for a while after finishing BLAS operation 
# to reduce thread activate/deactivate overhead. You can determine 
# time out to improve performance. This number should be from 4 to 30 
# which corresponds to (1 << n) cycles. For example, if you set to 26, 
# thread will be running for (1 << 26) cycles(about 25ms on 3.0GHz 
# system). Also you can control this mumber by GOTO_THREAD_TIMEOUT 
# CCOMMON_OPT += -DTHREAD_TIMEOUT=26 

# If you need cross compiling 
# (you have to set architecture manually in getarch.c!) 
# Example : HOST ... G5 OSX, TARGET = CORE2 OSX 
# CROSS_SUFFIX = i686-apple-darwin8- 
# CROSS_VERSION = -4.0.1 
# CROSS_BINUTILS = 

# If you need Special memory management; 
# Using HugeTLB file system(Linux / AIX / Solaris) 
# HUGETLB_ALLOCATION = 1 

# Using bigphysarea memory instead of normal allocation to get 
# physically contiguous memory. 
# BIGPHYSAREA_ALLOCATION = 1 

# To get maxiumum performance with minimum impact to the system, 
# mixing memory allocation may be worth to try. In this case, 
# you have to define one of ALLOC_HUGETLB or BIGPHYSAREA_ALLOCATION. 
# Another allocation will be done by mmap or static allocation. 
# (Not implemented yet) 
# MIXED_MEMORY_ALLOCATION = 1 

# Using static allocation instead of dynamic allocation 
# You can't use it with ALLOC_HUGETLB 
# STATIC_ALLOCATION = 1 

# If you want to use CPU affinity 
# CCOMMON_OPT += -DUSE_CPU_AFFINITY 

# If you want to use memory affinity (NUMA) 
# You can't use it with ALLOC_STATIC 
# NUMA_AFFINITY = 1 

# If you want to use interleaved memory allocation. 
# Default is local allocation(it only works with NUMA_AFFINITY). 
# CCOMMON_OPT += -DINTERLEAVED_MAPPING 

# If you want to drive whole 64bit region by BLAS. Not all Fortran 
# compiler supports this. It's safe to keep comment it out if you 
# are not sure. 
# INTERFACE64 = 1 

# If you have special compiler to run script to determine architecture. 
GETARCH_CC += 
GETARCH_FLAGS += 

Logo

更多推荐