FastEpistasis, a software tool capable of computing tests of epistasis for a large number of SNP pairs, is an efficient parallel extension to the PLINK
epistasis module. It tests epistatic effects in the normal linear regression of a quantitative response on marginal effects of each SNP and an interaction
effect of the SNP pair, where SNPs are coded as additive effects, taking user defined values or the default 0, 1 and 2. The test for epistasis reduces to
testing whether the interaction term is significantly different from zero.
As of version 2.03, FastEpistasis will also test epistatic effects for binary traits using logistic regression model similar to PLINK. Currently this feature
is still considered BETA and requires deeper analysis.
FastEpistasis optimizes the computations by splitting the analysis tasks into three separate applications: pre-, core- and post-computation.
- The precomputation phase loads PLINK binary format data files, reformats the data for faster computations and reduces the number of conditions to test
for in the core phase.
- The core computational phase is designed to embarrassingly parallelize the computations, iterating through SNP pairs and efficiently carrying
out the tests for epistasis. The computations are based on applying the QR decomposition to derive least squares estimates of the interaction
coefficient and its standard error. The core computation software comes in several versions to take advantage of different high performance
architectures - a Shared Memory Processor (SMP) version and a clustered Message Passing Interface (MPI) version.
- An optional post-computation phase is provided to aggregate results from each processor or core, include detailed SNP information, compute
p-values from each test, and convert to text files.
- An errata in smpFastEpistasis 2.02 version setting prevents the use of postFastEpistasis, this has been corrected in version 2.03.
If you have large data please contact me and I shall put up a small corrective binary, otherwise rerun using v2.03
- A bug was spotted in version 2.01, if you are using such binary package, please upgrade to 2.03.
-
Version 2.x
-
FastEpistasis requires phenotype values to appear in same ordering as the family data. It should warn about potential errors.
-
Version 1.x
- FastEpistasis uses for the time being size_t datatype, hence cannot pass output data from 32bit architecture to 64bit architecture.
Furthermore the application is designed to run on Intel and AMD processors, using it under Big Endian processors will fail.
-
FastEpistasis requires phenotype values to appear in same ordering than the family data. As of version 1.07 it should warn about potential errors.
Performances
-
Scaling
Benchmarks obtained using either a Dual Intel Clovertown rated at 2.66GHz with 16GB RAM (SMP) or Dual Intel Hapertown rated at 2.8GHz with 16GB RAM (MPI).
They are averaged over 10 runs using data from the Hapmap3 MKK. SMP results are reported using disjoint SNP sets, A of size 19, 999 and B of size 2, 596,
whereas MPI results use set A = B with 19, 999 SNPs. Only the core computational phase is taken into account.
Divergence between SMP (blue) and MPI (red) version appears due to initial data loading and computer frequencies. The measured scaling efficiency is about 93% for both,
SMP or MPI.
-
Population size scaling
8-core SMP benchmarks using an enlarged Hapmap3 MKK. Only one phenotype is used.
Only the core computational phase is taken into account.
-
Scaling
The speed-up is maximum at about 32 phenotypes and then suffers from cache penalty. Ouput is stored on a dedicated SATA II hard disk.
Only the core computational phase is taken into account.
Documentation
Despite the use of Plink binary data format, FastEpistasis has been designed to match as most as possible Plink command line options for epistasis. Of course,
there exists slight differences that may bother you. The purpose here is to gather some common problems and explain a full procedure based upon data used for
benchmarking. In any case, user should try the executables with
--help option to have a glance of the possibilities together with a brief description,
especially as the following will NOT deal with all possibilities but focus on the basics and mandatory elements.
For this tutorial we will need the
HapMap3 (~750 MB) data file. More precisely the files
hapmap3_r1_b36_fwd.MKK.qc.poly.recode.map and
hapmap3_r1_b36_fwd.MKK.qc.poly.recode.ped thereby contained. In addition, a working
Plink executable will also be required.
A run of FastEpistasis bears several stages, each having its own executables (or several in the compute stage to differentiate MPI from SMP computer architecture).
-
The first stage gathers data into a unique binary file that will hold all the information required later on in a compact format. Most of the issues arise here as the user may run into unrecognized
format. Similar to Plink, PreFastEpistasis has several mandatory arguments that point to the different files holding the data:
- the individual relations: Plink .fam file
- the overall genotype data: Plink .ped or .bed file
- the SNPs information: Plink .map or .bim file.
- the phenotype(s) file
- the SNPs set file
-
It is nevertheless worth mentionning that PreFastEpistasis imports Plink data based upon Plink's binary format that is .bed and .bim files rather that .ped and .map file.
Furthermore, the family file is not always presents but may be generated by Plink.
Hence one has to run Plink in order to generate the .fam file and the binary version of the .map and .ped files. So in our example, this is performed by the following commands:
You should now have generated the binary files FastEpistasis.data.MKK.bed and FastEpistasis.data.MKK.bim as well as the family file FastEpistasis.data.MKK.fam.
Before one can run PreFastEpistasis, a phenotype file and a set file must be created.
The phenotype file is space separated text and contains 2+N columns and M+1 rows where N is the number of phenotypes and M the number of individuals.
As of version 2.0, there may not be more than 64 phenotypes due to the algorithm used in the detection of missing values. The header row should contain the column titles that is "fid" and "iid"
for column 1 and 2 and then the desired phenotype names (above 16 characters the name will be truncated).
PreFastEpistasis assumes row order to be exactly the same as the .map or .bim file, any change should trigger an error!
Here is an example used in the benchmarks.
The set file is text based as well and should bear a unique SNP name or keyword per line. Recognized keywords are SET_A, SET_B and END. It is
important to keep in mind that PreFastEpistasis searches for best SNPs pair correlations taking each element of set A with respect to each element in set B.
Consequentely one has to provide two sets even though they may be identical. The possibilities are then 3 fold, each having its own dedicated computational phase for optimization:
- A = B yielding a so called pure interaction run where symmetrical pairs are taking into account ond performed only once. A reduction upon the best match of each thread is then
necessary at the end of the computation phase.
- A and B have shared SNPs, hence some interactions will be performed twice in order to avoid any post reduction. One might suffer a great performance penalty if there are many shared SNPs.
- A and B are disjoint, no reduction is necessary as the best pairs may be allocated and owned by only one working thread.
Here is an example of a set file used for the pure interaction benchmarks.
At last, generating the binary file FastEpistasis.data.MKK.bin for the stage 2 process is a matter of running PreFastEpistasis giving all prior files, that is
which should output
Note the last line where default values where used. Indeed the EPI 1 default is 0.0001, hence pvalues below this threshold will be both accounted and stored. Whereas EPI 2
sets the threshold for values to be only accounted (0.01). mBB, mAB and mAA are the continous values assigned to the genotype in the fit procedure.
While mBB, mAB and mAA can only be modified at this stage, the EPI thresholds can also be modified in stage 2 using the appropriate option (use --help to see options).
This features was added as of version 1.07.
-
Once the data has been compacted into a binary file by PreFastEpistasis, running the search for best pairs is as simple as choosing its architecture, namely SMP for shared memmory processor or MPI
for Message Passing Interface implementation. For SMP version the command is
which should modify the EPI1 threshold to have no storing and use method 4 (SSE3 switch QR) to compute the fit, it outputs
as well as in this case 8 files named FastEpistasis.data.MKK.epi.qt.lm_XXX.bin where XXX runs from 0 to 7 (empty here as EPI1 is null),
an index file FastEpistasis.data.MKK.epi.qt.lm.idx and a summary file
FastEpistasis.data.MKK.epi.qt.lm.summary. Only the summary file is text, the rest is binary and holds data required for post processing (see stage 3).
Unless one wants to perform statistical analysis on the overall data, the summary file holds all the information about the best pairs.
-
Reference
FastEpistasis: A high performance computing solution for quantitative trait epistasis.
Schüpbach T, Xenarios I, Bergmann S, Kapur K .
Bioinformatics. 2010 Apr 7
Installation
Download
Source
| v1.05 |
 |
Initial version (not recommended use 1.06 instead) |
| v1.06 |
|
Few optimizations brought, mainly all modifications made to ease installation |
| v1.07 |
|
Easier to provide other blas/lapack libraries, warn about not handling missing phenotype values and check ordering between input data files. |
| v2.01 |
|
Linear fit algorithm has changed to avoid the use of external BLAS/LAPACK libraries. In addition, weights are now used to limit the matrix size,
thus providing a significant performance boost.
Data storage has been shrunk to its minimum to reduce IO bottleneck. Moreover the new format is independant of the x86 or x86_64 architecture.
Big endian architecture can now compile, the code ran on IBM BlueGene/P performing hundreds of millions interactions per seconds!
Additional features were added to the pre- stage to account for missing phenotypes, these people should now be removed from the population.
However some issue remains on how to deal with multiple phenotypes, choice was made to stop with warning in the case the individual presents
partial miss.
SMP version bears additional features not yet implemented on the MPI version. Indeed, the QR algorithm has been implemented in several ways using
SSE intrinsics. Choice is performed by the --method option. We measured best performance using --method 4.
Note that this version is superseded by 2.02 due to a bug encountered in single phenotype data reading which caused every 2 values to be zero!
|
| v2.03 |
 |
Corrected version of 2.01 and 2.02
SMP version has an added feature to run epistatic tests on binary trait data. So far this feature is triggered when no phenotype file is provided
and requires the phenotype data within the standard plink files to be in the format 1 is unaffected and 2 is affected. We would like to emphasize
that this is still under heavy implementation and should be considered BETA software.
MPI version nows allow command line option to alter thresholds hence no need to rerun preFastEpistasis anymore for such purpose. Furthermore SSE
intrinsics functions are now provided as in SMP and can be adjusted throught the use of --method option.
Do not attempt to run MPI version on multiple phenotypes. This is not yet ready.
|
Binaries
| v1.08 |
 |
Static build of Pre/SMP/Post FastEpistasis. Version 1.08 is just a 1.07 modified to be statically linked. |
| v2.03 |
 |
Static build of Pre/SMP/Post FastEpistasis. |
Dependencies
- The whole compiling process uses Cmake version >= 2.4
- On versions < 2.01, core computations applications depends upon Lapack library (Intel© Math Kernel Library or ATLAS or others) and make direct calls to
- PostFastEpistasis requires the GNU Scientific Library for
Howto
- Decompress the FastEpistasis-x.xx.tar.gz file
- Go into FastEpistasis/Build directory
- Create an environment variable CC with the compiler to use, Intel C Compiler here (icc)
- Creation of the makefiles is performed by Cmake which tries to find all dependencies.
Nevertheless it may not always succeed and therefore user can input or override parameters to help. This is performed by adding "-D parameter=value" to the cmake command.
- Global Cmake options
Although many features exist, see Cmake documentation, of primary importance is
- CMAKE_INSTALL_PREFIX to change default installation path.
- CMAKE_BUILD_TYPE to change build type, possible values are Debug, Release, RelWithDebInfo
- FastEpistasis specific options
Global options
| Option name | Description | Dependencies | Default | As of version |
| COMPILE_PRE_EPISTASIS | Triggers compilation of preFastEpistasis | | ON |
| COMPILE_SMP_EPISTASIS | Triggers compilation of smpFastEpistasis | | ON |
| COMPILE_MPI_EPISTASIS | Triggers compilation of mpiFastEpistasis | | OFF |
| COMPILE_POST_EPISTASIS | Triggers compilation of postFastEpistasis | | ON |
| UNICODE |
Uses unicode characters for boxes requires a suitable terminal or text viewer otherwise quite unreadable |
|
ON |
| ADD_RPATH |
Hard code link path into executables useful if multiple librairies or compilers exist |
|
OFF |
| USE_AFFINITY |
Add thread affinity settings to smpFastEpistasis Add an optional keywords that gives users a way to specify which core to run on and how many threads to use |
COMPILE_SMP_EPISTASIS |
OFF |
| USE_MKL |
Triggers the search for Intel Math Kernel Library If Cmake fails to find MKL then export MKL_HOME environment variable to help. |
COMPILE_SMP_EPISTASIS or COMPILE_MPI_EPISTASIS |
ON |
> 2.0 deprecated |
| STANDALONE |
Triggers static built of the executables. |
|
OFF |
>2.0 |
| NO_PERFECT_MATCH |
Perfect fit will not be accounted. Allowing perfect fit will result in incredibly high pvalue integral boundary due to numerical
errors arising in the substraction of (almost) identical values and their reciprocals. This is harmless but spoils the summary best match data.
|
|
OFF |
>2.0 |
MPI option when COMPILE_MPI_EPISTASIS=ON
| Option name | Description |
| MPI_COMPILER | Full path to the mpi C compiler |
- Compile
- Install compiled applications
Manually specifing BLAS/LAPACK libraries for version prior to 2.0
BLAS / LAPACK options when
USE_MKL=OFF
| Option name | Description |
| BLAS_LIBS | Name of the BLAS library either full path or just "-lblasname" |
| BLAS_PATH | > Path to the directory containing the above library |
| LAPACK64_LIBS | Name of the LAPACK library (double precision) either full path or just "-llapackname" |
| LAPACK64_PATH | Path to the directory containing the above library |
- ATLAS BLAS and ATLAS LAPACK
Assuming the ATLAS BLAS is installed in directory /usr/lib64/blas/atlas and ATLAS LAPACK in /usr/lib64/lapack/atlas, then perform
Note that you do not want ATLAS BLAS to be built threaded otherwise threads of smpFastEpistasis will run in competition with ATLAS BLAS.
- Intel Math Kernel Library (MKL)
Assuming we want the x86_64 (em64t) version of MKL installed in directory /opt/intel/mkl90/lib/em64t
The FastEpistasis program suite is written by Thierry Schuepbach,
.