FastEpistasis, a software tool capable of computing tests of epistasis for a large number of SNP pairs, is an efficient parallel extension to the PLINK epistasis module. It tests epistatic effects in the normal linear regression of a quantitative response on marginal effects of each SNP and an interaction effect of the SNP pair, where SNPs are coded as additive effects, taking user defined values or the default 0, 1 and 2. The test for epistasis reduces to testing whether the interaction term is significantly different from zero.
As of version 2.03, FastEpistasis will also test epistatic effects for binary traits using logistic regression model similar to PLINK. Currently this feature is still considered BETA and requires deeper analysis.

FastEpistasis optimizes the computations by splitting the analysis tasks into three separate applications: pre-, core- and post-computation.

  • The precomputation phase loads PLINK binary format data files, reformats the data for faster computations and reduces the number of conditions to test for in the core phase.
  • The core computational phase is designed to embarrassingly parallelize the computations, iterating through SNP pairs and efficiently carrying out the tests for epistasis. The computations are based on applying the QR decomposition to derive least squares estimates of the interaction coefficient and its standard error. The core computation software comes in several versions to take advantage of different high performance architectures - a Shared Memory Processor (SMP) version and a clustered Message Passing Interface (MPI) version.
  • An optional post-computation phase is provided to aggregate results from each processor or core, include detailed SNP information, compute p-values from each test, and convert to text files.

  • An errata in smpFastEpistasis 2.02 version setting prevents the use of postFastEpistasis, this has been corrected in version 2.03.
    If you have large data please contact me and I shall put up a small corrective binary, otherwise rerun using v2.03
  • A bug was spotted in version 2.01, if you are using such binary package, please upgrade to 2.03.
  • Version 2.x
    • FastEpistasis requires phenotype values to appear in same ordering as the family data. It should warn about potential errors.
  • Version 1.x
    • FastEpistasis uses for the time being size_t datatype, hence cannot pass output data from 32bit architecture to 64bit architecture.
      Furthermore the application is designed to run on Intel and AMD processors, using it under Big Endian processors will fail.
    • FastEpistasis requires phenotype values to appear in same ordering than the family data. As of version 1.07 it should warn about potential errors.

Documentation

Despite the use of Plink binary data format, FastEpistasis has been designed to match as most as possible Plink command line options for epistasis. Of course, there exists slight differences that may bother you. The purpose here is to gather some common problems and explain a full procedure based upon data used for benchmarking. In any case, user should try the executables with --help option to have a glance of the possibilities together with a brief description, especially as the following will NOT deal with all possibilities but focus on the basics and mandatory elements.
For this tutorial we will need the HapMap3 (~750 MB) data file. More precisely the files hapmap3_r1_b36_fwd.MKK.qc.poly.recode.map and hapmap3_r1_b36_fwd.MKK.qc.poly.recode.ped thereby contained. In addition, a working Plink executable will also be required.
A run of FastEpistasis bears several stages, each having its own executables (or several in the compute stage to differentiate MPI from SMP computer architecture).
  1. The first stage gathers data into a unique binary file that will hold all the information required later on in a compact format. Most of the issues arise here as the user may run into unrecognized format. Similar to Plink, PreFastEpistasis has several mandatory arguments that point to the different files holding the data:
    • the individual relations: Plink .fam file
    • the overall genotype data: Plink .ped or .bed file
    • the SNPs information: Plink .map or .bim file.
    • the phenotype(s) file
    • the SNPs set file
    It is nevertheless worth mentionning that PreFastEpistasis imports Plink data based upon Plink's binary format that is .bed and .bim files rather that .ped and .map file. Furthermore, the family file is not always presents but may be generated by Plink.
    Hence one has to run Plink in order to generate the .fam file and the binary version of the .map and .ped files. So in our example, this is performed by the following commands:

    You should now have generated the binary files FastEpistasis.data.MKK.bed and FastEpistasis.data.MKK.bim as well as the family file FastEpistasis.data.MKK.fam.
    Before one can run PreFastEpistasis, a phenotype file and a set file must be created.
    The phenotype file is space separated text and contains 2+N columns and M+1 rows where N is the number of phenotypes and M the number of individuals. As of version 2.0, there may not be more than 64 phenotypes due to the algorithm used in the detection of missing values. The header row should contain the column titles that is "fid" and "iid" for column 1 and 2 and then the desired phenotype names (above 16 characters the name will be truncated).
    PreFastEpistasis assumes row order to be exactly the same as the .map or .bim file, any change should trigger an error!
    Here is an example used in the benchmarks.

    The set file is text based as well and should bear a unique SNP name or keyword per line. Recognized keywords are SET_A, SET_B and END. It is important to keep in mind that PreFastEpistasis searches for best SNPs pair correlations taking each element of set A with respect to each element in set B. Consequentely one has to provide two sets even though they may be identical. The possibilities are then 3 fold, each having its own dedicated computational phase for optimization:
    • A = B yielding a so called pure interaction run where symmetrical pairs are taking into account ond performed only once. A reduction upon the best match of each thread is then necessary at the end of the computation phase.
    • A and B have shared SNPs, hence some interactions will be performed twice in order to avoid any post reduction. One might suffer a great performance penalty if there are many shared SNPs.
    • A and B are disjoint, no reduction is necessary as the best pairs may be allocated and owned by only one working thread.

    Here is an example of a set file used for the pure interaction benchmarks.
    At last, generating the binary file FastEpistasis.data.MKK.bin for the stage 2 process is a matter of running PreFastEpistasis giving all prior files, that is
    which should output

    Note the last line where default values where used. Indeed the EPI 1 default is 0.0001, hence pvalues below this threshold will be both accounted and stored. Whereas EPI 2 sets the threshold for values to be only accounted (0.01). mBB, mAB and mAA are the continous values assigned to the genotype in the fit procedure.
    While mBB, mAB and mAA can only be modified at this stage, the EPI thresholds can also be modified in stage 2 using the appropriate option (use --help to see options). This features was added as of version 1.07.
  2. Once the data has been compacted into a binary file by PreFastEpistasis, running the search for best pairs is as simple as choosing its architecture, namely SMP for shared memmory processor or MPI for Message Passing Interface implementation. For SMP version the command is
    which should modify the EPI1 threshold to have no storing and use method 4 (SSE3 switch QR) to compute the fit, it outputs

    as well as in this case 8 files named FastEpistasis.data.MKK.epi.qt.lm_XXX.bin where XXX runs from 0 to 7 (empty here as EPI1 is null), an index file FastEpistasis.data.MKK.epi.qt.lm.idx and a summary file FastEpistasis.data.MKK.epi.qt.lm.summary. Only the summary file is text, the rest is binary and holds data required for post processing (see stage 3).

    Unless one wants to perform statistical analysis on the overall data, the summary file holds all the information about the best pairs.

Reference

FastEpistasis: A high performance computing solution for quantitative trait epistasis. Schüpbach T, Xenarios I, Bergmann S, Kapur K . Bioinformatics. 2010 Apr 7 Download PDF

Installation

Download

Source

v1.05 Download sources Initial version (not recommended use 1.06 instead)
v1.06 Download sources Few optimizations brought, mainly all modifications made to ease installation
v1.07 Download sources Easier to provide other blas/lapack libraries, warn about not handling missing phenotype values and check ordering between input data files.
v2.01 Linear fit algorithm has changed to avoid the use of external BLAS/LAPACK libraries. In addition, weights are now used to limit the matrix size, thus providing a significant performance boost.
Data storage has been shrunk to its minimum to reduce IO bottleneck. Moreover the new format is independant of the x86 or x86_64 architecture.
Big endian architecture can now compile, the code ran on IBM BlueGene/P performing hundreds of millions interactions per seconds!
Additional features were added to the pre- stage to account for missing phenotypes, these people should now be removed from the population. However some issue remains on how to deal with multiple phenotypes, choice was made to stop with warning in the case the individual presents partial miss.
SMP version bears additional features not yet implemented on the MPI version. Indeed, the QR algorithm has been implemented in several ways using SSE intrinsics. Choice is performed by the --method option. We measured best performance using --method 4. Note that this version is superseded by 2.02 due to a bug encountered in single phenotype data reading which caused every 2 values to be zero!
v2.03 Download sources Corrected version of 2.01 and 2.02
SMP version has an added feature to run epistatic tests on binary trait data. So far this feature is triggered when no phenotype file is provided and requires the phenotype data within the standard plink files to be in the format 1 is unaffected and 2 is affected. We would like to emphasize that this is still under heavy implementation and should be considered BETA software.
MPI version nows allow command line option to alter thresholds hence no need to rerun preFastEpistasis anymore for such purpose. Furthermore SSE intrinsics functions are now provided as in SMP and can be adjusted throught the use of --method option.
Do not attempt to run MPI version on multiple phenotypes. This is not yet ready.

Binaries

v1.08 Download sources Static build of Pre/SMP/Post FastEpistasis. Version 1.08 is just a 1.07 modified to be statically linked.
v2.03 Download sources Static build of Pre/SMP/Post FastEpistasis.

Dependencies

  • The whole compiling process uses Cmake version >= 2.4
  • On versions < 2.01, core computations applications depends upon Lapack library (Intel© Math Kernel Library or ATLAS or others) and make direct calls to
    • ddot
    • dgeqrf
    • dormqr
  • PostFastEpistasis requires the GNU Scientific Library for
    • gsl_cdf_chisq_Q

Howto

  1. Decompress the FastEpistasis-x.xx.tar.gz file
  2. Go into FastEpistasis/Build directory
  3. Create an environment variable CC with the compiler to use, Intel C Compiler here (icc)
  4. Creation of the makefiles is performed by Cmake which tries to find all dependencies.

    Nevertheless it may not always succeed and therefore user can input or override parameters to help. This is performed by adding "-D parameter=value" to the cmake command.

    • Global Cmake options
    • Although many features exist, see Cmake documentation, of primary importance is
      • CMAKE_INSTALL_PREFIX to change default installation path.
      • CMAKE_BUILD_TYPE to change build type, possible values are Debug, Release, RelWithDebInfo
    • FastEpistasis specific options
      Global options
      Option name Description Dependencies Default As of version
      COMPILE_PRE_EPISTASIS Triggers compilation of preFastEpistasis ON
      COMPILE_SMP_EPISTASIS Triggers compilation of smpFastEpistasis ON
      COMPILE_MPI_EPISTASIS Triggers compilation of mpiFastEpistasis OFF
      COMPILE_POST_EPISTASIS Triggers compilation of postFastEpistasis ON
      UNICODE Uses unicode characters for boxes
      requires a suitable terminal or text viewer
      otherwise quite unreadable
      ON
      ADD_RPATH Hard code link path into executables
      useful if multiple librairies or compilers exist
      OFF
      USE_AFFINITY Add thread affinity settings to smpFastEpistasis
      Add an optional keywords that gives users a way
      to specify which core to run on and how many threads to use
      COMPILE_SMP_EPISTASIS OFF
      USE_MKL Triggers the search for Intel Math Kernel Library
      If Cmake fails to find MKL then export
      MKL_HOME environment variable to help.
      COMPILE_SMP_EPISTASIS
      or
      COMPILE_MPI_EPISTASIS
      ON > 2.0 deprecated
      STANDALONE Triggers static built of the executables. OFF >2.0
      NO_PERFECT_MATCH Perfect fit will not be accounted. Allowing perfect fit will result in incredibly high pvalue integral boundary due to numerical errors arising in the substraction of (almost) identical values and their reciprocals. This is harmless but spoils the summary best match data. OFF >2.0

      MPI option when COMPILE_MPI_EPISTASIS=ON
      Option name Description
      MPI_COMPILER Full path to the mpi C compiler
  5. Compile

  6. Install compiled applications

Manually specifing BLAS/LAPACK libraries for version prior to 2.0


BLAS / LAPACK options when USE_MKL=OFF >
Option name Description
BLAS_LIBS Name of the BLAS library
either full path or just "-lblasname"
BLAS_PATHPath to the directory containing the above library
LAPACK64_LIBS Name of the LAPACK library (double precision)
either full path or just "-llapackname"
LAPACK64_PATH Path to the directory containing the above library


  • ATLAS BLAS and ATLAS LAPACK
    Assuming the ATLAS BLAS is installed in directory /usr/lib64/blas/atlas and ATLAS LAPACK in /usr/lib64/lapack/atlas, then perform
    Note that you do not want ATLAS BLAS to be built threaded otherwise threads of smpFastEpistasis will run in competition with ATLAS BLAS.
  • Intel Math Kernel Library (MKL)
  • Assuming we want the x86_64 (em64t) version of MKL installed in directory /opt/intel/mkl90/lib/em64t

The FastEpistasis program suite is written by Thierry Schuepbach, .

Last update: