FaST-LMM & PySnpTools

Project Home & Bibliography 

Established: October 14, 2006
Last Update: August 11, 2023

FaST-LMM

FaST-LMM, which stands for Factored Spectrally Transformed Linear Mixed Models, is a program for performing genome-wide association studies (GWAS) on datasets of all sizes, up to one millions samples.

Learn more about Python FaST-LMM and install from:

FaST-LMM runs on Python 3.8, 3.9, 3.10, & 3.11. It runs on x64 on Linux, Windows, and Mac.

A older C++ version, including Windows binary, Linux binary, and source, supports univariate GWAS and limited epistatic testing.

PySnpTools

PySnpTools is a Python library for reading and manipulating genetic data. It efficiently reads genetic PLINK formats (including *.bed/bim/fam files) and the BGEN format. It also efficiently reads parts of files, reads kernel data, standardizes data, manipulates data in-memory, and scales to cluster-sized data.

PySnpTools runs on Python 3.8, 3.9, 3.10 & 3.11. It runs on x64 on Linux and Windows. On Mac, it runs on both x64 and ARM.

Learn more about PySnpTools and install from:

bed-reader

Read and write the PLINK BED format, simply and efficiently. Available for Python or Rust.

Learn more about bed-reader and install from:

  • Python: PyPi or GitHub (Python 3.8, 3.9, 3.10, and 3.11. Linux [x64], Windows [x64], & Mac [x64 and ARM])

  • Rust: crates.io


Contact


Full Annotated Bibliography

Univariate GWAS

  • [U1] H. Kang, N. Zaitlen, C. Wade, A. Kirby, D. Heckerman, M. Daly, and E. Eskin, Efficient Control of Population Structure in Model Organism Association Mapping, Genetics, 178:1709-1723, March, 2008 (doi: 10.1534/genetics.107.080101).
    • Describes early efforts to make linear mixed models more computationally efficient.
  • [U2]C. Lippert*, J. Listgarten*, Y. Liu, C.M. Kadie, R.I. Davidson, D. Heckerman*FaST linear mixed models for genome-wide association studiesNature Methods, 8: 833-835, Oct 2011 (doi:10.1038/nmeth.1681). (*equal contributions)
    • Shows how exact linear-mixed-model computations can be performed in time and memory linear in the number of individuals when the number of SNPs used in the similarity matrix is less than the number of individuals (i.e., when the similarity matrix is low rank). This work also describes an approach to select SNPs to achieve this condition with linkage-disequilibrium-based pruning. In addition, this work shows that computations are quadratic in time and memory when the similarity matrix is full rank.
  • [U3] J. Listgarten*, C. Lippert*, C.M. Kadie, R.I. Davidson, E. Eskin, D. Heckerman*. Improved linear mixed models for genome-wide association studiesNature Methods, 9: 525-526, June 2012 (doi:10.1038/nmeth.2037). (*equal contributions)
    • Describes a method for selecting SNPs for the linear-mixed-model similarity matrix by identifying SNPs that are predictive of the phenotype. A later publication [U6] shows this approach yields poor control of type I error, whereas the original selection method in [U2] performs well. This work also shows that the inclusion of irrelevant SNPs in the similarity matrix leads to inflated test statistics and reduced power, a phenomenon called “dilution”. Although an incorrect explanation for dilution is offered here, a correction is given in [U5]. Finally, there is a bug in the analysis of the synthetic data, which makes the prediction-based selection method appear to perform better than it actually does.
  • [U4] J. Listgarten*, C. Lippert*, D. Heckerman*. FaST-LMM-Select for addressing confounding from spatial structure and rare variantsNature Genetics (2013) doi:10.1038/ng.2620 (*equal contributions)
    • Shows how the feature-selection method in [U3] addresses an open problem in statistical genetics that had been published in Nature Genetics. Based on results in [U6], however, we recommend that the selection approach in [U2] be used instead.
  • [U5] C. Lippert*, Gerald Quon, Eun Youg Kang, Carl M. Kadie, J. Listgarten*, D. Heckerman*The benefits of selecting phenotype-specific variants for applications of mixed models in genomicsScientific Reports(2013) doi:10.1038/srep01815 (*equal contributions)
    • Describes additional experiments regarding the feature-selection method in [U3] as applied to GWAS and prediction. Again, based on the results in [U6], we recommend that the selection approach in [U2] be used instead.
  • [U6] C. Widmer*, C. Lippert*, O. Weissbrod, N. Fusi, C.M. Kadie, R.I. Davidson, J. Listgarten, and D. Heckerman*. Further Improvements to Linear Mixed Models for Genome-Wide Association Studies. Scientific Reports, 4, 6874, Nov 2014 (doi:10.1038/srep06874). (*equal contributions)
    • Describes the latest version of FaST-LMM. It shows that selecting SNPs for the linear-mixed-model similarity matrix through pruning via linkage disequilibrium (as in [U2]) works well to control type I error, whereas selecting SNPs that are predictive of the phenotype (as in [U3]) does not.
  • [U7] C. Lippert and D. Heckerman. Computational and statistical issues in personalized medicine. XRDS 21, 24-27, Summer 2015 (doi:10.1145/2788502).
    • Describes statistical issues in GWAS with linear mixed models from a graphical-model perspective.
  • [U8] C. Kadie, D. Heckerman.  Ludicrous Speed Linear Mixed Models for Genome-Wide Association Studies. BioRXiv, Jan 2018.
    • Shows how to scale the FaST-LMM in [U2] to 1 million samples on a cluster.
  • [U9] D. Heckerman.  Toward accounting for hidden common causes when inferring cause and effect from observational dataACM Transactions on Intelligent Systems and Technology, 10, Sept 2019 (doi: 10.1145/3309720).
    • Describes how linear mixed models account for a hidden confounder by aggregating small observed signals that reveal the confounder.

    Set Tests for GWAS

    • [S1] Listgarten*, C. Lippert*, Eun Youg Kang, Jing Xiang, Carl M. Kadie, D. Heckerman*A powerful and efficient set test for genetic markers that handles confounders. Bioinformatics, 29:1526-1533, April 2013 (doi:10.1093/bioinformatics/btt177). (*equal contributions)
      • Shows that the LRT can be more powerful than a score test for set association tests. This work is limited to similarity matrices that are low rank and includes an efficient algorithm for this case. This limitation is relaxed in [S2].
    • [S2] C. Lippert, Jing Xiang, Danilo Horta, Christian Widmer, Carl M. Kadie, D. Heckerman*, J. Listgarten. Greater power and computational efficiency for kernel-based association testing of sets of genetic variantsBioinformatics, 2014 (doi: 10.1093/bioinformatics/btu504). (*corresponding author)
      • Makes theoretical arguments and demonstrates empirically that the LRT is often more powerful than the traditionally-used score test (e.g. SKAT). It also has exposition on how to do a number of algebraic computations for set tests with either a low- or full-rank background kernel efficiently.

    Data Transformations/Pre-processing for GWAS

    Epigenetic Cellular Heterogeneity Correction

    • [C1] Zou, C. Lippert, D. Heckerman, M. Aryee, Jennifer Listgarten. Epigenome-wide association studies without the need for cell-type compositionNature Methods, doi:10.1038/NMETH.2815.
      • Shows how FaST-LMM, with the inclusion of principal components (PCs) as covariates, can correct for the confounding effects of multiple cell types. Although a method for selecting PCs is presented here, the method in [U6] is now recommended.

    Epistatic Genome-Wide Association

    • [E1] Lippert*, J. Listgarten*, Robert Davidson, Scott Baxter, Hoifung Poon, Carl M. Kadie, D. Heckerman*An Exhaustive Epistatic SNP Association Analysis on Expanded Wellcome Trust Data, Scientific Reports, 2013, doi:10.1038/srep01099 (*equal contributions)
      • Presents results for all pairwise-epistatic tests for all phenotypes in the WTCCC1 data, using a linear mixed model with a low-rank similarity matrix based on the feature-selection method in [U3]. As described, based on the results in [U6], we now recommend that the feature-selection method in [U2] be used instead.

    GWAS for “Functional Traits” such as Longitudinal Traits

    • [F1] Fusi and J. Listgarten.  Leveraging Non-Linear Genetic Effects on Functional Traits for GWAS, Proceedings of RECOMB 2016.
      • Introduces a model for performing GWAS for vector-valued traits which vary smoothly in time. The framework is expressive and computationally efficient, but the null model is not nested inside of the alternative model, something we are currently addressing in ongoing work.

    Heritability Estimation

    • [H1] N. Furlotte, D. Heckerman, and C. Lippert.  Quantifying the uncertainty in heritabilityJournal of Human Genetics 27, March 2014 (doi: 10.1038/jhg.2014.15).
      • Applies the spectral-decomposition trick from FaST-LMM [2] to speed up Bayesian estimates of heritability.
    • [H2] Heckerman, D. Gurdasani, C. Kadie, C. Pomilla, T. Carstensen, H. Martin, K. Ekoru, R.N. Nsubuga, G. Ssenyomo A. Kamali, P. Kaleebu, C. Widmer, and M.S. Sandhu. Linear mixed model for heritability estimation that explicitly addresses environmental variation. PNAS, 113: 7377–7382 (doi: 10.1073/pnas.1510497113).
      • Describes a way to generalize linear mixed models to take spatial location into account when jointly modeling the influences of genomics and environment on traits.