fastlmm Documentation

FaST-LMM, which stands for Factored Spectrally Transformed Linear Mixed Models, is a program for performing both single-SNP and SNP-set genome-wide association studies (GWAS) on extremely large data sets.

See FaST-LMM’s README.md for installation instructions, documentation, code, and a bibliography.

new:

single_snp() now supports effect size, multiple phenotypes, and related features (notebook demonstrating new features).

Code:

Contacts:

single_snp

fastlmm.association.single_snp(test_snps, pheno, K0=None, K1=None, mixing=None, covar=None, covar_by_chrom=None, leave_out_one_chrom=True, output_file_name=None, h2=None, log_delta=None, show_snp_fract_var_exp=False, cache_file=None, GB_goal=None, interact_with_snp=None, force_full_rank=False, force_low_rank=False, G0=None, G1=None, runner=None, map_reduce_outer=True, pvalue_threshold=None, random_threshold=None, random_seed=0, xp=None, count_A1=None)

Function performing single SNP GWAS using cross validation over the chromosomes and REML. Will reorder and intersect IIDs as needed. (For backwards compatibility, you may use ‘leave_out_one_chrom=False’ to skip cross validation, but that is not recommended.)

Parameters:
  • test_snps – SNPs to test. Can be any SnpReader. If you give a string, it should be the base name of a set of PLINK Bed-formatted files. (For backwards compatibility can also be dictionary with keys ‘vals’, ‘iid’, ‘header’)

  • pheno

    A one or more phenotypes: Can be any SnpReader, for example, Pheno or SnpData. If you give a string, it should be the file name of a PLINK phenotype-formatted file. Any individual with missing all values will be removed. If more than one phenotype is given, then K1 (the 2nd kernel) cannot be given. (For backwards compatibility can also be dictionary with keys ‘vals’, ‘iid’, ‘header’)

  • K0

    similarity matrix or SNPs from which to create a similarity matrix. If not given, will use test_snps. Can be any SnpReader. If you give a string, it should be the base name of a set of PLINK Bed-formatted files. If leave_out_one_chrom is True, can be a dictionary from chromosome number to any KernelReader or the name a KernelNpz-formated file. If leave_out_one_chrom is False, can be any KernelReader or name of a KernelNpz-formated file.

  • K1 – second similarity matrix or SNPs from which to create a second similarity matrix, optional. (Also, see ‘mixing’).

  • mixing (number) – Weight between 0.0 (inclusive, default) and 1.0 (inclusive) given to K1 relative to K0. If you give no mixing number and a K1 is given, the best weight will be learned.

  • covar

    covariate information, optional: Can be any SnpReader, for example, Pheno or SnpData. If you give a string, it should be the file name of a PLINK phenotype-formatted file. Missing values are not supported. (For backwards compatibility can also be dictionary with keys ‘vals’, ‘iid’, ‘header’)

  • covar_by_chrom

    dictionary from chromosome number to covariate information, optional: The covariate information can be any SnpReader. If given, leave_out_one_chrom must be True. Both covar and covar_by_chrom can be given. Missing values are not supported.

  • leave_out_one_chrom (boolean) – Perform single SNP GWAS via cross validation over the chromosomes. Default to True. (Warning: setting False can cause proximal contamination.)

  • output_file_name (file name) – Name of file to write results to, optional. If not given, no output file will be created. The output format is tab-delimited text.

  • h2 (number) – A parameter to LMM learning, optional If not given will search for best value. If mixing is unspecified, then h2 must also be unspecified.

  • log_delta (number) – a re-parameterization of h2 provided for backwards compatibility. h2 is 1./(exp(log_delta)+1)

  • show_snp_fract_var_exp (boolean) – The output will always show ‘EffectSize’. If show_snp_fract_var_exp is True, then the output will also show ‘SnpFractVarExp’. EffectSize and SnpFractVarExp are different measures of fraction of variance explained. The first uses all variance of the phenotype in the denominator, whereas the second uses variance of the phenotype after removing the effects of population structure and family relatedness. Specifically, EffectSize is SNPWeight^2 * “raw” test SNP variance / phenotype variance, where: SNPWeight is the beta for the “standardized test SNP” (where “standardized” means made to have mean 0, stdev 1). And where “raw” means the original test SNP values of “0,1,2” (the count of allele 1 or 2 [it doesn’t matter]). And where the phenotype’s original values are used (so, they are not, for example, changed by regressing out the covariates).

  • cache_file (file name) – Name of file to read or write cached precomputation values to, optional. If not given, no cache file will be used. If given and file does not exist, will write precomputation values to file. If given and file does exist, will read precomputation values from file. The file contains the S and U from the decomposition of the training matrix and other values. It is in Python’s np.savez (*.npz) format. Calls using the same cache file should have the same inputs (pheno, K0, K1, covar) but test_snps can differ.

  • GB_goal (number) – gigabytes of memory the run should use, optional. If not given, will read the test_snps in blocks the same size as the kernel, which is memory efficient with little overhead on computation time.

  • interact_with_snp – index of a covariate to perform an interaction test with. Allows for interaction testing (interact_with_snp x snp will be tested) default: None

  • force_full_rank (Boolean) – Even if kernels are defined with fewer SNPs than IIDs, create an explicit iid_count x iid_count kernel. Cannot be True if force_low_rank is True.

  • force_low_rank (Boolean) – Even if kernels are defined with fewer IIDs than SNPs, create a low-rank iid_count x sid_count kernel. Cannot be True if force_full_rank is True.

  • G0 (Same as K0.) – Same as K0. Provided for backwards compatibility. Cannot be given if K0 is given.

  • G1 (Same as K1.) – Same as K1. Provided for backwards compatibility. Cannot be given if K1 is given.

  • runner – a Runner, optional: Tells how to run locally, multi-processor, or on a cluster. If not given, the function is run locally.

  • pvalue_threshold (number between 0 and 1) – All output rows with p-values less than this threshold will be included. By default, all rows are included. This is used to exclude rows with large p-values.

  • random_threshold (integer) – All potential output rows are assigned a random value. All rows with a random value less than this threshold will be included. By default, all rows are included. This is used to create a random sample of output rows.

  • random_seed – Seed used to assign a random values to rows. Used with random_threshold.

  • map_reduce_outer (bool) – If true (default), divides work by chromosome. If false, divides test_snp work into chunks.

  • xp (string or Python module) – The array module to use (optional), for example, ‘numpy’ (normal CPU-based module) or ‘cupy’ (GPU-based module). If not given, will try to read from the ARRAY_MODULE environment variable. If not given and ARRAY_MODULE is not set, will use numpy. If ‘cupy’ is requested, will try to ‘import cupy’. If that import fails, will revert to numpy. If two kernels are given, will ignore this and use ‘numpy’

  • count_A1 (bool) – If it needs to read SNP data from a BED-formatted file, tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.

Return type:

Python module

Return type:

Pandas dataframe with one row per test SNP. Columns include “PValue”

Example:

>>> import logging
>>> from fastlmm.association import single_snp
>>> from fastlmm.util import example_file # Download and return local file name
>>> from pysnptools.snpreader import Bed
>>> logging.basicConfig(level=logging.INFO)
>>> pheno_fn = example_file("fastlmm/feature_selection/examples/toydata.phe")
>>> test_snps = example_file("fastlmm/feature_selection/examples/toydata.5chrom.*","*.bed")
>>> results_dataframe = single_snp(test_snps=test_snps, pheno=pheno_fn, count_A1=False)
>>> print(results_dataframe.iloc[0].SNP,round(results_dataframe.iloc[0].PValue,7),len(results_dataframe))
null_576 1e-07 10000

For more examples, see:

single_snp_scale

fastlmm.association.single_snp_scale(test_snps, pheno, G0=None, covar=None, cache=None, memory_factor=1, output_file_name=None, K0=None, runner=None, min_work_count=1, gtg_runner=None, gtg_min_work_count=None, svd_runner=None, postsvd_runner=None, postsvd_min_work_count=None, test_snps_runner=None, test_snps_min_work_count=None, count_A1=False, clear_local_lambda=None, force_python_only=False)

Function performing single SNP GWAS using REML and cross validation over the chromosomes. Will reorder and intersect IIDs as needed. It gives the same results as single_snp() but scales a little better on a single machine and has the ability to run on a cluster. (Cluster runs require appropriate modules for parameters cache and runner.)

To work with a G0 kernel larger than about 16K, requires a NumPy library that can do eigenvalue decompositions larger than 16K, for example, one that uses MKL IPL64. Alternatively, use an older version of FaST-LMM (version 0.5.*) which includes its own version of MKL IPL64.

Compared to single_snp(), single_snp_scale() always:

  • does cross validation of chromosomes (single_snp()’s leave_out_one_chrom=True)

  • use a low-rank kernel only (single_snp()’s force_low_rank=True)

  • uses exactly one kernel constructed from SNPs

  • searches for the best h2. (single_snp()’s h2=None)

Parameters:
  • test_snps (a SnpReader or a string) – SNPs to test. Can be any SnpReader. If you give a string, it should be the name of a Bed-formatted file. For cluster runs, this and G0 should be readable from any node on the cluster, for example, by using DistributedBed format.

  • pheno (a SnpReader or a string) – One or more phenotypes: Can be any SnpReader, for example, Pheno, or SnpData. If you give a string, it should be the name of a Pheno-formatted file. If only one phenotype is given, any individual with missing phenotype data will be removed from processing. If multiple phenotypes are given, any missing data will raise an error. (Avoid this by removing missing values, perhaps via filling in missing values.)

  • G0 (SnpReader or a string) – SNPs from which to create a similarity matrix. Defaults to test_snps. Can be any SnpReader. If you give a string, it should be the name of a Bed-formatted file. For cluster runs, this and test_snps should be readable from any node on the cluster, for example, by using DistributedBed format.

  • covar (a SnpReader or a string) – covariate information, optional: Can be any SnpReader, for example, Pheno, or SnpData,. If you give a string, it should be the name of a Pheno-formatted file. #!!!LATER raise error if has NaN

  • cache (None or string or FileCache or dictionary from number to a FileCache.) – Tells where to store intermediate results. Place the cache on an SSD drive for best performance. By default, the cache will be an automatically-erasing temporary directory. (If the TEMP environment variable is set, Python places the temporary directory under it.) A string can be given and will be interpreted as the path of a local directory to use as a cache. (The local directory will not be automatically erased and so must be user managed.) A FileCache instance can be given, which provides a method to specify a cluster-distributed cache. (FileCache’s will not be automatically erased and must be user managed.) Finally, a dictionary from 0 to 22 (inclusive) to FileCache (or None or string) can be given. The dictionary specifies a cache for general work (0) and for every chromosome (1 to 22, inclusive).

  • memory_factor (number) – How much memory to use proportional to G0, optional. If not given, will assume that it can use memory about the same size as one copy of G0.

  • output_file_name (file name) – Name of file to write results to, optional. If not given, no output file will be created. The output format is tab-delimited text.

  • runner (Runner) – a Runner, optional: Tells how to run locally, multi-processor, or on a cluster. If not given, the function is run locally.

  • min_work_count (integer) – When running the work on a cluster, the minimum number of pieces in which to divide the work. Defaults to 1, which is usually fine.

  • gtg_runner (Runner) – the Runner to use instead of runner for the GtG stage of work. For an overview of the stages, see below.

  • gtg_min_work_count (integer) – the min_work_count to use instead of min_work_count on the GtG stage of work.

  • svd_runner (Runner) – the Runner to use instead of runner for the SVD stage of work.

  • postsvd_runner (Runner.) – the Runner to use instead of runner for the PostSVD stage of work.

  • postsvd_min_work_count (integer) – the min_work_count to use instead of min_work_count on the PostSVD stage of work.

  • test_snps_runner (a Runner.) – the Runner to use instead of runner for the TestSNPS stage of work.

  • test_snps_min_work_count (integer) – the min_work_count to use instead of min_work_count on the TestSNPS stage of work.

  • count_A1 (bool) – If it needs to read SNP data from a BED-formatted file, tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.

  • clear_local_lambda (function or lambda) – A function to run in the middle of the PostSVD stage, typically to clear unneeded large files from the local file cache.

  • force_python_only – (Default: False) Skip faster C++ code. Used for debugging and testing.

Return type:

Pandas dataframe with one row per test SNP. Columns include “PValue”

Example:

>>> import logging
>>> from fastlmm.association import single_snp
>>> from pysnptools.snpreader import Bed
>>> from fastlmm.util import example_file # Download and return local file name
>>> logging.basicConfig(level=logging.INFO)
>>> test_snps = Bed(example_file('tests/datasets/synth/all.*','*.bed'),count_A1=True)[:,::10] #use every 10th SNP
>>> pheno_fn = example_file("tests/datasets/synth/pheno_10_causals.txt")
>>> cov_fn = example_file("tests/datasets/synth/cov.txt")
>>> results_dataframe = single_snp_scale(test_snps=test_snps, pheno=pheno_fn, covar=cov_fn, count_A1=False)
-etc-
>>> print(results_dataframe.iloc[0].SNP,round(results_dataframe.iloc[0].PValue,7),len(results_dataframe))
snp1200_m0_.37m1_.36 0.0 500

The stages of processing are:

  • 0: G - Read G0 (the selected SNPs), standardize, regress out covariates, output G

  • 1: GtG - Compute GtG = G.T x G

  • 2: SVD - For each chromosome in the TestSNPs, extract the GtG for the chromosome, compute the singular value decomposition (SVD) on that GtG

  • 3: PostSVD - Transform the 22 GtG SVD into 22 G SVD results called U1 .. U22. Find h2, the importance of person-to-person similarity.

  • 4: TestSNPs - For each test SNP, read its data, regress out covariates, use the appropriate U and compute a Pvalue.

All stages cache intermediate results. If the results for stage are found in the cache, that stage will be skipped.

single_snp_all_plus_select

fastlmm.association.single_snp_all_plus_select(test_snps, pheno, G=None, covar=None, k_list=None, n_folds=10, seed=0, output_file_name=None, GB_goal=None, force_full_rank=False, force_low_rank=False, mixing=None, h2=None, do_plot=False, runner=None, count_A1=None)

Function performing single SNP GWAS based on two kernels. The first kernel is based on all SNPs. The second kernel is a similarity matrix constructed of the top k SNPs where the SNPs are ordered via the PValue from single_snp() and k is determined via out-of-sample prediction. All work is done via ‘leave_out_one_chrom’, that one chromosome is tested and the kernels are constructed from the other chromosomes. Will reorder and intersect IIDs as needed.

Parameters:
  • test_snps

    SNPs to test. Can be any SnpReader. If you give a string, it should be the base name of a set of PLINK Bed-formatted files. (For backwards compatibility can also be dictionary with keys ‘vals’, ‘iid’, ‘header’)

  • pheno

    A single phenotype: Can be any SnpReader, for example, Pheno or SnpData. If you give a string, it should be the file name of a PLINK phenotype-formatted file. Any IIDs with missing values will be removed. (For backwards compatibility can also be dictionary with keys ‘vals’, ‘iid’, ‘header’)

  • G

    SNPs from which to create a similarity matrix of the top k SNPs. If not given, will use test_snps. Can be any SnpReader. If you give a string, it should be the base name of a set of PLINK Bed-formatted files.

  • covar

    covariate information, optional: Can be any SnpReader, for example, Pheno or SnpData. If you give a string, it should be the file name of a PLINK phenotype-formatted file. (For backwards compatibility can also be dictionary with keys ‘vals’, ‘iid’, ‘header’)

  • k_list (list of numbers) – Values of k (in addition to 0) to test. Default to [1,2,4,8,…8192].

  • n_folds (number) – Number of folds of cross validation to use for out-of-sample evaluation of various values of k. Default to 10.

  • seed (number) – (optional) Random seed used to generate permutations for lrt G0 fitting.

  • output_file_name (file name) – Name of file to write results to, optional. If not given, no output file will be created.

  • GB_goal (number) – gigabytes of memory the run should use, optional. If not given, will read the test_snps in blocks the same size as the kernel, which is memory efficient with little overhead on computation time.

  • force_full_rank (Boolean) – Even if kernels are defined with fewer SNPs than IIDs, create an explicit iid_count x iid_count kernel. Cannot be True if force_low_rank is True.

  • force_low_rank (Boolean) – Even if kernels are defined with fewer IIDs than SNPs, create a low-rank iid_count x sid_count kernel. Cannot be True if force_full_rank is True.

  • mixing (number) – A parameter to LMM learning telling how to combine the two kernels, optional If not given will search for best value.

  • h2 (number) – A parameter to LMM learning that tells how much weight to give the K’s vs. the identity matrix, optional If not given will search for best value.

  • do_plot (boolean) – If true, will plot, for each chrom, the negative loglikelihood vs k.

  • runner

    a Runner, optional: Tells how to run locally, multi-processor, or on a cluster. If not given, the function is run locally.

  • count_A1 (bool) – If it needs to read SNP data from a BED-formatted file, tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.

Return type:

Pandas dataframe with one row per test SNP. Columns include “PValue”

Example:

>>> import logging
>>> import numpy as np
>>> from fastlmm.association import single_snp_all_plus_select
>>> from pysnptools.snpreader import Bed
>>> from fastlmm.util import example_file # Download and return local file name
>>> from pysnptools.util.mapreduce1.runner import LocalMultiProc
>>> logging.basicConfig(level=logging.INFO)
>>> pheno_fn = example_file("fastlmm/feature_selection/examples/toydata.phe")
>>> test_snps = example_file("fastlmm/feature_selection/examples/toydata.5chrom.*","*.bed")
>>> snps = Bed(test_snps,count_A1=False)[:,::100] #To make example faster, run on only 1/100th of the data
>>> chrom5_snps = snps[:,snps.pos[:,0]==5] # Test on only chrom5
>>> results_dataframe = single_snp_all_plus_select(test_snps=chrom5_snps,G=snps,pheno=pheno_fn,GB_goal=2,runner=LocalMultiProc(20,mkl_num_threads=5), count_A1=False) #Run multiproc
>>> print(results_dataframe.iloc[0].SNP,round(results_dataframe.iloc[0].PValue,7),len(results_dataframe))
null_9800 0.0793385 4

single_snp_linreg

fastlmm.association.single_snp_linreg(test_snps, pheno, covar=None, max_output_len=None, output_file_name=None, GB_goal=None, runner=None, count_A1=None)

Function performing single SNP GWAS using linear regression. Will reorder and intersect IIDs as needed.

Parameters:
  • test_snps (a SnpReader or a string) – SNPs to test. Can be any SnpReader. If you give a string, it should be the base name of a set of PLINK Bed-formatted files. (For backwards compatibility can also be dictionary with keys ‘vals’, ‘iid’, ‘header’)

  • pheno (a SnpReader or a string) – A single phenotype: Can be any SnpReader, for example, Pheno or SnpData. If you give a string, it should be the file name of a PLINK phenotype-formatted file. Any IIDs with missing values will be removed. (For backwards compatibility can also be dictionary with keys ‘vals’, ‘iid’, ‘header’)

  • covar (a SnpReader or a string) – covariate information, optional: Can be any SnpReader, for example, Pheno or SnpData. If you give a string, it should be the file name of a PLINK phenotype-formatted file. (For backwards compatibility can also be dictionary with keys ‘vals’, ‘iid’, ‘header’)

  • max_output_len (number) – Maximum number of Pvalues to return. Default to None, which means ‘Return all’.

  • output_file_name (file name) – Name of file to write results to, optional. If not given, no output file will be created. The output format is tab-delimited text.

  • GB_goal (number) – gigabytes of memory the run should use, optional. If not given, will read the test_snps in blocks of size iid_count, which is memory efficient with little overhead on computation time.

  • runner (Runner) – Runner, optional: Tells how to run locally, multi-processor, or on a cluster. If not given, the function is run locally.

  • count_A1 (bool) – If it needs to read SNP data from a BED-formatted file, tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.

Return type:

Pandas dataframe with one row per test SNP. Columns include “PValue”

Example:

>>> import logging
>>> import numpy as np
>>> from fastlmm.association import single_snp_linreg
>>> from pysnptools.snpreader import Bed
>>> from fastlmm.util import example_file # Download and return local file name
>>> logging.basicConfig(level=logging.INFO)
>>> pheno_fn = example_file("fastlmm/feature_selection/examples/toydata.phe")
>>> test_snps = example_file("fastlmm/feature_selection/examples/toydata.5chrom.*","*.bed")
>>> results_dataframe = single_snp_linreg(test_snps=test_snps, pheno=pheno_fn, count_A1=False)
>>> print(results_dataframe.iloc[0].SNP,round(results_dataframe.iloc[0].PValue,7),len(results_dataframe))
null_576 1e-07 10000

single_snp_select

fastlmm.association.single_snp_select(test_snps, pheno, G=None, covar=None, k_list=None, n_folds=10, just_return_selected_snps=False, seed=0, output_file_name=None, GB_goal=None, force_full_rank=False, force_low_rank=False, h2=None, runner=None, count_A1=None)

Function performing single SNP GWAS based on covariates (often PCs) and a similarity matrix constructed of the top k SNPs where SNPs are ordered via the PValue from single_snp_linreg() and k is determined via out-of-sample prediction. Will reorder and intersect IIDs as needed.

Parameters:
  • test_snps (a SnpReader or a string) – SNPs to test. Can be any SnpReader. If you give a string, it should be the base name of a set of PLINK Bed-formatted files. (For backwards compatibility can also be dictionary with keys ‘vals’, ‘iid’, ‘header’)

  • pheno (a SnpReader or a string) – A single phenotype: Can be any SnpReader, for example, Pheno or SnpData. If you give a string, it should be the file name of a PLINK phenotype-formatted file. Any IIDs with missing values will be removed. (For backwards compatibility can also be dictionary with keys ‘vals’, ‘iid’, ‘header’)

  • G (SnpReader or a string) – SNPs from which to create a similarity matrix of the top k SNPs. If not given, will use test_snps. Can be any SnpReader. If you give a string, it should be the base name of a set of PLINK Bed-formatted files.

  • covar (a SnpReader or a string) – covariate information, optional: Can be any SnpReader, for example, Pheno or SnpData. If you give a string, it should be the file name of a PLINK phenotype-formatted file. (For backwards compatibility can also be dictionary with keys ‘vals’, ‘iid’, ‘header’)

  • k_list (list of numbers) – Values of k (in addition to 0) to test. Default to [1,2,4,8,…8192].

  • n_folds (number) – Number of folds of cross validation to use for out-of-sample evaluation of various values of k. Default to 10.

  • just_return_selected_snps (list of strings) – Instead of returning the results of GWAS, return the top k SNPs selected.

  • seed (number) – (optional) Random seed used to generate permutations for lrt G0 fitting.

  • output_file_name (file name) – Name of file to write results to, optional. If not given, no output file will be created.

  • GB_goal (number) – gigabytes of memory the run should use, optional. If not given, will read the test_snps in blocks the same size as the kernel, which is memory efficient with little overhead on computation time.

  • force_full_rank (Boolean) – Even if kernels are defined with fewer SNPs than IIDs, create an explicit iid_count x iid_count kernel. Cannot be True if force_low_rank is True.

  • force_low_rank (Boolean) – Even if kernels are defined with fewer IIDs than SNPs, create a low-rank iid_count x sid_count kernel. Cannot be True if force_full_rank is True.

  • h2 (number) – A parameter to LMM learning that tells how much weight to give the K’s vs. the identity matrix, optional If not given will search for best value.

  • runner (Runner) – a Runner, optional: Tells how to run locally, multi-processor, or on a cluster. If not given, the function is run locally.

  • count_A1 (bool) – If it needs to read SNP data from a BED-formatted file, tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.

Return type:

Pandas dataframe with one row per test SNP. Columns include “PValue”

Example:

>>> import logging
>>> import numpy as np
>>> from fastlmm.association import single_snp_select
>>> from pysnptools.snpreader import Bed
>>> from fastlmm.util import example_file # Download and return local file name
>>> from fastlmm.util import compute_auto_pcs
>>> bed_fn = example_file("tests/datasets/synth/all.bed")
>>> phen_fn = example_file("tests/datasets/synth/pheno_10_causals.txt")
>>> covar = compute_auto_pcs(bed_fn,count_A1=False)
>>> results_dataframe = single_snp_select(test_snps=bed_fn, G=bed_fn, pheno=phen_fn, covar=covar, GB_goal=2, count_A1=False)
>>> print(results_dataframe.iloc[0].SNP,round(results_dataframe.iloc[0].PValue,7),len(results_dataframe))
snp495_m0_.01m1_.04 0.0 5000

epistasis

fastlmm.association.epistasis(test_snps, pheno, G0, G1=None, mixing=0.0, covar=None, output_file_name=None, sid_list_0=None, sid_list_1=None, log_delta=None, min_log_delta=- 5, max_log_delta=10, cache_file=None, runner=None, count_A1=None)

Function performing epistasis GWAS. See http://www.nature.com/srep/2013/130122/srep01099/full/srep01099.html. REML is used to optimize H2 and beta is always estimated via ML (maximum likelihood, see https://static-content.springer.com/esm/art%3A10.1038%2Fnmeth.1681/MediaObjects/41592_2011_BFnmeth1681_MOESM290_ESM.pdf).

Parameters:
  • test_snps (a SnpReader or a string) – SNPs from which to test pairs. If you give a string, it should be the base name of a set of PLINK Bed-formatted files.

  • pheno (a 'pheno dictionary' or a string) – A single phenotype: A ‘pheno dictionary’ contains an ndarray on the ‘vals’ key and a iid list on the ‘iid’ key. If you give a string, it should be the file name of a PLINK phenotype-formatted file.

  • G0 (a SnpReader or a string) – SNPs from which to construct a similarity matrix. If you give a string, it should be the base name of a set of PLINK Bed-formatted files.

  • G1 (a SnpReader or a string) – SNPs from which to construct a second similarity kernel, optional. Also, see ‘mixing’). If you give a string, it should be the base name of a set of PLINK Bed-formatted files.

  • mixing (number) – Weight between 0.0 (inclusive, default) and 1.0 (inclusive) given to G1 relative to G0. If you give no mixing number, G0 will get all the weight and G1 will be ignored.

  • covar (a 'pheno dictionary' or a string) – covariate information, optional: A ‘pheno dictionary’ contains an ndarray on the ‘vals’ key and a iid list on the ‘iid’ key. If you give a string, it should be the file name of a PLINK phenotype-formatted file.

  • sid_list_0 (list of strings) – list of sids, optional: All unique pairs from sid_list_0 x sid_list_1 will be evaluated. If you give no sid_list_0, all sids in test_snps will be used.

  • sid_list_1 (list of strings) – list of sids, optional: All unique pairs from sid_list_0 x sid_list_1 will be evaluated. If you give no sid_list_1, all sids in test_snps will be used.

  • output_file_name (file name) – Name of file to write results to, optional. If not given, no output file will be created. The output format is tab-delimited text.

  • log_delta (number) – A parameter to LMM learning, optional If not given will search for best value.

  • min_log_delta (number) – (default:-5) When searching for log_delta, the lower bounds of the search.

  • max_log_delta (number) – (default:-5) When searching for log_delta, the upper bounds of the search.

  • cache_file (file name) – Name of file to read or write cached precomputation values to, optional. If not given, no cache file will be used. If given and file does not exists, will write precomputation values to file. If given and file does exists, will read precomputation values from file. The file contains the U and S matrix from the decomposition of the training matrix. It is in Python’s np.savez (*.npz) format. Calls using the same cache file should have the same ‘G0’ and ‘G1’

  • runner (Runner) – a Runner, optional: Tells how to run locally, multi-processor, or on a cluster. If not given, the function is run locally.

  • count_A1 (bool) – If it needs to read SNP data from a BED-formatted file, tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.

Return type:

Pandas dataframe with one row per SNP pair. Columns include “PValue”

Example:

>>> import logging
>>> from pysnptools.snpreader import Bed
>>> from fastlmm.util import example_file # Download and return local file name
>>> from fastlmm.association import epistasis
>>> logging.basicConfig(level=logging.INFO)
>>> bed_file = example_file('tests/datasets/all_chr.maf0.001.N300.*','*.bed')
>>> test_snps = Bed(bed_file,count_A1=True)
>>> pheno = example_file('tests/datasets/phenSynthFrom22.23.N300.randcidorder.txt')
>>> covar = example_file('tests/datasets/all_chr.maf0.001.covariates.N300.txt')
>>> results_dataframe = epistasis(test_snps, pheno, G0=test_snps, covar=covar,
...                                 sid_list_0=test_snps.sid[:10], #first 10 snps
...                                 sid_list_1=test_snps.sid[5:15], #Skip 5 snps, use next 10
...                                 count_A1=False)
>>> print(results_dataframe.iloc[0].SNP0, results_dataframe.iloc[0].SNP1,round(results_dataframe.iloc[0].PValue,5),len(results_dataframe))
1_12 1_9 0.07779 85

snp_set

fastlmm.association.snp_set(test_snps, set_list, pheno, covar=None, output_file_name=None, G0=None, test='lrt', write_lrtperm=False, nperm=10, npermabs=None, mpheno=1, G0_fit='qq', qmax=0.1, seed=None, minsetsize=None, maxsetsize=None, mindist=0, idist=1, show_pvalue_5050=False)

Function performing GWAS on sets of snps

Parameters:
  • test_snps (a string) – The base name of the file containing the SNPs for alternative kernel. The file must be in PLINK Bed format.

  • set_list (a string) – The name of a tab-delimited file defining the sets. The file should contain two-columns ‘snp’ and ‘set’.

  • pheno (a string) – The name of a file containing the phenotype. The file must be in PLINK phenotype format.

  • covar (a 'pheno dictionary' or a string) – covariate information, optional: The name of a file in PLINK phenotype format.

  • output_file_name (file name) – Name of file to write results to, optional. If not given, no output file will be created.

  • G0 (a string) – Training SNPs from which to construct a similarity kernel. It should be the base name of files in PLINK Bed or Ped format.

  • test (a string) – ‘lrt’ (default) or ‘sc_davies’

  • write_lrtperm (boolean) – (default: False) If True, write the lrtperm vector (dictated by seed) to a file.

  • nperm (number) – (default: 10) number of permutations per test

  • npermabs (number) – (default: None) absolute number of permutations

  • mpheno (number) – (default: 1) integer, starting at 1, representing the index of the phenotype tested

  • G0_fit (string) – (default: “qq”) How to fit G0. Should be either “qq” or “ml”

  • qmax (number) – (default: .1) Use the top qmax fraction of G0 distrib test statistics to fit the G0 distribution

  • seed (number) – (optional) Random seed used to generate permutations for lrt G0 fitting.

  • minsetsize (number) – (optional) only include sets at least this large (inclusive)

  • maxsetsize (number) – (optional) only include sets no more than this large (inclusive)

  • mindist (number) – (default 0) SNPs within mindist from the test SNPs will be removed from

  • idist (number) – (default: 1) the type of position to use with mindist 1, genomic distance 2, base-pair distance

  • show_pvalue_5050 (Boolean) – (default: False) show a conservative P-value arising from an assumed null distribution that is a 50-50 mixture distribution of 0 and 1 dof chi squares [Molenberghs and Verbeke, 2003]. Provided for backwards compatibility.

Return type:

Pandas dataframe with one row per set.

Example:

>>> import logging
>>> from fastlmm.association import snp_set
>>> logging.basicConfig(level=logging.INFO)
>>> result_dataframe = snp_set(
...     test_snps = '../../tests/datasets/all_chr.maf0.001.N300',
...     set_list = '../../tests/datasets/set_input.23.txt',
...     pheno = '../../tests/datasets/phenSynthFrom22.23.N300.txt')
>>> print(result_dataframe.iloc[0].SetId, round(result_dataframe.iloc[0]['P-value'],15))
set23 0.0

FastLMM

class fastlmm.inference.FastLMM(GB_goal=None, force_full_rank=False, force_low_rank=False, snp_standardizer=Unit(), covariate_standardizer=Unit(), kernel_standardizer=DiagKtoN())

A predictor, somewhat in the style of scikit-learn, for learning and predicting with linear mixed models.

Constructor:
Parameters:
  • GB_goal (int) – gigabytes of memory the run should use, optional. If not given, will read the test_snps in blocks the same size as the kernel, which is memory efficient with little overhead on computation time.

  • force_full_rank (bool) – Even if kernels are defined with fewer SNPs than IIDs, create an explicit iid_count x iid_count kernel. Cannot be True if force_low_rank is True.

  • force_low_rank (bool) – Even if kernels are defined with fewer IIDs than SNPs, create a low-rank iid_count x sid_count kernel. Cannot be True if force_full_rank is True.

  • snp_standardizer (Standardizer) – The PySnpTools standardizer to be apply to SNP data. Choices include Standardizer.Unit (Default. Makes values for each SNP have mean zero and standard deviation 1.0, then fills missing with zero) and Standardizer.Identity (Do nothing)

  • covariate_standardizer (Standardizer) – The PySnpTools standardizer to be apply to X, the covariate data. Some choices include Standardizer.Unit (Default. Fills missing with zero) and Standardizer.Identity (do nothing)

  • kernel_standardizer (KernelStandardizer) – The PySnpTools kernel standardizer to be apply to the kernels. Some choices include KernelStandardizer.DiagKToN (Default. Make the diagonal sum to iid_count) and KernelStandardizer.Identity (Do nothing)

Example:

>>> import numpy as np
>>> import logging
>>> from pysnptools.snpreader import Bed, Pheno
>>> from fastlmm.util import example_file # Download and return local file name
>>> from fastlmm.inference import FastLMM
>>> logging.basicConfig(level=logging.INFO)
>>> snpreader = Bed(example_file("fastlmm/feature_selection/examples/toydata.5chrom.*","*.bed"),count_A1=False)
>>> cov_fn = example_file("fastlmm/feature_selection/examples/toydata.cov")
>>> pheno_fn = example_file("fastlmm/feature_selection/examples/toydata.phe")
>>> train_idx = np.r_[10:snpreader.iid_count] # iids 10 and on
>>> test_idx  = np.r_[0:10] # the first 10 iids
>>> fastlmm = FastLMM(GB_goal=2)
>>> #We give it phenotype and covariate information for extra examples, but it reorders and intersects the examples, so only training examples are used. 
>>> _ = fastlmm.fit(K0_train=snpreader[train_idx,:],X=cov_fn,y=pheno_fn) 
>>> mean, covariance = fastlmm.predict(K0_whole_test=snpreader[test_idx,:],X=cov_fn,count_A1=False)
>>> print(list(mean.iid[0]), round(mean.val[0,0],7), round(covariance.val[0,0],7))
['per0', 'per0'] 0.1791958 0.8995209
>>> nll = fastlmm.score(K0_whole_test=snpreader[test_idx,:],X=cov_fn,y=pheno_fn,count_A1=False)
>>> print(f"{nll:.6f}")
13.462323
fit(X=None, y=None, K0_train=None, K1_train=None, h2raw=None, mixing=None, count_A1=None)

Method for training a FastLMM predictor. If the examples in X, y, K0_train, K1_train are not the same, they will be reordered and intersected.

Parameters:
  • X (a PySnpTools SnpReader (such as Pheno or SnpData) or string.) – training covariate information, optional: If you give a string, it should be the file name of a PLINK phenotype-formatted file.

  • y (a PySnpTools SnpReader (such as Pheno or SnpData) or string.) – training phenotype: If you give a string, it should be the file name of a PLINK phenotype-formatted file.

  • K0_train (SnpReader or a string or KernelReader) – A similarity matrix or SNPs from which to construct such a similarity matrix. Can be any SnpReader. If you give a string, can be the name of a PLINK-formated Bed file. Can be PySnpTools KernelReader. If you give a string it can be the name of a KernelNpz file.

  • K1_train (SnpReader or a string or KernelReader) – A second similarity matrix or SNPs from which to construct such a second similarity matrix. (Also, see ‘mixing’). Can be any SnpReader. If you give a string, can be the name of a PLINK-formated Bed file. Can be PySnpTools KernelReader. If you give a string it can be the name of a KernelNpz file.

  • h2raw (number) – A parameter to LMM learning that tells how much weight to give the K’s vs. the identity matrix, optional If not given will search for best value. If mixing is unspecified, then h2 must also be unspecified.

  • mixing (number) – Weight between 0.0 (inclusive, default) and 1.0 (inclusive) given to K1_train relative to K0_train. If you give no mixing number and a K1_train is given, the best weight will be learned.

  • count_A1 (bool) – If it needs to read SNP data from a BED-formatted file, tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.

Return type:

self, the fitted FastLMM predictor

predict(X=None, K0_whole_test=None, K1_whole_test=None, iid_if_none=None, count_A1=None)

Method for predicting from a fitted FastLMM predictor. If the examples in X, K0_whole_test, K1_whole_test are not the same, they will be reordered and intersected.

Parameters:
  • X (a PySnpTools SnpReader (such as Pheno or SnpData) or string.) – testing covariate information, optional: If you give a string, it should be the file name of a PLINK phenotype-formatted file.

  • K0_whole_test (SnpReader or a string or KernelReader) – A similarity matrix from all the examples to the test examples. Alternatively, the test SNPs needed to construct such a similarity matrix. Can be any SnpReader. If you give a string, can be the name of a PLINK-formated Bed file. Can be PySnpTools KernelReader. If you give a string it can be the name of a KernelNpz file.

  • K1_whole_test (SnpReader or a string or KernelReader) – A second similarity matrix from all the examples to the test examples. Alternatively, the test SNPs needed to construct such a similarity matrix. Can be any SnpReader. If you give a string, can be the name of a PLINK-formated Bed file. Can be PySnpTools KernelReader. If you give a string it can be the name of a KernelNpz file.

  • iid_if_none (an ndarray of two strings) – Examples to predict for if no X, K0_whole_test, K1_whole_test is provided.

Return type:

A SnpData of the means and a KernelData of the covariance

score(X=None, y=None, K0_whole_test=None, K1_whole_test=None, iid_if_none=None, return_mse_too=False, return_per_iid=False, count_A1=None)

Method for calculating the negative log likelihood of testing examples. If the examples in X,y, K0_whole_test, K1_whole_test are not the same, they will be reordered and intersected.

Parameters:
  • X (a PySnpTools SnpReader (such as Pheno or SnpData) or string.) – testing covariate information, optional: If you give a string, it should be the file name of a PLINK phenotype-formatted file.

  • y (a PySnpTools SnpReader (such as Pheno or SnpData) or string.) – testing phenotype: If you give a string, it should be the file name of a PLINK phenotype-formatted file.

  • K0_whole_test (SnpReader or a string or KernelReader) – A similarity matrix from all the examples to the test examples. Alternatively, the test SNPs needed to construct such a similarity matrix. Can be any SnpReader. If you give a string, can be the name of a PLINK-formated Bed file. Can be PySnpTools KernelReader. If you give a string it can be the name of a KernelNpz file.

  • K1_whole_test (SnpReader or a string or KernelReader) – A second similarity matrix from all the examples to the test examples. Alternatively, the test SNPs needed to construct such a similarity matrix. Can be any SnpReader. If you give a string, can be the name of a PLINK-formated Bed file. Can be PySnpTools KernelReader. If you give a string it can be the name of a KernelNpz file.

  • iid_if_none (an ndarray of two strings) – Examples to predict for if no X, K0_whole_test, K1_whole_test is provided.

  • return_mse_too (bool) – If true, will also return the mean squared error.

  • count_A1 (bool) – If it needs to read SNP data from a BED-formatted file, tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.

  • count_A1 – If it needs to read SNP data from a BED-formatted file, tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.

Return type:

a float of the negative log likelihood and, optionally, a float of the mean squared error.

LinearRegression

class fastlmm.inference.LinearRegression(covariate_standardizer=Unit())

A linear regression predictor, that works like the FastLMM in fastlmm_predictor.py, but that expects all similarity matrices to be identity.

Constructor:
Parameters:
  • covariate_standardizer (Standardizer) – The PySnpTools standardizer to be apply to X, the covariate data. Some choices include Standardizer.Unit (Default. Fills missing with zero) and Standardizer.Identity (do nothing)

Example:

>>> import numpy as np
>>> import logging
>>> from pysnptools.snpreader import Pheno
>>> from fastlmm.inference import LinearRegression
>>> from fastlmm.util import example_file # Download and return local file name
>>> logging.basicConfig(level=logging.INFO)
>>> cov = Pheno(example_file("fastlmm/feature_selection/examples/toydata.cov"))
>>> pheno_fn = example_file("fastlmm/feature_selection/examples/toydata.phe")
>>> train_idx = np.r_[10:cov.iid_count] # iids 10 and on
>>> test_idx  = np.r_[0:10] # the first 10 iids
>>> linreg = LinearRegression()
>>> #We give it phenotype information for extra examples, but it reorders and intersects the examples, so only training examples are used. 
>>> _ = linreg.fit(X=cov[train_idx,:],y=pheno_fn) 
>>> mean, covariance = linreg.predict(X=cov[test_idx,:])
>>> print(mean.iid[0], round(mean.val[0,0],7), round(covariance.val[0,0],7))
['per0' 'per0'] 0.1518764 0.9043703
>>> nll = linreg.score(X=cov[test_idx,:],y=pheno_fn)
>>> print(f"{nll:.6f}")
13.668845
fit(X=None, y=None, K0_train=None, K1_train=None, h2=None, mixing=None, count_A1=None)

Method for training a FastLMM predictor. If the examples in X, y, K0_train, K1_train are not the same, they will be reordered and intersected.

Parameters:
  • X (a PySnpTools SnpReader (such as Pheno or SnpData) or string.) – training covariate information, optional: If you give a string, it should be the file name of a PLINK phenotype-formatted file.

  • y (a PySnpTools SnpReader (such as Pheno or SnpData) or string.) – training phenotype: If you give a string, it should be the file name of a PLINK phenotype-formatted file.

  • K0_train (None) – Must be None. Represents the identity similarity matrix.

  • K1_train (SnpReader or a string or KernelReader) – Must be None. Represents the identity similarity matrix.

  • h2 (number) – Ignored. Optional.

  • mixing (number) – Ignored. Optional.

  • count_A1 (bool) – If it needs to read SNP data from a BED-formatted file, tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.

Return type:

self, the fitted Linear Regression predictor

predict(X=None, K0_whole_test=None, K1_whole_test=None, iid_if_none=None, count_A1=None)

Method for predicting from a fitted FastLMM predictor. If the examples in X, K0_whole_test, K1_whole_test are not the same, they will be reordered and intersected.

Parameters:
  • X (a PySnpTools SnpReader (such as Pheno or SnpData) or string.) – testing covariate information, optional: If you give a string, it should be the file name of a PLINK phenotype-formatted file.

  • K0_whole_test (None) – Must be None. Represents the identity similarity matrix.

  • K1_whole_test (SnpReader or a string or KernelReader) – Must be None. Represents the identity similarity matrix.

  • iid_if_none (an ndarray of two strings) – Examples to predict for if no X, K0_whole_test, K1_whole_test is provided.

  • count_A1 (bool) – If it needs to read SNP data from a BED-formatted file, tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.

Return type:

A SnpData of the means and a KernelData of the covariance

score(X=None, y=None, K0_whole_test=None, K1_whole_test=None, iid_if_none=None, return_mse_too=False, count_A1=None)

Method for calculating the negative log likelihood of testing examples. If the examples in X,y, K0_whole_test, K1_whole_test are not the same, they will be reordered and intersected.

Parameters:
  • X (a PySnpTools SnpReader (such as Pheno or SnpData) or string.) – testing covariate information, optional: If you give a string, it should be the file name of a PLINK phenotype-formatted file.

  • y (a PySnpTools SnpReader (such as Pheno or SnpData) or string.) – testing phenotype: If you give a string, it should be the file name of a PLINK phenotype-formatted file.

  • K0_whole_test (None) – Must be None. Represents the identity similarity matrix.

  • K1_whole_test (SnpReader or a string or KernelReader) – Must be None. Represents the identity similarity matrix.

  • iid_if_none (an ndarray of two strings) – Examples to predict for if no X, K0_whole_test, K1_whole_test is provided.

  • return_mse_too (bool) – If true, will also return the mean squared error.

  • count_A1 (bool) – If it needs to read SNP data from a BED-formatted file, tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.

Return type:

a float of the negative log likelihood and, optionally, a float of the mean squared error.

heritability_spatial_correction

fastlmm.association.heritability_spatial_correction(G_kernel, spatial_coor, spatial_iid, alpha_list, alpha_power, pheno, map_function=<class 'map'>, cache_folder=None, jackknife_count=500, permute_plus_count=10000, permute_times_count=10000, seed=0, just_testing=False, always_remote=False, allow_gxe2=True, count_A1=None)

Function measuring heritability with correction for spatial location.

Parameters:
  • G_kernel (a KernelReader, SnpReader or a string) – A kernel that tells the genetic similarity between all pairs of individuals. The kernel can be given explicitly, for example with a KernelData. The kernel can also be given implicitly by providing a set of SNPs or the name of a BED file.

  • spatial_coor (a iid_count x 2 array) – The position of each individual given by two coordinates. Any units are allowed, but the two values must be compatible so that distance can be determined via Pythagoras’ theorem. (So, longitude and latitude should not be used unless the locations are near the Equator.)

  • spatial_iid (array of strings with shape [iid_count,2]) – A ndarray of the iids. Each iid is a ndarray of two strings (a family ID and a case ID) that identifies an individual.

  • alpha_list (number) – a list of numbers to search to find the best alpha, which is the similarity scale. The similarity of two individuals is here defined as exp(-(distance_between/alpha)**alpha_power). If the closest individuals are 100 units apart and the farthest individuals are 4e6 units apart, a reasonable alpha_list might be: [int(v) for v in np.logspace(np.log10(100),np.log10(1e10), 100)] The function’s reports on the alphas chosen. If an extreme alpha is picked, change alpha_list to cover more range.

  • alpha_power – 2 (a good choice) means that similarity goes with area. 1 means with distance.

  • pheno (a SnpReader or string) – The target values(s) to predict. It can be a file name readable via Pheno or any SnpReader.

  • cache_folder (a string) – (default ‘None’) The name of a directory in which to save intermediate results. If ‘None’, then no intermediate results are saved.

  • map_function (a function) – (default ‘map’) A function with the same inputs and functionality as Python’s ‘map’ function. Can be used to run ‘heritability_spatial_correction’ on a cluster.

  • jackknife_count (number) – (default 500) The number of jackknife groups to use when calculating standard errors (SE). Changing to a small number, 2, speeds up calculation at the cost of unusable SEs.

  • permute_plus_count (number) – (default 10000) The number of permutations used when calculating P values. Changing to a small number, 1, speeds up calculation at the cost of unusable P values.

  • permute_times_count (number) – (default 10000) The number of permutations used when calculating P values. Changing to a small number, 1, speeds up calculation at the cost of unusable P values.

  • seed (number) – (default 0) The random seed used by jackknifing and permutation.

  • just_testing (bool) – (default False) If true, skips actual LMM-related search and calculation.

  • count_A1 (bool) – If it needs to read SNP data from a BED-formatted file, tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.

Return type:

Pandas dataframe with one row per phenotype. Columns include “h2uncorr”, “h2corr”, etc.

util Module

util.compute_auto_pcs

fastlmm.util.compute_auto_pcs(snpreader, cutoff=0.1, k_values=array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]), output_file_name=None, count_A1=None)

Function automatically finds the best principle components (PCs)

Parameters:
  • snpreader (a SnpReader or a string) – SNPs for which to find the best PCs If you give a string, it should be the base name of a set of PLINK Bed-formatted files.

  • cutoff (a number between 0 and 1.) – (default: .1) The degree of relatedness to remove before finding the best number of PCs. Relatedness is measured with a RRM (realized relationship matrix) so 0 is no relation, .5 is a sibling or parent, and 1 is self or twin.

  • k_values (list of integers) – (default: 0 … 10 [inclusive]) The number of PCs to search.

  • count_A1 (bool) – If it needs to read SNP data from a BED-formatted file, tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.

Return type:

A phenotype dictionary with property ‘iid’ listing the iids and property ‘vals’ containing a nparray of PC values.

Example:

>>> import logging
>>> from fastlmm.util import compute_auto_pcs
>>> from fastlmm.util import example_file # Download and return local file name
>>> logging.basicConfig(level=logging.INFO)
>>> file_name = example_file("fastlmm/feature_selection/examples/toydata.5chrom.*","*.bed")
>>> best_pcs = compute_auto_pcs(file_name,count_A1=False)
>>> print(int(best_pcs['vals'].shape[0]),int(best_pcs['vals'].shape[1]))
500 0

util.manhattan_plot

fastlmm.util.manhattan_plot(chr_pos_pvalue_array, pvalue_line=None, plot_threshold=1.0, vline_significant=False, marker='o', chromosome_starts=None, xaxis_unit_bp=True, alpha=0.5)

Function to create a Manhattan plot. See http://en.wikipedia.org/wiki/Manhattan_plot.

Parameters:
  • chr_pos_pvalue_array (numpy array) – an n x 3 numpy array. The three columns are the chrom number (as a number), the position, and pvalue.

  • pvalue_line (number) – (Default: None). If given, draws a line at that PValue.

  • plot_threshold (number) – (Default: 1) Plot only SNPs that achieve a P-value smaller than pvalue_threshold to speed up plotting

  • vline_significant (Boolean) – If true, draw a vertical line at each significant Pvalue (Default: False)

  • marker (string) – marker for the scatter plot. default: “o”

  • chromosome_starts ([Nchrom x 3] ndarray) – chromosome, cumulative start position, cumulative stop position cumulative chromosome starts, for plotting. If None (default), this is estimated from data

  • xaxis_unit_bp (Boolean) – If true, plot cumulative position in basepair units on x axis. If False, only use rank of SNP positions. (default: True)

  • alpha (number) – alpha (opaqueness) for P-value markers in scatterplot (default 0.5)

Return type:

chromosome_starts [Nchrom x 3] ndarray: chromosome, cumulative start position, cumulative stop position cumulative chromosome starts used in plotting.

Example:
>>> from fastlmm.association import single_snp
>>> from pysnptools.snpreader import Bed
>>> from fastlmm.util import example_file # Download and return local file name
>>> import matplotlib.pyplot as plt
>>> import fastlmm.util.util as flutil
>>> test_snps = example_file("fastlmm/feature_selection/examples/toydata.5chrom.*","*.bed")
>>> pheno_fn = example_file("fastlmm/feature_selection/examples/toydata.phe")
>>> results_dataframe = single_snp(test_snps=test_snps, pheno=pheno_fn, h2=.2, count_A1=False)
>>> chromosome_starts = flutil.manhattan_plot(results_dataframe[["Chr", "ChrPos", "PValue"]],pvalue_line=1e-7)
>>> #plt.show()

Indices and Tables