Multi Trait Set Test (mtSet)

Set tests are a powerful approach for genome-wide association testing between groups of genetic variants and quantitative traits. Here we implement MTSet, a mixed-model approach that enables joint analysis across multiple correlated traits while accounting for population structure and relatedness. MTSet effectively combines the benefits of set tests with multi-trait modeling and is computationally efficient, enabling genetic analysis of large cohorts (up to 500,000 individuals) and multiple traits. A detailed description of the method can be found at [CRS15].

References

[CRS15](1, 2) Casale FP, Rakitsch B, Lippert C, Stegle O (2015) Efficient set tests for the genetic analysis of correlated traits. Nature Methods, Vol. 12, No. 8. (15 June 2015), pp. 755-758.

Public interface

class limix.mtset.MTSet(Y=None, R=None, S_R=None, U_R=None, traitID=None, F=None, rank=1)[source]

Multi-trait set tests (mtSet).

Mixed-model approach that enables joint analysis across multiple correlated traits while accounting for population structure and relatedness [CRS15].

Parameters:
  • Y ((N, P) ndarray) – Individuals by phenotypes.
  • F ((N, K) ndarray, optional) – Individuals by covariates. By default, no fixed effect term is set.
  • R ((N, N) ndarray, optional) – LMM-covariance/kinship coefficients. U_R and S_R can be provided instead of R. If neither R nor U_R and S_R are provided, the null models has iid normal residuals.
  • U_R ((N, N) ndarray, optional) – Eigenvectors of R. U_R and S_R can be provided instead of R. If neither R nor U_R and S_R are provided, iid normal residuals are considered.
  • S_R ((N, ) ndarray, optional) – Eigenvalues of R. U_R and S_R can be provided instead of R. If neither R nor U_R and S_R are provided, iid normal residuals are considered.
  • traitID (ndarray, optional) – Phenotype IDs.
  • rank (int, optional) – Rank of the trait set covariance. The default value is 1.

Examples

This example shows how to fit mtSet when modelling population structure/relatedness using the full genetic relatedness matrix.

>>> from numpy.random import RandomState
>>> from limix.mtset import MTSet
>>> from numpy import dot, eye, ones, concatenate
>>> from numpy import set_printoptions
>>> set_printoptions(4)
>>>
>>> random = RandomState(1)
>>>
>>> N = 100
>>> P = 2
>>>
>>> pheno = random.randn(N, P)
>>> pheno-= pheno.mean(0)
>>> pheno/= pheno.std(0)
>>>
>>> W = random.randn(N, 10)
>>> kinship = dot(W, W.T) / float(10)
>>> kinship+= 1e-4 * eye(N)
>>>
>>> mtset = MTSet(pheno, R=kinship)
>>> res_null = mtset.fitNull()
>>>
>>> print(res_null['conv'][0])
True
>>> print('%.2f'%res_null['NLL0'])
98.43
>>>
>>> S = 4
>>> snp_set = (random.rand(N, S) < 0.2).astype(float)
>>> res = mtset.optimize(snp_set)
>>>
>>> print("%.4f" % res['LLR'][0])
0.0616
>>> print(res['Cr'])
[[ 0.0006  0.0025]
 [ 0.0025  0.0113]]

This example shows how to fit mtSet when modelling population structure/relatedness by introducing the top principle components of the genetic relatedness matrix (pc_rrm) as fixed effects.

>>> random = RandomState(0)
>>>
>>> mean = ones((N, 1))
>>> pc_rrm = random.randn(N, 4)
>>> covs = concatenate([mean, pc_rrm], 1)
>>>
>>> mtset = MTSet(pheno, F=covs)
>>> res_null = mtset.fitNull()
>>>
>>> print(res_null['conv'][0])
True
>>> print('%.2f'%res_null['NLL0'])
118.36
>>>
>>> res = mtset.optimize(snp_set)
>>>
>>> print("%.4f" % res['LLR'][0])
0.1373
>>> print(res['Cr'])
[[ 0.0002  0.0019]
 [ 0.0019  0.0207]]
fitNull(cache=False, out_dir=’./cache’, fname=None, rewrite=False, seed=None, factr=1000.0, n_times=10, init_method=None, verbose=False)[source]

Fit null model

Parameters:
  • () (verbose) –
  • cache (bool, optional) –

    If False (default), the null model is fitted and the results are not cached. If True, the cache is activated. The cache file dir and name can be specified using hcache and fname. When cache=True, we distinguish the following cases:

    • if the specified file does not exist, the output of the null model fiting is cached in the file.
    • if the specified file exists and rewrite=True, the cache file is overwritten.
    • if the specified file exists and rewrite=False, the results from the cache file are imported (the null model is not re-fitted).
  • out_dir (str, optional) – output dir of the cache file. The default value is “./cache”.
  • fname (str, optional) – Name of the cache hdf5 file. It must be specified if cache=True.
  • rewrite (bool, optional) – It has effect only if cache cache=True`. In this case, if True, the cache file is overwritten in case it exists. The default value is False
  • factr (float, optional) – optimization paramenter that determines the accuracy of the solution. By default it is 1000. (see scipy.optimize.fmin_l_bfgs_b for more details).
  • verbose (bool, optional) – verbose flag.
Returns:

dictionary containing:
  • B (ndarray): estimated effect sizes (null);
  • Cg (ndarray): estimated relatedness trait covariance (null);
  • Cn (ndarray): estimated genetic noise covariance (null);
  • conv (bool): convergence indicator;
  • NLL0 (ndarray): negative loglikelihood (NLL) of the null model;
  • LMLgrad (ndarray): norm of the gradient of the NLL.
  • time (time): elapsed time (in seconds).

Return type:

(dict)

fitNullTraitByTrait(verbose=False, cache=False, out_dir=’./cache’, fname=None, rewrite=False)[source]

Fit null model trait by trait

getInfoOpt()[source]

get information for the optimization

getInfoOptST()[source]

get information on the optimization

getLLR()[source]

get log likelihood ratio

getLMLgrad()[source]

get norm LML gradient

getNLLAlt()[source]

get negative log likelihood of the alternative

getNull()[source]

get null model info

getVariances()[source]

get variances

optimize(G, params0=None, n_times=10, vmax=5, factr=10000000.0, verbose=False)[source]

Fit alternative model

Parameters:
  • G (ndarray) – (N, S) genotype values for N samples and S variants (defines the set component)
  • params0 (ndarray, optional) – initial parameter values for optimization. By default the parameters are initialized by randonmly pertumrning the maximum likelihood estimators (MLE) under the null.
  • n_times (ndarray, optional) – maximum number of restarts in case the optimization does not converge. It has no effect if params0 is provided.
  • vmax (float, option) – maximum value for variance parameters that is accepted. If any of the variance MLE is >vmax then we say that the optimization has not converged. Default value is 5.
  • factr (float, optional) – optimization paramenter that determines the accuracy of the solution. By default it is 1e7. (see scipy.optimize.fmin_l_bfgs_b for more details).
  • verbose (bool, optional) – verbose flag.
Returns:

dictionary containing:
  • Cr (ndarray): estimated set trait covariance (alt);
  • Cg (ndarray): estimated relatedness trait covariance (alt);
  • Cn (ndarray): estimated genetic noise covariance (alt);
  • conv (ndarray): convergence indicator;
  • NLLAlt (ndarray): negative loglikelihood (NLL) of the alternative model;
  • LMLgrad (ndarray): norm of the gradient of the NLL.
  • LLR (ndarray*): log likelihood ratio statistics;
  • var (ndarray): (P, K) ndarray of variance estimates for the P traits and the K
  • time (ndarray): elapsed time (in seconds).
  • nit (ndarray): number of iterations L-BFGS. (see scipy.optimize.fmin_l_bfgs_b for more details).

Return type:

(dict)

optimizeTraitByTrait(G, verbose=False, n_times=10, factr=1000.0)[source]

Optimize trait by trait

setNull(null)[source]

set null model info