Dataset utils

Public interface

class limix.data.BedReader(prefix)[source]

Class to read and make queries on plink binary files.

Parameters:prefix (str) – Path prefix to the set of PLINK files.

Examples

Basics

>>> from limix.data import BedReader
>>> from limix.data import build_geno_query
>>> from pandas_plink import example_file_prefix
>>>
>>> reader = BedReader(example_file_prefix())
>>>
>>> print(reader.getSnpInfo().head())
  chrom         snp   cm    pos a0 a1  i
0     1  rs10399749  0.0  45162  G  C  0
1     1   rs2949420  0.0  45257  C  T  1
2     1   rs2949421  0.0  45413  0  0  2
3     1   rs2691310  0.0  46844  A  T  3
4     1   rs4030303  0.0  72434  0  G  4

Query and load genotype values into memory:

>>> # build genotype query
>>> gquery = build_geno_query(idx_start=4,
...                           idx_end=10,
...                           pos_start=45200,
...                           pos_end=80000,
...                           chrom=1)
>>>
>>> # apply geno query and impute
>>> X, snpinfo = reader.getGenotypes(gquery,
...                                  impute=True,
...                                  return_snpinfo=True)
>>>
>>> print(snpinfo)
  chrom        snp   cm    pos a0 a1  i
0     1  rs4030303  0.0  72434  0  G  4
1     1  rs4030300  0.0  72515  0  C  5
2     1  rs3855952  0.0  77689  G  A  6
3     1   rs940550  0.0  78032  0  T  7
>>>
>>> print(X)
[[ 2.  2.  2.  2.]
 [ 2.  2.  1.  2.]
 [ 2.  2.  0.  2.]]

Lazy subsetting using queries:

>>> reader_sub = reader.subset_snps(gquery)
>>>
>>> print(reader_sub.getSnpInfo().head())
  chrom        snp   cm    pos a0 a1  i
0     1  rs4030303  0.0  72434  0  G  0
1     1  rs4030300  0.0  72515  0  C  1
2     1  rs3855952  0.0  77689  G  A  2
3     1   rs940550  0.0  78032  0  T  3
>>>
>>> # only when using getGenotypes, the genotypes are loaded
>>> print( reader_sub.getGenotypes( impute=True ) )
[[ 2.  2.  2.  2.]
 [ 2.  2.  1.  2.]
 [ 2.  2.  0.  2.]]

You can do it in place as well:

>>> query1 = build_geno_query(pos_start=72500, pos_end=78000)
>>>
>>> reader_sub.subset_snps(query1, inplace=True)
>>>
>>> print(reader_sub.getSnpInfo())
  chrom        snp   cm    pos a0 a1  i
0     1  rs4030300  0.0  72515  0  C  0
1     1  rs3855952  0.0  77689  G  A  1

and you can even iterate on genotypes to enable low-memory genome-wide analyses.

>>> from limix.data import GIter
>>>
>>> for gr in GIter(reader, batch_size=2):
...     print(gr.getGenotypes().shape)
(3, 2)
(3, 2)
(3, 2)
(3, 2)
(3, 2)

Have fun!

getGenotypes(query=None, impute=False, standardize=False, return_snpinfo=False)[source]

Query and Load genotype data.

Parameters:
  • query (str) – pandas query on the bim file. The default is None.
  • impute (bool, optional) – list of chromosomes. If True, the missing values in the bed file are mean imputed (variant-by-variant). If standardize is True, the default value of impute is True, otherwise is False.
  • standardize (bool, optional) – If True, the genotype values are standardizes. The default value is False.
  • return_snpinfo (bool, optional) – If True, returns genotype info By default is False.
Returns:

  • X (ndarray) – (N, S) ndarray of queried genotype values for N individuals and S variants.
  • snpinfo (pandas.DataFrame) – dataframe with genotype info. Returned only if return_snpinfo=True.

getSnpInfo()[source]

Return pandas dataframe with all variant info.

subset_snps(query=None, inplace=False)[source]

Builds a new bed reader with filtered variants.

Parameters:
  • query (str) – pandas query on the bim file. The default value is None.
  • inplace (bool) – If True, the operation is done in place. Default is False.
Returns:

R – Bed reader with filtered variants (if inplace is False).

Return type:

limix.BedReader

limix.data.query_and(*queries)[source]

Given multiple queries it joins them using the & operator.

Examples


limix.data.build_geno_query(idx_start=None, idx_end=None, pos_start=None, pos_end=None, chrom=None)[source]

helper function to build genotype queries.

Parameters:
  • idx_start (int, optional) – start idx. If not None (default), the query ‘idx >= idx_start’ is considered.
  • idx_end (int, optional) – end idx. If not None (default), the query ‘idx < idx_end’ is considered.
  • pos_start (int, optional) – start chromosomal position. If not None (default), the query ‘pos >= pos_start’ is considered.
  • pos_end (int, optional) – end chromosomal position. If not None (default), the query ‘pos < pos_end’ is considered.
  • chrom (int, optional) – chromosome. If not None (default), the query ‘chrom == chrom’ is considered.
Returns:

query

Return type:

str

Examples