https://badge.fury.io/py/bed-reader.svg https://github.com/fastlmm/bed-reader/actions/workflows/ci.yml/badge.svg?branch=master https://img.shields.io/pypi/pyversions/bed-reader

bed_reader Documentation

Read and write the PLINK BED format, simply and efficiently.

Features:

  • Fast multi-threaded Rust engine.

  • Supports all Python indexing methods. Slice data by individuals (samples) and/or SNPs (variants).

  • Used by PySnpTools, FaST-LMM, and PyStatGen.

  • Supports PLINK 1.9.

Install

pip install bed-reader

Usage

Read genotype data from a .bed file.

>>> import numpy as np
>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> bed = open_bed(file_name)
>>> val = bed.read()
>>> print(val)
[[ 1.  0. nan  0.]
 [ 2.  0. nan  2.]
 [ 0.  1.  2.  0.]]
>>> del bed

Read every second individual and SNPs (variants) from 20 to 30.

>>> file_name2 = sample_file("some_missing.bed")
>>> bed2 = open_bed(file_name2)
>>> val2 = bed2.read(index=np.s_[::2,20:30])
>>> print(val2.shape)
(50, 10)
>>> del bed2

List the first 5 individual (sample) ids, the first 5 SNP (variant) ids, and every unique chromosome. Then, read every value in chromosome 5.

>>> with open_bed(file_name2) as bed3:
...     print(bed3.iid[:5])
...     print(bed3.sid[:5])
...     print(np.unique(bed3.chromosome))
...     val3 = bed3.read(index=np.s_[:,bed3.chromosome=='5'])
...     print(val3.shape)
['iid_0' 'iid_1' 'iid_2' 'iid_3' 'iid_4']
['sid_0' 'sid_1' 'sid_2' 'sid_3' 'sid_4']
['1' '10' '11' '12' '13' '14' '15' '16' '17' '18' '19' '2' '20' '21' '22'
 '3' '4' '5' '6' '7' '8' '9']
(100, 6)

Summary

Open, Read, and Write

open_bed(filepath, pathlib.Path], iid_count, …)

Open a PLINK .bed file for reading.

open_bed.read([index, dtype, order, …])

Read genotype information.

to_bed(filepath, val[, properties, …])

Write values to a file in PLINK .bed format.

Properties of Individuals (samples) and SNPs (variants)

open_bed.iid_count

Number of individuals (samples).

open_bed.sid_count

Number of SNPs (variants).

open_bed.shape

Number of individuals (samples) and SNPs (variants).

open_bed.fid

Family id of each individual (sample).

open_bed.iid

Individual id of each individual (sample).

open_bed.father

Father id of each individual (sample).

open_bed.mother

Mother id of each individual (sample).

open_bed.sex

Sex of each individual (sample).

open_bed.pheno

A phenotype for each individual (sample) (seldom used).

open_bed.sid

SNP id of each SNP (variant).

open_bed.chromosome

Chromosome of each SNP (variant)

open_bed.cm_position

Centimorgan position of each SNP (variant).

open_bed.bp_position

Base-pair position of each SNP (variant).

open_bed.allele_1

First allele of each SNP (variant).

open_bed.allele_2

Second allele of each SNP (variant),

open_bed.properties

All the properties returned as a dictionary.

open_bed.property_item(name)

Retrieve one property by name.

Utilities

sample_file(filepath)

Retrieve a sample .bed file.

tmp_path()

Return a pathlib.Path to a temporary directory.

Details

open_bed

class bed_reader.open_bed(filepath: Union[str, pathlib.Path], iid_count: Optional[int] = None, sid_count: Optional[int] = None, properties: Mapping[str, List[Any]] = {}, count_A1: bool = True, num_threads: Optional[int] = None, skip_format_check: bool = False, fam_filepath: Optional[Union[str, pathlib.Path]] = None, bim_filepath: Optional[Union[str, pathlib.Path]] = None)[source]

Open a PLINK .bed file for reading.

Parameters
  • filepath (pathlib.Path or str) – File path to the .bed file.

  • iid_count (None or int, optional) – Number of individuals (samples) in the .bed file. The default (iid_count=None) finds the number automatically by quickly scanning the .fam file.

  • sid_count (None or int, optional) – Number of SNPs (variants) in the .bed file. The default (sid_count=None) finds the number automatically by quickly scanning the .bim file.

  • properties (dict, optional) –

    A dictionary of any replacement properties. The default is an empty dictionary. The keys of the dictionary are the names of the properties to replace. The possible keys are:

    ”fid” (family id), “iid” (individual or sample id), “father” (father id), “mother” (mother id), “sex”, “pheno” (phenotype), “chromosome”, “sid” (SNP or variant id), “cm_position” (centimorgan position), “bp_position” (base-pair position), “allele_1”, “allele_2”.

    The values are replacement lists or arrays. A value can also be None, meaning do not read or offer this property. See examples, below.

    The list or array will be converted to a numpy.ndarray of the appropriate dtype, if necessary. Any numpy.nan values will converted to the appropriate missing value. The PLINK .fam specification and .bim specification lists the dtypes and missing values for each property.

  • count_A1 (bool, optional) – True (default) to count the number of A1 alleles (the PLINK standard). False to count the number of A2 alleles.

  • num_threads (None or int, optional) – The number of threads with which to read data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

  • skip_format_check (bool, optional) – False (default) to immediately check for expected starting bytes in the .bed file. True to delay the check until (and if) data is read.

  • fam_filepath (pathlib.Path or str, optional) – Path to the file containing information about each individual (sample). Defaults to replacing the .bed file’s suffix with .fam.

  • bim_filepath (pathlib.Path or str, optional) – Path to the file containing information about each SNP (variant). Defaults to replacing the .bed file’s suffix with .bim.

Returns

an open_bed object

Return type

open_bed

Examples

List individual (sample) iid and SNP (variant) sid, then read() the whole file.

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> bed = open_bed(file_name)
>>> print(bed.iid)
['iid1' 'iid2' 'iid3']
>>> print(bed.sid)
['sid1' 'sid2' 'sid3' 'sid4']
>>> print(bed.read())
[[ 1.  0. nan  0.]
 [ 2.  0. nan  2.]
 [ 0.  1.  2.  0.]]
>>> del bed  # optional: delete bed object

Open the file and read data for one SNP (variant) at index position 2.

>>> import numpy as np
>>> with open_bed(file_name) as bed:
...     print(bed.read(np.s_[:,2]))
[[nan]
 [nan]
 [ 2.]]

Replace iid.

>>> bed = open_bed(file_name, properties={"iid":["sample1","sample2","sample3"]})
>>> print(bed.iid) # replaced
['sample1' 'sample2' 'sample3']
>>> print(bed.sid) # same as before
['sid1' 'sid2' 'sid3' 'sid4']

Give the number of individuals (samples) and SNPs (variants) so that the .fam and .bim files need never be opened.

>>> with open_bed(file_name, iid_count=3, sid_count=4) as bed:
...     print(bed.read())
[[ 1.  0. nan  0.]
 [ 2.  0. nan  2.]
 [ 0.  1.  2.  0.]]

Mark some properties as “don’t read or offer”.

>>> bed = open_bed(file_name, properties={
...    "father" : None, "mother" : None, "sex" : None, "pheno" : None,
...    "allele_1" : None, "allele_2":None })
>>> print(bed.iid)        # read from file
['iid1' 'iid2' 'iid3']
>>> print(bed.allele_2)   # not read and not offered
None

See the read() for details of reading batches via slicing and fancy indexing.

property allele_1: numpy.ndarray

First allele of each SNP (variant).

Returns

array of str

Return type

numpy.ndarray

If needed, will cause a one-time read of the .bim file.

Example

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> with open_bed(file_name) as bed:
...     print(bed.allele_1)
['A' 'T' 'A' 'T']
property allele_2: numpy.ndarray

Second allele of each SNP (variant),

Returns

array of str

Return type

numpy.ndarray

If needed, will cause a one-time read of the .bim file.

Example

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> with open_bed(file_name) as bed:
...     print(bed.allele_2)
['A' 'C' 'C' 'G']
property bp_position: numpy.ndarray

Base-pair position of each SNP (variant).

Returns

array of int

Return type

numpy.ndarray

0 represents a missing value.

If needed, will cause a one-time read of the .bim file.

Example

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> with open_bed(file_name) as bed:
...     print(bed.bp_position)
[   1  100 1000 1004]
property chromosome: numpy.ndarray

Chromosome of each SNP (variant)

Returns

array of str

Return type

numpy.ndarray

‘0’ represents a missing value.

If needed, will cause a one-time read of the .bim file.

Example

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> with open_bed(file_name) as bed:
...     print(bed.chromosome)
['1' '1' '5' 'Y']
property cm_position: numpy.ndarray

Centimorgan position of each SNP (variant).

Returns

array of float

Return type

numpy.ndarray

0.0 represents a missing value.

If needed, will cause a one-time read of the .bim file.

Example

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> with open_bed(file_name) as bed:
...     print(bed.cm_position)
[ 100.4 2000.5 4000.7 7000.9]
property father: numpy.ndarray

Father id of each individual (sample).

Returns

array of str

Return type

numpy.ndarray

‘0’ represents a missing value.

If needed, will cause a one-time read of the .fam file.

Example

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> with open_bed(file_name) as bed:
...     print(bed.father)
['iid23' 'iid23' 'iid22']
property fid: numpy.ndarray

Family id of each individual (sample).

Returns

array of str

Return type

numpy.ndarray

‘0’ represents a missing value.

If needed, will cause a one-time read of the .fam file.

Example

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> with open_bed(file_name) as bed:
...     print(bed.fid)
['fid1' 'fid1' 'fid2']
property iid: numpy.ndarray

Individual id of each individual (sample).

Returns

array of str

Return type

numpy.ndarray

If needed, will cause a one-time read of the .fam file.

Example

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> with open_bed(file_name) as bed:
...     print(bed.iid)
['iid1' 'iid2' 'iid3']
property iid_count: numpy.ndarray

Number of individuals (samples).

Returns

number of individuals

Return type

int

If needed, will cause a fast line-count of the .fam file.

Example

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> with open_bed(file_name) as bed:
...     print(bed.iid_count)
3
property mother: numpy.ndarray

Mother id of each individual (sample).

Returns

array of str

Return type

numpy.ndarray

‘0’ represents a missing value.

If needed, will cause a one-time read of the .fam file.

Example

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> with open_bed(file_name) as bed:
...     print(bed.mother)
['iid34' 'iid34' 'iid33']
property pheno: numpy.ndarray

A phenotype for each individual (sample) (seldom used).

Returns

array of str

Return type

numpy.ndarray

‘0’ may represent a missing value.

If needed, will cause a one-time read of the .fam file.

Example

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> with open_bed(file_name) as bed:
...     print(bed.pheno)
['red' 'red' 'blue']
property properties: Mapping[str, numpy.array]

All the properties returned as a dictionary.

Returns

all the properties

Return type

dict

The keys of the dictionary are the names of the properties, namely:

“fid” (family id), “iid” (individual or sample id), “father” (father id), “mother” (mother id), “sex”, “pheno” (phenotype), “chromosome”, “sid” (SNP or variant id), “cm_position” (centimorgan position), “bp_position” (base-pair position), “allele_1”, “allele_2”.

The values are numpy.ndarray.

If needed, will cause a one-time read of the .fam and .bim file.

Example

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> with open_bed(file_name) as bed:
...     print(len(bed.properties)) #length of dict
12
property_item(name: str)numpy.ndarray[source]

Retrieve one property by name.

Returns

a property value

Return type

numpy.ndarray

The name is one of these:

“fid” (family id), “iid” (individual or sample id), “father” (father id), “mother” (mother id), “sex”, “pheno” (phenotype), “chromosome”, “sid” (SNP or variant id), “cm_position” (centimorgan position), “bp_position” (base-pair position), “allele_1”, “allele_2”.

If needed, will cause a one-time read of the .fam or .bim file.

Example

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> with open_bed(file_name) as bed:
...     print(bed.property_item('chromosome'))
['1' '1' '5' 'Y']
read(index: Optional[Any] = None, dtype: Optional[Union[type, str]] = 'float32', order: Optional[str] = 'F', force_python_only: Optional[bool] = False, num_threads=None)numpy.ndarray[source]

Read genotype information.

Parameters
  • index

    An optional expression specifying the individuals (samples) and SNPs (variants) to read. (See examples, below). Defaults to None, meaning read all.

    (If index is a tuple, the first component indexes the individuals and the second indexes the SNPs. If it is not a tuple and not None, it indexes SNPs.)

  • dtype ({'float32' (default), 'float64', 'int8'}, optional) – The desired data-type for the returned array.

  • order ({'F','C'}, optional) – The desired memory layout for the returned array. Defaults to F (Fortran order, which is SNP-major).

  • force_python_only (bool, optional) – If False (default), uses the faster Rust code; otherwise it uses the slower pure Python code.

  • num_threads (None or int, optional) – The number of threads with which to read data. Defaults to all available processors. Can also be set with open_bed or these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

Returns

2-D array containing values of 0, 1, 2, or missing

Return type

numpy.ndarray

Rows represent individuals (samples). Columns represent SNPs (variants).

For dtype ‘float32’ and ‘float64’, NaN indicates missing values. For ‘int8’, -127 indicates missing values.

Examples

To read all data in a .bed file, set index to None. This is the default.

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> with open_bed(file_name) as bed:
...     print(bed.read())
[[ 1.  0. nan  0.]
 [ 2.  0. nan  2.]
 [ 0.  1.  2.  0.]]

To read selected individuals (samples) and/or SNPs (variants), set each part of a numpy.s_ to an int, a list of int, a slice expression, or a list of bool. Negative integers count from the end of the list.

>>> import numpy as np
>>> bed = open_bed(file_name)
>>> print(bed.read(np.s_[:,2]))  # read the SNPs indexed by 2.
[[nan]
 [nan]
 [ 2.]]
>>> print(bed.read(np.s_[:,[2,3,0]]))  # read the SNPs indexed by 2, 3, and 0
[[nan  0.  1.]
 [nan  2.  2.]
 [ 2.  0.  0.]]
>>> # read SNPs from 1 (inclusive) to 4 (exclusive)
>>> print(bed.read(np.s_[:,1:4]))
[[ 0. nan  0.]
 [ 0. nan  2.]
 [ 1.  2.  0.]]
>>> print(np.unique(bed.chromosome)) # print unique chrom values
['1' '5' 'Y']
>>> print(bed.read(np.s_[:,bed.chromosome=='5'])) # read all SNPs in chrom 5
[[nan]
 [nan]
 [ 2.]]
>>> print(bed.read(np.s_[0,:])) # Read 1st individual (across all SNPs)
[[ 1.  0. nan  0.]]
>>> print(bed.read(np.s_[::2,:])) # Read every 2nd individual
[[ 1.  0. nan  0.]
 [ 0.  1.  2.  0.]]
>>> #read last and 2nd-to-last individuals and the last SNPs
>>> print(bed.read(np.s_[[-1,-2],-1]))
[[0.]
 [2.]]

You can give a dtype for the output.

>>> print(bed.read(dtype='int8'))
[[   1    0 -127    0]
 [   2    0 -127    2]
 [   0    1    2    0]]
>>> del bed  # optional: delete bed object
property sex: numpy.ndarray

Sex of each individual (sample).

Returns

array of 0, 1, or 2

Return type

numpy.ndarray

0 is unknown, 1 is male, 2 is female

If needed, will cause a one-time read of the .fam file.

Example

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> with open_bed(file_name) as bed:
...     print(bed.sex)
[1 2 0]
property shape

Number of individuals (samples) and SNPs (variants).

Returns

number of individuals, number of SNPs

Return type

(int, int)

If needed, will cause a fast line-count of the .fam and .bim files.

Example

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> with open_bed(file_name) as bed:
...     print(bed.shape)
(3, 4)
property sid: numpy.ndarray

SNP id of each SNP (variant).

Returns

array of str

Return type

numpy.ndarray

If needed, will cause a one-time read of the .bim file.

Example

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> with open_bed(file_name) as bed:
...     print(bed.sid)
['sid1' 'sid2' 'sid3' 'sid4']
property sid_count: numpy.ndarray

Number of SNPs (variants).

Returns

number of SNPs

Return type

int

If needed, will cause a fast line-count of the .bim file.

Example

>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> with open_bed(file_name) as bed:
...     print(bed.sid_count)
4

to_bed

bed_reader.to_bed(filepath: Union[str, pathlib.Path], val: numpy.ndarray, properties: Mapping[str, List[Any]] = {}, count_A1: bool = True, fam_filepath: Optional[Union[str, pathlib.Path]] = None, bim_filepath: Optional[Union[str, pathlib.Path]] = None, force_python_only: bool = False, num_threads=None)[source]

Write values to a file in PLINK .bed format.

Parameters
  • filepath – .bed file to write to.

  • val (array-like:) – A two-dimension array (or array-like object) of values. The values should be (or be convertible to) all floats or all integers. The values should be 0, 1, 2, or missing. If floats, missing is np.nan. If integers, missing is -127.

  • properties (dict, optional) –

    A dictionary of property names and values to write to the .fam and .bim files. Any properties not mentioned will be filled in with default values.

    The possible property names are:

    ”fid” (family id), “iid” (individual or sample id), “father” (father id), “mother” (mother id), “sex”, “pheno” (phenotype), “chromosome”, “sid” (SNP or variant id), “cm_position” (centimorgan position), “bp_position” (base-pair position), “allele_1”, “allele_2”.

    The values are lists or arrays. See example, below.

  • count_A1 (bool, optional) – True (default) to count the number of A1 alleles (the PLINK standard). False to count the number of A2 alleles.

  • fam_filepath (pathlib.Path or str, optional) – Path to the file containing information about each individual (sample). Defaults to replacing the .bed file’s suffix with .fam.

  • bim_filepath (pathlib.Path or str, optional) – Path to the file containing information about each SNP (variant). Defaults to replacing the .bed file’s suffix with .bim.

  • force_python_only – If False (default), uses the faster Rust code; otherwise it uses the slower pure Python code.

  • num_threads (None or int, optional) – The number of threads with which to write data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

Examples

In this example, all properties are given.

>>> import numpy as np
>>> from bed_reader import to_bed, tmp_path
>>>
>>> output_file = tmp_path() / "small.bed"
>>> val = [[1.0, 0.0, np.nan, 0.0],
...        [2.0, 0.0, np.nan, 2.0],
...        [0.0, 1.0, 2.0, 0.0]]
>>> properties = {
...    "fid": ["fid1", "fid1", "fid2"],
...    "iid": ["iid1", "iid2", "iid3"],
...    "father": ["iid23", "iid23", "iid22"],
...    "mother": ["iid34", "iid34", "iid33"],
...    "sex": [1, 2, 0],
...    "pheno": ["red", "red", "blue"],
...    "chromosome": ["1", "1", "5", "Y"],
...    "sid": ["sid1", "sid2", "sid3", "sid4"],
...    "cm_position": [100.4, 2000.5, 4000.7, 7000.9],
...    "bp_position": [1, 100, 1000, 1004],
...    "allele_1": ["A", "T", "A", "T"],
...    "allele_2": ["A", "C", "C", "G"],
... }
>>> to_bed(output_file, val, properties=properties)

Here, no properties are given, so default values are assigned. If we then read the new file and list the chromosome property, it is an array of ‘0’s, the default chromosome value.

>>> output_file2 = tmp_path() / "small2.bed"
>>> val = [[1, 0, -127, 0], [2, 0, -127, 2], [0, 1, 2, 0]]
>>> to_bed(output_file2, val)
>>>
>>> from bed_reader import open_bed
>>> with open_bed(output_file2) as bed2:
...     print(bed2.chromosome)
['0' '0' '0' '0']

sample_file

bed_reader.sample_file(filepath: Union[str, pathlib.Path])str[source]

Retrieve a sample .bed file. (Also retrieves associated .fam and .bim files).

Parameters

filepath – Name of the sample .bed file.

Returns

Local name of sample .bed file.

Return type

str

By default this function puts files under the user’s cache directory. Override this by setting the BED_READER_DATA_DIR environment variable.

Example

>>> from bed_reader import sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> print(f"The local file name is '{file_name}'")
The local file name is '...small.bed'

tmp_path

bed_reader.tmp_path()pathlib.Path[source]

Return a pathlib.Path to a temporary directory.

Returns

a temporary directory

Return type

pathlib.Path

Example

>>> from bed_reader import to_bed, tmp_path
>>>
>>> output_file = tmp_path() / "small3.bed"
>>> val = [[1, 0, -127, 0], [2, 0, -127, 2], [0, 1, 2, 0]]
>>> to_bed(output_file, val)

Environment Variables

By default sample_file() puts files under the user’s cache directory. Override this by setting the BED_READER_DATA_DIR environment variable.

By default, open_bed uses all available processors. Override this with the num_threads parameter or by setting environment variable (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

Indices and Tables