bed_reader
Documentation
Read and write the PLINK BED format, simply and efficiently.
Features:
Fast multi-threaded Rust engine.
Supports all Python indexing methods. Slice data by individuals (samples) and/or SNPs (variants).
Used by PySnpTools, FaST-LMM, and PyStatGen.
Supports PLINK 1.9.
Read data locally or from the cloud, efficiently and directly.
Install
Full version: With all optional dependencies:
pip install bed-reader[samples,sparse]
Minimal version: Depends only on numpy:
pip install bed-reader
Usage
Read genotype data from a .bed file.
>>> import numpy as np
>>> from bed_reader import open_bed, sample_file
>>>
>>> file_name = sample_file("small.bed")
>>> bed = open_bed(file_name)
>>> val = bed.read()
>>> print(val)
[[ 1. 0. nan 0.]
[ 2. 0. nan 2.]
[ 0. 1. 2. 0.]]
>>> del bed
Read every second individual and SNPs (variants) from 20 to 30.
>>> file_name2 = sample_file("some_missing.bed")
>>> bed2 = open_bed(file_name2)
>>> val2 = bed2.read(index=np.s_[::2,20:30])
>>> print(val2.shape)
(50, 10)
>>> del bed2
List the first 5 individual (sample) ids, the first 5 SNP (variant) ids, and every unique chromosome. Then, read every value in chromosome 5.
>>> with open_bed(file_name2) as bed3:
... print(bed3.iid[:5])
... print(bed3.sid[:5])
... print(np.unique(bed3.chromosome))
... val3 = bed3.read(index=np.s_[:,bed3.chromosome=='5'])
... print(val3.shape)
['iid_0' 'iid_1' 'iid_2' 'iid_3' 'iid_4']
['sid_0' 'sid_1' 'sid_2' 'sid_3' 'sid_4']
['1' '10' '11' '12' '13' '14' '15' '16' '17' '18' '19' '2' '20' '21' '22'
'3' '4' '5' '6' '7' '8' '9']
(100, 6)
From the cloud: open a file and read data for one SNP (variant) at index position 2. (See Cloud URL Examples for details on cloud URLs.)
>>> with open_bed("https://raw.githubusercontent.com/fastlmm/bed-sample-files/main/small.bed") as bed:
... val = bed.read(index=np.s_[:,2], dtype="float64")
... print(val)
[[nan]
[nan]
[ 2.]]
Project Links
Summary
Open, Read, and Write
|
Open a PLINK .bed file, local or cloud, for reading. |
|
Read genotype information. |
|
Read genotype information into a |
|
Write values to a file in PLINK .bed format. |
|
Open a file to write values into PLINK .bed format. |
Properties of Individuals (samples) and SNPs (variants)
Number of individuals (samples). |
|
Number of SNPs (variants). |
|
Number of individuals (samples) and SNPs (variants). |
|
Family id of each individual (sample). |
|
Individual id of each individual (sample). |
|
Father id of each individual (sample). |
|
Mother id of each individual (sample). |
|
Sex of each individual (sample). |
|
A phenotype for each individual (sample) (seldom used). |
|
SNP id of each SNP (variant). |
|
Chromosome of each SNP (variant). |
|
Centimorgan position of each SNP (variant). |
|
Base-pair position of each SNP (variant). |
|
First allele of each SNP (variant). |
|
Second allele of each SNP (variant),. |
|
Major mode of a local .bed file. |
|
All the properties returned as a dictionary. |
|
|
Retrieve one property by name. |
Utilities
|
Retrieve a sample .bed file. |
|
Return a |
Details
open_bed
- class bed_reader.open_bed(location: str | Path | ParseResult, iid_count: int | None = None, sid_count: int | None = None, properties: Mapping[str, List[Any]] = {}, count_A1: bool = True, num_threads: int | None = None, skip_format_check: bool = False, fam_location: str | Path | ParseResult | None = None, bim_location: str | Path | ParseResult | None = None, cloud_options: Mapping[str, str] = {}, max_concurrent_requests: int | None = None, max_chunk_bytes: int | None = None, filepath: str | Path | None = None, fam_filepath: str | Path | None = None, bim_filepath: str | Path | None = None)[source]
Open a PLINK .bed file, local or cloud, for reading.
- Parameters:
location (pathlib.Path or str) – File path or URL to the .bed file. See Cloud URL Examples for details on cloud URLs.
iid_count (None or int, optional) – Number of individuals (samples) in the .bed file. The default (
iid_count=None
) finds the number automatically by quickly scanning the .fam file.sid_count (None or int, optional) – Number of SNPs (variants) in the .bed file. The default (
sid_count=None
) finds the number automatically by quickly scanning the .bim file.properties (dict, optional) –
A dictionary of any replacement properties. The default is an empty dictionary. The keys of the dictionary are the names of the properties to replace. The possible keys are:
”fid” (family id), “iid” (individual or sample id), “father” (father id), “mother” (mother id), “sex”, “pheno” (phenotype), “chromosome”, “sid” (SNP or variant id), “cm_position” (centimorgan position), “bp_position” (base-pair position), “allele_1”, “allele_2”.
The values are replacement lists or arrays. A value can also be None, meaning do not read or offer this property. See examples, below.
The list or array will be converted to a
numpy.ndarray
of the appropriate dtype, if necessary. Anynumpy.nan
values will converted to the appropriate missing value. The PLINK .fam specification and .bim specification lists the dtypes and missing values for each property.count_A1 (bool, optional) – True (default) to count the number of A1 alleles (the PLINK standard). False to count the number of A2 alleles.
num_threads (None or int, optional) – The number of threads with which to read data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
skip_format_check (bool, optional) – False (default) to immediately check for expected starting bytes in the .bed file. True to delay the check until (and if) data is read.
fam_location (pathlib.Path or str or URL, optional) – Path to the file containing information about each individual (sample). Defaults to replacing the .bed file’s suffix with .fam.
bim_location (pathlib.Path or str URL, optional) – Path to the file containing information about each SNP (variant). Defaults to replacing the .bed file’s suffix with .bim.
cloud_options (dict, optional) – A dictionary of options for reading from cloud storage. The default is an empty.
max_concurrent_requests (None or int, optional) – The maximum number of concurrent requests to make to the cloud storage service. Defaults to 10.
max_chunk_bytes (None or int, optional) – The maximum number of bytes to read in a single request to the cloud storage service. Defaults to 8MB.
filepath (same as location) – Deprecated. Use location instead.
fam_filepath (same as fam_location) – Deprecated. Use fam_location instead.
bim_filepath (same as bim_location) – Deprecated. Use bim_location instead.
- Returns:
an open_bed object
- Return type:
Examples
Open a local file and list individual (sample)
iid
and SNP (variant)sid
. Then,read()
the whole file.>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> bed = open_bed(file_name) >>> print(bed.iid) ['iid1' 'iid2' 'iid3'] >>> print(bed.sid) ['sid1' 'sid2' 'sid3' 'sid4'] >>> print(bed.read()) [[ 1. 0. nan 0.] [ 2. 0. nan 2.] [ 0. 1. 2. 0.]] >>> del bed # optional: delete bed object
Open a cloud file with a non-default timeout. Then, read the data for one SNP (variant) at index position 2.
See Cloud URL Examples for details on reading files from cloud storage.
>>> import numpy as np >>> with open_bed("https://raw.githubusercontent.com/fastlmm/bed-sample-files/main/small.bed", ... cloud_options={"timeout": "10s"}) as bed: ... print(bed.read(np.s_[:,2])) [[nan] [nan] [ 2.]]
With the local file, replace
iid
.>>> bed = open_bed(file_name, properties={"iid":["sample1","sample2","sample3"]}) >>> print(bed.iid) # replaced ['sample1' 'sample2' 'sample3'] >>> print(bed.sid) # same as before ['sid1' 'sid2' 'sid3' 'sid4']
Give the number of individuals (samples) and SNPs (variants) so that the .fam and .bim files need never be opened.
>>> with open_bed(file_name, iid_count=3, sid_count=4) as bed: ... print(bed.read()) [[ 1. 0. nan 0.] [ 2. 0. nan 2.] [ 0. 1. 2. 0.]]
Mark some properties as “don’t read or offer”.
>>> bed = open_bed(file_name, properties={ ... "father" : None, "mother" : None, "sex" : None, "pheno" : None, ... "allele_1" : None, "allele_2":None }) >>> print(bed.iid) # read from file ['iid1' 'iid2' 'iid3'] >>> print(bed.allele_2) # not read and not offered None
See the
read()
for details of reading batches via slicing and fancy indexing.- property allele_1: ndarray
First allele of each SNP (variant).
Returns:
- numpy.ndarray
array of str
If needed, will cause a one-time read of the .bim file.
Example:
>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(bed.allele_1) ['A' 'T' 'A' 'T']
- property allele_2: ndarray
Second allele of each SNP (variant),.
Returns:
- numpy.ndarray
array of str
If needed, will cause a one-time read of the .bim file.
Example:
>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(bed.allele_2) ['A' 'C' 'C' 'G']
- property bp_position: ndarray
Base-pair position of each SNP (variant).
Returns:
- numpy.ndarray
array of int
0 represents a missing value.
If needed, will cause a one-time read of the .bim file.
Example:
>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(bed.bp_position) [ 1 100 1000 1004]
- property chromosome: ndarray
Chromosome of each SNP (variant).
Returns:
- numpy.ndarray
array of str
‘0’ represents a missing value.
If needed, will cause a one-time read of the .bim file.
Example:
>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(bed.chromosome) ['1' '1' '5' 'Y']
- property cm_position: ndarray
Centimorgan position of each SNP (variant).
Returns:
- numpy.ndarray
array of float
0.0 represents a missing value.
If needed, will cause a one-time read of the .bim file.
Example:
>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(bed.cm_position) [ 100.4 2000.5 4000.7 7000.9]
- property father: ndarray
Father id of each individual (sample).
Returns:
- numpy.ndarray
array of str
‘0’ represents a missing value.
If needed, will cause a one-time read of the .fam file.
Example:
>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(bed.father) ['iid23' 'iid23' 'iid22']
- property fid: ndarray
Family id of each individual (sample).
Returns:
- numpy.ndarray
array of str
‘0’ represents a missing value.
If needed, will cause a one-time read of the .fam file.
Example:
>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(bed.fid) ['fid1' 'fid1' 'fid2']
- property iid: ndarray
Individual id of each individual (sample).
Returns:
- numpy.ndarray
array of str
If needed, will cause a one-time read of the .fam file.
Example:
>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(bed.iid) ['iid1' 'iid2' 'iid3']
- property iid_count: ndarray
Number of individuals (samples).
Returns:
- int
number of individuals
If needed, will cause a fast line-count of the .fam file.
Example:
>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(bed.iid_count) 3
- property major: str
Major mode of a local .bed file.
Returns:
- str
‘SNP’ or ‘individual’
Almost all PLINK 1.9 .bed files are ‘SNP’ major. This makes reading the data by SNP(s) fast.
Errors
- ValueError
If the file is a cloud file.
Example:
>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(bed.major) SNP
- property mother: ndarray
Mother id of each individual (sample).
Returns:
- numpy.ndarray
array of str
‘0’ represents a missing value.
If needed, will cause a one-time read of the .fam file.
Example:
>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(bed.mother) ['iid34' 'iid34' 'iid33']
- property pheno: ndarray
A phenotype for each individual (sample) (seldom used).
Returns:
- numpy.ndarray
array of str
‘0’ may represent a missing value.
If needed, will cause a one-time read of the .fam file.
Example:
>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(bed.pheno) ['red' 'red' 'blue']
- property properties: Mapping[str, array]
All the properties returned as a dictionary.
Returns:
- dict
all the properties
The keys of the dictionary are the names of the properties, namely:
“fid” (family id), “iid” (individual or sample id), “father” (father id), “mother” (mother id), “sex”, “pheno” (phenotype), “chromosome”, “sid” (SNP or variant id), “cm_position” (centimorgan position), “bp_position” (base-pair position), “allele_1”, “allele_2”.
The values are
numpy.ndarray
.If needed, will cause a one-time read of the .fam and .bim file.
Example:
>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(len(bed.properties)) #length of dict 12
- property_item(name: str) ndarray [source]
Retrieve one property by name.
Returns:
- numpy.ndarray
a property value
The name is one of these:
“fid” (family id), “iid” (individual or sample id), “father” (father id), “mother” (mother id), “sex”, “pheno” (phenotype), “chromosome”, “sid” (SNP or variant id), “cm_position” (centimorgan position), “bp_position” (base-pair position), “allele_1”, “allele_2”.
If needed, will cause a one-time read of the .fam or .bim file.
Example:
>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(bed.property_item('chromosome')) ['1' '1' '5' 'Y']
- read(index: Any | None = None, dtype: type | str | None = 'float32', order: str | None = 'F', force_python_only: bool | None = False, num_threads=None, max_concurrent_requests=None, max_chunk_bytes=None) ndarray [source]
Read genotype information.
- Parameters:
index –
An optional expression specifying the individuals (samples) and SNPs (variants) to read. (See examples, below). Defaults to
None
, meaning read all.(If index is a tuple, the first component indexes the individuals and the second indexes the SNPs. If it is not a tuple and not None, it indexes SNPs.)
dtype ({'float32' (default), 'float64', 'int8'}, optional) – The desired data-type for the returned array.
order ({'F','C'}, optional) – The desired memory layout for the returned array. Defaults to
F
(Fortran order, which is SNP-major).force_python_only (bool, optional) – If False (default), uses the faster Rust code; otherwise it uses the slower pure Python code.
num_threads (None or int, optional) – The number of threads with which to read data. Defaults to all available processors. Can also be set with
open_bed
or these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.max_concurrent_requests (None or int, optional) – The maximum number of concurrent requests to make to the cloud storage service. Defaults to 10.
max_chunk_bytes (None or int, optional) – The maximum number of bytes to read in a single request to the cloud storage service. Defaults to 8MB.
- Returns:
2-D array containing values of 0, 1, 2, or missing
- Return type:
Rows represent individuals (samples). Columns represent SNPs (variants).
For
dtype
‘float32’ and ‘float64’, NaN indicates missing values. For ‘int8’, -127 indicates missing values.Examples
To read all data in a .bed file, set
index
toNone
. This is the default.>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(bed.read()) [[ 1. 0. nan 0.] [ 2. 0. nan 2.] [ 0. 1. 2. 0.]]
To read selected individuals (samples) and/or SNPs (variants), set each part of a
numpy.s_
to an int, a list of int, a slice expression, or a list of bool. Negative integers count from the end of the list.>>> import numpy as np >>> bed = open_bed(file_name) >>> print(bed.read(np.s_[:,2])) # read the SNPs indexed by 2. [[nan] [nan] [ 2.]] >>> print(bed.read(np.s_[:,[2,3,0]])) # read the SNPs indexed by 2, 3, and 0 [[nan 0. 1.] [nan 2. 2.] [ 2. 0. 0.]] >>> # read SNPs from 1 (inclusive) to 4 (exclusive) >>> print(bed.read(np.s_[:,1:4])) [[ 0. nan 0.] [ 0. nan 2.] [ 1. 2. 0.]] >>> print(np.unique(bed.chromosome)) # print unique chrom values ['1' '5' 'Y'] >>> print(bed.read(np.s_[:,bed.chromosome=='5'])) # read all SNPs in chrom 5 [[nan] [nan] [ 2.]] >>> print(bed.read(np.s_[0,:])) # Read 1st individual (across all SNPs) [[ 1. 0. nan 0.]] >>> print(bed.read(np.s_[::2,:])) # Read every 2nd individual [[ 1. 0. nan 0.] [ 0. 1. 2. 0.]] >>> #read last and 2nd-to-last individuals and the last SNPs >>> print(bed.read(np.s_[[-1,-2],-1])) [[0.] [2.]]
You can give a dtype for the output.
>>> print(bed.read(dtype='int8')) [[ 1 0 -127 0] [ 2 0 -127 2] [ 0 1 2 0]] >>> del bed # optional: delete bed object
- read_sparse(index: Any | None = None, dtype: type | str | None = 'float32', batch_size: int | None = None, format: str | None = 'csc', num_threads=None, max_concurrent_requests=None, max_chunk_bytes=None) csc_matrix | csr_matrix [source]
Read genotype information into a
scipy.sparse
matrix. Sparse matrices may be useful when the data is mostly zeros.- Parameters:
index –
An optional expression specifying the individuals (samples) and SNPs (variants) to read. (See examples, below). Defaults to
None
, meaning read all.(If index is a tuple, the first component indexes the individuals and the second indexes the SNPs. If it is not a tuple and not None, it indexes SNPs.)
dtype ({'float32' (default), 'float64', 'int8'}, optional) – The desired data-type for the returned array.
batch_size (None or int, optional) – Number of dense columns or rows to read at a time, internally. Defaults to round(sqrt(total-number-of-columns-or-rows-to-read)).
format ({'csc','csr'}, optional) – The desired format of the sparse matrix. Defaults to
csc
(Compressed Sparse Column, which is SNP-major).num_threads (None or int, optional) – The number of threads with which to read data. Defaults to all available processors. Can also be set with
open_bed
or these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.max_concurrent_requests (None or int, optional) – The maximum number of concurrent requests to make to the cloud storage service. Defaults to 10.
max_chunk_bytes (None or int, optional) – The maximum number of bytes to read in a single request to the cloud storage service. Defaults to 8MB.
- Returns:
a
scipy.sparse.csc_matrix
(default) orscipy.sparse.csr_matrix
Rows represent individuals (samples). Columns represent SNPs (variants).
For
dtype
‘float32’ and ‘float64’, NaN indicates missing values.For ‘int8’, -127 indicates missing values.
The memory used by the final sparse matrix is approximately:
# of non-zero values * (4 bytes + 1 byte (for int8))
For example, consider reading 1000 individuals (samples) x 50,000 SNPs (variants) into csc format where the data is 97% sparse. The memory used will be about 7.5 MB (1000 x 50,000 x 3% x 5 bytes). This is 15% of the 50 MB needed by a dense matrix.
Internally, the function reads the data via small dense matrices. For this example, by default, the function will read 1000 individuals x 224 SNPs (because 224 * 224 is about 50,000). The memory used by the small dense matrix is 1000 x 244 x 1 byte (for int8) = 0.224 MB.
You can set batch_size. Larger values will be faster. Smaller values will use less memory.
For this example, we might want to set the batch_size to 5000. Then, the memory used by the small dense matrix would be 1000 x 5000 x 1 byte (for int8) = 5 MB, similar to the 7.5 MB needed for the final sparse matrix.
Examples
Read all data in a .bed file into a
scipy.sparse.csc_matrix
. The file has 10 individuals (samples) by 20 SNPs (variants). All but eight values are 0.>>> # pip install bed-reader[samples,sparse] # if needed >>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("sparse.bed") >>> with open_bed(file_name) as bed: ... print(bed.shape) ... val_sparse = bed.read_sparse(dtype="int8") (10, 20) >>> print("Nonzero Values", val_sparse.data) Nonzero Values [1 2 2 1 1 1 1 1]
To read selected individuals (samples) and/or SNPs (variants), set each part of a
numpy.s_
to an int, a list of int, a slice expression, or a list of bool. Negative integers count from the end of the list.>>> import numpy as np >>> bed = open_bed(file_name) >>> print("Nonzero Values", bed.read_sparse(np.s_[:,5], dtype="int8").data) # read the SNPs indexed by 5. Nonzero Values [2] >>> # read the SNPs indexed by 5, 4, and 0 >>> print("Nonzero Values", bed.read_sparse(np.s_[:,[5,4,0]], dtype="int8").data) Nonzero Values [2 1] >>> # read SNPs from 1 (inclusive) to 11 (exclusive) >>> print("Nonzero Values", bed.read_sparse(np.s_[:,1:11], dtype="int8").data) Nonzero Values [1 2 2 1 1] >>> print(np.unique(bed.chromosome)) # print unique chrom values ['1' '5' 'Y'] >>> # read all SNPs in chrom 5 >>> print("Nonzero Values", bed.read_sparse(np.s_[:,bed.chromosome=='5'], dtype="int8").data) Nonzero Values [1 2 2 1 1 1 1 1] >>> # Read 1st individual (across all SNPs) >>> print("Nonzero Values", bed.read_sparse(np.s_[0,:], dtype="int8").data) Nonzero Values [2] >>> print("Nonzero Values", bed.read_sparse(np.s_[::2,:], dtype="int8").data) # Read every 2nd individual Nonzero Values [1 2 2 1 1] >>> # read last and 2nd-to-last individuals and the 15th-from-the-last SNP >>> print("Nonzero Values", bed.read_sparse(np.s_[[-1,-2],-15], dtype="int8").data) Nonzero Values [2]
- property sex: ndarray
Sex of each individual (sample).
Returns:
- numpy.ndarray
array of 0, 1, or 2
0 is unknown, 1 is male, 2 is female
If needed, will cause a one-time read of the .fam file.
Example:
>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(bed.sex) [1 2 0]
- property shape
Number of individuals (samples) and SNPs (variants).
Returns:
- (int, int)
number of individuals, number of SNPs
If needed, will cause a fast line-count of the .fam and .bim files.
Example:
>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(bed.shape) (3, 4)
- property sid: ndarray
SNP id of each SNP (variant).
Returns:
- numpy.ndarray
array of str
If needed, will cause a one-time read of the .bim file.
Example:
>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(bed.sid) ['sid1' 'sid2' 'sid3' 'sid4']
- property sid_count: ndarray
Number of SNPs (variants).
Returns:
- int
number of SNPs
If needed, will cause a fast line-count of the .bim file.
Example:
>>> from bed_reader import open_bed, sample_file >>> >>> file_name = sample_file("small.bed") >>> with open_bed(file_name) as bed: ... print(bed.sid_count) 4
to_bed
- bed_reader.to_bed(filepath: str | Path, val: ndarray, properties: Mapping[str, List[Any]] = {}, count_A1: bool = True, fam_filepath: str | Path | None = None, bim_filepath: str | Path | None = None, major: str = 'SNP', force_python_only: bool = False, num_threads=None) None [source]
Write values to a file in PLINK .bed format.
If your data is too large to fit in memory, use
create_bed
instead.- Parameters:
filepath – .bed file to write to.
val (array-like:) – A two-dimension array (or array-like object) of values. The values should be (or be convertible to) all floats or all integers. The values should be 0, 1, 2, or missing. If floats, missing is
np.nan
. If integers, missing is -127.properties (dict, optional) –
A dictionary of property names and values to write to the .fam and .bim files. Any properties not mentioned will be filled in with default values.
The possible property names are:
”fid” (family id), “iid” (individual or sample id), “father” (father id), “mother” (mother id), “sex”, “pheno” (phenotype), “chromosome”, “sid” (SNP or variant id), “cm_position” (centimorgan position), “bp_position” (base-pair position), “allele_1”, “allele_2”.
The values are lists or arrays. See example, below.
count_A1 (bool, optional) – True (default) to count the number of A1 alleles (the PLINK standard). False to count the number of A2 alleles.
fam_filepath (pathlib.Path or str, optional) – Path to the file containing information about each individual (sample). Defaults to replacing the .bed file’s suffix with .fam.
bim_filepath (pathlib.Path or str, optional) – Path to the file containing information about each SNP (variant). Defaults to replacing the .bed file’s suffix with .bim.
major (str, optional) – Use “SNP” (default) to write the file is usual SNP-major mode. This makes reading the data SNP-by-SNP faster. Use “individual” to write the file in the uncommon individual-major mode.
force_python_only – If False (default), uses the faster Rust helper functions; otherwise it uses the slower pure Python code.
num_threads (None or int, optional) – The number of threads with which to write data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
Examples
In this example, all properties are given.
>>> import numpy as np >>> from bed_reader import to_bed, tmp_path >>> >>> output_file = tmp_path() / "small.bed" >>> val = [[1.0, 0.0, np.nan, 0.0], ... [2.0, 0.0, np.nan, 2.0], ... [0.0, 1.0, 2.0, 0.0]] >>> properties = { ... "fid": ["fid1", "fid1", "fid2"], ... "iid": ["iid1", "iid2", "iid3"], ... "father": ["iid23", "iid23", "iid22"], ... "mother": ["iid34", "iid34", "iid33"], ... "sex": [1, 2, 0], ... "pheno": ["red", "red", "blue"], ... "chromosome": ["1", "1", "5", "Y"], ... "sid": ["sid1", "sid2", "sid3", "sid4"], ... "cm_position": [100.4, 2000.5, 4000.7, 7000.9], ... "bp_position": [1, 100, 1000, 1004], ... "allele_1": ["A", "T", "A", "T"], ... "allele_2": ["A", "C", "C", "G"], ... } >>> to_bed(output_file, val, properties=properties)
Here, no properties are given, so default values are assigned. If we then read the new file and list the chromosome property, it is an array of ‘0’s, the default chromosome value.
>>> output_file2 = tmp_path() / "small2.bed" >>> val = [[1, 0, -127, 0], [2, 0, -127, 2], [0, 1, 2, 0]] >>> to_bed(output_file2, val) >>> >>> from bed_reader import open_bed >>> with open_bed(output_file2) as bed2: ... print(bed2.chromosome) ['0' '0' '0' '0']
create_bed
- class bed_reader.create_bed(location: str | Path, iid_count: int, sid_count: int, properties: Mapping[str, List[Any]] = {}, count_A1: bool = True, fam_location: str | Path | None = None, bim_location: str | Path | None = None, major: str = 'SNP', force_python_only: bool = False, num_threads=None)[source]
Open a file to write values into PLINK .bed format.
Values may be given in a SNP-by-SNP (or individual-by-individual) manner. For large datasets, create_bed requires less memory than
to_bed()
.- Parameters:
location (pathlib.Path or str, optional) – local .bed file to create.
iid_count (int:) – The number of individuals (samples).
sid_count (int:) – The number of SNPs (variants).
properties (dict, optional) –
A dictionary of property names and values to write to the .fam and .bim files. Any properties not mentioned will be filled in with default values.
The possible property names are:
”fid” (family id), “iid” (individual or sample id), “father” (father id), “mother” (mother id), “sex”, “pheno” (phenotype), “chromosome”, “sid” (SNP or variant id), “cm_position” (centimorgan position), “bp_position” (base-pair position), “allele_1”, “allele_2”.
The values are lists or arrays. See example, below.
count_A1 (bool, optional) – True (default) to count the number of A1 alleles (the PLINK standard). False to count the number of A2 alleles.
fam_location (pathlib.Path or str, optional) – Path to the file containing information about each individual (sample). Defaults to replacing the .bed file’s suffix with .fam.
bim_location (pathlib.Path or str, optional) – Path to the file containing information about each SNP (variant). Defaults to replacing the .bed file’s suffix with .bim.
major (str, optional) – Use “SNP” (default) to write the file is usual SNP-major mode. This makes reading the data SNP-by-SNP faster. Use “individual” to write the file in the uncommon individual-major mode.
force_python_only – If False (default), uses the faster Rust helper functions; otherwise it uses the slower pure Python code.
num_threads (None or int, optional) – Not currently used.
create_bed Errors
Raises an error if you write the wrong number of vectors or if any vector has the wrong length.
Also, all vector values must be 0, 1, 2, or missing. If floats, missing is
np.nan
. If integers, missing is -127.Behind the scenes, create_bed first creates a temporary bed file. At the end, if there are no errors, it renames the temporary file to the final file name. This helps prevent creation of corrupted files.
Examples
In this example, all properties are given and we write the data out SNP-by-SNP. The data is floats.
>>> import numpy as np >>> from bed_reader import create_bed, tmp_path >>> >>> output_file = tmp_path() / "small.bed" >>> properties = { ... "fid": ["fid1", "fid1", "fid2"], ... "iid": ["iid1", "iid2", "iid3"], ... "father": ["iid23", "iid23", "iid22"], ... "mother": ["iid34", "iid34", "iid33"], ... "sex": [1, 2, 0], ... "pheno": ["red", "red", "blue"], ... "chromosome": ["1", "1", "5", "Y"], ... "sid": ["sid1", "sid2", "sid3", "sid4"], ... "cm_position": [100.4, 2000.5, 4000.7, 7000.9], ... "bp_position": [1, 100, 1000, 1004], ... "allele_1": ["A", "T", "A", "T"], ... "allele_2": ["A", "C", "C", "G"], ... } >>> with create_bed(output_file, iid_count=3, sid_count=4, properties=properties) as bed_writer: ... bed_writer.write([1.0, 2.0, 0.0]) ... bed_writer.write([0.0, 0.0, 1.0]) ... bed_writer.write([np.nan, np.nan, 2.0]) ... bed_writer.write([0.0, 2.0, 0.0])
In this next example, no properties are given, so default values are assigned. Also, we write out ints. Finally, we write the same data out as above, but individual-by-individual. If we then read the new file and list the chromosome property, it is an array of ‘0’s, the default chromosome value.
>>> output_file2 = tmp_path() / "small2.bed" >>> with create_bed(output_file2, iid_count=3, sid_count=4, major="individual") as bed_writer: # noqa: E501 ... bed_writer.write([1, 0, -127, 0]) ... bed_writer.write([2, 0, -127, 2]) ... bed_writer.write([0, 1, 2, 0]) >>> >>> from bed_reader import open_bed >>> with open_bed(output_file2) as bed2: ... print(bed2.chromosome) ['0' '0' '0' '0']
- close() None [source]
Close the bed_writer, writing the file to disk. If you use
create_bed
with the with statement, you don’t need to use this.See
create_bed
for more information.
- write(vector) None [source]
Write a vector of values to the bed_writer.
See
create_bed
for more information.
sample_file
- bed_reader.sample_file(filepath: str | Path) str [source]
Retrieve a sample .bed file. (Also retrieves associated .fam and .bim files).
- Parameters:
filepath – Name of the sample .bed file.
- Returns:
Local name of sample .bed file.
- Return type:
Note
This function requires the
pooch
package. Install pooch with:pip install --upgrade bed-reader[samples]
By default this function puts files under the user’s cache directory. Override this by setting the BED_READER_DATA_DIR environment variable.
Example
>>> # pip install bed-reader[samples] # if needed >>> from bed_reader import sample_file >>> >>> file_name = sample_file("small.bed") >>> print(f"The local file name is '{file_name}'") The local file name is '...small.bed'
tmp_path
- bed_reader.tmp_path() Path [source]
Return a
pathlib.Path
to a temporary directory.Returns:
- pathlib.Path
a temporary directory
Example:
>>> from bed_reader import to_bed, tmp_path >>> >>> output_file = tmp_path() / "small3.bed" >>> val = [[1, 0, -127, 0], [2, 0, -127, 2], [0, 1, 2, 0]] >>> to_bed(output_file, val)
Environment Variables
By default sample_file()
puts files under the user’s cache directory. Override this by setting
the BED_READER_DATA_DIR
environment variable.
By default, open_bed
uses all available processors. Override this with the num_threads
parameter or by setting environment variable (listed in priority order):
‘PST_NUM_THREAD