pysnptools
Documentation¶
PySnpTools: A library for reading and manipulating genetic data.
See PySnpTools’s README.md for installation instructions, documentation, and code.
- Synopsis:
snpreader
: Efficiently read genetic PLINK formats including *.bed/bim/fam and phenotype files. Also, efficiently read parts of files and standardize data.- new:
distreader
: Efficiently work with unphased BGEN format and other diploid, biallelic distribution data. Also, efficiently read parts of files. See Distribution IPython Notebook
kernelreader
: Efficiently create, read, and manipulate kernel data.standardizer
: Specify standardizers forsnpreader
.kernelstandardizer
: Specify standardizers forkernelreader
.pstreader
: Generalizessnpreader
andkernelreader
(provides the efficiency of numpy arrays with some of the flexibility of pandas)util
: In one line, intersect and re-order IIDs fromsnpreader
,kernelreader
and other sources. Also, efficiently extract a submatrix from an ndarray.util.IntRangeSet
: Efficiently manipulate ranges of integers – for example, genetic position – with set operators including union, intersection, and set difference.util.mapreduce1
: Run loops in parallel on multiple processes, threads, or clusters.util.filecache
: Automatically copy files to and from any remote storage.
- Tutorial:
From PyData Conference, Seattle
- IPython Notebooks
- Code:
- Contacts:
Email the developers at fastlmm-dev@python.org.
Join the user discussion and announcements list (or use web sign up).
Open an issue on GitHub.
snpreader
Module¶
Tools for reading and manipulating SNP data.
- class pysnptools.snpreader.Bed(filename, count_A1=None, iid=None, sid=None, pos=None, num_threads=None, skip_format_check=False, fam_filename=None, bim_filename=None, chrom_map={'MT': 26, 'X': 23, 'XY': 25, 'Y': 24})¶
A
SnpReader
for random-access reads of Bed/Bim/Fam files from disk.See
SnpReader
for details and examples.The format is described here.
- Constructor:
- Parameters:
filename (string) – The *.bed file to read. The ‘.bed’ suffix is optional. The related *.bim and *.fam files will also be read.
count_A1 (bool) – Tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.
The following three options are never needed, but can be used to avoid reading large ‘.fam’ and ‘.bim’ files when their information is already known.
iid (an array of strings) – The
SnpReader.iid
information. If not given, reads info from ‘.fam’ file.sid (an array of strings) – The
SnpReader.sid
information. If not given, reads info from ‘.bim’ file.pos (optional, an array of strings) – The
SnpReader.pos
information. If not given, reads info from ‘.bim’ file.- num_threads (optinal, int) – The number of threads with which to read data. Defaults to all available processors.
Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
- skip_format_check (bool) – By default (False), checks that the ‘.bed’ file starts with the expected bytes
the first time any file (‘.bed’, ‘.fam’, or ‘.bim’) is opened.
fam_filename (optional, string) – The *.fam file to read. Defaults to the bed filename with the suffix replaced.
bim_filename (optional, string) – The *.bim file to read. Defaults to the bed filename with the suffix replaced.
- chrom_map (optional, dictionary) – A dictionary from non-numeric chromosome names to numbers. Defaults to the PLINK
mapping, namely,
plink_chrom_map
.
Constants
- plink_chrom_map
A dictionary that ‘X’ to 23, ‘Y’ to 24, ‘XY’ to 25, and ‘MT’ to 26.
- reverse_plink_chrom_map
A dictionary that maps 23 to ‘X’, 24 to ‘Y’, 25 to ‘XY’, and 26 to ‘MT’
Methods beyond
SnpReader
The
SnpReader.read()
method returns aSnpData
with aSnpData.val
ndarray. By default, this ndarray is numpy.float32. Optionally, it can be numpy.float16. ForBed
, however, it can also be numpy.int8 with missing values represented by -127.When reading, any chromosome and position values of 0 (the PLINK standard for missing) will be represented in
pos
as NaN.- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/distributed_bed_test1_X.*","*.bed") >>> snp_on_disk = Bed(bedfile, count_A1=False) >>> snpdata1 = snp_on_disk.read() # Read into the default, an float64 ndarray >>> snpdata1.val.dtype dtype('float64') >>> snpdata1 = snp_on_disk.read(dtype='int8',_require_float32_64=False) #Read into an 'int8' ndarray. >>> snpdata1.val.dtype dtype('int8')
- property col¶
same as
sid
- property col_property¶
same as
pos
- copyinputs(copier)¶
- property row¶
same as
iid
- static write(filename, snpdata, count_A1=False, force_python_only=False, _require_float32_64=True, num_threads=None, reverse_chrom_map={})¶
Writes a
SnpData
to Bed format and returns theBed
.- Parameters:
filename (string) – the name of the file to create
snpdata (
SnpData
) – The in-memory data that should be written to disk.count_A1 (bool) – Tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.
force_python_only (bool) – Defaults to False. Tells to use Python code rather than faster Rust code. (Might be useful for debugging).
_require_float32_64 (bool) – Defaults to True. Requires that snpdata’s dtype is float32 or float64. (False is useful for writing int8 data.)
num_threads (int) – Maximum number of threads to use. Currently ignored and number of threads is 1.
reverse_chrom_map (dictionary) – Dictionary from chromosome number to chromsome string to write in the *.bim file. Defaults to empty dictionary.
- Return type:
Any
pos
values of NaN will be written as 0, the PLINK standard for missing chromosome and position values.>>> from pysnptools.snpreader import Pheno, Bed >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> bed_fn = example_file("pysnptools/examples/toydata.5chrom.*", "*.bed") >>> snpdata = Bed(bed_fn)[:,::2].read() # Read every-other SNP >>> pstutil.create_directory_if_necessary("tempdir/everyother.bed") >>> Bed.write("tempdir/everyother.bed",snpdata,count_A1=False) # Write data in Bed format Bed('tempdir/everyother.bed',count_A1=False) >>> # Can write from an int8 array, too. >>> snpdata_int = SnpData(val=np.int_(snpdata.val).astype('int8'),iid=snpdata.iid,sid=snpdata.sid,pos=snpdata.pos,_require_float32_64=False) >>> snpdata_int.val.dtype dtype('int8') >>> Bed.write("tempdir/everyother.bed",snpdata_int,count_A1=False,_require_float32_64=False) Bed('tempdir/everyother.bed',count_A1=False)
- class pysnptools.snpreader.Dat(filename, skiprows=0)¶
A
SnpReader
for reading Dat/Fam/Map-formatted files from disk.See
SnpReader
for general examples of using SnpReaders.This is a text format that can store any numeric values. (In contrast, Bed and Ped can only store 0,1,2, and missing). Its Dat files look like:
null_0 j n 0.333 1 2 null_100 j n 2 1 1 null_200 j n 0 nan 1 ...
Its Map and Fam files are described here.
- Constructor:
- Parameters:
filename (string) – The Dat file to read.
skiprows (int) – Number of lines to skip before reading. Defaults to 0.
- Example:
>>> from pysnptools.snpreader import Dat >>> from pysnptools.util import example_file # Download and return local file name >>> dat_file = example_file("pysnptools/examples/toydata.*","*.dat") >>> data_on_disk = Dat(dat_file) >>> print((data_on_disk.iid_count, data_on_disk.sid_count)) (500, 10000)
Methods beyond
SnpReader
- copyinputs(copier)¶
- static write(filename, snpdata)¶
Writes a
SnpData
to dat/fam/map format and returns theDat
.- Parameters:
filename (string) – the name of the file to create
snpdata (
SnpData
) – The in-memory data that should be written to disk.
- Return type:
>>> from pysnptools.snpreader import Dat, Bed >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> bed_file = example_file("pysnptools/examples/toydata.5chrom.*","*.bed") >>> snpdata = Bed(bed_file,count_A1=False)[:,:10].read() # Read first 10 snps from Bed format >>> pstutil.create_directory_if_necessary("tempdir/toydata10.dat") >>> Dat.write("tempdir/toydata10.dat",snpdata) # Write data in dat/fam/map format Dat('tempdir/toydata10.dat')
- class pysnptools.snpreader.Dense(filename, extract_iid_function=<function zero_family>, extract_sid_pos_function=<function zero_pos>)¶
A
SnpReader
for reading *.dense.txt (0,1,2 text files) from disk.See
SnpReader
for general examples of using SnpReaders.This format looks like:
var 4006 9570 22899 37676 41236 41978 55159 66828... 1-10004-rs12354060 22222222... 1-707348-rs12184279 222222?2... 1-724325-rs12564807 00000000... ...
where rows are sids and columns are iids.
- Constructor:
- Parameters:
filename (string) – The Dense file to read.
extract_iid_function (string) – A function that breaks the row id from the file into a family id and an individual id. Defaults to setting the family id to 0 and using the whole row id as the iid.
extract_sid_pos_function (string) – A function that breaks the column id from the file into (chromosome, sid, genetic distance, basepair distance). Defaults to setting the
SnpReader.pos
information to 0 and using the whole column id as the sid.
- Example:
>>> from pysnptools.snpreader import Dense >>> from pysnptools.util import example_file # Download and return local file name >>> dense_file = example_file("pysnptools/examples/toydata100.dense.txt") >>> data_on_disk = Dense(dense_file) >>> print((data_on_disk.iid_count, data_on_disk.sid_count)) (500, 100)
Methods beyond
SnpReader
- static write(filename, snpdata, join_iid_function=<function just_case>, join_sid_pos_function=<function just_sid>)¶
Writes a
SnpData
to Dense format. The values must be 0,1,2 (or missing). Returns aDense
with the default extractors.- Parameters:
filename (string) – the name of the file to create
snpdata (
SnpData
) – The in-memory data that should be written to disk.join_iid_function (a function) – function to turn a family id and individual id into a file id for columns. Defaults ignoring the family id and using the individual id as the column id.
join_sid_pos_function (a function) – function to turn a sid and pos data into a file id for rows. Defaults ignoring the
SnpReader.pos
information and using the sid id as the row id.
- Return type:
>>> from pysnptools.snpreader import Dense, Bed >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> bed_file = example_file("pysnptools/examples/toydata.5chrom.*","*.bed") >>> snpdata = Bed(bed_file,count_A1=False)[:,:10].read() # Read first 10 snps from Bed format >>> pstutil.create_directory_if_necessary("tempdir/toydata10.dense.txt") >>> Dense.write("tempdir/toydata10.dense.txt",snpdata) # Write data in Dense format Dense('tempdir/toydata10.dense.txt')
- class pysnptools.snpreader.DistributedBed(storage)¶
A class that implements the
SnpReader
interface. It storesBed
-like data in pieces on storage. When you request data, it retrieves only the needed pieces.- Constructor:
- Parameters:
storage (string or
FileCache
) – Tells where the DistributedBed data is stored. A string can be given and will be interpreted as the path to a directory. AFileCache
instance can be given, which provides a method to specify cluster-distributed storage.- type storage:
string or
FileCache
- Example:
>>> import os >>> from pysnptools.snpreader import DistributedBed >>> from pysnptools.util import example_file # Download and return local file name >>> folder = os.path.dirname(example_file('pysnptools/examples/toydataSkip10.distributedbed/*')) >>> data_on_disk = DistributedBed(folder) >>> print((data_on_disk.iid_count, data_on_disk.sid_count)) (500, 1000)
- property col¶
same as
sid
- property col_property¶
same as
pos
- copyinputs(copier)¶
- property row¶
same as
iid
- static write(storage, snpreader, piece_per_chrom_count=1, updater=None, runner=None)¶
Uploads from any
Bed
-like data to cluster storage for efficient retrieval later. If some of the contents already exists in storage, it skips uploading that part of the contents. (To avoid this behavior, clear the storage.)- Parameters:
storage (string or
FileCache
or None.) – Tells where to store SNP data. A string can be given and will be interpreted as the path of a local directory to use for storage. (The local directory will not be automatically erased and so must be user managed.) AFileCache
instance can be given, which provides a method to specify cluster-distributed storage. (FileCache
’s will not be automatically erased and must be user managed.) If None, the storage will be in an automatically-erasing temporary directory. (If the TEMP environment variable is set, Python places the temp directory under it.)snpreader (
SnpReader
) – ABed
or otherSnpReader
with values of 0,1,2, or missing. (Note that this differs from most other write methods that take aSnpData
)piece_per_chrom_count (A number) – The number of pieces in which to store the data from each chromosome. Data is split across SNPs. For example, if piece_per_chrom_count is set to 100 and 22 chromosomes are uploaded, then data will be stored in 2200 pieces. Later, when data is requested only the pieces necessary for the request will be downloaded to local storage.
updater (A function or lambda) – A single argument function to write logging message to, for example, the function created by
log_in_place()
.runner (
Runner
) – aRunner
, optional: Tells how to run. (Note thatLocal
andLocalMultProc
are good options.) If not given, the function is run locally.
- Return type:
>>> from pysnptools.snpreader import DistributedBed, Bed >>> import shutil >>> from pysnptools.util import example_file # Download and return local file name >>> directory = 'tempdir/toydataSkip10.distributedbed' >>> if os.path.exists(directory): ... shutil.rmtree(directory) >>> bedfile = example_file("pysnptools/examples/toydata.5chrom.*","*.bed") >>> snpreader = Bed(bedfile,count_A1=False)[:,::10] # Read every 10 snps from Bed format >>> DistributedBed.write(directory,snpreader,piece_per_chrom_count=5) # Write data in DistributedBed format DistributedBed(LocalCache('tempdir/toydataSkip10.distributedbed'))
- class pysnptools.snpreader.Hdf5(*args, **kwargs)¶
- class pysnptools.snpreader.Ped(filename, missing='0')¶
A
SnpReader
for reading Ped-formatted (and the related Map-formatted) files from disk.See
SnpReader
for general examples of using SnpReaders.This format is described here and looks like:
FAM001 1 0 0 1 2 A A G G A C FAM001 2 0 0 1 2 A A A G 0 0 ...
the direction of the encoding from allele pair to 0,1,2 is arbitrary. That is, if the alleles are “A” and “G”, then “G G” could be 0 and “A A” could be 2 or visa versa. The pair “A G” will always be 1.
- Constructor:
- Parameters:
filename (string) – The Ped file to read.
missing (string) – The value in the file that represents missing data.
- Example:
>>> from pysnptools.snpreader import Ped >>> from pysnptools.util import example_file # Download and return local file name >>> ped_file = example_file('pysnptools/examples/toydata.ped') >>> data_on_disk = Ped(ped_file) >>> print((data_on_disk.iid_count, data_on_disk.sid_count)) (500, 10000)
Methods beyond
SnpReader
- copyinputs(copier)¶
- static write(filename, snpdata)¶
Writes a
SnpData
to Ped format. The values must be 0,1,2. The direction of the encoding to allele pairs is arbitrary. This means that if a SnpData is written in Ped format and then read back, then 0’s may become 2’s and 2’s may become 0’s. (1’s will stay 1’s). Returns thePed
- Parameters:
filename (string) – the name of the file to create
snpdata (
SnpData
) – The in-memory data that should be written to disk.
- Return type:
>>> from pysnptools.snpreader import Ped, Bed >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> bed_file = example_file("pysnptools/examples/toydata.5chrom.*","*.bed") >>> snpdata = Bed(bed_file,count_A1=False)[:,:10].read() # Read first 10 snps from Bed format >>> pstutil.create_directory_if_necessary("tempdir/toydata10.ped") >>> Ped.write("tempdir/toydata10.ped",snpdata) # Write data in Ped format Ped('tempdir/toydata10.ped')
- class pysnptools.snpreader.Pheno(input, iid_if_none=None, missing=None)¶
A
SnpReader
for reading “alternative” phenotype files from disk.See
SnpReader
for general examples of using SnpReaders.This text format is described here and looks like:
FID IID qt1 bmi site F1 1110 2.3 22.22 2 F2 2202 34.12 18.23 1 ...
where the heading row is optional.
- Constructor:
- Parameters:
input (string) – The phenotype file to read.
iid_if_none (None or array of strings) – An
SnpReader.iid
to use if the file is empty.missing (None or string) – The value in the file that represents missing data.
For backwards compatibility with older code, input can be a dictionary (instead of a filename) with these fields:
‘header’ : [1] array phenotype name,
‘vals’ : [N*1] array of phenotype data,
‘iid’ : [N*2] array of family IDs and individual IDs
- Example:
>>> from pysnptools.snpreader import Pheno, Bed >>> from pysnptools.util import example_file # Download and return local file name >>> pheno_file = example_file('pysnptools/examples/toydata.phe') >>> data_on_disk = Pheno(pheno_file) >>> print((data_on_disk.iid_count, data_on_disk.sid_count)) (500, 1)
Methods beyond
SnpReader
- static write(filename, snpdata, missing='NaN', sep='\t')¶
Writes a
SnpData
to Pheno format and returns thePheno
.- Parameters:
filename (string) – the name of the file to create
missing (string) – value to threat as missing (default ‘NaN’)
sep (string) – the separator (default ‘ ‘)
- Return type:
>>> from pysnptools.snpreader import Pheno, Bed >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> bed_file = example_file("pysnptools/examples/toydata.5chrom.*","*.bed") >>> snpdata = Bed(bed_file,count_A1=False)[:,:10].read() # Read first 10 snps from Bed format >>> pstutil.create_directory_if_necessary("tempdir/toydata10.phe") >>> Pheno.write("tempdir/toydata10.txt",snpdata) # Write data in Pheno format Pheno('tempdir/toydata10.txt')
- class pysnptools.snpreader.SnpData(iid, sid, val, pos=None, name=None, parent_string=None, copyinputs_function=None, xp=None, _require_float32_64=True)¶
A
SnpReader
for holding SNP values (or similar values) in-memory, along with related iid and sid information. It is usually created by calling theSnpReader.read()
method on anotherSnpReader
, for example,Bed
. It can also be constructed directly.See
SnpReader
for details and examples.- Constructor:
- Parameters:
iid (an array of string pair) – The
SnpReader.iid
information.sid (an array of strings) – The
SnpReader.sid
information.val (a 2-D array of floats) – The SNP values
pos (optional, an array of strings) – The
SnpReader.pos
informationname (optional, string) – Information to be display about the origin of this data
copyinputs_function (optional, function) – Used internally by optional clustering code
- xp (optional, a Python module or string) – The array module that controls val,
for example, ‘numpy’ or ‘cupy’. Defaults to numpy. Can also be set with the ‘ARRAY_MODULE’ environment variable.
- Example:
>>> from pysnptools.snpreader import SnpData >>> snpdata = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]]) >>> print((float(snpdata.val[0,1]), snpdata.iid_count, snpdata.sid_count)) (2.0, 2, 3)
Equality:
Two SnpData objects are equal if their four arrays (
SnpData.val
,SnpReader.iid
,SnpReader.sid
, andSnpReader.pos
) are ‘array_equal’. (Their ‘name’ does not need to be the same). If eitherSnpData.val
contains NaN, the objects will not be equal. However,SnpData.allclose()
can be used to treat NaN as regular values.- Example:
>>> import numpy as np >>> from pysnptools.snpreader import SnpData >>> snpdata1 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]], pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> snpdata2 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]], pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> print(snpdata1 == snpdata2) #True, because all the arrays have the same values. True >>> print(snpdata1.val is snpdata2.val) #False, because the two arrays have different memory. False >>> snpdata3 = SnpData(iid=[['a','0'],['b','0']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]], pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> snpdata4 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]], pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> print(snpdata3 == snpdata4) #False, because the iids are different. False >>> snpdata5 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]], pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> snpdata6 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]], pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> print(snpdata5 == snpdata6) #False, because the val's contain NaN False >>> print(snpdata5.allclose(snpdata6)) #True, if we consider the NaN as regular values, all the arrays have the same values. True
Methods beyond
SnpReader
- allclose(value, equal_nan=True)¶
- Parameters:
value (
SnpData
) – Other object with which to compare.equal_nan (bool) – (Default: True) Tells if NaN in
SnpData.val
should be treated as regular values when testing equality.
>>> import numpy as np >>> snpdata5 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]], pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> snpdata6 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]], pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> print(snpdata5.allclose(snpdata6)) #True, if we consider the NaN as regular values, all the arrays have the same values. True >>> print(snpdata5.allclose(snpdata6,equal_nan=False)) #False, if we consider the NaN as special values, all the arrays are not equal. False
- standardize(standardizer=Unit(), block_size=None, return_trained=False, force_python_only=False, num_threads=None)¶
Does in-place standardization of the in-memory SNP data. By default, it applies ‘Unit’ standardization, that is: the values for each SNP will have mean zero and standard deviation 1.0. NaN values are then filled with zero, the mean (consequently, if there are NaN values, the final standard deviation will not be zero. Note that, for efficiency, this method works in-place, actually changing values in the ndarray. Although it works in place, for convenience it also returns the SnpData.
- Parameters:
standardizer (
Standardizer
) – optional – Specify standardization to be applied. AnyStandardizer
may be used. Some choices includeUnit
(default, makes values for each SNP have mean zero and standard deviation 1.0) andBeta
.block_size (None) – optional – Deprecated.
return_trained (bool) – If true, returns a second value containing a constant
Standardizer
trained on this data.force_python_only (bool) – optional – If true, will use pure Python instead of faster C++ libraries.
num_threads (None or int) – optional – The number of threads with which to standardize data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
- Return type:
SnpData
(standardizes in place, but for convenience, returns ‘self’)
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bed_file = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bed_file,count_A1=False) # Specify some data on disk in Bed format >>> snpdata1 = snp_on_disk.read() # read all SNP values into memory >>> print(snpdata1) # Prints the specification for this SnpData SnpData(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)) >>> print(snpdata1.val[0,0]) 2.0 >>> snpdata1.standardize() # standardize changes the values in snpdata1.val and changes the specification. SnpData(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False),Unit()) >>> print('{0:.6f}'.format(snpdata1.val[0,0])) 0.229416 >>> snpdata2 = snp_on_disk.read().standardize() # Read and standardize in one expression with only one ndarray allocated. >>> print('{0:.6f}'.format(snpdata2.val[0,0])) 0.229416
- train_standardizer(apply_in_place, standardizer=Unit(), force_python_only=False, num_thread=None)¶
Deprecated since version 0.2.23: Use
standardize()
with return_trained=True instead.
- property val¶
The 2D NumPy array of floats that represents the values of the SNPs. You can get or set this property.
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bed_file = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snpdata = Bed(bed_file,count_A1=False)[:5,:].read() #read data for first 5 iids >>> print(snpdata.val[4,100]) #print one of the SNP values 2.0
- class pysnptools.snpreader.SnpGen(seed, iid_count, sid_count, chrom_count=22, cache_file=None, block_size=None, sid_batch_size=None)¶
A
SnpReader
that generates deterministic “random” SNP data on the fly.See
SnpReader
for general examples of using SnpReader.- Constructor:
- Parameters:
seed (number) – The random seed that (along with block_size) determines the SNP values.
- Parameters:
iid_count (number) – the number of iids (number of individuals)
- Parameters:
sid_count (number) – the number of sids (number of SNPs)
- Parameters:
chrom_count (number) – the number of chromosomes to generate (must be 22 or fewer)
- Parameters:
cache_file (string) – (default None) If provided, tells where to cache the common iid, sid, and pos information. Using it can save time.
- Parameters:
block_size (number or None) – (default None) Tells how many SNP to generate at once. The default picks a block_size such that iid_count * *block_size is about 100,000.
- Example:
>>> from pysnptools.snpreader import SnpGen >>> #Prepare to generate data for 1000 individuals and 1,000,000 SNPs >>> snp_gen = SnpGen(seed=332,iid_count=1000,sid_count=1000*1000) >>> print(snp_gen.iid_count,snp_gen.sid_count) 1000 1000000 >>> snp_data = snp_gen[:,200*1000:201*1000].read() #Generate for all users and for SNPs 200K to 201K >>> print(snp_data.val[1,1], snp_data.iid_count, snp_data.sid_count) 0.0 1000 1000
- Also See:
- property col¶
A ndarray of the cols id. Each id can be anything, for example, a string, an array of two strings, a number, etc.
- Return type:
ndarray (length
col_count
)
This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.
- Example:
>>> import numpy as np >>> from pysnptools.pstreader import PstData >>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> print(data1.col[-1]) # print the last column id. z >>> data2 = PstData(row=['a','b','c'],col=[[1,0],[2,0]],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> print(data2.col[-1]) # print the last column id. [2 0]
- property col_property¶
A ndarray of the additional information for each col. Each element is a ndarray.
- Return type:
1- or 2-dimensional ndarray (length
col_count
)
This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.
- Example:
>>> from pysnptools.pstreader import PstNpz >>> from pysnptools.util import example_file # Download and return local file name >>> pstnpz_file = example_file('tests/datasets/all_chr.maf0.001.N300.pst.npz') >>> on_disk = PstNpz(pstnpz_file) >>> print(on_disk.col_property[:3]) # print column information for the first three cols: #The '...' is an optional space [[1. 0.00800801 0. ] [1. 0.023023 1. ] [1. 0.0700701 4. ]]
- copyinputs(copier)¶
- property row¶
A ndarray of the row ids. Each id can be anything, for example, a string, an array of two strings, a number, etc.
- Return type:
ndarray (length
row_count
)
This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.
- Example:
>>> import numpy as np >>> from pysnptools.pstreader import PstData >>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> print(data1.row[:2]) # print the first two row ids ['a' 'b'] >>> data2 = PstData(row=[[1,0],[2,0],[2,1]],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> print(data2.row[:2]) # print the first two row ids [[1 0] [2 0]]
- class pysnptools.snpreader.SnpHdf5(*args, **kwargs)¶
A
SnpReader
for reading *.snp.hdf5 files from disk.See
SnpReader
for general examples of using SnpReaders.The general HDF5 format is described here. The SnpHdf5 format stores val, iid, sid, and pos information in Hdf5 format.
- Constructor:
- Parameters:
filename (string) – The SnpHdf5 file to read.
- Example:
>>> from pysnptools.snpreader import SnpHdf5 >>> from pysnptools.util import example_file # Download and return local file name >>> hd5f_file = example_file("pysnptools/examples/toydata.snpmajor.snp.hdf5") >>> data_on_disk = SnpHdf5(hd5f_file) >>> print((data_on_disk.iid_count, data_on_disk.sid_count)) (500, 10000)
Methods beyond
SnpReader
- property col¶
A ndarray of the cols id. Each id can be anything, for example, a string, an array of two strings, a number, etc.
- Return type:
ndarray (length
col_count
)
This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.
- Example:
>>> import numpy as np >>> from pysnptools.pstreader import PstData >>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> print(data1.col[-1]) # print the last column id. z >>> data2 = PstData(row=['a','b','c'],col=[[1,0],[2,0]],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> print(data2.col[-1]) # print the last column id. [2 0]
- property row¶
A ndarray of the row ids. Each id can be anything, for example, a string, an array of two strings, a number, etc.
- Return type:
ndarray (length
row_count
)
This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.
- Example:
>>> import numpy as np >>> from pysnptools.pstreader import PstData >>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> print(data1.row[:2]) # print the first two row ids ['a' 'b'] >>> data2 = PstData(row=[[1,0],[2,0],[2,1]],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> print(data2.row[:2]) # print the first two row ids [[1 0] [2 0]]
- static write(filename, snpdata, hdf5_dtype=None, sid_major=True)¶
Writes a
SnpData
to SnpHdf5 format and return a theSnpHdf5
.- Parameters:
filename (string) – the name of the file to create
snpdata (
SnpData
) – The in-memory data that should be written to disk.hdf5_dtype (string) – None (use the .val’s dtype) or a Hdf5 dtype, e.g. ‘f8’,’f4’,etc.
sid_major (bool) – Tells if vals should be stored on disk in sid_major (default) or iid_major format.
- Return type:
>>> from pysnptools.snpreader import SnpHdf5, Bed >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("pysnptools/examples/toydata.5chrom.*","*.bed") >>> snpdata = Bed(bedfile,count_A1=False)[:,:10].read() # Read first 10 snps from Bed format >>> pstutil.create_directory_if_necessary("tempdir/toydata10.snp.hdf5") >>> SnpHdf5.write("tempdir/toydata10.snp.hdf5",snpdata) # Write data in SnpHdf5 format SnpHdf5('tempdir/toydata10.snp.hdf5')
- class pysnptools.snpreader.SnpMemMap(*args, **kwargs)¶
A
SnpData
that keeps its data in a memory-mapped file. This allows data large than fits in main memory.See
SnpData
for general examples of using SnpData.- Constructor:
- Parameters:
filename (string) – The *.snp.memmap file to read.
Also see
SnpMemMap.empty()
andSnpMemMap.write()
.- Example:
>>> from pysnptools.snpreader import SnpMemMap >>> from pysnptools.util import example_file # Download and return local file name >>> mem_map_file = example_file('pysnptools/examples/tiny.snp.memmap') >>> snp_mem_map = SnpMemMap(mem_map_file) >>> print(snp_mem_map.val[0,1], snp_mem_map.iid_count, snp_mem_map.sid_count) 2.0 2 3
Methods inherited from
SnpData
Methods beyond
SnpReader
- static empty(iid, sid, filename, pos=None, order='F', dtype=<class 'numpy.float64'>)¶
Create an empty
SnpMemMap
on disk.- Parameters:
iid (an array of string pairs) – The
SnpReader.iid
informationsid (an array of strings) – The
SnpReader.sid
informationfilename (string) – name of memory-mapped file to create
pos (an array of numeric triples) – optional – The additional
SnpReader.pos
information associated with each sid. Default: Noneorder (string or None) – {‘F’ (default), ‘C’}, optional – Specify the order of the ndarray.
dtype (data-type) – {numpy.float64 (default), numpy.float32}, optional – The data-type for the
SnpMemMap.val
ndarray.
- Return type:
>>> import pysnptools.util as pstutil >>> from pysnptools.snpreader import SnpMemMap >>> filename = "tempdir/tiny.snp.memmap" >>> pstutil.create_directory_if_necessary(filename) >>> snp_mem_map = SnpMemMap.empty(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'],filename=filename,order="F",dtype=np.float64) >>> snp_mem_map.val[:,:] = [[0.,2.,0.],[0.,1.,2.]] >>> snp_mem_map.flush()
- property filename¶
The name of the memory-mapped file
- flush()¶
Flush
SnpMemMap.val
to disk and close the file. (If values or properties are accessed again, the file will be reopened.)>>> import pysnptools.util as pstutil >>> from pysnptools.snpreader import SnpMemMap >>> filename = "tempdir/tiny.snp.memmap" >>> pstutil.create_directory_if_necessary(filename) >>> snp_mem_map = SnpMemMap.empty(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'],filename=filename,order="F",dtype=np.float64) >>> snp_mem_map.val[:,:] = [[0.,2.,0.],[0.,1.,2.]] >>> snp_mem_map.flush()
- property offset¶
The byte position in the file where the memory-mapped values start.
(The disk space before this is used to store
SnpReader.iid
, etc. information. This property is useful when interfacing with, for example, external Fortran and C matrix libraries.)
- property val¶
The 2D NumPy memmap array of floats that represents the values. You can get this property, but cannot set it (except with itself)
>>> from pysnptools.snpreader import SnpMemMap >>> from pysnptools.util import example_file # Download and return local file name >>> mem_map_file = example_file('pysnptools/examples/tiny.snp.memmap') >>> snp_mem_map = SnpMemMap(mem_map_file) >>> print(snp_mem_map.val[0,1]) 2.0
- static write(filename, snpreader, standardizer=Identity(), order='A', dtype=None, block_size=None, num_threads=None)¶
Writes a
SnpReader
toSnpMemMap
format.- Parameters:
filename (string) – the name of the file to create
snpreader (
SnpReader
) – The data that should be written to disk.
- Return type:
>>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> from pysnptools.snpreader import Bed, SnpMemMap >>> bed_file = example_file("pysnptools/examples/toydata.5chrom.*","*.bed") >>> bed = Bed(bed_file) >>> pstutil.create_directory_if_necessary("tempdir/toydata.5chrom.snp.memmap") #LATER should we just promise to create directories? >>> SnpMemMap.write("tempdir/toydata.5chrom.snp.memmap",bed) # Write bed in SnpMemMap format SnpMemMap('tempdir/toydata.5chrom.snp.memmap')
- class pysnptools.snpreader.SnpNpz(*args, **kwargs)¶
A
SnpReader
for reading *.snp.npz files from disk.See
SnpReader
for general examples of using SnpReaders.The general NPZ format is described here. The SnpNpz format stores val, iid, sid, and pos information in NPZ format.
- Constructor:
- Parameters:
filename (string) – The SnpNpz file to read.
- Example:
>>> from pysnptools.snpreader import SnpNpz >>> from pysnptools.util import example_file # Download and return local file name >>> npz_file = example_file('pysnptools/examples/toydata10.snp.npz') >>> data_on_disk = SnpNpz(npz_file) >>> print((data_on_disk.iid_count, data_on_disk.sid_count)) (500, 10)
Methods beyond
SnpReader
- property col¶
A ndarray of the cols id. Each id can be anything, for example, a string, an array of two strings, a number, etc.
- Return type:
ndarray (length
col_count
)
This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.
- Example:
>>> import numpy as np >>> from pysnptools.pstreader import PstData >>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> print(data1.col[-1]) # print the last column id. z >>> data2 = PstData(row=['a','b','c'],col=[[1,0],[2,0]],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> print(data2.col[-1]) # print the last column id. [2 0]
- property row¶
A ndarray of the row ids. Each id can be anything, for example, a string, an array of two strings, a number, etc.
- Return type:
ndarray (length
row_count
)
This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.
- Example:
>>> import numpy as np >>> from pysnptools.pstreader import PstData >>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> print(data1.row[:2]) # print the first two row ids ['a' 'b'] >>> data2 = PstData(row=[[1,0],[2,0],[2,1]],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> print(data2.row[:2]) # print the first two row ids [[1 0] [2 0]]
- static write(filename, snpdata)¶
Writes a
SnpData
to SnpNpz format and returns theSnpNpz
- Parameters:
filename (string) – the name of the file to create
snpdata (
SnpData
) – The in-memory data that should be written to disk.
- Return type:
>>> from pysnptools.snpreader import SnpNpz, Bed >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> bed_file = example_file("pysnptools/examples/toydata.5chrom.*","*.bed") >>> snpdata = Bed(bed_file,count_A1=False)[:,:10].read() # Read first 10 snps from Bed format >>> pstutil.create_directory_if_necessary("tempdir/toydata10.snp.npz") >>> SnpNpz.write("tempdir/toydata10.snp.npz",snpdata) # Write data in SnpNpz format SnpNpz('tempdir/toydata10.snp.npz')
- class pysnptools.snpreader.SnpReader(*args, **kwargs)¶
A SnpReader is one of three things:
A class such as
Bed
for you to specify a file with data. For example,
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bedfile,count_A1=False) >>> print(snp_on_disk) # prints the name of the file reader Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False) >>> snp_on_disk.sid_count # prints the number of SNPS (but doesn't read any SNP values) 1015
A
SnpData
class that holds SNP data in memory, typically after reading it from disk:
>>> snp_on_disk = Bed(bedfile,count_A1=False) >>> snpdata1 = snp_on_disk.read() #reads the SNP values >>> type(snpdata1.val).__name__ # The val property is an ndarray of SNP values 'ndarray' >>> print(snpdata1) # prints the name of the file reader SnpData(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)) >>> snpdata1.iid_count #prints the number of iids (number of individuals) in this in-memory data 300
- A subset of any SnpReader, specified with “[ iid_index , sid_index ]”, to read only some SNP values.
It can also be used to re-order the values.
>>> snp_on_disk = Bed(bedfile,count_A1=False) >>> subset_on_disk = snp_on_disk[[3,4],::2] # specification for a subset of the data on disk. No SNP values are read yet. >>> print(subset_on_disk.sid_count) # prints the number of sids in this subset (but still doesn't read any SNP values) 508 >>> print(subset_on_disk) #prints a specification of 'subset_on_disk' Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)[[3,4],::2] >>> snpdata_subset = subset_on_disk.read() # efficiently reads the specified subset of values from the disk >>> print(snpdata_subset) # prints the name of the file reader SnpData(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)[[3,4],::2]) >>> print((int(snpdata_subset.val.shape[0]), int(snpdata_subset.val.shape[1]))) # The dimensions of the ndarray of SNP values (2, 508)
The SnpReaders Classes
Class
Format
Random Access
Suffixes
Write method?
in-memory floats
Yes
n/a
n/a
binary, 0,1,2
Yes (by sid)
.bed/.bim/.fam
Yes
text, floats
No
.txt,.phe
Yes
text, floats
No
.dat/.map/.fam
Yes
text, 0,1,2
No
.ped/.map
Yes
text, 0,1,2
No
.dense.txt
Yes
binary, floats
No
.snp.npz
Yes
binary, floats
Yes (by sid or iid)
.snp.hdf5
Yes
mem-mapped floats
Yes
.snp.memmap
Yes
generated values
Yes (by sid)
n/a
n/a
Methods & Properties:
Every SnpReader, such as
Bed
andSnpData
, has these properties:iid
,iid_count
,sid
,sid_count
,pos
and these methods:read()
,iid_to_index()
,sid_to_index()
,read_kernel()
. See below for details.SnpData
is a SnpReader so it supports the above properties and methods. In addition, it supports propertySnpData.val
, methodSnpData.standardize()
, and equality testing. See below for details.Many of the classes, such as
Bed
, also provide a staticBed.write()
method for writingSnpData
to disk.>>> # read from Pheno, write to Bed >>> from pysnptools.snpreader import Pheno, Bed >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name
>>> bed_fn = example_file("pysnptools/examples/toydata.5chrom.*", "*.bed") >>> snpdata = Bed(bed_fn)[:,::2].read() # Read every-other SNP >>> pstutil.create_directory_if_necessary("tempdir/everyother.bed") >>> Bed.write("tempdir/everyother.bed",snpdata,count_A1=False) # Write data in Bed format Bed('tempdir/everyother.bed',count_A1=False)
iids and sids:
Individuals are identified with an iid, which is a ndarray of two strings: a family ID and a individual ID. SNP locations are identified with an sid string in a ndarray. For example:
>>> snp_on_disk = Bed(bedfile,count_A1=False) >>> print(snp_on_disk.iid[:3]) # print the first three iids [['POP1' '0'] ['POP1' '12'] ['POP1' '44']] >>> print(snp_on_disk.sid[:9]) # print the first nine sids ['1_12' '1_34' '1_10' '1_35' '1_28' '1_25' '1_36' '1_39' '1_4'] >>> print(snp_on_disk.iid_to_index([['POP1','44'],['POP1','12']])) #Find the indexes for two iids. [2 1]
Selecting and Reordering Individuals and SNPs
You often don’t want to read the SNP values for all iids and sids. You can use indexing to create a subsetting SnpReader that will read only the SNP values of interest.
SnpReaders support the indexing formats supported by ndarray plus two generalizations. Here are examples of indexing with an array of indexes, with slicing, and with an array of Booleans.
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format >>> subset_snpreader_1 = snp_on_disk[[3,-1],:] #index with an array of indexes. Negatives count from end. >>> print((subset_snpreader_1.iid_count, subset_snpreader_1.sid_count)) (2, 1015) >>> snpdata1 = subset_snpreader_1.read() # read just the two rows of interest from the disk >>> subset_snpreader_2 = snp_on_disk[:,:0:-2] #index with a slice >>> print((subset_snpreader_2.iid_count, subset_snpreader_2.sid_count)) (300, 507) >>> boolindexes = [s.startswith('23_') for s in snp_on_disk.sid] # create a Boolean index of sids that start '23_' >>> subset_snpreader_3 = snp_on_disk[:,boolindexes] #index with array of Booleans >>> print((subset_snpreader_3.iid_count, subset_snpreader_3.sid_count)) (300, 24)
The first generalization over what ndarray offers is full indexing on both the iid dimension and the sid dimension, in other words, full multidimensional indexing. For example,
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format >>> subset_snpreader_4 = snp_on_disk[[3,4],:0:-2] # index on two dimensions at once >>> print((subset_snpreader_4.iid_count, subset_snpreader_4.sid_count)) (2, 507)
The second generalization is indexing on a single integer index.
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format >>> subset_snpreader_5 = snp_on_disk[5,:] #index with single integer >>> print((subset_snpreader_5.iid_count, subset_snpreader_5.sid_count)) (1, 1015)
Indexing is also useful when you have SNP values in memory via a
SnpData
index and want to copy a subset of those values. While you could instead index directly on theSnpData.val
ndarray, by indexing on theSnpData
instance you also get iid and cid information.>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format >>> snpdata1 = snp_on_disk.read() # read all SNP values into memory >>> print(snpdata1.sid[:9]) # print the first 9 sids ['1_12' '1_34' '1_10' '1_35' '1_28' '1_25' '1_36' '1_39' '1_4'] >>> snpdata_subset = snpdata1[:,::2].read(view_ok=True,order='A') # create a copy or view with every other sid >>> print(snpdata_subset.sid[:9])# print the first 9 sids in the subset ['1_12' '1_10' '1_28' '1_36' '1_4' '1_11' '1_32' '1_9' '1_17']
You can apply indexing on top of indexing to specify subsets of subsets of data to read. In this example, only the SNP values for every 16th sid is actually read from the disk.
>>> # These are just SnpReaders, nothing is read from disk yet >>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format >>> half_snpreader = snp_on_disk[:,::2] # a reader for half the sids >>> quarter_snpreader = half_snpreader[:,::2] # a reader for half of half the sids >>> sixteenth_snpreader = quarter_snpreader[:,::2][:,::2] # a reader for half of half of half of half the sids >>> print(sixteenth_snpreader) #Print the specification of this reader Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)[:,::2][:,::2][:,::2][:,::2] >>> # Now we read from disk. Only values for one sid in every 16 will be read. >>> snpdata_sixteenth = sixteenth_snpreader.read() >>> print(snpdata_sixteenth.val[0,3]) 2.0
When Data is Read:
SNP data can be enormous so we generally avoid reading it to the degree practical. Specifically,
Constructing and printing a SnpReader causes no file reading. For example, these commands read no data:
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read. >>> print(snp_on_disk) # Print the Bed SnpReader specification. No data is read. Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False) >>> subset_on_disk = snp_on_disk[[3,4],::2] # Construct a subsetting SnpReader. No data is read. >>> print(subset_on_disk) # print the subset SnpReader. No data is read. Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)[[3,4],::2]
Properties and methods related to the iids and sids (to the degree practical) read just some iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once. Consider these commands:
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read. >>> print(snp_on_disk.sid[:9]) # without reading any SNP values data from disk, read the sid and iid data from disk, cache it, and then print the first nine sids. ['1_12' '1_34' '1_10' '1_35' '1_28' '1_25' '1_36' '1_39' '1_4'] >>> print(snp_on_disk.sid_to_index(['1_10','1_13'])) #use the cached sid information to find the indexes of '1_10' and '1_13'. (No data is read from disk.) [2 9]
The only methods that read SNP values from file are
read()
andread_kernel()
(to the degree practical). For example:>>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read. >>> snpdata1 = snp_on_disk.read() #read all the SNP values from disk, creating a new SnpData instance that keeps these values in memory >>> print(snpdata1.val[0,2])# print the SNP value for the iid with index 0 and the sid with index 2. (No data is read from disk.) 1.0
If you request the values for only a subset of the iids or sids, (to the degree practical) only that subset will be read from disk. for example:
>>> subset_on_disk = Bed(bedfile,count_A1=False)[[3,4],::2] # Construct a subsetting SnpReader. No data is read. >>> snpdata_subset = subset_on_disk.read() # from disk, read the SNP values for the iids with index 3 and 4 AND sids with even numbered indexes. >>> print(snpdata_subset.val[0,2]) # print the SNP value with subset iid index 0 and sid index 2 (corresponding to iid index 3 and sid index 4 in the full data). No data is read from disk. 2.0
When Data is Re-Read and Copied:
Every time you call a SnpReader’s
read()
method, the SnpReader re-reads the SNP value data and returns a new in-memorySnpData
(withSnpData.val
property containing a new ndarray of the SNP values). Likewise, when you call thekernel()
method, the SnpReader re-reads the data and returns a new kernel ndarray.Here is an example of what not to do, because it causes all the SNP value data to be read twice.
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read. >>> # The following is not recommended because it inefficiently reads all the SNP values twice. >>> print(snp_on_disk.read().val[0,2]) # read all values into a new SnpData, print a SNP value 1.0 >>> print(snp_on_disk.read().val[0,3]) # read all values (again) into a second new SnpData, print a SNP value 2.0
Here are two efficient alternatives. First, if all SNP values can all fit in memory, read them once into a
SnpData
and then access thatSnpData
multiple times.>>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read. >>> snpdata1 = snp_on_disk.read() # read all values into a new SnpData >>> print(snpdata1.val[0,2]) # print a SNP value from snpdata1's in-memory ndarray 1.0 >>> print(snpdata1.val[0,3]) # print another SNP value from snpdata1's in-memory ndarray. 2.0
Second, if the SNP value data is too large to fit in memory, use subsetting to read only the SNP values of interest from disk.
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read. >>> print(snp_on_disk[0,2].read().val[0,0]) #Define the subset of data and read only that subset from disk. 1.0 >>> print(snp_on_disk[0,3].read().val[0,0]) #Define a second subset of data and read only that subset from disk. 2.0
Because the in-memory
SnpData
class is a kind of SnpReader, you may read from it, too. Doing so create a newSnpData
instance containing a copy of the SNP values in a new ndarray.>>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read. >>> snpdata1 = snp_on_disk.read() # read all SNP values from disk into a new SnpData >>> print(snpdata1.val is snpdata1.val) # Do the in-memory SNP values use the same memory as themselves? Yes True >>> snpdata2 = snpdata1.read() # copy all the SNP values into a new ndarray in a new SnpData >>> print(snpdata2.val is snpdata1.val) # Do the two ndarrays of in-memory SNP values use the same memory? False
Avoiding Unwanted ndarray Allocations
You may want a subset of SNPs values from an in-memory
SnpData
and you may know that this subset and the originalSnpData
can safely share the memory of the ndarray of SNP values. For this case, theread()
has optional parameters called view_ok and order. If you override the defaults of “view_ok=False,order=’F’” with “view_ok=True,order=’A’, theread()
will, if practical, return a newSnpData
with a ndarray shares memory with the original ndarray. Use these parameters with care because any change to either ndarray (for example, viaSnpData.standardize()
) will effect the others. Also keep in mind thatread()
relies on ndarray’s mechanisms to decide whether to actually share memory and so it may ignore your suggestion and allocate a new ndarray anyway.- snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read.
>>> snpdata1 = snp_on_disk.read() # read all data from disk into a SnpData with a new ndarray >>> column01 = snpdata1[:,0:1].read(view_ok=True,order='A') #create SnpData with the data from just the first two SNPs. Sharing memory is OK. The memory may be laid out in any order (that is sid-major and iid-major are both OK). >>> import numpy as np >>> #print np.may_share_memory(snpdata1.val, column01.val) # Do the two ndarray's share memory? They could (but currently they won't) >>> column201 = snpdata1[:,[2,0,1]].read(view_ok=True,order='A') #create SnpData with the data from three SNPs, permuted. Sharing memory is OK. >>> print(np.may_share_memory(snpdata1.val, column201.val)) # Do the two ndarray's share memory? No, ndarray decided that this indexing was too complex for sharing. False
The
read()
Method- The
SnpData.standardize()
Method The
SnpData.standardize()
method, available only onSnpData
, does in-place standardization of the in-memory SNP data. By default, it applies ‘Unit’ standardization, that is: the values for each SNP will have mean zero and standard deviation 1.0. NaN values are then filled with zeros (consequently, if there are NaN values, the final standard deviation will not be zero). Note that, for efficiently, this method works in-place, actually changing values in the ndarray. Although it works in place, for convenience it also returns itself. SeeSnpData.standardize()
for options and details.>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format >>> snpdata1 = snp_on_disk.read() # read all SNP values into memory >>> print(snpdata1) # Prints the specification for this SnpData SnpData(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)) >>> print(snpdata1.val[0,0]) 2.0 >>> snpdata1.standardize() # standardize changes the values in snpdata1.val and changes the specification. SnpData(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False),Unit()) >>> print('{0:.6f}'.format(snpdata1.val[0,0])) 0.229416 >>> snpdata2 = snp_on_disk.read().standardize() # Read and standardize in one expression with only one ndarray allocated. >>> print('{0:.6f}'.format(snpdata2.val[0,0])) 0.229416
The
read_kernel()
MethodThe
read_kernel()
method, available on any SnpReader, returns aKernelData
. Theval
property of theKernelData
is an ndarray of the (possibility standardized) SNP values multiplied with their transposed selves. When applied to an read-from-disk SnpReader, such asBed
, the method can save memory by reading (and standardizing) the data in blocks. Seeread_kernel()
for details.>>> from pysnptools.standardizer import Unit >>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format >>> kerneldata1 = snp_on_disk.read_kernel(Unit()) #Create an in-memory kernel from the snp data on disk. >>> print('{0:.6f}'.format(kerneldata1.val[0,0])) 901.421836 >>> kerneldata2 = snp_on_disk.read_kernel(Unit(),block_size=10) #Create an in-memory kernel from the snp data on disk, but only read 10 SNPS at a time from the disk. >>> print('{0:.6f}'.format(kerneldata2.val[0,0])) 901.421836
Details of Methods & Properties:
- The
- as_dist(max_weight=2.0, block_size=None)¶
Returns a
pysnptools.distreader.DistReader
such that turns the an allele count of 0,1, or 2 into a probability distribution of [1,0,0],[0,1,0], or [0,0,1], respectively. Any other values produce a distribution of [NaN,NaN,NaN].- Parameters:
max_weight (number) – optional – Tells the maximum allele count. Default of 2.
block_size (int or None) – optional – Default of None (meaning to load all). Suggested number of sids to read into memory at a time.
- Return type:
class:DistReader
- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snpreader = Bed(bedfile, count_A1=False) >>> print(snpreader[0,0].read().val) [[2.]] >>> distreader = snpreader.as_dist(max_weight=2) >>> print(distreader[0,0].read().val) [[[0. 0. 1.]]]
- copyinputs(copier)¶
- property iid¶
A ndarray of the iids. Each iid is a ndarray of two strings (a family ID and a individual ID) that identifies an individual.
- Return type:
ndarray of strings with shape [
iid_count
,2]
This property (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.
- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bedfile,count_A1=False) >>> print(snp_on_disk.iid[:3]) # print the first three iids [['POP1' '0'] ['POP1' '12'] ['POP1' '44']]
- property iid_count¶
number of iids
- Return type:
integer
This property (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.
- iid_to_index(list)¶
Takes a list of iids and returns a list of index numbers
- Parameters:
list – list of iids
- Return type:
ndarray of int
This method (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.
- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify SNP data on disk >>> print(snp_on_disk.iid_to_index([['POP1','44'],['POP1','12']])) #Find the indexes for two iids. [2 1]
- kernel(standardizer, allowlowrank=False, block_size=10000, blocksize=None, num_threads=None)¶
Warning
Deprecated. Use
read_kernel()
instead.Returns a ndarray of size iid_count x iid_count. The returned array has the value of the standardized SNP values multiplied with their transposed selves.
- Parameters:
standardizer (
Standardizer
) – – Specify standardization to be applied before the matrix multiply. AnyStandardizer
may be used. Some choices includeStandardizer.Identity
(do nothing),Unit
(make values for each SNP have mean zero and standard deviation 1.0),Beta
.block_size (int or None) – optional – Default of 10000. None means to load all. Suggested number of sids to read into memory at a time.
num_threads (None or int) – optional – The number of threads with which to standardize data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
- Return type:
Calling the method again causes the SNP values to be re-read and allocates a new ndarray.
When applied to an read-from-disk SnpReader, such as
Bed
, the method can save memory by reading (and standardizing) the data in blocks.If you request the values for only a subset of the sids or iids, (to the degree practical) only that subset will be read from disk.
- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.standardizer import Unit >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify SNP data on disk >>> kernel = snp_on_disk.kernel(Unit()) >>> print(((int(kernel.shape[0]),int(kernel.shape[1])), '{0:.6f}'.format(kernel[0,0]))) ((300, 300), '901.421836')
- property pos¶
A ndarray of the position information for each sid. Each element is a ndarray of three numpy.numbers (chromosome, genetic distance, basepair distance).
- Return type:
ndarray of float64 with shape [
sid_count
, 3]
This property (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.
- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bedfile,count_A1=False) >>> print(snp_on_disk.pos[:3,]) # print position information for the first three sids: #The '...' is for possible space char [[1. 0.00800801 nan] [1. 0.023023 1. ] [1. 0.0700701 4. ]]
- read(order='F', dtype=<class 'numpy.float64'>, force_python_only=False, view_ok=False, num_threads=None, _require_float32_64=True)¶
Reads the SNP values and returns a
SnpData
(withSnpData.val
property containing a new ndarray of the SNP values).- Parameters:
order (string or None) – {‘F’ (default), ‘C’, ‘A’}, optional – Specify the order of the ndarray. If order is ‘F’ (default), then the array will be in F-contiguous order (iid-index varies the fastest). If order is ‘C’, then the returned array will be in C-contiguous order (sid-index varies the fastest). If order is ‘A’, then the
SnpData.val
ndarray may be in any order (either C-, Fortran-contiguous).dtype (data-type.) – {numpy.float64 (default), numpy.float32}, optional – The data-type for the
SnpData.val
ndarray. (ForBed
, only, it can also be numpy.int8. Hidden option _require_float32_64 must be set to False. SeeBed
for an example.)force_python_only (bool) – optional – If False (default), may use outside library code. If True, requests that the read be done without outside library code.
view_ok (bool) – optional – If False (default), allocates new memory for the
SnpData.val
’s ndarray. If True, if practical and reading from aSnpData
, will return a newSnpData
with a ndarray shares memory with the originalSnpData
. Typically, you’ll also wish to use “order=’A’” to increase the chance that sharing will be possible. Use these parameters with care because any change to either ndarray (for example, viaSnpData.standardize()
) will effect the others. Also keep in mind thatread()
relies on ndarray’s mechanisms to decide whether to actually share memory and so it may ignore your suggestion and allocate a new ndarray anyway.num_threads (None or int) – optional – The number of threads with which to read data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
- Return type:
Calling the method again causes the SNP values to be re-read and creates a new in-memory
SnpData
with a new ndarray of SNP values.If you request the values for only a subset of the sids or iids, (to the degree practical) only that subset will be read from disk.
- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bedfile, count_A1=False) # Specify SNP data on disk >>> snpdata1 = snp_on_disk.read() # Read all the SNP data returning a SnpData instance >>> print(type(snpdata1.val).__name__) # The SnpData instance contains a ndarray of the data. ndarray >>> subset_snpdata = snp_on_disk[:,::2].read() # From the disk, read SNP values for every other sid >>> print(subset_snpdata.val[0,0]) # Print the first SNP value in the subset 2.0 >>> subsub_snpdata = subset_snpdata[:10,:].read(order='A',view_ok=True) # Create an in-memory subset of the subset with SNP values for the first ten iids. Share memory if practical. >>> import numpy as np >>> # print np.may_share_memory(subset_snpdata.val, subsub_snpdata.val) # Do the two ndarray's share memory? They could. Currently they won't.
- read_kernel(standardizer=None, block_size=None, order='A', dtype=<class 'numpy.float64'>, force_python_only=False, view_ok=False, num_threads=None)¶
Returns a
KernelData
such that theKernelData.val()
property will be a ndarray of the standardized SNP values multiplied with their transposed selves.- Parameters:
standardizer (
Standardizer
) – – (required) Specify standardization to be applied before the matrix multiply. AnyStandardizer
may be used. Some choices includeStandardizer.Identity
(do nothing),Unit
(make values for each SNP have mean zero and standard deviation 1.0) andBeta
.block_size (int or None) – optional – Default of None (meaning to load all). Suggested number of sids to read into memory at a time.
- Return type:
class:KernelData
Calling the method again causes the SNP values to be re-read and allocates a new class:KernelData.
When applied to an read-from-disk SnpReader, such as
Bed
, the method can save memory by reading (and standardizing) the data in blocks.- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.standardizer import Unit >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify SNP data on disk >>> kerneldata1 = snp_on_disk.read_kernel(Unit()) >>> print((int(kerneldata1.iid_count), '{0:.6f}'.format(kerneldata1.val[0,0]))) (300, '901.421836')
- property row_property¶
Defined as a zero-width array for compatibility with
PstReader
, but not used.
- property sid¶
A ndarray of the sids. Each sid is a string that identifies a SNP.
- Return type:
ndarray (length
sid_count
) of strings
This property (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.
- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bedfile,count_A1=False) >>> print(snp_on_disk.sid[:9]) # print the first nine sids ['1_12' '1_34' '1_10' '1_35' '1_28' '1_25' '1_36' '1_39' '1_4']
- property sid_count¶
number of sids
- Return type:
integer
This property (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.
- sid_to_index(list)¶
Takes a list of sids and returns a list of index numbers
- Parameters:
list (list of strings) – list of sids
- Return type:
ndarray of int
This method (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.
- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify SNP data on disk >>> print(snp_on_disk.sid_to_index(['1_10','1_13'])) #Find the indexes for two sids. [2 9]
- property val_shape¶
Tells the shape of value for a given individual and SNP. For SnpReaders always returns None, meaning a single scalar value.
snpreader.SnpReader
¶
- class pysnptools.snpreader.SnpReader(*args, **kwargs)¶
A SnpReader is one of three things:
A class such as
Bed
for you to specify a file with data. For example,
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bedfile,count_A1=False) >>> print(snp_on_disk) # prints the name of the file reader Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False) >>> snp_on_disk.sid_count # prints the number of SNPS (but doesn't read any SNP values) 1015
A
SnpData
class that holds SNP data in memory, typically after reading it from disk:
>>> snp_on_disk = Bed(bedfile,count_A1=False) >>> snpdata1 = snp_on_disk.read() #reads the SNP values >>> type(snpdata1.val).__name__ # The val property is an ndarray of SNP values 'ndarray' >>> print(snpdata1) # prints the name of the file reader SnpData(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)) >>> snpdata1.iid_count #prints the number of iids (number of individuals) in this in-memory data 300
- A subset of any SnpReader, specified with “[ iid_index , sid_index ]”, to read only some SNP values.
It can also be used to re-order the values.
>>> snp_on_disk = Bed(bedfile,count_A1=False) >>> subset_on_disk = snp_on_disk[[3,4],::2] # specification for a subset of the data on disk. No SNP values are read yet. >>> print(subset_on_disk.sid_count) # prints the number of sids in this subset (but still doesn't read any SNP values) 508 >>> print(subset_on_disk) #prints a specification of 'subset_on_disk' Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)[[3,4],::2] >>> snpdata_subset = subset_on_disk.read() # efficiently reads the specified subset of values from the disk >>> print(snpdata_subset) # prints the name of the file reader SnpData(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)[[3,4],::2]) >>> print((int(snpdata_subset.val.shape[0]), int(snpdata_subset.val.shape[1]))) # The dimensions of the ndarray of SNP values (2, 508)
The SnpReaders Classes
Class
Format
Random Access
Suffixes
Write method?
in-memory floats
Yes
n/a
n/a
binary, 0,1,2
Yes (by sid)
.bed/.bim/.fam
Yes
text, floats
No
.txt,.phe
Yes
text, floats
No
.dat/.map/.fam
Yes
text, 0,1,2
No
.ped/.map
Yes
text, 0,1,2
No
.dense.txt
Yes
binary, floats
No
.snp.npz
Yes
binary, floats
Yes (by sid or iid)
.snp.hdf5
Yes
mem-mapped floats
Yes
.snp.memmap
Yes
generated values
Yes (by sid)
n/a
n/a
Methods & Properties:
Every SnpReader, such as
Bed
andSnpData
, has these properties:iid
,iid_count
,sid
,sid_count
,pos
and these methods:read()
,iid_to_index()
,sid_to_index()
,read_kernel()
. See below for details.SnpData
is a SnpReader so it supports the above properties and methods. In addition, it supports propertySnpData.val
, methodSnpData.standardize()
, and equality testing. See below for details.Many of the classes, such as
Bed
, also provide a staticBed.write()
method for writingSnpData
to disk.>>> # read from Pheno, write to Bed >>> from pysnptools.snpreader import Pheno, Bed >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name
>>> bed_fn = example_file("pysnptools/examples/toydata.5chrom.*", "*.bed") >>> snpdata = Bed(bed_fn)[:,::2].read() # Read every-other SNP >>> pstutil.create_directory_if_necessary("tempdir/everyother.bed") >>> Bed.write("tempdir/everyother.bed",snpdata,count_A1=False) # Write data in Bed format Bed('tempdir/everyother.bed',count_A1=False)
iids and sids:
Individuals are identified with an iid, which is a ndarray of two strings: a family ID and a individual ID. SNP locations are identified with an sid string in a ndarray. For example:
>>> snp_on_disk = Bed(bedfile,count_A1=False) >>> print(snp_on_disk.iid[:3]) # print the first three iids [['POP1' '0'] ['POP1' '12'] ['POP1' '44']] >>> print(snp_on_disk.sid[:9]) # print the first nine sids ['1_12' '1_34' '1_10' '1_35' '1_28' '1_25' '1_36' '1_39' '1_4'] >>> print(snp_on_disk.iid_to_index([['POP1','44'],['POP1','12']])) #Find the indexes for two iids. [2 1]
Selecting and Reordering Individuals and SNPs
You often don’t want to read the SNP values for all iids and sids. You can use indexing to create a subsetting SnpReader that will read only the SNP values of interest.
SnpReaders support the indexing formats supported by ndarray plus two generalizations. Here are examples of indexing with an array of indexes, with slicing, and with an array of Booleans.
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format >>> subset_snpreader_1 = snp_on_disk[[3,-1],:] #index with an array of indexes. Negatives count from end. >>> print((subset_snpreader_1.iid_count, subset_snpreader_1.sid_count)) (2, 1015) >>> snpdata1 = subset_snpreader_1.read() # read just the two rows of interest from the disk >>> subset_snpreader_2 = snp_on_disk[:,:0:-2] #index with a slice >>> print((subset_snpreader_2.iid_count, subset_snpreader_2.sid_count)) (300, 507) >>> boolindexes = [s.startswith('23_') for s in snp_on_disk.sid] # create a Boolean index of sids that start '23_' >>> subset_snpreader_3 = snp_on_disk[:,boolindexes] #index with array of Booleans >>> print((subset_snpreader_3.iid_count, subset_snpreader_3.sid_count)) (300, 24)
The first generalization over what ndarray offers is full indexing on both the iid dimension and the sid dimension, in other words, full multidimensional indexing. For example,
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format >>> subset_snpreader_4 = snp_on_disk[[3,4],:0:-2] # index on two dimensions at once >>> print((subset_snpreader_4.iid_count, subset_snpreader_4.sid_count)) (2, 507)
The second generalization is indexing on a single integer index.
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format >>> subset_snpreader_5 = snp_on_disk[5,:] #index with single integer >>> print((subset_snpreader_5.iid_count, subset_snpreader_5.sid_count)) (1, 1015)
Indexing is also useful when you have SNP values in memory via a
SnpData
index and want to copy a subset of those values. While you could instead index directly on theSnpData.val
ndarray, by indexing on theSnpData
instance you also get iid and cid information.>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format >>> snpdata1 = snp_on_disk.read() # read all SNP values into memory >>> print(snpdata1.sid[:9]) # print the first 9 sids ['1_12' '1_34' '1_10' '1_35' '1_28' '1_25' '1_36' '1_39' '1_4'] >>> snpdata_subset = snpdata1[:,::2].read(view_ok=True,order='A') # create a copy or view with every other sid >>> print(snpdata_subset.sid[:9])# print the first 9 sids in the subset ['1_12' '1_10' '1_28' '1_36' '1_4' '1_11' '1_32' '1_9' '1_17']
You can apply indexing on top of indexing to specify subsets of subsets of data to read. In this example, only the SNP values for every 16th sid is actually read from the disk.
>>> # These are just SnpReaders, nothing is read from disk yet >>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format >>> half_snpreader = snp_on_disk[:,::2] # a reader for half the sids >>> quarter_snpreader = half_snpreader[:,::2] # a reader for half of half the sids >>> sixteenth_snpreader = quarter_snpreader[:,::2][:,::2] # a reader for half of half of half of half the sids >>> print(sixteenth_snpreader) #Print the specification of this reader Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)[:,::2][:,::2][:,::2][:,::2] >>> # Now we read from disk. Only values for one sid in every 16 will be read. >>> snpdata_sixteenth = sixteenth_snpreader.read() >>> print(snpdata_sixteenth.val[0,3]) 2.0
When Data is Read:
SNP data can be enormous so we generally avoid reading it to the degree practical. Specifically,
Constructing and printing a SnpReader causes no file reading. For example, these commands read no data:
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read. >>> print(snp_on_disk) # Print the Bed SnpReader specification. No data is read. Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False) >>> subset_on_disk = snp_on_disk[[3,4],::2] # Construct a subsetting SnpReader. No data is read. >>> print(subset_on_disk) # print the subset SnpReader. No data is read. Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)[[3,4],::2]
Properties and methods related to the iids and sids (to the degree practical) read just some iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once. Consider these commands:
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read. >>> print(snp_on_disk.sid[:9]) # without reading any SNP values data from disk, read the sid and iid data from disk, cache it, and then print the first nine sids. ['1_12' '1_34' '1_10' '1_35' '1_28' '1_25' '1_36' '1_39' '1_4'] >>> print(snp_on_disk.sid_to_index(['1_10','1_13'])) #use the cached sid information to find the indexes of '1_10' and '1_13'. (No data is read from disk.) [2 9]
The only methods that read SNP values from file are
read()
andread_kernel()
(to the degree practical). For example:>>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read. >>> snpdata1 = snp_on_disk.read() #read all the SNP values from disk, creating a new SnpData instance that keeps these values in memory >>> print(snpdata1.val[0,2])# print the SNP value for the iid with index 0 and the sid with index 2. (No data is read from disk.) 1.0
If you request the values for only a subset of the iids or sids, (to the degree practical) only that subset will be read from disk. for example:
>>> subset_on_disk = Bed(bedfile,count_A1=False)[[3,4],::2] # Construct a subsetting SnpReader. No data is read. >>> snpdata_subset = subset_on_disk.read() # from disk, read the SNP values for the iids with index 3 and 4 AND sids with even numbered indexes. >>> print(snpdata_subset.val[0,2]) # print the SNP value with subset iid index 0 and sid index 2 (corresponding to iid index 3 and sid index 4 in the full data). No data is read from disk. 2.0
When Data is Re-Read and Copied:
Every time you call a SnpReader’s
read()
method, the SnpReader re-reads the SNP value data and returns a new in-memorySnpData
(withSnpData.val
property containing a new ndarray of the SNP values). Likewise, when you call thekernel()
method, the SnpReader re-reads the data and returns a new kernel ndarray.Here is an example of what not to do, because it causes all the SNP value data to be read twice.
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read. >>> # The following is not recommended because it inefficiently reads all the SNP values twice. >>> print(snp_on_disk.read().val[0,2]) # read all values into a new SnpData, print a SNP value 1.0 >>> print(snp_on_disk.read().val[0,3]) # read all values (again) into a second new SnpData, print a SNP value 2.0
Here are two efficient alternatives. First, if all SNP values can all fit in memory, read them once into a
SnpData
and then access thatSnpData
multiple times.>>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read. >>> snpdata1 = snp_on_disk.read() # read all values into a new SnpData >>> print(snpdata1.val[0,2]) # print a SNP value from snpdata1's in-memory ndarray 1.0 >>> print(snpdata1.val[0,3]) # print another SNP value from snpdata1's in-memory ndarray. 2.0
Second, if the SNP value data is too large to fit in memory, use subsetting to read only the SNP values of interest from disk.
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read. >>> print(snp_on_disk[0,2].read().val[0,0]) #Define the subset of data and read only that subset from disk. 1.0 >>> print(snp_on_disk[0,3].read().val[0,0]) #Define a second subset of data and read only that subset from disk. 2.0
Because the in-memory
SnpData
class is a kind of SnpReader, you may read from it, too. Doing so create a newSnpData
instance containing a copy of the SNP values in a new ndarray.>>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read. >>> snpdata1 = snp_on_disk.read() # read all SNP values from disk into a new SnpData >>> print(snpdata1.val is snpdata1.val) # Do the in-memory SNP values use the same memory as themselves? Yes True >>> snpdata2 = snpdata1.read() # copy all the SNP values into a new ndarray in a new SnpData >>> print(snpdata2.val is snpdata1.val) # Do the two ndarrays of in-memory SNP values use the same memory? False
Avoiding Unwanted ndarray Allocations
You may want a subset of SNPs values from an in-memory
SnpData
and you may know that this subset and the originalSnpData
can safely share the memory of the ndarray of SNP values. For this case, theread()
has optional parameters called view_ok and order. If you override the defaults of “view_ok=False,order=’F’” with “view_ok=True,order=’A’, theread()
will, if practical, return a newSnpData
with a ndarray shares memory with the original ndarray. Use these parameters with care because any change to either ndarray (for example, viaSnpData.standardize()
) will effect the others. Also keep in mind thatread()
relies on ndarray’s mechanisms to decide whether to actually share memory and so it may ignore your suggestion and allocate a new ndarray anyway.- snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read.
>>> snpdata1 = snp_on_disk.read() # read all data from disk into a SnpData with a new ndarray >>> column01 = snpdata1[:,0:1].read(view_ok=True,order='A') #create SnpData with the data from just the first two SNPs. Sharing memory is OK. The memory may be laid out in any order (that is sid-major and iid-major are both OK). >>> import numpy as np >>> #print np.may_share_memory(snpdata1.val, column01.val) # Do the two ndarray's share memory? They could (but currently they won't) >>> column201 = snpdata1[:,[2,0,1]].read(view_ok=True,order='A') #create SnpData with the data from three SNPs, permuted. Sharing memory is OK. >>> print(np.may_share_memory(snpdata1.val, column201.val)) # Do the two ndarray's share memory? No, ndarray decided that this indexing was too complex for sharing. False
The
read()
Method- The
SnpData.standardize()
Method The
SnpData.standardize()
method, available only onSnpData
, does in-place standardization of the in-memory SNP data. By default, it applies ‘Unit’ standardization, that is: the values for each SNP will have mean zero and standard deviation 1.0. NaN values are then filled with zeros (consequently, if there are NaN values, the final standard deviation will not be zero). Note that, for efficiently, this method works in-place, actually changing values in the ndarray. Although it works in place, for convenience it also returns itself. SeeSnpData.standardize()
for options and details.>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format >>> snpdata1 = snp_on_disk.read() # read all SNP values into memory >>> print(snpdata1) # Prints the specification for this SnpData SnpData(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)) >>> print(snpdata1.val[0,0]) 2.0 >>> snpdata1.standardize() # standardize changes the values in snpdata1.val and changes the specification. SnpData(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False),Unit()) >>> print('{0:.6f}'.format(snpdata1.val[0,0])) 0.229416 >>> snpdata2 = snp_on_disk.read().standardize() # Read and standardize in one expression with only one ndarray allocated. >>> print('{0:.6f}'.format(snpdata2.val[0,0])) 0.229416
The
read_kernel()
MethodThe
read_kernel()
method, available on any SnpReader, returns aKernelData
. Theval
property of theKernelData
is an ndarray of the (possibility standardized) SNP values multiplied with their transposed selves. When applied to an read-from-disk SnpReader, such asBed
, the method can save memory by reading (and standardizing) the data in blocks. Seeread_kernel()
for details.>>> from pysnptools.standardizer import Unit >>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format >>> kerneldata1 = snp_on_disk.read_kernel(Unit()) #Create an in-memory kernel from the snp data on disk. >>> print('{0:.6f}'.format(kerneldata1.val[0,0])) 901.421836 >>> kerneldata2 = snp_on_disk.read_kernel(Unit(),block_size=10) #Create an in-memory kernel from the snp data on disk, but only read 10 SNPS at a time from the disk. >>> print('{0:.6f}'.format(kerneldata2.val[0,0])) 901.421836
Details of Methods & Properties:
- The
- as_dist(max_weight=2.0, block_size=None)¶
Returns a
pysnptools.distreader.DistReader
such that turns the an allele count of 0,1, or 2 into a probability distribution of [1,0,0],[0,1,0], or [0,0,1], respectively. Any other values produce a distribution of [NaN,NaN,NaN].- Parameters:
max_weight (number) – optional – Tells the maximum allele count. Default of 2.
block_size (int or None) – optional – Default of None (meaning to load all). Suggested number of sids to read into memory at a time.
- Return type:
class:DistReader
- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snpreader = Bed(bedfile, count_A1=False) >>> print(snpreader[0,0].read().val) [[2.]] >>> distreader = snpreader.as_dist(max_weight=2) >>> print(distreader[0,0].read().val) [[[0. 0. 1.]]]
- property iid¶
A ndarray of the iids. Each iid is a ndarray of two strings (a family ID and a individual ID) that identifies an individual.
- Return type:
ndarray of strings with shape [
iid_count
,2]
This property (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.
- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bedfile,count_A1=False) >>> print(snp_on_disk.iid[:3]) # print the first three iids [['POP1' '0'] ['POP1' '12'] ['POP1' '44']]
- property iid_count¶
number of iids
- Return type:
integer
This property (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.
- iid_to_index(list)¶
Takes a list of iids and returns a list of index numbers
- Parameters:
list – list of iids
- Return type:
ndarray of int
This method (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.
- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify SNP data on disk >>> print(snp_on_disk.iid_to_index([['POP1','44'],['POP1','12']])) #Find the indexes for two iids. [2 1]
- kernel(standardizer, allowlowrank=False, block_size=10000, blocksize=None, num_threads=None)¶
Warning
Deprecated. Use
read_kernel()
instead.Returns a ndarray of size iid_count x iid_count. The returned array has the value of the standardized SNP values multiplied with their transposed selves.
- Parameters:
standardizer (
Standardizer
) – – Specify standardization to be applied before the matrix multiply. AnyStandardizer
may be used. Some choices includeStandardizer.Identity
(do nothing),Unit
(make values for each SNP have mean zero and standard deviation 1.0),Beta
.block_size (int or None) – optional – Default of 10000. None means to load all. Suggested number of sids to read into memory at a time.
num_threads (None or int) – optional – The number of threads with which to standardize data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
- Return type:
Calling the method again causes the SNP values to be re-read and allocates a new ndarray.
When applied to an read-from-disk SnpReader, such as
Bed
, the method can save memory by reading (and standardizing) the data in blocks.If you request the values for only a subset of the sids or iids, (to the degree practical) only that subset will be read from disk.
- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.standardizer import Unit >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify SNP data on disk >>> kernel = snp_on_disk.kernel(Unit()) >>> print(((int(kernel.shape[0]),int(kernel.shape[1])), '{0:.6f}'.format(kernel[0,0]))) ((300, 300), '901.421836')
- property pos¶
A ndarray of the position information for each sid. Each element is a ndarray of three numpy.numbers (chromosome, genetic distance, basepair distance).
- Return type:
ndarray of float64 with shape [
sid_count
, 3]
This property (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.
- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bedfile,count_A1=False) >>> print(snp_on_disk.pos[:3,]) # print position information for the first three sids: #The '...' is for possible space char [[1. 0.00800801 nan] [1. 0.023023 1. ] [1. 0.0700701 4. ]]
- read(order='F', dtype=<class 'numpy.float64'>, force_python_only=False, view_ok=False, num_threads=None, _require_float32_64=True)¶
Reads the SNP values and returns a
SnpData
(withSnpData.val
property containing a new ndarray of the SNP values).- Parameters:
order (string or None) – {‘F’ (default), ‘C’, ‘A’}, optional – Specify the order of the ndarray. If order is ‘F’ (default), then the array will be in F-contiguous order (iid-index varies the fastest). If order is ‘C’, then the returned array will be in C-contiguous order (sid-index varies the fastest). If order is ‘A’, then the
SnpData.val
ndarray may be in any order (either C-, Fortran-contiguous).dtype (data-type.) – {numpy.float64 (default), numpy.float32}, optional – The data-type for the
SnpData.val
ndarray. (ForBed
, only, it can also be numpy.int8. Hidden option _require_float32_64 must be set to False. SeeBed
for an example.)force_python_only (bool) – optional – If False (default), may use outside library code. If True, requests that the read be done without outside library code.
view_ok (bool) – optional – If False (default), allocates new memory for the
SnpData.val
’s ndarray. If True, if practical and reading from aSnpData
, will return a newSnpData
with a ndarray shares memory with the originalSnpData
. Typically, you’ll also wish to use “order=’A’” to increase the chance that sharing will be possible. Use these parameters with care because any change to either ndarray (for example, viaSnpData.standardize()
) will effect the others. Also keep in mind thatread()
relies on ndarray’s mechanisms to decide whether to actually share memory and so it may ignore your suggestion and allocate a new ndarray anyway.num_threads (None or int) – optional – The number of threads with which to read data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
- Return type:
Calling the method again causes the SNP values to be re-read and creates a new in-memory
SnpData
with a new ndarray of SNP values.If you request the values for only a subset of the sids or iids, (to the degree practical) only that subset will be read from disk.
- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bedfile, count_A1=False) # Specify SNP data on disk >>> snpdata1 = snp_on_disk.read() # Read all the SNP data returning a SnpData instance >>> print(type(snpdata1.val).__name__) # The SnpData instance contains a ndarray of the data. ndarray >>> subset_snpdata = snp_on_disk[:,::2].read() # From the disk, read SNP values for every other sid >>> print(subset_snpdata.val[0,0]) # Print the first SNP value in the subset 2.0 >>> subsub_snpdata = subset_snpdata[:10,:].read(order='A',view_ok=True) # Create an in-memory subset of the subset with SNP values for the first ten iids. Share memory if practical. >>> import numpy as np >>> # print np.may_share_memory(subset_snpdata.val, subsub_snpdata.val) # Do the two ndarray's share memory? They could. Currently they won't.
- read_kernel(standardizer=None, block_size=None, order='A', dtype=<class 'numpy.float64'>, force_python_only=False, view_ok=False, num_threads=None)¶
Returns a
KernelData
such that theKernelData.val()
property will be a ndarray of the standardized SNP values multiplied with their transposed selves.- Parameters:
standardizer (
Standardizer
) – – (required) Specify standardization to be applied before the matrix multiply. AnyStandardizer
may be used. Some choices includeStandardizer.Identity
(do nothing),Unit
(make values for each SNP have mean zero and standard deviation 1.0) andBeta
.block_size (int or None) – optional – Default of None (meaning to load all). Suggested number of sids to read into memory at a time.
- Return type:
class:KernelData
Calling the method again causes the SNP values to be re-read and allocates a new class:KernelData.
When applied to an read-from-disk SnpReader, such as
Bed
, the method can save memory by reading (and standardizing) the data in blocks.- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.standardizer import Unit >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify SNP data on disk >>> kerneldata1 = snp_on_disk.read_kernel(Unit()) >>> print((int(kerneldata1.iid_count), '{0:.6f}'.format(kerneldata1.val[0,0]))) (300, '901.421836')
- property row_property¶
Defined as a zero-width array for compatibility with
PstReader
, but not used.
- property sid¶
A ndarray of the sids. Each sid is a string that identifies a SNP.
- Return type:
ndarray (length
sid_count
) of strings
This property (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.
- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bedfile,count_A1=False) >>> print(snp_on_disk.sid[:9]) # print the first nine sids ['1_12' '1_34' '1_10' '1_35' '1_28' '1_25' '1_36' '1_39' '1_4']
- property sid_count¶
number of sids
- Return type:
integer
This property (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.
- sid_to_index(list)¶
Takes a list of sids and returns a list of index numbers
- Parameters:
list (list of strings) – list of sids
- Return type:
ndarray of int
This method (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.
- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify SNP data on disk >>> print(snp_on_disk.sid_to_index(['1_10','1_13'])) #Find the indexes for two sids. [2 9]
- property val_shape¶
Tells the shape of value for a given individual and SNP. For SnpReaders always returns None, meaning a single scalar value.
snpreader.Bed
¶
- class pysnptools.snpreader.Bed(filename, count_A1=None, iid=None, sid=None, pos=None, num_threads=None, skip_format_check=False, fam_filename=None, bim_filename=None, chrom_map={'MT': 26, 'X': 23, 'XY': 25, 'Y': 24})¶
A
SnpReader
for random-access reads of Bed/Bim/Fam files from disk.See
SnpReader
for details and examples.The format is described here.
- Constructor:
- Parameters:
filename (string) – The *.bed file to read. The ‘.bed’ suffix is optional. The related *.bim and *.fam files will also be read.
count_A1 (bool) – Tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.
The following three options are never needed, but can be used to avoid reading large ‘.fam’ and ‘.bim’ files when their information is already known.
iid (an array of strings) – The
SnpReader.iid
information. If not given, reads info from ‘.fam’ file.sid (an array of strings) – The
SnpReader.sid
information. If not given, reads info from ‘.bim’ file.pos (optional, an array of strings) – The
SnpReader.pos
information. If not given, reads info from ‘.bim’ file.- num_threads (optinal, int) – The number of threads with which to read data. Defaults to all available processors.
Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
- skip_format_check (bool) – By default (False), checks that the ‘.bed’ file starts with the expected bytes
the first time any file (‘.bed’, ‘.fam’, or ‘.bim’) is opened.
fam_filename (optional, string) – The *.fam file to read. Defaults to the bed filename with the suffix replaced.
bim_filename (optional, string) – The *.bim file to read. Defaults to the bed filename with the suffix replaced.
- chrom_map (optional, dictionary) – A dictionary from non-numeric chromosome names to numbers. Defaults to the PLINK
mapping, namely,
plink_chrom_map
.
Constants
- plink_chrom_map
A dictionary that ‘X’ to 23, ‘Y’ to 24, ‘XY’ to 25, and ‘MT’ to 26.
- reverse_plink_chrom_map
A dictionary that maps 23 to ‘X’, 24 to ‘Y’, 25 to ‘XY’, and 26 to ‘MT’
Methods beyond
SnpReader
The
SnpReader.read()
method returns aSnpData
with aSnpData.val
ndarray. By default, this ndarray is numpy.float32. Optionally, it can be numpy.float16. ForBed
, however, it can also be numpy.int8 with missing values represented by -127.When reading, any chromosome and position values of 0 (the PLINK standard for missing) will be represented in
pos
as NaN.- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/distributed_bed_test1_X.*","*.bed") >>> snp_on_disk = Bed(bedfile, count_A1=False) >>> snpdata1 = snp_on_disk.read() # Read into the default, an float64 ndarray >>> snpdata1.val.dtype dtype('float64') >>> snpdata1 = snp_on_disk.read(dtype='int8',_require_float32_64=False) #Read into an 'int8' ndarray. >>> snpdata1.val.dtype dtype('int8')
- static write(filename, snpdata, count_A1=False, force_python_only=False, _require_float32_64=True, num_threads=None, reverse_chrom_map={})¶
Writes a
SnpData
to Bed format and returns theBed
.- Parameters:
filename (string) – the name of the file to create
snpdata (
SnpData
) – The in-memory data that should be written to disk.count_A1 (bool) – Tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.
force_python_only (bool) – Defaults to False. Tells to use Python code rather than faster Rust code. (Might be useful for debugging).
_require_float32_64 (bool) – Defaults to True. Requires that snpdata’s dtype is float32 or float64. (False is useful for writing int8 data.)
num_threads (int) – Maximum number of threads to use. Currently ignored and number of threads is 1.
reverse_chrom_map (dictionary) – Dictionary from chromosome number to chromsome string to write in the *.bim file. Defaults to empty dictionary.
- Return type:
Any
pos
values of NaN will be written as 0, the PLINK standard for missing chromosome and position values.>>> from pysnptools.snpreader import Pheno, Bed >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> bed_fn = example_file("pysnptools/examples/toydata.5chrom.*", "*.bed") >>> snpdata = Bed(bed_fn)[:,::2].read() # Read every-other SNP >>> pstutil.create_directory_if_necessary("tempdir/everyother.bed") >>> Bed.write("tempdir/everyother.bed",snpdata,count_A1=False) # Write data in Bed format Bed('tempdir/everyother.bed',count_A1=False) >>> # Can write from an int8 array, too. >>> snpdata_int = SnpData(val=np.int_(snpdata.val).astype('int8'),iid=snpdata.iid,sid=snpdata.sid,pos=snpdata.pos,_require_float32_64=False) >>> snpdata_int.val.dtype dtype('int8') >>> Bed.write("tempdir/everyother.bed",snpdata_int,count_A1=False,_require_float32_64=False) Bed('tempdir/everyother.bed',count_A1=False)
snpreader.SnpData
¶
- class pysnptools.snpreader.SnpData(iid, sid, val, pos=None, name=None, parent_string=None, copyinputs_function=None, xp=None, _require_float32_64=True)¶
A
SnpReader
for holding SNP values (or similar values) in-memory, along with related iid and sid information. It is usually created by calling theSnpReader.read()
method on anotherSnpReader
, for example,Bed
. It can also be constructed directly.See
SnpReader
for details and examples.- Constructor:
- Parameters:
iid (an array of string pair) – The
SnpReader.iid
information.sid (an array of strings) – The
SnpReader.sid
information.val (a 2-D array of floats) – The SNP values
pos (optional, an array of strings) – The
SnpReader.pos
informationname (optional, string) – Information to be display about the origin of this data
copyinputs_function (optional, function) – Used internally by optional clustering code
- xp (optional, a Python module or string) – The array module that controls val,
for example, ‘numpy’ or ‘cupy’. Defaults to numpy. Can also be set with the ‘ARRAY_MODULE’ environment variable.
- Example:
>>> from pysnptools.snpreader import SnpData >>> snpdata = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]]) >>> print((float(snpdata.val[0,1]), snpdata.iid_count, snpdata.sid_count)) (2.0, 2, 3)
Equality:
Two SnpData objects are equal if their four arrays (
SnpData.val
,SnpReader.iid
,SnpReader.sid
, andSnpReader.pos
) are ‘array_equal’. (Their ‘name’ does not need to be the same). If eitherSnpData.val
contains NaN, the objects will not be equal. However,SnpData.allclose()
can be used to treat NaN as regular values.- Example:
>>> import numpy as np >>> from pysnptools.snpreader import SnpData >>> snpdata1 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]], pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> snpdata2 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]], pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> print(snpdata1 == snpdata2) #True, because all the arrays have the same values. True >>> print(snpdata1.val is snpdata2.val) #False, because the two arrays have different memory. False >>> snpdata3 = SnpData(iid=[['a','0'],['b','0']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]], pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> snpdata4 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]], pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> print(snpdata3 == snpdata4) #False, because the iids are different. False >>> snpdata5 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]], pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> snpdata6 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]], pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> print(snpdata5 == snpdata6) #False, because the val's contain NaN False >>> print(snpdata5.allclose(snpdata6)) #True, if we consider the NaN as regular values, all the arrays have the same values. True
Methods beyond
SnpReader
- allclose(value, equal_nan=True)¶
- Parameters:
value (
SnpData
) – Other object with which to compare.equal_nan (bool) – (Default: True) Tells if NaN in
SnpData.val
should be treated as regular values when testing equality.
>>> import numpy as np >>> snpdata5 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]], pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> snpdata6 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]], pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> print(snpdata5.allclose(snpdata6)) #True, if we consider the NaN as regular values, all the arrays have the same values. True >>> print(snpdata5.allclose(snpdata6,equal_nan=False)) #False, if we consider the NaN as special values, all the arrays are not equal. False
- standardize(standardizer=Unit(), block_size=None, return_trained=False, force_python_only=False, num_threads=None)¶
Does in-place standardization of the in-memory SNP data. By default, it applies ‘Unit’ standardization, that is: the values for each SNP will have mean zero and standard deviation 1.0. NaN values are then filled with zero, the mean (consequently, if there are NaN values, the final standard deviation will not be zero. Note that, for efficiency, this method works in-place, actually changing values in the ndarray. Although it works in place, for convenience it also returns the SnpData.
- Parameters:
standardizer (
Standardizer
) – optional – Specify standardization to be applied. AnyStandardizer
may be used. Some choices includeUnit
(default, makes values for each SNP have mean zero and standard deviation 1.0) andBeta
.block_size (None) – optional – Deprecated.
return_trained (bool) – If true, returns a second value containing a constant
Standardizer
trained on this data.force_python_only (bool) – optional – If true, will use pure Python instead of faster C++ libraries.
num_threads (None or int) – optional – The number of threads with which to standardize data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
- Return type:
SnpData
(standardizes in place, but for convenience, returns ‘self’)
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bed_file = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snp_on_disk = Bed(bed_file,count_A1=False) # Specify some data on disk in Bed format >>> snpdata1 = snp_on_disk.read() # read all SNP values into memory >>> print(snpdata1) # Prints the specification for this SnpData SnpData(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)) >>> print(snpdata1.val[0,0]) 2.0 >>> snpdata1.standardize() # standardize changes the values in snpdata1.val and changes the specification. SnpData(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False),Unit()) >>> print('{0:.6f}'.format(snpdata1.val[0,0])) 0.229416 >>> snpdata2 = snp_on_disk.read().standardize() # Read and standardize in one expression with only one ndarray allocated. >>> print('{0:.6f}'.format(snpdata2.val[0,0])) 0.229416
- train_standardizer(apply_in_place, standardizer=Unit(), force_python_only=False, num_thread=None)¶
Deprecated since version 0.2.23: Use
standardize()
with return_trained=True instead.
- property val¶
The 2D NumPy array of floats that represents the values of the SNPs. You can get or set this property.
>>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bed_file = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snpdata = Bed(bed_file,count_A1=False)[:5,:].read() #read data for first 5 iids >>> print(snpdata.val[4,100]) #print one of the SNP values 2.0
snpreader.Pheno
¶
- class pysnptools.snpreader.Pheno(input, iid_if_none=None, missing=None)¶
A
SnpReader
for reading “alternative” phenotype files from disk.See
SnpReader
for general examples of using SnpReaders.This text format is described here and looks like:
FID IID qt1 bmi site F1 1110 2.3 22.22 2 F2 2202 34.12 18.23 1 ...
where the heading row is optional.
- Constructor:
- Parameters:
input (string) – The phenotype file to read.
iid_if_none (None or array of strings) – An
SnpReader.iid
to use if the file is empty.missing (None or string) – The value in the file that represents missing data.
For backwards compatibility with older code, input can be a dictionary (instead of a filename) with these fields:
‘header’ : [1] array phenotype name,
‘vals’ : [N*1] array of phenotype data,
‘iid’ : [N*2] array of family IDs and individual IDs
- Example:
>>> from pysnptools.snpreader import Pheno, Bed >>> from pysnptools.util import example_file # Download and return local file name >>> pheno_file = example_file('pysnptools/examples/toydata.phe') >>> data_on_disk = Pheno(pheno_file) >>> print((data_on_disk.iid_count, data_on_disk.sid_count)) (500, 1)
Methods beyond
SnpReader
- static write(filename, snpdata, missing='NaN', sep='\t')¶
Writes a
SnpData
to Pheno format and returns thePheno
.- Parameters:
filename (string) – the name of the file to create
missing (string) – value to threat as missing (default ‘NaN’)
sep (string) – the separator (default ‘ ‘)
- Return type:
>>> from pysnptools.snpreader import Pheno, Bed >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> bed_file = example_file("pysnptools/examples/toydata.5chrom.*","*.bed") >>> snpdata = Bed(bed_file,count_A1=False)[:,:10].read() # Read first 10 snps from Bed format >>> pstutil.create_directory_if_necessary("tempdir/toydata10.phe") >>> Pheno.write("tempdir/toydata10.txt",snpdata) # Write data in Pheno format Pheno('tempdir/toydata10.txt')
snpreader.Ped
¶
- class pysnptools.snpreader.Ped(filename, missing='0')¶
A
SnpReader
for reading Ped-formatted (and the related Map-formatted) files from disk.See
SnpReader
for general examples of using SnpReaders.This format is described here and looks like:
FAM001 1 0 0 1 2 A A G G A C FAM001 2 0 0 1 2 A A A G 0 0 ...
the direction of the encoding from allele pair to 0,1,2 is arbitrary. That is, if the alleles are “A” and “G”, then “G G” could be 0 and “A A” could be 2 or visa versa. The pair “A G” will always be 1.
- Constructor:
- Parameters:
filename (string) – The Ped file to read.
missing (string) – The value in the file that represents missing data.
- Example:
>>> from pysnptools.snpreader import Ped >>> from pysnptools.util import example_file # Download and return local file name >>> ped_file = example_file('pysnptools/examples/toydata.ped') >>> data_on_disk = Ped(ped_file) >>> print((data_on_disk.iid_count, data_on_disk.sid_count)) (500, 10000)
Methods beyond
SnpReader
- static write(filename, snpdata)¶
Writes a
SnpData
to Ped format. The values must be 0,1,2. The direction of the encoding to allele pairs is arbitrary. This means that if a SnpData is written in Ped format and then read back, then 0’s may become 2’s and 2’s may become 0’s. (1’s will stay 1’s). Returns thePed
- Parameters:
filename (string) – the name of the file to create
snpdata (
SnpData
) – The in-memory data that should be written to disk.
- Return type:
>>> from pysnptools.snpreader import Ped, Bed >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> bed_file = example_file("pysnptools/examples/toydata.5chrom.*","*.bed") >>> snpdata = Bed(bed_file,count_A1=False)[:,:10].read() # Read first 10 snps from Bed format >>> pstutil.create_directory_if_necessary("tempdir/toydata10.ped") >>> Ped.write("tempdir/toydata10.ped",snpdata) # Write data in Ped format Ped('tempdir/toydata10.ped')
snpreader.Dat
¶
- class pysnptools.snpreader.Dat(filename, skiprows=0)¶
A
SnpReader
for reading Dat/Fam/Map-formatted files from disk.See
SnpReader
for general examples of using SnpReaders.This is a text format that can store any numeric values. (In contrast, Bed and Ped can only store 0,1,2, and missing). Its Dat files look like:
null_0 j n 0.333 1 2 null_100 j n 2 1 1 null_200 j n 0 nan 1 ...
Its Map and Fam files are described here.
- Constructor:
- Parameters:
filename (string) – The Dat file to read.
skiprows (int) – Number of lines to skip before reading. Defaults to 0.
- Example:
>>> from pysnptools.snpreader import Dat >>> from pysnptools.util import example_file # Download and return local file name >>> dat_file = example_file("pysnptools/examples/toydata.*","*.dat") >>> data_on_disk = Dat(dat_file) >>> print((data_on_disk.iid_count, data_on_disk.sid_count)) (500, 10000)
Methods beyond
SnpReader
- static write(filename, snpdata)¶
Writes a
SnpData
to dat/fam/map format and returns theDat
.- Parameters:
filename (string) – the name of the file to create
snpdata (
SnpData
) – The in-memory data that should be written to disk.
- Return type:
>>> from pysnptools.snpreader import Dat, Bed >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> bed_file = example_file("pysnptools/examples/toydata.5chrom.*","*.bed") >>> snpdata = Bed(bed_file,count_A1=False)[:,:10].read() # Read first 10 snps from Bed format >>> pstutil.create_directory_if_necessary("tempdir/toydata10.dat") >>> Dat.write("tempdir/toydata10.dat",snpdata) # Write data in dat/fam/map format Dat('tempdir/toydata10.dat')
snpreader.Dense
¶
- class pysnptools.snpreader.Dense(filename, extract_iid_function=<function zero_family>, extract_sid_pos_function=<function zero_pos>)¶
A
SnpReader
for reading *.dense.txt (0,1,2 text files) from disk.See
SnpReader
for general examples of using SnpReaders.This format looks like:
var 4006 9570 22899 37676 41236 41978 55159 66828... 1-10004-rs12354060 22222222... 1-707348-rs12184279 222222?2... 1-724325-rs12564807 00000000... ...
where rows are sids and columns are iids.
- Constructor:
- Parameters:
filename (string) – The Dense file to read.
extract_iid_function (string) – A function that breaks the row id from the file into a family id and an individual id. Defaults to setting the family id to 0 and using the whole row id as the iid.
extract_sid_pos_function (string) – A function that breaks the column id from the file into (chromosome, sid, genetic distance, basepair distance). Defaults to setting the
SnpReader.pos
information to 0 and using the whole column id as the sid.
- Example:
>>> from pysnptools.snpreader import Dense >>> from pysnptools.util import example_file # Download and return local file name >>> dense_file = example_file("pysnptools/examples/toydata100.dense.txt") >>> data_on_disk = Dense(dense_file) >>> print((data_on_disk.iid_count, data_on_disk.sid_count)) (500, 100)
Methods beyond
SnpReader
- static write(filename, snpdata, join_iid_function=<function just_case>, join_sid_pos_function=<function just_sid>)¶
Writes a
SnpData
to Dense format. The values must be 0,1,2 (or missing). Returns aDense
with the default extractors.- Parameters:
filename (string) – the name of the file to create
snpdata (
SnpData
) – The in-memory data that should be written to disk.join_iid_function (a function) – function to turn a family id and individual id into a file id for columns. Defaults ignoring the family id and using the individual id as the column id.
join_sid_pos_function (a function) – function to turn a sid and pos data into a file id for rows. Defaults ignoring the
SnpReader.pos
information and using the sid id as the row id.
- Return type:
>>> from pysnptools.snpreader import Dense, Bed >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> bed_file = example_file("pysnptools/examples/toydata.5chrom.*","*.bed") >>> snpdata = Bed(bed_file,count_A1=False)[:,:10].read() # Read first 10 snps from Bed format >>> pstutil.create_directory_if_necessary("tempdir/toydata10.dense.txt") >>> Dense.write("tempdir/toydata10.dense.txt",snpdata) # Write data in Dense format Dense('tempdir/toydata10.dense.txt')
snpreader.SnpHdf5
¶
- class pysnptools.snpreader.SnpHdf5(*args, **kwargs)¶
A
SnpReader
for reading *.snp.hdf5 files from disk.See
SnpReader
for general examples of using SnpReaders.The general HDF5 format is described here. The SnpHdf5 format stores val, iid, sid, and pos information in Hdf5 format.
- Constructor:
- Parameters:
filename (string) – The SnpHdf5 file to read.
- Example:
>>> from pysnptools.snpreader import SnpHdf5 >>> from pysnptools.util import example_file # Download and return local file name >>> hd5f_file = example_file("pysnptools/examples/toydata.snpmajor.snp.hdf5") >>> data_on_disk = SnpHdf5(hd5f_file) >>> print((data_on_disk.iid_count, data_on_disk.sid_count)) (500, 10000)
Methods beyond
SnpReader
- static write(filename, snpdata, hdf5_dtype=None, sid_major=True)¶
Writes a
SnpData
to SnpHdf5 format and return a theSnpHdf5
.- Parameters:
filename (string) – the name of the file to create
snpdata (
SnpData
) – The in-memory data that should be written to disk.hdf5_dtype (string) – None (use the .val’s dtype) or a Hdf5 dtype, e.g. ‘f8’,’f4’,etc.
sid_major (bool) – Tells if vals should be stored on disk in sid_major (default) or iid_major format.
- Return type:
>>> from pysnptools.snpreader import SnpHdf5, Bed >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("pysnptools/examples/toydata.5chrom.*","*.bed") >>> snpdata = Bed(bedfile,count_A1=False)[:,:10].read() # Read first 10 snps from Bed format >>> pstutil.create_directory_if_necessary("tempdir/toydata10.snp.hdf5") >>> SnpHdf5.write("tempdir/toydata10.snp.hdf5",snpdata) # Write data in SnpHdf5 format SnpHdf5('tempdir/toydata10.snp.hdf5')
snpreader.SnpNpz
¶
- class pysnptools.snpreader.SnpNpz(*args, **kwargs)¶
A
SnpReader
for reading *.snp.npz files from disk.See
SnpReader
for general examples of using SnpReaders.The general NPZ format is described here. The SnpNpz format stores val, iid, sid, and pos information in NPZ format.
- Constructor:
- Parameters:
filename (string) – The SnpNpz file to read.
- Example:
>>> from pysnptools.snpreader import SnpNpz >>> from pysnptools.util import example_file # Download and return local file name >>> npz_file = example_file('pysnptools/examples/toydata10.snp.npz') >>> data_on_disk = SnpNpz(npz_file) >>> print((data_on_disk.iid_count, data_on_disk.sid_count)) (500, 10)
Methods beyond
SnpReader
- static write(filename, snpdata)¶
Writes a
SnpData
to SnpNpz format and returns theSnpNpz
- Parameters:
filename (string) – the name of the file to create
snpdata (
SnpData
) – The in-memory data that should be written to disk.
- Return type:
>>> from pysnptools.snpreader import SnpNpz, Bed >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> bed_file = example_file("pysnptools/examples/toydata.5chrom.*","*.bed") >>> snpdata = Bed(bed_file,count_A1=False)[:,:10].read() # Read first 10 snps from Bed format >>> pstutil.create_directory_if_necessary("tempdir/toydata10.snp.npz") >>> SnpNpz.write("tempdir/toydata10.snp.npz",snpdata) # Write data in SnpNpz format SnpNpz('tempdir/toydata10.snp.npz')
snpreader.SnpMemMap
¶
- class pysnptools.snpreader.SnpMemMap(*args, **kwargs)¶
A
SnpData
that keeps its data in a memory-mapped file. This allows data large than fits in main memory.See
SnpData
for general examples of using SnpData.- Constructor:
- Parameters:
filename (string) – The *.snp.memmap file to read.
Also see
SnpMemMap.empty()
andSnpMemMap.write()
.- Example:
>>> from pysnptools.snpreader import SnpMemMap >>> from pysnptools.util import example_file # Download and return local file name >>> mem_map_file = example_file('pysnptools/examples/tiny.snp.memmap') >>> snp_mem_map = SnpMemMap(mem_map_file) >>> print(snp_mem_map.val[0,1], snp_mem_map.iid_count, snp_mem_map.sid_count) 2.0 2 3
Methods inherited from
SnpData
Methods beyond
SnpReader
- static empty(iid, sid, filename, pos=None, order='F', dtype=<class 'numpy.float64'>)¶
Create an empty
SnpMemMap
on disk.- Parameters:
iid (an array of string pairs) – The
SnpReader.iid
informationsid (an array of strings) – The
SnpReader.sid
informationfilename (string) – name of memory-mapped file to create
pos (an array of numeric triples) – optional – The additional
SnpReader.pos
information associated with each sid. Default: Noneorder (string or None) – {‘F’ (default), ‘C’}, optional – Specify the order of the ndarray.
dtype (data-type) – {numpy.float64 (default), numpy.float32}, optional – The data-type for the
SnpMemMap.val
ndarray.
- Return type:
>>> import pysnptools.util as pstutil >>> from pysnptools.snpreader import SnpMemMap >>> filename = "tempdir/tiny.snp.memmap" >>> pstutil.create_directory_if_necessary(filename) >>> snp_mem_map = SnpMemMap.empty(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'],filename=filename,order="F",dtype=np.float64) >>> snp_mem_map.val[:,:] = [[0.,2.,0.],[0.,1.,2.]] >>> snp_mem_map.flush()
- property filename¶
The name of the memory-mapped file
- flush()¶
Flush
SnpMemMap.val
to disk and close the file. (If values or properties are accessed again, the file will be reopened.)>>> import pysnptools.util as pstutil >>> from pysnptools.snpreader import SnpMemMap >>> filename = "tempdir/tiny.snp.memmap" >>> pstutil.create_directory_if_necessary(filename) >>> snp_mem_map = SnpMemMap.empty(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'],filename=filename,order="F",dtype=np.float64) >>> snp_mem_map.val[:,:] = [[0.,2.,0.],[0.,1.,2.]] >>> snp_mem_map.flush()
- property offset¶
The byte position in the file where the memory-mapped values start.
(The disk space before this is used to store
SnpReader.iid
, etc. information. This property is useful when interfacing with, for example, external Fortran and C matrix libraries.)
- property val¶
The 2D NumPy memmap array of floats that represents the values. You can get this property, but cannot set it (except with itself)
>>> from pysnptools.snpreader import SnpMemMap >>> from pysnptools.util import example_file # Download and return local file name >>> mem_map_file = example_file('pysnptools/examples/tiny.snp.memmap') >>> snp_mem_map = SnpMemMap(mem_map_file) >>> print(snp_mem_map.val[0,1]) 2.0
- static write(filename, snpreader, standardizer=Identity(), order='A', dtype=None, block_size=None, num_threads=None)¶
Writes a
SnpReader
toSnpMemMap
format.- Parameters:
filename (string) – the name of the file to create
snpreader (
SnpReader
) – The data that should be written to disk.
- Return type:
>>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> from pysnptools.snpreader import Bed, SnpMemMap >>> bed_file = example_file("pysnptools/examples/toydata.5chrom.*","*.bed") >>> bed = Bed(bed_file) >>> pstutil.create_directory_if_necessary("tempdir/toydata.5chrom.snp.memmap") #LATER should we just promise to create directories? >>> SnpMemMap.write("tempdir/toydata.5chrom.snp.memmap",bed) # Write bed in SnpMemMap format SnpMemMap('tempdir/toydata.5chrom.snp.memmap')
snpreader.SnpGen
¶
- class pysnptools.snpreader.SnpGen(seed, iid_count, sid_count, chrom_count=22, cache_file=None, block_size=None, sid_batch_size=None)¶
A
SnpReader
that generates deterministic “random” SNP data on the fly.See
SnpReader
for general examples of using SnpReader.- Constructor:
- Parameters:
seed (number) – The random seed that (along with block_size) determines the SNP values.
- Parameters:
iid_count (number) – the number of iids (number of individuals)
- Parameters:
sid_count (number) – the number of sids (number of SNPs)
- Parameters:
chrom_count (number) – the number of chromosomes to generate (must be 22 or fewer)
- Parameters:
cache_file (string) – (default None) If provided, tells where to cache the common iid, sid, and pos information. Using it can save time.
- Parameters:
block_size (number or None) – (default None) Tells how many SNP to generate at once. The default picks a block_size such that iid_count * *block_size is about 100,000.
- Example:
>>> from pysnptools.snpreader import SnpGen >>> #Prepare to generate data for 1000 individuals and 1,000,000 SNPs >>> snp_gen = SnpGen(seed=332,iid_count=1000,sid_count=1000*1000) >>> print(snp_gen.iid_count,snp_gen.sid_count) 1000 1000000 >>> snp_data = snp_gen[:,200*1000:201*1000].read() #Generate for all users and for SNPs 200K to 201K >>> print(snp_data.val[1,1], snp_data.iid_count, snp_data.sid_count) 0.0 1000 1000
- Also See:
snpreader.DistributedBed
¶
- class pysnptools.snpreader.DistributedBed(storage)¶
A class that implements the
SnpReader
interface. It storesBed
-like data in pieces on storage. When you request data, it retrieves only the needed pieces.- Constructor:
- Parameters:
storage (string or
FileCache
) – Tells where the DistributedBed data is stored. A string can be given and will be interpreted as the path to a directory. AFileCache
instance can be given, which provides a method to specify cluster-distributed storage.- type storage:
string or
FileCache
- Example:
>>> import os >>> from pysnptools.snpreader import DistributedBed >>> from pysnptools.util import example_file # Download and return local file name >>> folder = os.path.dirname(example_file('pysnptools/examples/toydataSkip10.distributedbed/*')) >>> data_on_disk = DistributedBed(folder) >>> print((data_on_disk.iid_count, data_on_disk.sid_count)) (500, 1000)
- static write(storage, snpreader, piece_per_chrom_count=1, updater=None, runner=None)¶
Uploads from any
Bed
-like data to cluster storage for efficient retrieval later. If some of the contents already exists in storage, it skips uploading that part of the contents. (To avoid this behavior, clear the storage.)- Parameters:
storage (string or
FileCache
or None.) – Tells where to store SNP data. A string can be given and will be interpreted as the path of a local directory to use for storage. (The local directory will not be automatically erased and so must be user managed.) AFileCache
instance can be given, which provides a method to specify cluster-distributed storage. (FileCache
’s will not be automatically erased and must be user managed.) If None, the storage will be in an automatically-erasing temporary directory. (If the TEMP environment variable is set, Python places the temp directory under it.)snpreader (
SnpReader
) – ABed
or otherSnpReader
with values of 0,1,2, or missing. (Note that this differs from most other write methods that take aSnpData
)piece_per_chrom_count (A number) – The number of pieces in which to store the data from each chromosome. Data is split across SNPs. For example, if piece_per_chrom_count is set to 100 and 22 chromosomes are uploaded, then data will be stored in 2200 pieces. Later, when data is requested only the pieces necessary for the request will be downloaded to local storage.
updater (A function or lambda) – A single argument function to write logging message to, for example, the function created by
log_in_place()
.runner (
Runner
) – aRunner
, optional: Tells how to run. (Note thatLocal
andLocalMultProc
are good options.) If not given, the function is run locally.
- Return type:
>>> from pysnptools.snpreader import DistributedBed, Bed >>> import shutil >>> from pysnptools.util import example_file # Download and return local file name >>> directory = 'tempdir/toydataSkip10.distributedbed' >>> if os.path.exists(directory): ... shutil.rmtree(directory) >>> bedfile = example_file("pysnptools/examples/toydata.5chrom.*","*.bed") >>> snpreader = Bed(bedfile,count_A1=False)[:,::10] # Read every 10 snps from Bed format >>> DistributedBed.write(directory,snpreader,piece_per_chrom_count=5) # Write data in DistributedBed format DistributedBed(LocalCache('tempdir/toydataSkip10.distributedbed'))
kernelreader
Module¶
Tools for reading and manipulating kernels. A kernels is a matrix from iid to iid that typically tells the relatedness of pairs of people.
kernelreader.KernelReader
¶
- class pysnptools.kernelreader.KernelReader(*args, **kwargs)¶
A KernelReader is one of three things:
A class such as
KernelNpz
for you to a file with data. For example,>>> from pysnptools.kernelreader import KernelNpz >>> from pysnptools.util import example_file # Download and return local file name >>> >>> kernel_file = example_file('pysnptools/examples/toydata.kernel.npz') >>> kernel_on_disk = KernelNpz(kernel_file) >>> print(kernel_on_disk) # prints specification for reading from file KernelNpz('...pysnptools/examples/toydata.kernel.npz') >>> kernel_on_disk.iid_count # prints the number of iids (but doesn't read any kernel values) 500
A
KernelData
class that holds kernel data in memory, typically after computing from a SnpReader or reading it from a KernelReader:>>> # Compute kernel from a SnpReader >>> from pysnptools.snpreader import Bed >>> from pysnptools.standardizer import Unit >>> from pysnptools.util import example_file # Download and return local file name >>> >>> bed_file = example_file('tests/datasets/all_chr.maf0.001.N300.*','*.bed') >>> snp_on_disk = Bed(bed_file,count_A1=False) >>> kerneldata1 = snp_on_disk.read_kernel(Unit()) #reads the SNP values and computes the kernel >>> type(kerneldata1.val).__name__ # The val property is an ndarray of kernel values 'ndarray' >>> print(kerneldata1) # prints the specification of the in-memory kernel information KernelData(SnpKernel(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False),standardizer=Unit())) >>> kerneldata1.iid_count #prints the number of iids (number of individuals) in this in-memory data 300 >>> # Read kernel from a KernelReader >>> kernel_file = example_file('pysnptools/examples/toydata.kernel.npz') >>> kernel_on_disk = KernelNpz(kernel_file) >>> kerneldata2 = kernel_on_disk.read() #reads the kernel values >>> print(kerneldata2) # prints the specification of the in-memory kernel information KernelData(KernelNpz(...pysnptools/examples/toydata.kernel.npz')) >>> kerneldata2.iid_count #prints the number of iids (number of individuals) in this in-memory data 500
A subset of any KernelReader, specified with “[ iid_index ]” (or specified with “[ iid0_index , iid1_index ]”), to read only some kernel values. It can also be used to re-order the values.
>>> kernel_on_disk = KernelNpz(kernel_file) >>> subset_on_disk1 = kernel_on_disk[[3,4]] # specification for a subset of the data on disk. No kernel values are read yet. >>> print(subset_on_disk1.iid_count) # prints the number of iids in this subset (but still doesn't read any kernel values) 2 >>> print(subset_on_disk1) #prints a specification of 'subset_on_disk1' KernelNpz(...pysnptools/examples/toydata.kernel.npz')[[3,4],[3,4]] >>> kerneldata_subset = subset_on_disk1.read() # efficiently (if possible) reads the specified subset of values from the disk >>> print(kerneldata_subset) # prints the specification of the in-memory kernel information KernelData(KernelNpz(...pysnptools/examples/toydata.kernel.npz')[[3,4],[3,4]]) >>> print((int(kerneldata_subset.val.shape[0]), int(kerneldata_subset.val.shape[1]))) # The dimensions of the ndarray of kernel values (2, 2) >>> subset_on_disk2 = kernel_on_disk[[3,4],::2] # specification for a subset of the data on disk. No kernel values are read yet. >>> print((subset_on_disk2.iid0_count, subset_on_disk2.iid1_count)) (2, 250)
The KernelReaders Classes
Class
Format
Random Access
Suffixes
Write method?
in-memory
Yes
n/a
n/a
binary
No
.kernel.npz
Yes
binary
Yes
.kernel.hdf5
Yes
n/a
Yes
n/a
No
depends
depends
n/a
No
Methods & Properties:
Every KernelReader, such as
KernelNpz
andKernelData
, when square has these properties:iid
,iid_count
, and these methods:read()
, andiid_to_index()
. A square kernel is one that has the same iid list for both its rows and columns.More generally, KernelReaders can have one iid list for its rows and a different iid list for its columns, so these properties and methods are also defined:
iid0
,iid1
,iid0_count
,iid1_count
,iid0_to_index()
, andiid1_to_index()
.See below for details.
KernelData
is a KernelReader so it supports the above properties and methods. In addition, it supports propertyKernelData.val
, methodKernelData.standardize()
, and equality testing.See below for details.
Some of the classes, such as
KernelNpz
, also provide a staticKernelNpz.write()
method for writingKernelData
.>>> # create a kernel from a Bed file and write to KernelNpz format >>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> from pysnptools.standardizer import Unit >>> import pysnptools.util as pstutil
>>> bed_file = example_file('pysnptools/examples/toydata.5chrom.bed') >>> kerneldata = Bed(bed_file,count_A1=False).read_kernel(Unit()) # Create a kernel from the data in the Bed file >>> pstutil.create_directory_if_necessary("tempdir/toydata.kernel.npz") >>> KernelNpz.write("tempdir/toydata.kernel.npz",kerneldata) # Write data in KernelNpz format KernelNpz('tempdir/toydata.kernel.npz')
iids:
Individual are identified with an iid, which is a ndarray of two strings: a family ID and an individual ID. For example:
>>> kernel_on_disk = KernelNpz(kernel_file) >>> print(kernel_on_disk.iid[:3]) # print the first three iids [['per0' 'per0'] ['per1' 'per1'] ['per2' 'per2']] >>> print(kernel_on_disk.iid_to_index([['per2','per2'],['per1','per1']])) #Find the indexes for two iids. [2 1]
KernelReader
is a kind ofPstReader
. See the documentation forPstReader
to learn about:When Data is Read
When Data is Re-Read and Copied
Avoiding Unwanted ndarray Allocations
Creating Subsetting PstReaders with Indexing
The
read()
Method- The
KernelData.standardize()
Method The
KernelData.standardize()
method, available only onKernelData
, does in-place standardization of the in-memory kernel data. The method multiples the values with a scalar factor such that the diagonal sums to iid_count. Although it works in place, for convenience it also returns itself. SeeKernelData.standardize()
for details.>>> kernel_on_disk = KernelNpz(kernel_file) >>> kerneldata1 = kernel_on_disk.read() # read all kernel values into memory >>> print(np.diag(kerneldata1.val).sum()) 5000000.0 >>> kerneldata1.standardize() # standardize changes the values in kerneldata1.val KernelData(KernelNpz(...pysnptools/examples/toydata.kernel.npz')) >>> print(np.diag(kerneldata1.val).sum()) 500.0 >>> kerneldata2 = kernel_on_disk.read().standardize() # Read and standardize in one expression with only one ndarray allocated. >>> print(np.diag(kerneldata2.val).sum()) 500.0
Details of Methods & Properties:
- property iid¶
A ndarray of the iids. Each iid is a ndarray of two strings (a family ID and a individual ID) that identifies an individual. Assumes the kernel is square, so will throw an exception if the row iids are different from the column iids.
- Return type:
ndarray (length
iid_count
) of ndarray (length 2) of strings
This property (to the degree practical) reads only iid and sid data from the disk, not kernel value data. Moreover, the iid data is read from file only once.
- Example:
>>> from pysnptools.kernelreader import KernelNpz >>> from pysnptools.util import example_file # Download and return local file name >>> >>> kernel_file = example_file('pysnptools/examples/toydata.kernel.npz') >>> kernel_on_disk = KernelNpz(kernel_file) >>> print(kernel_on_disk.iid[:3]) # print the first three iids [['per0' 'per0'] ['per1' 'per1'] ['per2' 'per2']]
- iid0_to_index(list)¶
Takes a list of row iids and returns a list of index numbers. See
iid_to_index
- iid1_to_index(list)¶
Takes a list of column iids and returns a list of index numbers. See
iid_to_index
- property iid_count¶
number of iids Assumes the kernel is square, so will throw an exception if the row iids are different from the column iids.
- Return type:
integer
This property (to the degree practical) reads only iid data from the disk, not kernel value data. Moreover, the iid data is read from file only once.
- iid_to_index(list)¶
Takes a list of iids and returns a list of index numbers. Assumes the kernel is square, so will throw an exception if the row iids are different from the column iids.
- Parameters:
list – list of iids
- Return type:
ndarray of int
This method (to the degree practical) reads only iid from the disk, not kernel value data. Moreover, the iid data is read from file only once.
- Example:
>>> from pysnptools.kernelreader import KernelNpz >>> from pysnptools.util import example_file # Download and return local file name >>> npz_file = example_file('pysnptools/examples/toydata.kernel.npz') >>> kernel_on_disk = KernelNpz(npz_file) >>> print(kernel_on_disk.iid_to_index([['per2','per2'],['per1','per1']])) #Find the indexes for two iids. [2 1]
- read(order='F', dtype=<class 'numpy.float64'>, force_python_only=False, view_ok=False, num_threads=None)¶
Reads the kernel values and returns a
KernelData
(withKernelData.val
property containing a new ndarray of the kernel values).- Parameters:
order (string or None) – {‘F’ (default), ‘C’, ‘A’}, optional – Specify the order of the ndarray. If order is ‘F’ (default), then the array will be in F-contiguous order (iid0-index varies the fastest). If order is ‘C’, then the returned array will be in C-contiguous order (iid1-index varies the fastest). If order is ‘A’, then the
KernelData.val
ndarray may be in any order (either C-, Fortran-contiguous, or even discontiguous).dtype (data-type) – {numpy.float64 (default), numpy.float32}, optional – The data-type for the
KernelData.val
ndarray.force_python_only (bool) – optional – If False (default), may use outside library code. If True, requests that the read be done without outside library code.
view_ok (bool) – optional – If False (default), allocates new memory for the
KernelData.val
’s ndarray. If True, if practical and reading from aKernelData
, will return a newKernelData
with a ndarray shares memory with the originalKernelData
. Typically, you’ll also wish to use “order=’A’” to increase the chance that sharing will be possible. Use these parameters with care because any change to either ndarray (for example, viaKernelData.standardize()
) will effect the others. Also keep in mind thatread()
relies on ndarray’s mechanisms to decide whether to actually share memory and so it may ignore your suggestion and allocate a new ndarray anyway.num_threads (None or int) – optional – The number of threads with which to standardize data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
- Return type:
Calling the method again causes the kernel values to be re-read and creates a new in-memory
KernelData
with a new ndarray of kernel values.If you request the values for only a subset of the sids or iids, (to the degree practical) only that subset will be read from disk.
- Example:
>>> from pysnptools.kernelreader import KernelNpz >>> from pysnptools.util import example_file # Download and return local file name >>> npz_file = example_file('pysnptools/examples/toydata.kernel.npz') >>> kernel_on_disk = KernelNpz(npz_file) >>> kerneldata1 = kernel_on_disk.read() # Read all the kernel data returning a KernelData instance >>> print(type(kerneldata1.val).__name__) # The KernelData instance contains a ndarray of the data. ndarray >>> subset_kerneldata = kernel_on_disk[::2].read() # From the disk, read kernel values for every other iid >>> print('{0:.6f}'.format(subset_kerneldata.val[0,0])) # Print the first kernel value in the subset 9923.069928 >>> subsub_kerneldata = subset_kerneldata[:10].read(order='A',view_ok=True) # Create an in-memory subset of the subset with kernel values for the first ten iids. Share memory if practical. >>> import numpy as np >>> #print(np.may_share_memory(subset_kerneldata.val, subsub_kerneldata.val)) # Do the two ndarray's share memory? They could. Currently they won't.
- property val_shape¶
Tells the shape of value for a given individual and SNP. For KernelReaders always returns None, meaning a single scalar value.
kernelreader.KernelData
¶
- class pysnptools.kernelreader.KernelData(iid=None, iid0=None, iid1=None, val=None, name=None, parent_string=None, xp=None)¶
A
KernelReader
for holding kernel values in-memory, along with related iid information. It is usually created by calling theSnpReader.read_kernel()
method on aSnpReader
, for example,Bed
. It can also be created by calling theKernelReader.read()
method onSnpKernel
, for example,KernelNpz
. It can also be constructed directly.See
KernelReader
for details and examples.- Constructor:
- Parameters:
iid (an array of string pairs) – The
KernelReader.iid
information.iid0 (an array of string pairs) – The
KernelReader.iid0
information.iid1 (an array of strings pairs) – The
KernelReader.iid1
information.val (a 2-D array of floats) – The SNP values
name (optional, string) – Information to be display about the origin of this data
- xp (optional, a Python module or string) – The array module that controls val,
for example, ‘numpy’ or ‘cupy’. Defaults to numpy. Can also be set with the ‘ARRAY_MODULE’ environment variable.
If iid is provided, don’t provide iid0 and iid1. Likewise, if iid0 and iid1 are provided, don’t provide iid.
- Example:
>>> from pysnptools.kernelreader import KernelData >>> kerneldata = KernelData(iid=[['fam0','iid0'],['fam0','iid1']], val=[[1.,.5],[.5,1.]]) >>> print((float(kerneldata.val[0,1]), kerneldata.iid_count)) (0.5, 2)
Equality:
Two KernelData objects are equal if their three arrays (
iid0
,iid1
, andKernelData.val
) are ‘array_equal’. (Their ‘string’ does not need to be the same). If eitherKernelData.val
contains NaN, the objects will not be equal. However,KernelData.allclose()
can be used to treat NaN as regular values.- Example:
>>> import numpy as np >>> from pysnptools.kernelreader import KernelData >>> kerneldata1 = KernelData(iid=[['fam0','iid0'],['fam0','iid1']], val=[[1.,.5],[.5,1.]]) >>> kerneldata2 = KernelData(iid=[['fam0','iid0'],['fam0','iid1']], val=[[1.,.5],[.5,1.]]) >>> print(kerneldata1 == kerneldata2) #True, because all the arrays have the same values. True >>> print(kerneldata1.val is kerneldata2.val) #False, because the two arrays have different memory. False >>> kerneldata3 = KernelData(iid=[['a','0'],['b','0']], val=[[1.,.5],[.5,1.]]) >>> kerneldata4 = KernelData(iid=[['fam0','iid0'],['fam0','iid1']], val=[[1.,.5],[.5,1.]]) >>> print(kerneldata3 == kerneldata4) #False, because the iids are different. False >>> kerneldata5 = KernelData(iid=[['fam0','iid0'],['fam0','iid1']], val=[[1.,.5],[.5,np.nan]]) >>> kerneldata6 = KernelData(iid=[['fam0','iid0'],['fam0','iid1']], val=[[1.,.5],[.5,np.nan]]) >>> print(kerneldata5 == kerneldata6) #False, because the val's contain NaN False >>> print(kerneldata5.allclose(kerneldata6)) #True, if we consider the NaN as regular values, all the arrays have the same values. True
Methods beyond
KernelReader
- allclose(value, equal_nan=True)¶
- Parameters:
value (
KernelData
) – Other object with which to compare.equal_nan (bool) – (Default: True) Tells if NaN in
KernelData.val
should be treated as regular values when testing equality.
>>> import numpy as np >>> kerneldata5 = KernelData(iid=[['fam0','iid0'],['fam0','iid1']], val=[[1.,.5],[.5,np.nan]]) >>> kerneldata6 = KernelData(iid=[['fam0','iid0'],['fam0','iid1']], val=[[1.,.5],[.5,np.nan]]) >>> print(kerneldata5.allclose(kerneldata6)) #True, if we consider the NaN as regular values, all the arrays have the same values. True >>> print(kerneldata5.allclose(kerneldata6,equal_nan=False)) #False, if we consider the NaN as special values, all the arrays are not equal. False
- standardize(standardizer=DiagKtoN(), return_trained=False, force_python_only=False, num_threads=None)¶
Does in-place standardization of the in-memory kernel data. The method multiples the values with a scalar factor such that the diagonal sums to iid_count. Although it works in place, for convenience it also returns the KernelData.
- Return type:
KernelData
(standardizes in place, but for convenience, returns ‘self’)
>>> from pysnptools.kernelreader import KernelNpz >>> from pysnptools.util import example_file # Download and return local file name >>> import numpy as np >>> kernel_file = example_file('pysnptools/examples/toydata.kernel.npz') >>> kernel_on_disk = KernelNpz(kernel_file) >>> kerneldata1 = kernel_on_disk.read() # read all kernel values into memory >>> print(np.diag(kerneldata1.val).sum()) 5000000.0 >>> kerneldata1.standardize() # standardize changes the values in kerneldata1.val KernelData(KernelNpz(...pysnptools/examples/toydata.kernel.npz')) >>> print(np.diag(kerneldata1.val).sum()) 500.0 >>> kerneldata2 = kernel_on_disk.read().standardize() # Read and standardize in one expression with only one ndarray allocated. >>> print(np.diag(kerneldata2.val).sum()) 500.0
- property val¶
The 2D NumPy array of floats that represents the values of the kernel. You can get or set this property.
>>> from pysnptools.kernelreader import KernelData >>> kerneldata = KernelData(iid=[['fam0','iid0'],['fam0','iid1']], val=[[1.,.5],[.5,1.]]) >>> print((float(kerneldata.val[0,1]), kerneldata.iid_count)) (0.5, 2)
kernelreader.KernelNpz
¶
- class pysnptools.kernelreader.KernelNpz(*args, **kwargs)¶
A
KernelReader
for reading *.kernel.npz files from disk.See
KernelReader
for general examples of using KernelReaders.The general NPZ format is described here. The KernelNpz format stores val, iid0, and iid1 information in NPZ format.
- Constructor:
- Parameters:
filename (string) – The KernelNpz file to read.
- Example:
>>> from pysnptools.kernelreader import KernelNpz >>> from pysnptools.util import example_file # Download and return local file name >>> npz_file = example_file('pysnptools/examples/toydata.kernel.npz') >>> data_on_disk = KernelNpz(npz_file) >>> print(data_on_disk.iid_count) 500
Methods beyond
KernelReader
- static write(filename, kerneldata)¶
Writes a
KernelData
to KernelNpz format and returns theKernelNpz
.- Parameters:
filename (string) – the name of the file to create
kerneldata (
KernelData
) – The in-memory data that should be written to disk.
- Return type:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.standardizer import Unit >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> bed_file = example_file('pysnptools/examples/toydata.5chrom.*','*.bed') >>> kerneldata = Bed(bed_file,count_A1=False).read_kernel(Unit()) # Create a kernel from the data in the Bed file >>> pstutil.create_directory_if_necessary("tempdir/toydata.kernel.npz") >>> KernelNpz.write("tempdir/toydata.kernel.npz",kerneldata) # Write data in KernelNpz format KernelNpz('tempdir/toydata.kernel.npz')
kernelreader.KernelHdf5
¶
- class pysnptools.kernelreader.KernelHdf5(*args, **kwargs)¶
A
KernelReader
for reading *.kernel.hdf5 files from disk.See
KernelReader
for general examples of using KernelReaders.The general HDF5 format is described in here. The KernelHdf5 format stores val, iid, sid, and pos information in Hdf5 format.
- Constructor:
- Parameters:
filename (string) – The KernelHdf5 file to read.
- Example:
>>> from pysnptools.kernelreader import KernelHdf5 >>> from pysnptools.util import example_file # Download and return local file name >>> hdf5_file = example_file('pysnptools/examples/toydata.kernel.hdf5') >>> data_on_disk = KernelHdf5(hdf5_file) >>> print(data_on_disk.iid_count) 500
Methods beyond
KernelReader
- static write(filename, kerneldata, hdf5_dtype=None, sid_major=True)¶
Writes a
KernelData
to KernelHdf5 format and returns theKernelHdf5
.- Parameters:
filename (string) – the name of the file to create
kerneldata (
KernelData
) – The in-memory data that should be written to disk.hdf5_dtype (string) – None (use the .val’s dtype) or a Hdf5 dtype, e.g. ‘f8’,’f4’,etc.
col_major (bool) – Tells if vals should be stored on disk in sid_major (default) or iid_major format.
- Return type:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.standardizer import Unit >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> bed_file = example_file('pysnptools/examples/toydata.5chrom.*','*.bed') >>> kerneldata = Bed(bed_file,count_A1=False).read_kernel(Unit()) # Create a kernel from the data in the Bed file >>> pstutil.create_directory_if_necessary("tempdir/toydata.kernel.hdf5") >>> KernelHdf5.write("tempdir/toydata.kernel.hdf5",kerneldata) # Write data in KernelHdf5 format KernelHdf5('tempdir/toydata.kernel.hdf5')
kernelreader.Identity
¶
- class pysnptools.kernelreader.Identity(iid, test=None)¶
A
KernelReader
for that represents an identity matrix. No memory for the values is allocated untilIdentity.read()
is called.See
KernelReader
for general examples of using KernelReaders.- Constructor:
- Parameters:
iid (an array of strings) – The
KernelReader.iid
information
- Example:
>>> from pysnptools.kernelreader import Identity >>> identity = Identity(iid=[['fam0','iid0'],['fam0','iid1']]) >>> print(identity.iid_count) 2 >>> print(identity.read().val) # '...' is possible space character [[...1. 0.] [...0. 1.]]
>>> identity = Identity(iid=[['fam0','iid0'],['fam0','iid1'],['fam0','iid2']],test=[['fam0','iid1'],['fam0','iid3']]) >>> print((identity.iid0_count, identity.iid1_count)) (3, 2) >>> print(identity.read().val) # '...' is possible space character [[...0. 0.] [...1. 0.] [...0. 0.]]
kernelreader.SnpKernel
¶
- class pysnptools.kernelreader.SnpKernel(snpreader, standardizer=None, block_size=None)¶
A
KernelReader
that creates a kernel from aSnpReader
. No SNP data will be read until theSnpKernel.read()
method is called. Use block_size to avoid ever reading all the SNP data into memory. at once.See
KernelReader
for general examples of using KernelReaders.- Constructor:
- Parameters:
snpreader (
SnpReader
) – The SNP datastandardizer (
Standardizer
) – How the SNP data should be standardizedblock_size (optional, int) – The number of SNPs to read at a time.
If block_size is not given, then all SNP data will be read at once.
- Example:
>>> from pysnptools.snpreader import Bed >>> from pysnptools.standardizer import Unit >>> from pysnptools.util import example_file # Download and return local file name >>> bed_file = example_file('pysnptools/examples/toydata.5chrom.*','*.bed') >>> snp_on_disk = Bed(bed_file,count_A1=False) # A Bed file is specified, but nothing is read from disk >>> kernel_on_disk = SnpKernel(snp_on_disk, Unit(),block_size=500) # A kernel is specified, but nothing is read from disk >>> print(kernel_on_disk) #Print the specification SnpKernel(Bed(...pysnptools/examples/toydata.5chrom.bed',count_A1=False),standardizer=Unit(),block_size=500) >>> print(kernel_on_disk.iid_count) # iid information is read from disk, but not SNP data 500 >>> kerneldata = kernel_on_disk.read().standardize() # SNPs are read and Unit standardized, 500 at a time, to create a kernel, which is then standardized >>> print('{0:.6f}'.format(kerneldata.val[0,0])) 0.992307
- property pos¶
The
SnpReader.pos
property of the SNP data.
- read_snps(order='F', dtype=<class 'numpy.float64'>, force_python_only=False, view_ok=False, num_threads=None)¶
Reads the SNP values, applies the standardizer, and returns a
SnpData
.- Parameters:
order (string or None) – {‘F’ (default), ‘C’, ‘A’}, optional – Specify the order of the ndarray. If order is ‘F’ (default), then the array will be in F-contiguous order (iid-index varies the fastest). If order is ‘C’, then the returned array will be in C-contiguous order (sid-index varies the fastest). If order is ‘A’, then the
SnpData.val
ndarray may be in any order (either C-, Fortran-contiguous).dtype (data-type) – {numpy.float64 (default), numpy.float32}, optional – The data-type for the
SnpData.val
ndarray.force_python_only (bool) – optional – If False (default), may use outside library code. If True, requests that the read be done without outside library code.
view_ok (bool) – optional – If False (default), allocates new memory for the
SnpData.val
’s ndarray. If True, if practical and reading from aSnpData
, will return a newSnpData
with a ndarray shares memory with the originalSnpData
. Typically, you’ll also wish to use “order=’A’” to increase the chance that sharing will be possible. Use these parameters with care because any change to either ndarray (for example, viaSnpData.standardize()
) will effect the others. Also keep in mind thatread()
relies on ndarray’s mechanisms to decide whether to actually share memory and so it may ignore your suggestion and allocate a new ndarray anyway.num_threads (None or int) – optional – The number of threads with which to read data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
- Return type:
- property sid¶
The
SnpReader.sid
property of the SNP data.
- property sid_count¶
The
SnpReader.sid_count
property of the SNP data.
standardizer
Module¶
standardizer.Standardizer
¶
- class pysnptools.standardizer.Standardizer¶
A Standardizer is a class such as
Unit
andBeta
to be used by theSnpData.standardize()
andSnpReader.read_kernel()
method to standardize SNP data.- Example:
Read and standardize SNP data.
>>> from pysnptools.standardizer import Unit >>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snpdata1 = Bed(bedfile,count_A1=False).read().standardize(Unit()) >>> print('{0:.6f}'.format(snpdata1.val[0,0])) 0.229416
Create a kernel from SNP data on disk.
>>> bedfile2 = example_file("pysnptools/examples/toydata.5chrom.*","*.bed") >>> kerneldata = Bed(bedfile2,count_A1=False).read_kernel(Unit()) >>> print('{0:.6f}'.format(kerneldata.val[0,0])) 9923.069928
Can also return a constant SNP standardizer that can be applied to other
SnpData
.>>> snp_whole = Bed(bedfile,count_A1=False) >>> train_idx, test_idx = range(10,snp_whole.iid_count), range(0,10) #test on the first 10, train on the rest >>> snp_train, trained_standardizer = Unit().standardize(snp_whole[train_idx,:].read(),return_trained=True) >>> print('{0:.6f}'.format(snp_train.val[0,0])) 0.233550 >>> print(trained_standardizer.stats[0,:]) #The mean and stddev of the 1st SNP on the training data # '...' for a possible space character [...1.94827586 0.22146953] >>> snp_test = snp_whole[test_idx,:].read().standardize(trained_standardizer) >>> float(snp_test.val[0,0]) 0.23354968324845735
Standardize any Numpy array.
>>> val = Bed(bedfile,count_A1=False).read().val >>> print(val[0,0]) 2.0 >>> val = Unit().standardize(val) >>> print('{0:.6f}'.format(val[0,0])) 0.229416
Details of Methods & Properties:
- property is_constant¶
Tell if a standardizer’s statistics are constant. For example,
Unit_Trained
contains a pre-specified mean and stddev for each SNP, soUnit_Trained.is_constant
is True.Unit
, on the other hand, measures the mean and stddev for each SNP, soUnit.is_constant
is False.- Return type:
bool
- standardize(snps, block_size=None, return_trained=False, force_python_only=False, num_threads=None)¶
Applies standardization, in place, to
SnpData
(or a NumPy array). For convenience also returns theSnpData
(or a NumPy array).- Parameters:
snps (
SnpData
(or a NumPy array)) – SNP values to standardizeblock_size (None) – Not used
return_trained (bool) – If true, returns a second value containing a constant standardizer trained on this data.
force_python_only (bool) – optional – If False (default), may use outside library code. If True, requests that the read be done without outside library code.
num_threads (None or int) – optional – The number of threads with which to standardize data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
- Return type:
SnpData
(or a NumPy array), (optional) constantStandardizer
standardizer.Unit
¶
- class pysnptools.standardizer.Unit¶
A
Standardizer
to unit standardize SNP data. For each sid, the mean of the values is zero with standard deviation 1. NaN values are then filled with zero, the mean (consequently, if there are NaN values, the final standard deviation will not be zero.See
Standardizer
for more information about standardization.>>> from pysnptools.standardizer import Unit >>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snpdata1 = Bed(bedfile,count_A1=False).read().standardize(Unit()) >>> print('{0:.6f}'.format(snpdata1.val[0,0])) 0.229416
standardizer.Beta
¶
- class pysnptools.standardizer.Beta(a, b)¶
A
Standardizer
to beta standardize SNP data.See
Standardizer
for more information about standardization.- Constructor:
- Parameters:
a (float) – The a parameter of the beta distribution
b (float) – The b parameter of the beta distribution
>>> from pysnptools.standardizer import Beta >>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snpdata1 = Bed(bedfile,count_A1=False).read().standardize(Beta(1,25)) >>> print('{0:.6f}'.format(snpdata1.val[0,0])) 0.680802
standardizer.Identity
¶
- class pysnptools.standardizer.Identity¶
A
Standardizer
that does nothing to SNP data.See
Standardizer
for more information about standardization.>>> from pysnptools.standardizer import Identity >>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snpdata1 = Bed(bedfile,count_A1=False).read() >>> print(snpdata1.val[0,0]) 2.0 >>> snpdata1 = snpdata1.standardize(Identity()) >>> print(snpdata1.val[0,0]) 2.0
standardizer.DiagKtoN
¶
- class pysnptools.standardizer.DiagKtoN(deprecated_iid_count=None)¶
Both a
Standardizer
and a AKernelStandardizer
.When applied to a
SnpData
, it multiplies the SNP values by the square root of a factor such that a kernel constructed from the SNP values will have a diagonal that sums to iid_count. This class thus standardizes the kernel before it is even constructed.When applied to a
KernelData
, it multiplies the kernel values by the a factor such that a kernel will have a diagonal that sums to iid_count.See
Standardizer
for more information about standardization.- Constructor:
- Parameters:
deprecated_iid_count (int) (optional) – Deprecated.
Example of DiagKtoN to
SnpData
:>>> from pysnptools.standardizer import DiagKtoN, Unit, Identity >>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snpdata1 = Bed(bedfile,count_A1=False).read().standardize(Unit()).standardize(DiagKtoN()) >>> kernel1 = snpdata1.read_kernel(Identity()) >>> print('{0:.6f}'.format(np.diag(kernel1.val).sum())) 300.000000
Example of DiagKtoN to
KernelData
:>>> import numpy as np >>> from pysnptools.standardizer import DiagKtoN, Unit, Identity >>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snpdata1 = Bed(bedfile,count_A1=False).read().standardize(Unit()) >>> kernel1 = snpdata1.read_kernel(DiagKtoN(),block_size=None) >>> print('{0:.6f}'.format(np.diag(kernel1.val).sum())) 300.000000
standardizer.UnitTrained
¶
- class pysnptools.standardizer.UnitTrained(sid, stats)¶
A
Standardizer
to unit standardize one set of SNP data based on the mean and stddev of another set of SNP data. NaN values are then filled with zero.See
Standardizer
for more information about standardization.- Constructor:
- Parameters:
stats (ndarray of float) – The mean and stddev of each sid
>>> from pysnptools.standardizer import Unit >>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> train = Bed(bedfile,count_A1=False)[1:,:].read() # read SNP values for all but the first iid >>> _, unittrained = train.standardize(Unit(),return_trained=True) #Unit standardize and remember the mean and stddev of each sid >>> print(unittrained.stats[:5,:]) #Print the means and stddev of the first five sids [[...1.94983278 0.21828988] [...1.96989967 0.17086341] [...1.84280936 0.39057474] [...1.99665552 0.0577347 ] [...1.97658863 0.15120608]] >>> test = Bed(bedfile,count_A1=False)[0,:].read() # read SNP values for the first iid >>> test = test.standardize(unittrained) # Use the mean and stddev of the train data to unit standardize the test data. >>> print('{0:.6f}'.format(test.val[0,0])) 0.229819
standardizer.DiagKtoNTrained
¶
- class pysnptools.standardizer.DiagKtoNTrained(factor)¶
Both a
Standardizer
and a AKernelStandardizer
.When applied to a
SnpData
, DiagKtoN standardizes one set of SNP data based on another set of SNP data.When applied to a
KernelData
, DiagKtoN standardizes one kernel data based on another kernel data.See
Standardizer
for more information about standardization.- Constructor:
- Parameters:
factor (float) – The number what when multiplied into the kernel values will make the diagonal of the kernel values to sum to iid_count.
kernelstandardizer
Module¶
kernelstandardizer.KernelStandardizer
¶
- class pysnptools.kernelstandardizer.KernelStandardizer¶
A KernelStandardizer is a class such as
DiagKtoN
andIdentity
to be used by theKernelData.standardize()
to standardize Kernel data. It always works in-place and returns theKernelData
on which it works.- Example:
Read and standardize Kernel data.
>>> from pysnptools.kernelstandardizer import DiagKtoN >>> from pysnptools.kernelreader import KernelNpz >>> from pysnptools.util import example_file # Download and return local file name >>> >>> kernel_file = example_file('pysnptools/examples/toydata.kernel.npz') >>> kerneldata1 = KernelNpz(kernel_file).read() >>> print(np.diag(kerneldata1.val).sum()) 5000000.0 >>> kerneldata1 = kerneldata1.standardize(DiagKtoN()) >>> print(np.diag(kerneldata1.val).sum()) 500.0
Can also return a constant kernel standardizer that be applied to other
KernelData
.>>> kernel_whole = KernelNpz(kernel_file) >>> train_idx, test_idx = range(10,kernel_whole.iid_count), range(0,10) #test on the first 10, train on the rest >>> kernel_train, trained_standardizer = DiagKtoN().standardize(kernel_whole[train_idx].read(),return_trained=True) >>> print('{0:.6f}'.format(np.diag(kernel_train.val).sum())) 490.000000 >>> print('{0:.6f}'.format(trained_standardizer.factor)) 0.000100 >>> kernel_whole_test = kernel_whole[:,test_idx].read().standardize(trained_standardizer) >>> print('{0:.6f}'.format(kernel_whole_test.val[0,0])) 0.992217
Details of Methods & Properties:
- standardize(kerneldata, return_trained=False, force_python_only=False, num_threads=None)¶
Applies standardization, in place, to
KernelData
. For convenience also returns theKernelData
.- Parameters:
snps (
KernelData
) – kernel values to standardizereturn_trained (bool) – If true, returns a second value containing a constant
KernelStandardizer
trained on this data.force_python_only (bool) – optional – If False (default), may use outside library code. If True, requests that the read be done without outside library code.
num_threads (None or int) – optional – The number of threads with which to standardize data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
- Return type:
KernelData
, (optional) constantKernelStandardizer
kernelstandardizer.DiagKtoN
¶
- class pysnptools.kernelstandardizer.DiagKtoN(deprecated_iid_count=None)¶
Both a
Standardizer
and a AKernelStandardizer
.When applied to a
SnpData
, it multiplies the SNP values by the square root of a factor such that a kernel constructed from the SNP values will have a diagonal that sums to iid_count. This class thus standardizes the kernel before it is even constructed.When applied to a
KernelData
, it multiplies the kernel values by the a factor such that a kernel will have a diagonal that sums to iid_count.See
Standardizer
for more information about standardization.- Constructor:
- Parameters:
deprecated_iid_count (int) (optional) – Deprecated.
Example of DiagKtoN to
SnpData
:>>> from pysnptools.standardizer import DiagKtoN, Unit, Identity >>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snpdata1 = Bed(bedfile,count_A1=False).read().standardize(Unit()).standardize(DiagKtoN()) >>> kernel1 = snpdata1.read_kernel(Identity()) >>> print('{0:.6f}'.format(np.diag(kernel1.val).sum())) 300.000000
Example of DiagKtoN to
KernelData
:>>> import numpy as np >>> from pysnptools.standardizer import DiagKtoN, Unit, Identity >>> from pysnptools.snpreader import Bed >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> snpdata1 = Bed(bedfile,count_A1=False).read().standardize(Unit()) >>> kernel1 = snpdata1.read_kernel(DiagKtoN(),block_size=None) >>> print('{0:.6f}'.format(np.diag(kernel1.val).sum())) 300.000000
kernelstandardizer.DiagKtoNTrained
¶
- class pysnptools.kernelstandardizer.DiagKtoNTrained(factor)¶
Both a
Standardizer
and a AKernelStandardizer
.When applied to a
SnpData
, DiagKtoN standardizes one set of SNP data based on another set of SNP data.When applied to a
KernelData
, DiagKtoN standardizes one kernel data based on another kernel data.See
Standardizer
for more information about standardization.- Constructor:
- Parameters:
factor (float) – The number what when multiplied into the kernel values will make the diagonal of the kernel values to sum to iid_count.
kernelstandardizer.Identity
¶
- class pysnptools.kernelstandardizer.Identity¶
A
KernelStandardizer
that does nothing to kernel data.See
KernelStandardizer
for more information about standardization.>>> from pysnptools.kernelstandardizer import Identity as KS_Identity >>> from pysnptools.kernelreader import KernelNpz >>> from pysnptools.util import example_file # Download and return local file name >>> >>> kernel_file = example_file('pysnptools/examples/toydata.kernel.npz') >>> kerneldata1 = KernelNpz(kernel_file).read() >>> print(np.diag(kerneldata1.val).sum()) 5000000.0 >>> kerneldata1 = kerneldata1.standardize(KS_Identity()) >>> print(np.diag(kerneldata1.val).sum()) 5000000.0
distreader
Module¶
Tools for reading and manipulating SNP distribution data. For each individual and SNP location, it gives three probabilities: P(AA),P(AB),P(BB), where A and B are the alleles. The probabilities should sum to 1.0. Missing data is represented by three numpy.NaN’s.
distreader.DistReader
¶
- class pysnptools.distreader.DistReader(*args, **kwargs)¶
A DistReader is one of three things:
A class such as
Bgen
for you to specify a file with data. For example,>>> from pysnptools.distreader import Bgen >>> from pysnptools.util import example_file # Download and return local file name >>> bgen_file = example_file("pysnptools/examples/2500x100.bgen") >>> dist_on_disk = Bgen(bgen_file) >>> print(dist_on_disk) # prints the name of the file reader Bgen('...pysnptools/examples/2500x100.bgen') >>> dist_on_disk.sid_count # prints the number of SNPS (but doesn't read any SNP distribution values) 100
A
DistData
class that holds SNP distribution data in memory, typically after reading it from disk:>>> from pysnptools.util import example_file # Download and return local file name >>> bgen_file = example_file("pysnptools/examples/2500x100.bgen") >>> dist_on_disk = Bgen(bgen_file) >>> distdata1 = dist_on_disk.read() #reads the SNP distribution values >>> print(type(distdata1.val)) # The val property is an 3-D ndarray of SNP distribution values <class 'numpy.ndarray'> >>> print(distdata1) # prints the name in-memory SNP distribution reader. DistData(Bgen('...pysnptools/examples/2500x100.bgen')) >>> distdata1.iid_count #prints the number of iids (number of individuals) in this in-memory data 2500
A subset of any DistReader, specified with “[ iid_index , sid_index ]”, to read only some SNP distribution values. It can also be used to re-order the values.
>>> from pysnptools.util import example_file # Download and return local file name >>> bgen_file = example_file("pysnptools/examples/2500x100.bgen") >>> dist_on_disk = Bgen(bgen_file) >>> subset_on_disk = dist_on_disk[[3,4],::2] # specification for a subset of the data on disk. No SNP distribution values are read yet. >>> print(subset_on_disk.sid_count) # prints the number of sids in this subset (but still doesn't read any SNP distribution values) 50 >>> print(subset_on_disk) #prints a specification of 'subset_on_disk' Bgen('...pysnptools/examples/2500x100.bgen')[[3,4],::2] >>> distdata_subset = subset_on_disk.read() # efficiently reads the specified subset of values from the disk >>> print(distdata_subset) # prints the specification of the in-memory SNP distribution information DistData(Bgen('...pysnptools/examples/2500x100.bgen')[[3,4],::2]) >>> print((int(distdata_subset.val.shape[0]), int(distdata_subset.val.shape[1]))) # The dimensions of the ndarray of SNP distribution values (2, 50)
The DistReaders Classes
Class
Format
Random Access
Suffixes
Write method?
in-memory floats
Yes
n/a
n/a
binary, floats
Yes (by sid)
*.bgen
Yes
binary, floats
No
.dist.npz
Yes
binary, floats
Yes (by sid or iid)
.dist.hdf5
Yes
mem-mapped floats
Yes
.dist.memmap
Yes
Methods & Properties:
Every DistReader, such as
Bgen
andDistData
, has these properties:iid
,iid_count
,sid
,sid_count
,pos
and these methods:read()
,iid_to_index()
,sid_to_index()
,as_snp()
. See below for details.DistData
is a DistReader so it supports the above properties and methods. In addition, it supports propertyDistData.val
.See below for details.
Many of the classes, such as
Bgen
, also provide a staticBgen.write()
method for writingDistReader
orDistData
to disk.>>> from pysnptools.distreader import DistHdf5, Bgen >>> import pysnptools.util as pstutil >>> hdf5_file = example_file("pysnptools/examples/toydata.snpmajor.dist.hdf5") >>> distreader = DistHdf5(hdf5_file)[:,:10] # A reader for the first 10 SNPs in Hdf5 format >>> pstutil.create_directory_if_necessary("tempdir/toydata10.bgen") >>> Bgen.write("tempdir/toydata10.bgen",distreader) # Write data in BGEN format Bgen('tempdir/toydata10.bgen')
iids and sids:
same as with
SnpReader
Selecting and Reordering Individuals and SNPs
same as with
SnpReader
When Data is Read:
same as with
SnpReader
When Data is Re-Read and Copied:
same as with
SnpReader
Avoiding Unwanted ndarray Allocations
same as with
SnpReader
The
read()
MethodDetails of Methods & Properties:
- as_snp(max_weight=2.0, block_size=None)¶
Returns a
pysnptools.snpreader.SnpReader
such that turns the probability distribution into an expected value.For example, if the probability distribution is [0.466804 0.38812848 0.14506752] and the max_weight is 2, then the expected value will be (0.466804*0+0.38812848*.5+0.14506752*1)*2, 0.67826352 .
- Parameters:
max_weight (number) – optional – The expected values will range from 0 to max_weight. Default of 2.
block_size (int or None) – optional – Default of None (meaning to load all). Suggested number of sids to read into memory at a time.
- Return type:
class:SnpReader
- Example:
>>> from pysnptools.distreader import Bgen >>> from pysnptools.util import example_file # Download and return local file name >>> bgen_file = example_file("pysnptools/examples/2500x100.bgen") >>> distreader = Bgen(bgen_file) # Specify distribution data on disk >>> print(distreader[0,0].read().val) [[[0.466804 0.38812848 0.14506752]]] >>> snpreader = distreader.as_snp(max_weight=2) >>> print(snpreader[0,0].read().val) [[0.67826352]]
- property iid¶
A ndarray of the iids. Each iid is a ndarray of two strings (a family ID and a individual ID) that identifies an individual.
- Return type:
ndarray of strings with shape [
iid_count
,2]
This property (to the degree practical) reads only iid and sid data from the disk, not SNP distribution data. Moreover, the iid and sid data is read from file only once.
- Example:
>>> from pysnptools.distreader import Bgen >>> from pysnptools.util import example_file # Download and return local file name >>> bgen_file = example_file("pysnptools/examples/2500x100.bgen") >>> dist_on_disk = Bgen(bgen_file) >>> print(dist_on_disk.iid[:3]) # print the first three iids [['0' 'iid_0'] ['0' 'iid_1'] ['0' 'iid_2']]
- property iid_count¶
number of iids
- Return type:
integer
This property (to the degree practical) reads only iid and sid data from the disk, not SNP distribution data. Moreover, the iid and sid data is read from file only once.
- iid_to_index(list)¶
Takes a list of iids and returns a list of index numbers
- Parameters:
list – list of iids
- Return type:
ndarray of int
This method (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.
- Example:
>>> from pysnptools.distreader import Bgen >>> from pysnptools.util import example_file # Download and return local file name >>> bgen_file = example_file("pysnptools/examples/2500x100.bgen") >>> dist_on_disk = Bgen(bgen_file) # Specify SNP data on disk >>> print(dist_on_disk.iid_to_index([['0','iid_2'],['0','iid_1']])) #Find the indexes for two iids. [2 1]
- property pos¶
A ndarray of the position information for each sid. Each element is a ndarray of three numpy.numbers (chromosome, genetic distance, basepair distance).
- Return type:
ndarray of float64 with shape [
sid_count
, 3]
This property (to the degree practical) reads only iid and sid data from the disk, not SNP distribution data. Moreover, the iid and sid data is read from file only once.
- Example:
>>> from pysnptools.distreader import Bgen >>> from pysnptools.util import example_file # Download and return local file name >>> bgen_file = example_file("pysnptools/examples/2500x100.bgen") >>> dist_on_disk = Bgen(bgen_file) >>> print(dist_on_disk.pos[:4,].astype('int')) # print position information for the first four sids: #The '...' is for possible space char [[ 1 0 9270273] [ 1 0 39900273] [ 1 0 70530273] [ 1 0 101160273]]
- read(order='F', dtype=<class 'numpy.float64'>, force_python_only=False, view_ok=False, num_threads=None)¶
Reads the SNP values and returns a
DistData
(withDistData.val
property containing a new 3D ndarray of the SNP distribution values).- Parameters:
order (string or None) – {‘F’ (default), ‘C’, ‘A’}, optional – Specify the order of the ndarray. If order is ‘F’ (default), then the array will be in F-contiguous order (iid-index varies the fastest). If order is ‘C’, then the returned array will be in C-contiguous order (sid-index varies the fastest). If order is ‘A’, then the
DistData.val
ndarray may be in any order (either C-, Fortran-contiguous).dtype (data-type) – {numpy.float64 (default), numpy.float32}, optional – The data-type for the
DistData.val
ndarray.force_python_only (bool) – optional – If False (default), may use outside library code. If True, requests that the read be done without outside library code.
view_ok (bool) – optional – If False (default), allocates new memory for the
DistData.val
’s ndarray. If True, if practical and reading from aDistData
, will return a newDistData
with a ndarray shares memory with the originalDistData
. Typically, you’ll also wish to use “order=’A’” to increase the chance that sharing will be possible. Use these parameters with care because any change to either ndarray will effect the others. Also keep in mind thatread()
relies on ndarray’s mechanisms to decide whether to actually share memory and so it may ignore your suggestion and allocate a new ndarray anyway.num_threads (None or int) – optional – The number of threads with which to read data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
- Return type:
Calling the method again causes the SNP distribution values to be re-read and creates a new in-memory
DistData
with a new ndarray of SNP values.If you request the values for only a subset of the sids or iids, (to the degree practical) only that subset will be read from disk.
- Example:
>>> from pysnptools.distreader import Bgen >>> from pysnptools.util import example_file # Download and return local file name >>> bgen_file = example_file("pysnptools/examples/2500x100.bgen") >>> dist_on_disk = Bgen(bgen_file) # Specify SNP data on disk >>> distdata1 = dist_on_disk.read() # Read all the SNP data returning a DistData instance >>> print(type(distdata1.val).__name__) # The DistData instance contains a ndarray of the data. ndarray >>> subset_distdata = dist_on_disk[:,::2].read() # From the disk, read SNP values for every other sid >>> print(subset_distdata.val[0,0]) # Print the first SNP value in the subset [0.466804 0.38812848 0.14506752] >>> subsub_distdata = subset_distdata[:10,:].read(order='A',view_ok=True) # Create an in-memory subset of the subset with SNP values for the first ten iids. Share memory if practical. >>> import numpy as np >>> # print np.may_share_memory(subset_distdata.val, subsub_distdata.val) # Do the two ndarray's share memory? They could. Currently they won't.
- property row_property¶
Defined as a zero-width array for compatibility with
PstReader
, but not used.
- property sid¶
A ndarray of the sids. Each sid is a string that identifies a SNP.
- Return type:
ndarray (length
sid_count
) of strings
This property (to the degree practical) reads only iid and sid data from the disk, not SNP distribution data. Moreover, the iid and sid data is read from file only once.
- Example:
>>> from pysnptools.distreader import Bgen >>> from pysnptools.util import example_file # Download and return local file name >>> bgen_file = example_file("pysnptools/examples/2500x100.bgen") >>> dist_on_disk = Bgen(bgen_file) >>> print(dist_on_disk.sid[:9]) # print the first nine sids ['sid_0' 'sid_1' 'sid_2' 'sid_3' 'sid_4' 'sid_5' 'sid_6' 'sid_7' 'sid_8']
- property sid_count¶
number of sids
- Return type:
integer
This property (to the degree practical) reads only iid and sid data from the disk, not SNP distribution data. Moreover, the iid and sid data is read from file only once.
- sid_to_index(list)¶
Takes a list of sids and returns a list of index numbers
- Parameters:
list (list of strings) – list of sids
- Return type:
ndarray of int
This method (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.
- Example:
>>> from pysnptools.distreader import Bgen >>> from pysnptools.util import example_file # Download and return local file name >>> bgen_file = example_file("pysnptools/examples/2500x100.bgen") >>> dist_on_disk = Bgen(bgen_file) # Specify SNP data on disk >>> print(dist_on_disk.sid_to_index(['sid_2','sid_9'])) #Find the indexes for two sids. [2 9]
- property val_shape¶
Tells the shape of value for a given individual and SNP. For DistReaders always returns 3.
distreader.Bgen
¶
- class pysnptools.distreader.Bgen(filename, iid_function=<function default_iid_function>, sid_function=<function default_sid_function>, sample=None, num_threads=None, fresh_properties=True)¶
A
DistReader
for reading *.bgen files from disk.See
DistReader
for general examples of using DistReaders.The BGEN format is described here.
Tip: The ‘gen’ in BGEN stands for ‘genetic’. The ‘gen’ in
DistGen
stands for generate, because it generates random (genetic) data.*- Constructor:
- Parameters:
filename (string) – The BGEN file to read.
iid_function (optional, function) – Function to turn a BGEN sample into a
DistReader.iid
. (Default:bgen.default_iid_function()
.)sid_function (optional, function or string) – Function to turn a BGEN (SNP) id and rsid into a
DistReader.sid
. (Default:bgen.default_sid_function()
.) Can also be the string ‘id’ or ‘rsid’, which is faster than using a function.sample (optional, string) – A GEN sample file. If given, overrides information in *.bgen file.
- num_threads (optional, int) – The number of threads with which to read data. Defaults to all available processors.
Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
fresh_properties (optional, bool) – When true (default), memory will be allocated for the iid, sid, and pos properties. This is safe. When false, the properties will use the Bgen’s on-disk ‘memmap’. That saves memory, but is only safe if the Bgen object is still around when the properties are used.
- Example:
>>> from pysnptools.distreader import Bgen >>> from pysnptools.util import example_file # Download and return local file name >>> bgen_file = example_file("pysnptools/examples/example.bgen") >>> data_on_disk = Bgen(bgen_file) >>> print((data_on_disk.iid_count, data_on_disk.sid_count)) (500, 199)
Methods beyond
DistReader
- flush()¶
Close the *.bgen file for reading. (If values or properties are accessed again, the file will be reopened.)
>>> from pysnptools.distreader import Bgen >>> from pysnptools.util import example_file # Download and return local file name >>> bgen_file = example_file("pysnptools/examples/example.bgen") >>> data_on_disk = Bgen(bgen_file) >>> print((data_on_disk.iid_count, data_on_disk.sid_count)) (500, 199) >>> data_on_disk.flush()
- static genwrite(filename, distreader, decimal_places=None, id_rsid_function=<function default_id_rsid_function>, sample_function=<function default_sample_function>, block_size=None)¶
Writes a
DistReader
to Gen format- Parameters:
filename (string) – the name of the file to create (will also create filename_without_gen.sample file.)
distreader (
DistReader
) – The data that should be written to disk. It can also be any distreader, for example,DistNpz
,DistData
, or anotherBgen
.decimal_places – (Default: None) Number of decimal places with which to write the text numbers. None writes to full precision.
id_rsid_function (function) – Function to turn a a
DistReader.sid
into a GEN (SNP) id and rsid. (Default:bgen.default_id_rsid_function()
.)sid_function (function) – Function to turn a GEN (SNP) id and rsid into a
DistReader.sid
. (Default:bgen.default_sid_function()
.)block_size (number) – The number of SNPs to read in a batch from distreader. Defaults to a block_size such that block_size * iid_count is about 100,000.
- Return type:
None
>>> from pysnptools.distreader import DistHdf5, Bgen >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> hdf5_file = example_file("pysnptools/examples/toydata.snpmajor.dist.hdf5") >>> distreader = DistHdf5(hdf5_file)[:,:10] # A reader for the first 10 SNPs in Hdf5 format >>> pstutil.create_directory_if_necessary("tempdir/toydata10.bgen") >>> Bgen.genwrite("tempdir/toydata10.gen",distreader) # Write data in GEN format
- static write(filename, distreader, bits=16, compression=None, sample_function=<function default_sample_function>, id_rsid_function=<function default_id_rsid_function>, iid_function=<function default_iid_function>, sid_function=<function default_sid_function>, block_size=None, qctool_path=None, cleanup_temp_files=True)¶
Writes a
DistReader
to BGEN format and return a theBgen
. Requires access to the 3rd party QCTool.- Parameters:
filename (string) – the name of the file to create
distreader (
DistReader
) – The data that should be written to disk. It can also be any distreader, for example,DistNpz
,DistData
, or anotherBgen
.bits (int) – Number of bits, between 1 and 32 used to represent each 0-to-1 probability value. Default is 16. An np.float32 needs 23 bits. A np.float64 would need 52 bits, which the BGEN format doesn’t offer, so use 32.
compression (string) – How to compress the file. Can be None (default), ‘zlib’, or ‘zstd’.
sample_function (function) – Function to turn a
DistReader.iid
into a BGEN sample. (Default:bgen.default_sample_function()
.)id_rsid_function (function) – Function to turn a a
DistReader.sid
into a BGEN (SNP) id and rsid. (Default:bgen.default_id_rsid_function()
.)iid_function (function) – Function to turn a BGEN sample into a
DistReader.iid
. (Default:bgen.default_iid_function()
.)sid_function (function) – Function to turn a BGEN (SNP) id and rsid into a
DistReader.sid
. (Default:bgen.default_sid_function()
.)block_size (number) – The number of SNPs to read in a batch from distreader. Defaults to a block_size such that block_size * iid_count is about 100,000.
qctool_path (string) – Tells the path to the 3rd party QCTool. Defaults to reading path from environment variable QCTOOLPATH. (To use on Windows, install Ubuntu for Windows, install QCTool in Ubuntu, and then give the path as “ubuntu run <UBUNTU PATH TO QCTOOL”.)
cleanup_temp_files (bool) – Tells if delete temporary *.gen and *.sample files.
- Return type:
>>> from pysnptools.distreader import DistHdf5, Bgen >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> hdf5_file = example_file("pysnptools/examples/toydata.snpmajor.dist.hdf5") >>> distreader = DistHdf5(hdf5_file)[:,:10] # A reader for the first 10 SNPs in Hdf5 format >>> pstutil.create_directory_if_necessary("tempdir/toydata10.bgen") >>> Bgen.write("tempdir/toydata10.bgen",distreader) # Write data in BGEN format Bgen('tempdir/toydata10.bgen')
- pysnptools.distreader.bgen.default_iid_function(sample)¶
The default function for turning a Bgen sample into a two-part
pysnptools.distreader.DistReader.iid
. If the Bgen sample contains a single comma, we split on the comma to create the iid. Otherwise, the iid will be (‘0’,sample)>>> default_iid_function('fam0,ind0') ('fam0', 'ind0') >>> default_iid_function('ind0') ('0', 'ind0')
- pysnptools.distreader.bgen.default_sid_function(id, rsid)¶
The default function for turning a Bgen (SNP) id and rsid into a
pysnptools.distreader.DistReader.sid
. If the Bgen rsid is ‘’ or ‘0’, the sid will be the (SNP) id. Otherwise, the sid will be ‘ID,RSID’>>> default_sid_function('SNP1','rs102343') 'SNP1,rs102343' >>> default_sid_function('SNP1','0') 'SNP1'
- pysnptools.distreader.bgen.default_sample_function(famid, indid)¶
The default function for turning a two-part
pysnptools.distreader.DistReader.iid
into a a Bgen sample. If the iid’s first part (the family id) is ‘0’ or ‘’, the sample will be iid’s 2nd part. Otherwise, the sample will be ‘FAMID,INDID’>>> default_sample_function('fam0','ind0') 'fam0,ind0' >>> default_sample_function('0','ind0') 'ind0'
- pysnptools.distreader.bgen.default_id_rsid_function(sid)¶
The default function for turning a
pysnptools.distreader.DistReader.sid
into a Bgen (SNP) id and rsid. If the sid contains a single comma, we split on the comma to create the id and rsid. Otherwise, the (SNP) id will be the sid and the rsid will be ‘0’>>> default_id_rsid_function('SNP1,rs102343') ('SNP1', 'rs102343') >>> default_id_rsid_function('SNP1') ('SNP1', '0')
distreader.DistData
¶
- class pysnptools.distreader.DistData(iid, sid, val, pos=None, name=None, copyinputs_function=None)¶
A
DistReader
for holding SNP distributions (or similar values) in-memory, along with related iid, sid, and pos information. It is usually created by calling theDistReader.read()
method on anotherDistReader
, for example,Bgen
. It can also be constructed directly.See
DistReader
for details and examples.- Constructor:
- Parameters:
iid (an array of string pair) – The
DistReader.iid
information.sid (an array of strings) – The
DistReader.sid
information.val (a 3-D array of floats) – The SNP value distribution
pos (optional, an array of strings) – The
DistReader.pos
informationname (optional, string) – Information to be display about the origin of this data
copyinputs_function (optional, function) – Used internally by optional clustering code
- Example:
>>> from pysnptools.distreader import DistData >>> distdata = DistData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], ... val=[[[.5,.5,0],[0,0,1],[.5,.5,0]], ... [[0,1.,0],[0,.75,.25],[.5,.5,0]]]) >>> print((distdata.val[1,1], distdata.iid_count, distdata.sid_count)) (array([0. , 0.75, 0.25]), 2, 3)
Equality:
Two DistData objects are equal if their four arrays (
DistData.val
,DistReader.iid
,DistReader.sid
, andDistReader.pos
) are ‘array_equal’. (Their ‘name’ does not need to be the same). If eitherDistData.val
contains NaN, the objects will not be equal. However,DistData.allclose()
can be used to treat NaN as regular values.- Example:
>>> import numpy as np >>> from pysnptools.distreader import DistData >>> snpdata1 = DistData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], ... val=[[[.5,.5,0],[0,0,1],[.5,.5,0]], ... [[0,1.,0],[0,.75,.25],[.5,.5,0]]], ... pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> snpdata2 = DistData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], ... val=[[[.5,.5,0],[0,0,1],[.5,.5,0]], ... [[0,1.,0],[0,.75,.25],[.5,.5,0]]], ... pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> print(snpdata1 == snpdata2) #True, because all the arrays have the same values. True >>> print(snpdata1.val is snpdata2.val) #False, because the two arrays have different memory. False >>> snpdata3 = DistData(iid=[['a','0'],['b','0']], sid=['snp334','snp349','snp921'], ... val=[[[.5,.5,0],[0,0,1],[.5,.5,0]], ... [[0,1.,0],[0,.75,.25],[.5,.5,0]]], ... pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> snpdata4 = DistData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], ... val=[[[.5,.5,0],[0,0,1],[.5,.5,0]], ... [[0,1.,0],[0,.75,.25],[.5,.5,0]]], ... pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> print(snpdata3 == snpdata4) #False, because the iids are different. False >>> snpdata5 = DistData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], ... val=[[[.5,.5,0],[0,0,1],[.5,.5,0]], ... [[0,1.,0],[0,.75,.25],[np.nan,np.nan,np.nan]]], ... pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> snpdata6 = DistData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], ... val=[[[.5,.5,0],[0,0,1],[.5,.5,0]], ... [[0,1.,0],[0,.75,.25],[np.nan,np.nan,np.nan]]], ... pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> print(snpdata5 == snpdata6) #False, because the val's contain NaN False >>> print(snpdata5.allclose(snpdata6)) #True, if we consider the NaN as regular values, all the arrays have the same values. True
Methods beyond
DistReader
- allclose(value, equal_nan=True)¶
- Parameters:
value (
DistData
) – Other object with which to compare.equal_nan (bool) – (Default: True) Tells if NaN in
DistData.val
should be treated as regular values when testing equality.
>>> import numpy as np >>> snpdata5 = DistData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], ... val=[[[.5,.5,0],[0,0,1],[.5,.5,0]], ... [[0,1.,0],[0,.75,.25],[np.nan,np.nan,np.nan]]], ... pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> snpdata6 = DistData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], ... val=[[[.5,.5,0],[0,0,1],[.5,.5,0]], ... [[0,1.,0],[0,.75,.25],[np.nan,np.nan,np.nan]]], ... pos=[[0,0,0],[0,0,0],[0,0,0]]) >>> print(snpdata5.allclose(snpdata6)) #True, if we consider the NaN as regular values, all the arrays have the same values. True >>> print(snpdata5.allclose(snpdata6,equal_nan=False)) #False, if we consider the NaN as special values, all the arrays are not equal. False
- property val¶
The 3D NumPy array of floats that represents the distribution of SNP values. You can get or set this property.
>>> from pysnptools.distreader import Bgen >>> from pysnptools.util import example_file # Download and return local file name >>> bgen_file = example_file("pysnptools/examples/2500x100.bgen") >>> distdata = Bgen(bgen_file)[:5,:].read() #read data for first 5 iids >>> print(distdata.val[4,55]) #print one of the SNP values [0.23137255 0.65342184 0.11520562]
distreader.DistHdf5
¶
- class pysnptools.distreader.DistHdf5(*args, **kwargs)¶
A
DistReader
for reading *.dist.hdf5 files from disk.See
DistReader
for general examples of using DistReaders.The general HDF5 format is described here. The DistHdf5 format stores val, iid, sid, and pos information in Hdf5 format.
- Constructor:
- Parameters:
filename (string) – The DistHdf5 file to read.
- Example:
>>> from pysnptools.distreader import DistHdf5 >>> from pysnptools.util import example_file # Download and return local file name >>> hdf5_file = example_file("pysnptools/examples/toydata.snpmajor.dist.hdf5") >>> data_on_disk = DistHdf5(hdf5_file) >>> print((data_on_disk.iid_count, data_on_disk.sid_count)) (25, 10000)
Methods beyond
DistReader
- static write(filename, distdata, hdf5_dtype=None, sid_major=True)¶
Writes a
DistData
to DistHdf5 format and return a theDistHdf5
.- Parameters:
filename (string) – the name of the file to create
distdata (
DistData
) – The in-memory data that should be written to disk.hdf5_dtype (string) – None (use the .val’s dtype) or a Hdf5 dtype, e.g. ‘f8’,’f4’,etc.
sid_major – Tells if vals should be stored on disk in sid_major (default) or iid_major format.
- Return type:
>>> from pysnptools.distreader import DistHdf5, Bgen >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> bgen_file = example_file("pysnptools/examples/2500x100.bgen") >>> distdata = Bgen(bgen_file)[:,:10].read() # Read first 10 snps from BGEN format >>> pstutil.create_directory_if_necessary("tempdir/toydata10.dist.hdf5") >>> DistHdf5.write("tempdir/toydata10.dist.hdf5",distdata) # Write data in DistHdf5 format DistHdf5('tempdir/toydata10.dist.hdf5')
distreader.DistNpz
¶
- class pysnptools.distreader.DistNpz(*args, **kwargs)¶
A
DistReader
for reading *.dist.npz files from disk.See
DistReader
for general examples of using DistReaders.The general NPZ format is described here. The DistNpz format stores val, iid, sid, and pos information in NPZ format.
- Constructor:
- Parameters:
filename (string) – The DistNpz file to read.
- Example:
>>> from pysnptools.distreader import DistNpz >>> from pysnptools.util import example_file # Download and return local file name >>> npz_file = example_file("pysnptools/examples/toydata10.dist.npz") >>> data_on_disk = DistNpz(npz_file) >>> print((data_on_disk.iid_count, data_on_disk.sid_count)) (25, 10)
Methods beyond
DistReader
- static write(filename, distdata)¶
Writes a
DistData
to DistNpz format and returns theDistNpz
- Parameters:
filename (string) – the name of the file to create
distdata (
DistData
) – The in-memory data that should be written to disk.
- Return type:
>>> from pysnptools.distreader import DistNpz, DistHdf5 >>> import pysnptools.util as pstutil >>> from pysnptools.util import example_file # Download and return local file name >>> hdf5_file = example_file("pysnptools/examples/toydata.iidmajor.dist.hdf5") >>> distdata = DistHdf5(hdf5_file)[:,:10].read() # Read first 10 snps from DistHdf5 format >>> pstutil.create_directory_if_necessary("tempdir/toydata10.dist.npz") >>> DistNpz.write("tempdir/toydata10.dist.npz",distdata) # Write data in DistNpz format DistNpz('tempdir/toydata10.dist.npz')
distreader.DistMemMap
¶
- class pysnptools.distreader.DistMemMap(*args, **kwargs)¶
A
DistData
that keeps its data in a memory-mapped file. This allows data large than fits in main memory.See
DistData
for general examples of using DistData.- Constructor:
- Parameters:
filename (string) – The *.dist.memmap file to read.
Also see
DistMemMap.empty()
andDistMemMap.write()
.- Example:
>>> from pysnptools.distreader import DistMemMap >>> from pysnptools.util import example_file # Download and return local file name >>> mem_map_file = example_file("pysnptools/examples/tiny.dist.memmap") >>> dist_mem_map = DistMemMap(mem_map_file) >>> print(dist_mem_map.val[0,1], dist_mem_map.iid_count, dist_mem_map.sid_count) [0.43403135 0.28289911 0.28306954] 25 10
Methods inherited from
DistData
Methods beyond
DistReader
- static empty(iid, sid, filename, pos=None, order='F', dtype=<class 'numpy.float64'>)¶
Create an empty
DistMemMap
on disk.- Parameters:
iid (an array of string pairs) – The
DistReader.iid
informationsid (an array of strings) – The
DistReader.sid
informationfilename (string) – name of memory-mapped file to create
pos (an array of numeric triples) – optional – The additional
DistReader.pos
information associated with each sid. Default: Noneorder (string or None) – {‘F’ (default), ‘C’}, optional – Specify the order of the ndarray.
dtype (data-type) – {numpy.float64 (default), numpy.float32}, optional – The data-type for the
DistMemMap.val
ndarray.
- Return type:
>>> import pysnptools.util as pstutil >>> from pysnptools.distreader import DistMemMap >>> filename = "tempdir/tiny.dist.memmap" >>> pstutil.create_directory_if_necessary(filename) >>> dist_mem_map = DistMemMap.empty(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'],filename=filename,order="F",dtype=np.float64) >>> dist_mem_map.val[:,:,:] = [[[.5,.5,0],[0,0,1],[.5,.5,0]], ... [[0,1.,0],[0,.75,.25],[.5,.5,0]]] >>> dist_mem_map.flush()
- property filename¶
The name of the memory-mapped file
- flush()¶
Flush
DistMemMap.val
to disk and close the file. (If values or properties are accessed again, the file will be reopened.)>>> import pysnptools.util as pstutil >>> from pysnptools.distreader import DistMemMap >>> filename = "tempdir/tiny.dist.memmap" >>> pstutil.create_directory_if_necessary(filename) >>> dist_mem_map = DistMemMap.empty(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'],filename=filename,order="F",dtype=np.float64) >>> dist_mem_map.val[:,:,:] = [[[.5,.5,0],[0,0,1],[.5,.5,0]], ... [[0,1.,0],[0,.75,.25],[.5,.5,0]]] >>> dist_mem_map.flush()
- property offset¶
The byte position in the file where the memory-mapped values start.
(The disk space before this is used to store
DistReader.iid
, etc. information. This property is useful when interfacing with, for example, external Fortran and C matrix libraries.)
- property val¶
The 3D NumPy memmap array of floats that represents the distribution of SNP values. You can get this property, but cannot set it (except with itself)
>>> from pysnptools.distreader import DistMemMap >>> from pysnptools.util import example_file # Download and return local file name >>> mem_map_file = example_file("pysnptools/examples/tiny.dist.memmap") >>> dist_mem_map = DistMemMap(mem_map_file) >>> print(dist_mem_map.val[0,1]) [0.43403135 0.28289911 0.28306954]
- static write(filename, distreader, order='A', dtype=None, block_size=None, num_threads=None)¶
Writes a
DistReader
toDistMemMap
format.- Parameters:
filename (string) – the name of the file to create
distreader (
DistReader
) – The data that should be written to disk. It can also be any distreader, for example,DistNpz
,DistData
, or anotherBgen
.order (string or None) – {‘A’ (default), ‘F’, ‘C’}, optional – Specify the order of the ndarray. By default, will match the order of the input if knowable; otherwise, ‘F’
dtype (data-type) – {None (default), numpy.float64, numpy.float32}, optional – The data-type for the
DistMemMap.val
ndarray. By default, will match the order of the input if knowable; otherwise np.float64.block_size (number) – The number of SNPs to read in a batch from distreader. Defaults to a block_size such that block_size * iid_count is about 100,000.
num_threads (None or int) – optional – The number of threads with which to write data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
- Return type:
>>> import pysnptools.util as pstutil >>> from pysnptools.distreader import Bgen, DistMemMap >>> from pysnptools.util import example_file # Download and return local file name >>> bgen_file = example_file("pysnptools/examples/2500x100.bgen") >>> distreader = Bgen(bgen_file)[:,:10] #Create a reader for the first 10 SNPs >>> pstutil.create_directory_if_necessary("tempdir/tiny.dist.memmap") >>> DistMemMap.write("tempdir/tiny.dist.memmap",distreader) # Write distreader in DistMemMap format DistMemMap('tempdir/tiny.dist.memmap')
distreader.DistGen
¶
- class pysnptools.distreader.DistGen(seed, iid_count, sid_count, chrom_count=22, cache_file=None, block_size=None)¶
A
DistReader
that generates – on the fly – deterministic “random” distributions over SNP data..See
DistReader
for general examples of using DistReader.Tip: The ‘gen’ in DistGen stands for generate, because it generates random (genetic) data. The ‘gen’ in
Bgen
stands for ‘genetic’. *- Constructor:
- Parameters:
seed (number) – The random seed that (along with block_size) determines the SNP values.
- Parameters:
iid_count (number) – the number of iids (number of individuals)
- Parameters:
sid_count (number) – the number of sids (number of SNPs)
- Parameters:
chrom_count (number) – the number of chromosomes to generate. Default is 22. Must be 22 or less.
- Parameters:
cache_file (string) – (default None) If provided, tells where to cache the common iid, sid, and pos information. Using it can save time.
- Parameters:
block_size (number or None) – (default None) Tells how many SNP to generate at once. The default picks a block_size such that iid_count * *block_size is about 100,000.
- Example:
>>> from pysnptools.distreader import DistGen >>> #Prepare to generate data for 1000 individuals and 1,000,000 SNPs >>> dist_gen = DistGen(seed=332,iid_count=1000,sid_count=1000*1000) >>> print(dist_gen.iid_count,dist_gen.sid_count) 1000 1000000 >>> dist_data = dist_gen[:,200*1000:201*1000].read() #Generate data for all users and for SNPs 200K to 201K >>> print(dist_data.val[2,2], dist_data.iid_count, dist_data.sid_count) [0.2760024 0.19624834 0.52774925] 1000 1000
pstreader
Module¶
Tools for reading and manipulating matrix data along with row and column ids and properties.
pstreader.PstReader
¶
- class pysnptools.pstreader.PstReader(*args, **kwargs)¶
A PstReader is one of three things:
A
PstData
class that holds matrix data in memory:>>> import numpy as np >>> from pysnptools.pstreader import PstData >>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> type(data1.val).__name__ # The val property is an ndarray of values 'ndarray' >>> data1.row_count #prints the number of rows in this in-memory data 3
A class such as
PstNpz
for you to specify data in file. For example,>>> from pysnptools.pstreader import PstNpz >>> from pysnptools.util import example_file # Download and return local file name >>> >>> pstnpz_file = example_file('tests/datasets/all_chr.maf0.001.N300.pst.npz') >>> on_disk = PstNpz(pstnpz_file) >>> print(on_disk) # prints the name of the file reader PstNpz(...tests/datasets/all_chr.maf0.001.N300.pst.npz') >>> on_disk.col_count # prints the number of columns (but doesn't read any matrix values) 1015
A subset of any PstReader, specified with “[ row_index , col_index ]”, to read just some values.
>>> from pysnptools.pstreader import PstNpz >>> from pysnptools.util import example_file # Download and return local file name >>> >>> pstnpz_file = example_file('tests/datasets/all_chr.maf0.001.N300.pst.npz') >>> on_disk = PstNpz(pstnpz_file) >>> subset_on_disk = on_disk[[3,4],::2] # specification for a subset of the data on disk. No values are read yet. >>> print(subset_on_disk.col_count) # prints the number of columns in this subset (but still doesn't read any values) 508 >>> print(subset_on_disk) #prints a specification of 'subset_on_disk' PstNpz(...tests/datasets/all_chr.maf0.001.N300.pst.npz')[[3,4],::2] >>> data_subset = subset_on_disk.read() # efficiently (if practical) reads the specified subset of values from the disk >>> print(data_subset) # prints the specification of the in-memory information PstData(PstNpz(...tests/datasets/all_chr.maf0.001.N300.pst.npz')[[3,4],::2]) >>> int(data_subset.val.shape[0]),int(data_subset.val.shape[1]) # The dimensions of the ndarray of values (2, 508)
The PstReaders Classes
Class
Format
Random Access
Suffixes
in-memory floats
Yes
n/a
binary, floats
No
.pst.npz,.snp.npz, .kernel.npz
binary, floats
Yes
.pst.hdf5,.snp.hdf5, .kernel.hdf5
mem-mapped floats
Yes
.pst.memmap,.snp.memmap
various
SnpReader
varies
varies
varies
various
KernelReader
varies
varies
varies
A
SnpReader
andKernelReader
are each a kind ofPstReader
. They have some restrictions summarized here: ================================== =============== ============ ============ ==================== ==================== Class val type row type col type row_property type col_property typePstReader
float any any any anySnpReader
float str,str str none float,float,floatKernelReader
float str,str str,str none none ================================== =============== ============ ============ ==================== ====================For convenience, they allow additional ways to access rows and columns.
Class
val name
row name
col name
row_property name
col_property name
val
row
col
row_property
col_property
val
iid,row
sid,col
none
col_property,pos
val
iid0,row,iid
iid1,col,iid
none
none
- Note:
A
KernelReader.iid
may be used whenKernelReader.iid0
is equal toKernelReader.iid1
Methods & Properties:
Every PstReader, such as
PstNpz
andPstData
, has these properties:row
,row_count
,col
,col_count
,row_property
,col_property
and these methods:read()
,row_to_index()
,col_to_index()
. See below for details.PstData
is a PstReader so it supports the above properties and methods. In addition, it supports propertyPstData.val
and equality testing. See below for details.Rows and Cols:
Example:
>>> from pysnptools.pstreader import PstHdf5 >>> from pysnptools.util import print2 #print bytes strings and Unicode strings the same >>> from pysnptools.util import example_file # Download and return local file name >>> hdf5_file = example_file('pysnptools/examples/toydata.iidmajor.snp.hdf5') >>> on_disk = PstHdf5(hdf5_file) # PstHdf5 can load .pst.hdf5, .snp.hdf5, and kernel.hdf5 >>> print2(on_disk.row[:3]) # print the first three rows [['per0' 'per0'] ['per1' 'per1'] ['per2' 'per2']] >>> print2(on_disk.col[:7]) # print the first seven columns ['null_0' 'null_1' 'null_2' 'null_3' 'null_4' 'null_5' 'null_6'] >>> print2(on_disk.row_to_index([[b'per2', b'per2'],[b'per1', b'per1']])) #Find the indexes for two rows. [2 1]
When Matrix Data is Read:
Matrix data can be enormous so we generally avoid reading it to the degree practical. Specifically,
Constructing and printing a PstReader causes no file reading. For example, these commands read no data:
>>> on_disk = PstHdf5(hdf5_file) # Construct a PstHdf5 PstReader. No data is read. >>> print(on_disk) # Print the PstHdf5 PstReader specification. No data is read. PstHdf5(...pysnptools/examples/toydata.iidmajor.snp.hdf5') >>> subset_on_disk = on_disk[[3,4],::2] # Construct a subsetting PstReader. No data is read. >>> print(subset_on_disk) # print the subset PstReader. No data is read. PstHdf5(...pysnptools/examples/toydata.iidmajor.snp.hdf5')[[3,4],::2]
Properties and methods related to the rows and columns (to the degree practical) read only row and col data from the disk, not value data. Moreover, the row and col data is read from file only once. Consider these commands:
>>> on_disk = PstHdf5(hdf5_file) # Construct a PstHdf5 PstReader. No data is read. >>> print2(on_disk.col[:7]) # without reading any values data from disk, read the row and col data from disk, cache it, and then print the first seven cols. ['null_0' 'null_1' 'null_2' 'null_3' 'null_4' 'null_5' 'null_6'] >>> print(on_disk.col_to_index([b'null_7',b'null_2'])) #use the cached col information to find the indexes of 'null_7' and 'null_2'. (No data is read from disk.) [7 2]
The only method that reads values from file (to the degree practical) is
read()
. For example:>>> on_disk = PstHdf5(hdf5_file) # Construct a PstHdf5 PstReader. No data is read. >>> data1 = on_disk.read() #read all the values from disk, creating a new PstData instance that keeps these values in memory >>> print(data1.val[0,2]) # print the value for the row with index 0 and the col with index 2. (No data is read from disk.) 2.0
If you request the values for only a subset of the rows or columns, (to the degree practical) only that subset will be read from disk. for example:
>>> hdf5_file2 = example_file('tests/datasets/all_chr.maf0.001.N300.hdf5') >>> on_disk = PstHdf5(hdf5_file2)[[3,4],::2] # Construct a subsetting PstReader. No data is read. >>> data_subset = subset_on_disk.read() # from disk, read the values for the rows with index 3 and 4 AND cols with even numbered indexes. >>> print(data_subset.val[0,2]) # print the value with subset row index 0 and col index 2 (corresponding to row index 3 and col index 4 in the full data). No data is read from disk. 1.0
When Matrix Data is Re-Read and Copied:
Every time you call a PstReader’s
read()
method, the PstReader re-reads the value data and returns a new in-memoryPstData
(withPstData.val
property containing a new ndarray of the values).Here is an example of what not to do, because it causes all the matrix value data to be read twice.
>>> on_disk = PstHdf5(hdf5_file) # Construct a PstHdf5 PstReader. No data is read. >>> # Not recommended because it inefficiently reads all the values twice. >>> print(on_disk.read().val[0,2]) # read all values into a new PstData, print a value 2.0 >>> print(on_disk.read().val[0,3]) # read all values (again) into a second new PstData, print a value 2.0
Here are two efficient alternatives. First, if all values can all fit in memory, read them once into a
PstData
and then access thatPstData
multiple times.>>> on_disk = PstHdf5(hdf5_file) # Construct a PstHdf5 PstReader. No data is read. >>> data1 = on_disk.read() # read all values into a new PstData >>> print(data1.val[0,2]) # print a value from data1's in-memory ndarray 2.0 >>> print(data1.val[0,3]) # print another value from data1's in-memory ndarray. 2.0
Second, if the value data is too large to fit in memory, use subsetting to read only the values of interest from disk.
>>> on_disk = PstHdf5(hdf5_file) # Construct a PstHdf5 PstReader. No data is read. >>> print(on_disk[0,2].read().val[0,0]) #Define the subset of data and read only that subset from disk. 2.0 >>> print(on_disk[0,3].read().val[0,0]) #Define a second subset of data and read only that subset from disk. 2.0
Because the in-memory
PstData
class is a kind of PstReader, you may read from it, too. Doing so create a newPstData
instance containing a copy of the matrix values in a new ndarray.>>> on_disk = PstHdf5(hdf5_file) # Construct a PstHdf5 PstReader. No data is read. >>> data1 = on_disk.read() # read all matrix values from disk into a new PstData >>> print(data1.val is data1.val) # Do the in-memory SNP values use the same memory as themselves? Yes True >>> data2 = data1.read() # copy all the matrix values into a new ndarray in a new PstData >>> print(data2.val is data1.val) # Do the two ndarrays of in-memory matrix values use the same memory? False
Avoiding Unwanted ndarray Allocations
You may want a subset of matrix values from an in-memory
PstData
and you may know that this subset and the originalPstData
can safely share the memory of the ndarray of matrix values. For this case, theread()
has optional parameters called view_ok and order. If you override the defaults of “view_ok=False,order=’F’” with “view_ok=True,order=’A’, theread()
will, if practical, return a newPstData
with a ndarray shares memory with the original ndarray. Use these parameters with care because any change to either ndarray will effect the others. Also keep in mind thatread()
relies on ndarray’s mechanisms to decide whether to actually share memory and so it may ignore your suggestion and allocate a new ndarray anyway.>>> on_disk = PstNpz(pstnpz_file) # Construct a PstNpz PstReader. No data is read. >>> data1 = on_disk.read() # read all data from disk into a PstData with a new ndarray >>> column01 = data1[:,0:1].read(view_ok=True,order='A') #create PstData with the data from just the first two columns. Sharing memory is OK. The memory may be laid out in any order (that is col-major and row-major are both OK). >>> import numpy as np >>> #print(np.may_share_memory(data1.val, column01.val)) # Do the two ndarray's share memory? They could (but currently they won't) >>> column201 = data1[:,[2,0,1]].read(view_ok=True,order='A') #create PstData with the data from three columns, permuted. Sharing memory is OK. >>> print(np.may_share_memory(data1.val, column201.val)) # Do the two ndarray's share memory? No, ndarray decided that this indexing was too complex for sharing. False
Creating Subsetting PstReaders with Indexing
You often don’t want to read the matrix values for all rows and cols. You can use indexing to create a subsetting PstReader that will read only the matrix values of interest.
PstReaders support the indexing formats supported by ndarray plus two generalizations. Here are examples of indexing with an array of indexes, with slicing, and with an array of Booleans.
>>> on_disk = PstNpz(pstnpz_file) # Specify some data on disk in PstNpz format >>> subset_reader_1 = on_disk[[3,-1],:] #index with an array of indexes (negatives count from end) >>> print((subset_reader_1.row_count, subset_reader_1.col_count)) (2, 1015) >>> data1 = subset_reader_1.read() # read just the two rows of interest from the disk >>> subset_reader_2 = on_disk[:,:0:-2] #index with a slice >>> print((subset_reader_2.row_count, subset_reader_2.col_count)) (300, 507) >>> boolindexes = [s.startswith(b'23_') for s in on_disk.col] # create a Boolean index of cols that start '23_' >>> subset_reader_3 = on_disk[:,boolindexes] #index with array of Booleans >>> print((subset_reader_3.row_count, subset_reader_3.col_count)) (300, 24)
The first generalization over what ndarray offers is full indexing on both the row dimension and the col dimension, in other words, full multidimensional indexing. For example,
>>> on_disk = PstNpz(pstnpz_file) # Specify some data on disk in PstNpz format >>> subset_reader_4 = on_disk[[3,4],:0:-2] # index on two dimensions at once >>> print((subset_reader_4.row_count, subset_reader_4.col_count)) (2, 507)
The second generalization is indexing on a single integer index.
>>> on_disk = PstNpz(pstnpz_file) # Specify some data on disk in PstNpz format >>> subset_reader_5 = on_disk[5,:] #index with single integer >>> print((subset_reader_5.row_count, subset_reader_5.col_count)) (1, 1015)
Indexing is also useful when you have matrix values in memory via a
PstData
index and want to copy a subset of those values. While you could instead index directly on thePstData.val
ndarray, by indexing on thePstData
instance you also get row and col information.>>> on_disk = PstNpz(pstnpz_file) # Specify some data on disk in PstNpz format >>> data1 = on_disk.read() # read all matrix values into memory >>> print2(data1.col[:9]) # print the first 9 cols ['1_12' '1_34' '1_10' '1_35' '1_28' '1_25' '1_36' '1_39' '1_4']
>>> data_subset = data1[:,::2].read(view_ok=True,order='A') # create a copy or view with every other col >>> print2(data_subset.col[:9]) # print the first 9 cols in the subset ['1_12' '1_10' '1_28' '1_36' '1_4' '1_11' '1_32' '1_9' '1_17']
You can apply indexing on top of indexing to specify subsets of subsets of data to read. In this example, only the column values for every 16th col is actually read from the disk.
>>> # These are just PstReaders, nothing is read from disk yet >>> on_disk = PstNpz(pstnpz_file) # Specify some data on disk in PstNpz format >>> half_reader = on_disk[:,::2] # a reader for half the cols >>> quarter_reader = half_reader[:,::2] # a reader for half of half the cols >>> sixteenth_reader = quarter_reader[:,::2][:,::2] # a reader for half of half of half of half the cols >>> print(sixteenth_reader) #Print the specification of this reader PstNpz(...tests/datasets/all_chr.maf0.001.N300.pst.npz')[:,::2][:,::2][:,::2][:,::2] >>> # Now we read from disk. Only values for one col in every 16 will be read. >>> data_sixteenth = sixteenth_reader.read() >>> print(data_sixteenth.val[0,3]) 2.0
The
read()
MethodDetails of Methods & Properties:
- property col¶
A ndarray of the cols id. Each id can be anything, for example, a string, an array of two strings, a number, etc.
- Return type:
ndarray (length
col_count
)
This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.
- Example:
>>> import numpy as np >>> from pysnptools.pstreader import PstData >>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> print(data1.col[-1]) # print the last column id. z >>> data2 = PstData(row=['a','b','c'],col=[[1,0],[2,0]],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> print(data2.col[-1]) # print the last column id. [2 0]
- property col_count¶
number of cols
- Return type:
integer
This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.
- property col_property¶
A ndarray of the additional information for each col. Each element is a ndarray.
- Return type:
1- or 2-dimensional ndarray (length
col_count
)
This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.
- Example:
>>> from pysnptools.pstreader import PstNpz >>> from pysnptools.util import example_file # Download and return local file name >>> pstnpz_file = example_file('tests/datasets/all_chr.maf0.001.N300.pst.npz') >>> on_disk = PstNpz(pstnpz_file) >>> print(on_disk.col_property[:3]) # print column information for the first three cols: #The '...' is an optional space [[1. 0.00800801 0. ] [1. 0.023023 1. ] [1. 0.0700701 4. ]]
- col_to_index(list)¶
Takes a list of column ds and returns a list of index numbers
- Parameters:
list (list) – list of cols
- Return type:
ndarray of int
This method (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.
- Example:
>>> from pysnptools.pstreader import PstNpz >>> from pysnptools.util import example_file # Download and return local file name >>> pstnpz_file = example_file('tests/datasets/all_chr.maf0.001.N300.pst.npz') >>> on_disk = PstNpz(pstnpz_file) # Specify matrix data on disk >>> print(on_disk.col_to_index([b'1_10',b'1_13'])) #Find the indexes for two cols. [2 9]
- read(order='F', dtype=<class 'numpy.float64'>, force_python_only=False, view_ok=False, num_threads=None)¶
Reads the matrix values and returns a
PstData
(withPstData.val
property containing a new ndarray of the matrix values).- Parameters:
order (string or None) – {‘F’ (default), ‘C’, ‘A’}, optional – Specify the order of the ndarray. If order is ‘F’ (default), then the array will be in F-contiguous order (row-index varies the fastest). If order is ‘C’, then the returned array will be in C-contiguous order (col-index varies the fastest). If order is ‘A’, then the
PstData.val
ndarray may be in any order (either C-, Fortran-contiguous).dtype (data-type) – {numpy.float64 (default), numpy.float32}, optional – The data-type for the
PstData.val
ndarray.force_python_only (bool) – optional – If False (default), may use outside library code. If True, requests that the read be done without outside library code.
view_ok (bool) – optional – If False (default), allocates new memory for the
PstData.val
’s ndarray. If True, if practical and reading from aPstData
, will return a newPstData
with a ndarray shares memory with the originalPstData
. Typically, you’ll also wish to use “order=’A’” to increase the chance that sharing will be possible. Use these parameters with care because any change to either ndarray will effect the others. Also keep in mind thatread()
relies on ndarray’s mechanisms to decide whether to actually share memory and so it may ignore your suggestion and allocate a new ndarray anyway.num_threads (None or int) – optional – The number of threads with which to read data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
- Return type:
Calling the method again causes the matrix values to be re-read and creates a new in-memory
PstData
with a new ndarray of matrix values.If you request the values for only a subset of the sids or iids, (to the degree practical) only that subset will be read from disk.
- Example:
>>> from pysnptools.pstreader import PstHdf5 >>> from pysnptools.util import example_file # Download and return local file name >>> hdf5_file = example_file('pysnptools/examples/toydata.iidmajor.snp.hdf5') >>> on_disk = PstHdf5(hdf5_file) # Specify matrix data on disk >>> pstdata1 = on_disk.read() # Read all the matrix data returning a PstData instance >>> print(type(pstdata1.val).__name__) # The PstData instance contains a ndarray of the data. ndarray >>> subset_pstdata = on_disk[:,::2].read() # From the disk, read matrix values for every other sid >>> print(subset_pstdata.val[0,0]) # Print the first matrix value in the subset 1.0 >>> subsub_pstdata = subset_pstdata[:10,:].read(order='A',view_ok=True) # Create an in-memory subset of the subset with matrix values for the first ten iids. Share memory if practical. >>> import numpy as np >>> # print(np.may_share_memory(subset_snpdata.val, subsub_snpdata.val)) # Do the two ndarray's share memory? They could. Currently they won't.
- property row¶
A ndarray of the row ids. Each id can be anything, for example, a string, an array of two strings, a number, etc.
- Return type:
ndarray (length
row_count
)
This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.
- Example:
>>> import numpy as np >>> from pysnptools.pstreader import PstData >>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> print(data1.row[:2]) # print the first two row ids ['a' 'b'] >>> data2 = PstData(row=[[1,0],[2,0],[2,1]],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> print(data2.row[:2]) # print the first two row ids [[1 0] [2 0]]
- property row_count¶
number of rows
- Return type:
integer
This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.
- property row_property¶
A ndarray of the additional information for each row. Each element is a ndarray.
- Return type:
1- or 2-dimensional ndarray (length
row_count
)
This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.
- Example:
>>> import numpy as np >>> from pysnptools.pstreader import PstData >>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> print(data1.row_property[:2]) # print the first two row property values ['A' 'B']
- row_to_index(list)¶
Takes a list of row ids and returns a list of index numbers
- Parameters:
list – list of rows
- Return type:
ndarray of int
This method (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.
- Example:
>>> from pysnptools.pstreader import PstNpz >>> from pysnptools.util import example_file # Download and return local file name >>> pstnpz_file = example_file('tests/datasets/all_chr.maf0.001.N300.pst.npz') >>> on_disk = PstNpz(pstnpz_file) # Specify matrix data on disk >>> print(on_disk.row_to_index([[b'POP1',b'44'],[b'POP1',b'12']])) #Find the indexes for two rows. [2 1]
- property shape¶
number of rows and number of cols
- Return type:
tuple of two integers
This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.
pstreader.PstData
¶
- class pysnptools.pstreader.PstData(row, col, val, row_property=None, col_property=None, name=None, parent_string=None, copyinputs_function=None)¶
A
PstReader
for holding values in-memory, along with related row and col information. It is usually created by calling thePstReader.read()
method on anotherPstReader
, for example,PstNpz
. It can also be constructed directly.See
PstReader
for details and examples.- Constructor:
- Parameters:
row (an array of anything) – The
PstReader.row
informationcol (an array of anything) – The
PstReader.col
informationval (a 2-D array of floats or an array of floats) – The values
row_property (optional, an array of anything) – Additional information associated with each row.
col_property (optional, an array of anything) – Additional information associated with each col.
name (optional, string) – Information to be display about the origin of this data
copyinputs_function (optional, function) – Used internally by optional clustering code
- Example:
>>> from pysnptools.pstreader import PstData >>> pstdata = PstData(row=[['fam0','iid0'],['fam0','iid1']], col=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]]) >>> print(pstdata.val[0,1], pstdata.row_count, pstdata.col_count) 2.0 2 3
Equality:
Two PstData objects are equal if their five arrays (
PstData.val
,PstReader.row
,PstReader.col
,PstReader.row_property
, andPstReader.col_property
) arrays are equal. (Their ‘name’ does not need to be the same). If eitherPstData.val
contains NaN, the objects will not be equal. However,PstData.allclose()
can be used to treat NaN as regular values. Any NaN’s in the other four arrays are treated as regular values.- Example:
>>> import numpy as np >>> from pysnptools.pstreader import PstData >>> pstdata1 = PstData(row=[['fam0','iid0'],['fam0','iid1']], col=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]]) >>> pstdata2 = PstData(row=[['fam0','iid0'],['fam0','iid1']], col=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]]) >>> print(pstdata1 == pstdata2) #True, because all the arrays have the same values. True >>> print(pstdata1.val is pstdata2.val) #False, because the two arrays have different memory. False >>> pstdata3 = PstData(row=['a','b'], col=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]]) >>> pstdata4 = PstData(row=[['fam0','iid0'],['fam0','iid1']], col=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]]) >>> print(pstdata3 == pstdata4) #False, because the rows have different ids. False >>> pstdata5 = PstData(row=[['fam0','iid0'],['fam0','iid1']], col=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]]) >>> pstdata6 = PstData(row=[['fam0','iid0'],['fam0','iid1']], col=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]]) >>> print(pstdata5 == pstdata6) #False, because the val's contain NaN False >>> print(pstdata5.allclose(pstdata6)) #True, if we consider the NaN as regular values, all the arrays have the same values. True
Methods beyond
PstReader
- allclose(value, equal_nan=True)¶
- Parameters:
value (
PstData
) – Other object with which to compare.equal_nan (bool) – (Default: True) Tells if NaN in
PstData.val
should be treated as regular values when testing equality.
>>> import numpy as np >>> pstdata5 = PstData(row=[['fam0','iid0'],['fam0','iid1']], col=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]]) >>> pstdata6 = PstData(row=[['fam0','iid0'],['fam0','iid1']], col=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]]) >>> print(pstdata5.allclose(pstdata6)) #True, if we consider the NaN as regular values, all the arrays have the same values. True >>> print(pstdata5.allclose(pstdata6,equal_nan=False)) #False, if we consider the NaN as special values, all the arrays are not equal. False
- property val¶
The 2D NumPy array of floats (or array of floats) that represents the values. You can get or set this property.
>>> from pysnptools.pstreader import PstNpz >>> from pysnptools.util import example_file >>> >>> pstnpz_file = example_file('tests/datasets/little.pst.npz') >>> pstdata = PstNpz(pstnpz_file)[:5,:].read() #read data for first 5 rows >>> print(pstdata.val[4,5]) #print one of the values 2.0
- property val_shape¶
Tells the shape of value for a given individual and SNP. If None, means a single scalar value.
pstreader.PstHdf5
¶
- class pysnptools.pstreader.PstHdf5(filename, block_size=5000)¶
A
PstReader
for reading *.pst.hdf5 files from disk.See
PstReader
for general examples of using PstReaders.The general HDF5 format is described in here. The PstHdf5 format stores val, row, col, row_property, and col_property information in Hdf5 format.
- Constructor:
- Parameters:
filename (string) – The PstHdf5 file to read.
- Example:
>>> from pysnptools.pstreader import PstHdf5 >>> from pysnptools.util import example_file # Download and return local file name >>> >>> psthdf5_file = example_file('pysnptools/examples/toydata.iidmajor.snp.hdf5') >>> on_disk = PstHdf5(psthdf5_file) # PstHdf5 can load .pst.hdf5, .snp.hdf5, and kernel.hdf5 >>> print(on_disk.row_count) 500
Methods beyond
PstReader
- flush()¶
Flush the connection to the HDF5 file. (If values or properties are accessed again, the file will be reopened.)
>>> import pysnptools.util as pstutil >>> from pysnptools.pstreader import PstHdf5 >>> from pysnptools.util import example_file # Download and return local file name >>> hdf5_file = example_file("pysnptools/examples/toydata.kernel.hdf5") >>> reader = PstHdf5(hdf5_file) >>> val1 = reader[0,0].read() >>> reader.flush()
- static write(filename, pstdata, hdf5_dtype=None, col_major=True)¶
Writes a
PstData
to PstHdf5 format and returns thePstHdf5
.- Parameters:
filename (string) – the name of the file to create
pstdata (
PstData
) – The in-memory data that should be written to disk.hdf5_dtype (string) – None (use the .val’s dtype) or a Hdf5 dtype, e.g. ‘f8’,’f4’,etc.
col_major (bool) – Tells if vals should be stored on disk in col_major (default) or row_major format.
- Return type:
>>> import numpy as np >>> from pysnptools.pstreader import PstData, PstHdf5 >>> import pysnptools.util as pstutil >>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> pstutil.create_directory_if_necessary("tempdir/tiny.pst.hdf5") >>> PstHdf5.write("tempdir/tiny.pst.hdf5",data1) # Write data in PstHdf5 format PstHdf5('tempdir/tiny.pst.hdf5')
pstreader.PstNpz
¶
- class pysnptools.pstreader.PstNpz(filename)¶
A
PstReader
for reading *.pst.npz files from disk.See
PstReader
for general examples of using PstReaders.The general NPZ format is described here. The PstNpz format stores val, row, col, row_property, and col_property information in NPZ format.
- Constructor:
- Parameters:
filename (string) – The PstNpz file to read.
- Example:
>>> from pysnptools.pstreader import PstNpz >>> from pysnptools.util import example_file # Download and return local file name >>> >>> pstnpz_file = example_file('tests/datasets/little.pst.npz') >>> data_on_disk = PstNpz(pstnpz_file) >>> print(data_on_disk.row_count) 300
Methods beyond
NpzReader
- static write(filename, pstdata)¶
Writes a
PstData
to PstNpz format and returns thePstNpz
.- Parameters:
filename (string) – the name of the file to create
pstdata (
PstData
) – The in-memory data that should be written to disk.
- Return type:
>>> from pysnptools.pstreader import PstData, PstNpz >>> import pysnptools.util as pstutil >>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> pstutil.create_directory_if_necessary("tempdir/tiny.pst.npz") >>> PstNpz.write("tempdir/tiny.pst.npz",data1) # Write data in PstNz format PstNpz('tempdir/tiny.pst.npz')
pstreader.PstMemMap
¶
- class pysnptools.pstreader.PstMemMap(filename)¶
A
PstData
that keeps its data in a memory-mapped file. This allows data large than fits in main memory.See
PstData
for general examples of using PstData.- Constructor:
- Parameters:
filename (string) – The *.pst.memmap file to read.
Also see
PstMemMap.empty()
andPstMemMap.write()
.- Example:
>>> from pysnptools.pstreader import PstMemMap >>> from pysnptools.util import example_file # Download and return local file name >>> >>> pst_mem_map_file = example_file('pysnptools/examples/tiny.pst.memmap') >>> pst_mem_map = PstMemMap(pst_mem_map_file) >>> print(pst_mem_map.val[0,1], pst_mem_map.row_count, pst_mem_map.col_count) 2.0 3 2
Methods beyond
PstReader
- static empty(row, col, filename, row_property=None, col_property=None, order='F', dtype=<class 'numpy.float64'>, val_shape=None)¶
Create an empty
PstMemMap
on disk.- Parameters:
row (an array of anything) – The
PstReader.row
informationcol (an array of anything) – The
PstReader.col
informationfilename (string) – name of memory-mapped file to create
row_property (an array of anything) – optional – The additional
PstReader.row_property
information associated with each row. Default: Nonecol_property (an array of anything) – optional – The additional
PstReader.col_property
information associated with each col. Default: Noneorder (string or None) – {‘F’ (default), ‘C’}, optional – Specify the order of the ndarray.
dtype (data-type) – {numpy.float64 (default), numpy.float32}, optional – The data-type for the
PstMemMap.val
ndarray.val_shape (None or a number) – (Default: None), optional – The shape of the last dimension of
PstMemMap.val
. None means each value is a scalar.
- Return type:
>>> import pysnptools.util as pstutil >>> from pysnptools.pstreader import PstMemMap >>> filename = "tempdir/tiny.pst.memmap" >>> pstutil.create_directory_if_necessary(filename) >>> pst_mem_map = PstMemMap.empty(row=['a','b','c'],col=['y','z'],filename=filename,row_property=['A','B','C'],order="F",dtype=np.float64) >>> pst_mem_map.val[:,:] = [[1,2],[3,4],[np.nan,6]] >>> pst_mem_map.flush()
- property filename¶
The name of the memory-mapped file
- flush()¶
Flush
PstMemMap.val
to disk and close the file. (If values or properties are accessed again, the file will be reopened.)>>> import pysnptools.util as pstutil >>> from pysnptools.pstreader import PstMemMap >>> filename = "tempdir/tiny.pst.memmap" >>> pstutil.create_directory_if_necessary(filename) >>> pst_mem_map = PstMemMap.empty(row=['a','b','c'],col=['y','z'],filename=filename,row_property=['A','B','C'],order="F",dtype=np.float64) >>> pst_mem_map.val[:,:] = [[1,2],[3,4],[np.nan,6]] >>> pst_mem_map.flush()
- property offset¶
The byte position in the file where the memory-mapped values start.
- property val¶
The NumPy memmap array of floats that represents the values. You can get this property, but cannot set it (except with itself)
>>> from pysnptools.pstreader import PstMemMap >>> from pysnptools.util import example_file # Download and return local file name >>> >>> pst_mem_map_file = example_file('pysnptools/examples/tiny.pst.memmap') >>> pst_mem_map = PstMemMap(pst_mem_map_file) >>> print(pst_mem_map.val[0,1]) 2.0
- static write(filename, pstdata)¶
Writes a
PstData
toPstMemMap
format and returns thePstMemMap
.- Parameters:
filename (string) – the name of the file to create
pstdata (
PstData
) – The in-memory data that should be written to disk.
- Return type:
>>> import pysnptools.util as pstutil >>> from pysnptools.pstreader import PstData, PstMemMap >>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C']) >>> pstutil.create_directory_if_necessary("tempdir/tiny.pst.memmap") >>> PstMemMap.write("tempdir/tiny.pst.memmap",data1) # Write data1 in PstMemMap format PstMemMap('tempdir/tiny.pst.memmap')
util
Module¶
util
¶
- pysnptools.util.array_module(xp=None)¶
Find the array module to use, for example numpy or cupy.
- Parameters:
xp (optional, string or Python module) – The array module to use, for example, ‘numpy’ (normal CPU-based module) or ‘cupy’ (GPU-based module). If not given, will try to read from the ARRAY_MODULE environment variable. If not given and ARRAY_MODULE is not set, will use numpy. If ‘cupy’ is requested, will try to ‘import cupy’. If that import fails, will revert to numpy.
- Return type:
Python module
>>> from pysnptools.util import array_module >>> xp = array_module() # will look at environment variable >>> print(xp.zeros((3))) [0. 0. 0.] >>> xp = array_module('cupy') # will try to import 'cupy' >>> print(xp.zeros((3))) [0. 0. 0.]
- pysnptools.util.asnumpy(a)¶
Given an array created with any array module, return the equivalent numpy array. (Returns a numpy array unchanged.)
>>> from pysnptools.util import asnumpy, array_module >>> xp = array_module('cupy') >>> zeros_xp = xp.zeros((3)) # will be cupy if available >>> zeros_np = asnumpy(zeros_xp) # will be numpy >>> zeros_np array([0., 0., 0.])
- pysnptools.util.create_directory_if_necessary(name, isfile=True, robust=False)¶
Create a directory for a file if the directory doesn’t already exist.
- Parameters:
name (string) – file or directory name
isfile (bool) – If True (default), the name is a file, otherwise it is a directory.
robust (bool) – If False (default), will try once to create the directory. If True, will try 25 times.
- pysnptools.util.format_delta(delta_seconds)¶
Format a time delta nicely.
- Parameters:
delta_seconds (number) – The number of seconds
- Return type:
string
>>> from pysnptools.util import format_delta >>> print(format_delta(86403.5)) 1 day, 0:00:03.500000
- pysnptools.util.get_array_module(a)¶
Given an array, returns the array’s module, for example, numpy or cupy. Works for numpy even when cupy is not available.
>>> import numpy as np >>> zeros_np = np.zeros((3)) >>> xp = get_array_module(zeros_np) >>> xp.ones((3)) array([1., 1., 1.])
- pysnptools.util.intersect_apply(data_list, sort_by_dataset=True, intersect_before_standardize=True, is_test=False)¶
Intersects and sorts the iids from a list of datasets, returning new version of the datasets with all the same iids in the same order.
- Parameters:
data_list (list) – list of datasets
sort_by_dataset (bool) – optional, If True (default), the iids are ordered according to the first non-None dataset. If False, the order is arbitrary, but consistent.
intersect_before_standardize (bool) – optional. Special code for
SnpKernel
, the class that postpones computing a kernel from SNP data. If True (default),SnpKernel
will remove any iids before SNP standardization before computing the kernel. If False, SNPs will be standardized with all iids, then the kernel will be computed, then any iids will be removed.
- Return type:
list of datasets
Here are the dataset formats understood and what is returned for each.
Dataset Format
What is Returned
None
None
A new subsetting
SnpReader
with adjusted iidA new subsetting
KernelReader
with adjusted iidA dictionary with [‘iid’] and [‘vals’] keys
The same dictionary but with the iid and vals values adjusted
Tuple of the form (val ndarray, iid list)
A new tuple with the val ndarray and iid list adjusted
If the iids in all the datasets are already the same and in the same order, then the datasets are returned without change.
Notice that only dictionaries are processed in-place. Inputting a
SnpReader
andKernelReader
returns a new class of the same type (unless its iids are already ok). Inputting a tuple returns a new tuple (unless its iids are already ok).- Example:
>>> from pysnptools.snpreader import Bed, Pheno >>> from pysnptools.kernelreader import SnpKernel >>> from pysnptools.standardizer import Unit >>> #Create five datasets in different formats >>> ignore_in = None >>> from pysnptools.util import example_file # Download and return local file name >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> phenofile = example_file("tests/datasets/phenSynthFrom22.23.N300.randcidorder.txt") >>> covfile = example_file("tests/datasets/all_chr.maf0.001.covariates.N300.txt") >>> kernel_in = SnpKernel(Bed(bedfile,count_A1=False),Unit()) # Create a kernel from a Bed file >>> pheno_in = Pheno(phenofile,missing="") >>> cov = Pheno(covfile,missing="").read() >>> cov_as_tuple_in = (cov.val,cov.iid) #We could do cov directly, but as an example we make it a tuple. >>> >>> # Create five new datasets with consistent iids >>> ignore_out, kernel_out, pheno_out, cov_as_tuple_out = intersect_apply([ignore_in, kernel_in, pheno_in, cov_as_tuple_in]) >>> # Print the first five iids from each dataset >>> print(ignore_out, kernel_out.iid[:5], pheno_out.iid[:5], cov_as_tuple_out[1][:5]) None [['POP1' '0'] ['POP1' '12'] ['POP1' '44'] ['POP1' '58'] ['POP1' '65']] [['POP1' '0'] ['POP1' '12'] ['POP1' '44'] ['POP1' '58'] ['POP1' '65']] [['POP1' '0'] ['POP1' '12'] ['POP1' '44'] ['POP1' '58'] ['POP1' '65']]
- pysnptools.util.intersect_ids(idslist)¶
Deprecated since version Use:
intersect_apply()
instead.Takes a list of 2d string arrays of family and individual ids. These are intersected.
- Return type:
indarr, an array of size N x L, where N is the number of individuals in the intersection, and L is the number of lists in idslist, and which contains the index to use (in order) such that all people will be identical and in order across all data sets.
If one of the lists=None, it is ignored (but still has values reported in indarr, all equal to -1),
- pysnptools.util.log_in_place(name, level, time_lambda=<built-in function time>, show_log_diffs=False)¶
Create an one-argument function to write messages to. If the logging level is met, messages will appear. All messages will be on the same line and will include time.
- Example:
from pysnptools.util import log_in_place import logging import time logging.basicConfig(level=logging.INFO) with log_in_place("counting", logging.INFO) as updater: for i in range(100): updater(i) time.sleep(.1) #typically, some work -- not a delay -- goes here
Outputs 100 messages on the same line, ending with something like “counting – time=0:00:09.99, 99”
- pysnptools.util.print2(arg)¶
Make printing under Python3 look the same as under Python2.
- pysnptools.util.sub_matrix(val, row_index_list, col_index_list, order='A', dtype=<class 'numpy.float64'>, num_threads=None)¶
Efficiently creates a sub-matrix from a 2-D or 3-D ndarray.
- Parameters:
val (ndarray) – The ndarray from which to copy.
row_index_list (list of integers) – Tells which rows (and in which order) to copy into the sub-matrix
col_index_list (list of integers) – Tells which columns (and in which order) to copy into the sub-matrix
order (string or None) – {‘A’ (default), ‘C’, ‘F’}, optional – Specify the order of the sub-matrix to create. If order is ‘C’, then the returned sub-matrix will be in C-contiguous order (first index varies the fastest). If order is ‘F’, then the array will be in F-contiguous order (second index varies the fastest). If order is ‘A’, then sub-matrix may be in any order F or C.
dtype (data-type) – {numpy.float64 (default), numpy.float32}, optional – The data-type for sub-matrix created.
num_threads (None or int) – optional – The number of threads with which to work. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.
- Return type:
ndarray
>>> import numpy as np >>> import pysnptools.util as pstutil >>> np.random.seed(0) # set seed so that results are deterministic >>> matrix = np.random.rand(12,7) # create a 12 x 7 ndarray >>> submatrix = pstutil.sub_matrix(matrix,[0,2,11],[6,5,4,3,2,1,0]) >>> print(int(submatrix.shape[0]),int(submatrix.shape[1])) 3 7 >>> print(matrix[2,0] == submatrix[1,6]) #The row # 2 is now #1, the column #0 is now #6. True
Note: Behind the scenes, for performance, this function calls a Rust helper function.
- pysnptools.util.weighted_mean(ys, weights)¶
- Parameters:
ys (ndarray) – The ndarray of values
weights (ndarray) – the weight of each value (unnormalized is fine)
- Return type:
the weight mean
>>> ys = np.array([103.664086,89.80645161,83.86888046,90.54141176]) >>> weights = np.array([2.340862423,4.982888433,0.17522245,0.098562628]) >>> float(round(weighted_mean(ys, weights),5)) 93.9487
- pysnptools.util.weighted_simple_linear_regression(xs, ys, weights)¶
- Parameters:
xs (ndarray) – The ndarray of independent values
ys (ndarray) – The ndarray of dependent values
weights (ndarray) – the weight of each case (unnormalized is fine)
- Return type:
slope, intercept, xmean, ymean
>>> xs = np.array([53.8329911,57.49486653,60.07392197,60.21081451]) >>> ys = np.array([103.664086,89.80645161,83.86888046,90.54141176]) >>> weights = np.array([2.340862423,4.982888433,0.17522245,0.098562628]) >>> slope, intercept, xmean, ymean = weighted_simple_linear_regression(xs, ys, weights) >>> print(round(slope,5), round(intercept,5), round(xmean,5), round(ymean,5)) -3.52643 293.05586 56.46133 93.9487
- pysnptools.util.example_file(pattern, endswith=None)¶
Returns the local location of a PySnpTools example file, downloading it if needed.
- Parameters:
pattern – The name of the example file of interest. A file name pattern may be given. All matching files will be downloaded (if needed) and the name of one will be returned.
endswith – The pattern of the file name to return. By default, when no endswith is given, the name of the first matching file will be returned.
- Return type:
string
By default, the local location will be under the system temp directory (typically controlled with the TEMP environment variable). Alternatively, the directory can be set with the PYSNPTOOLS_CACHE_HOME environment variable.
This function knows the MD5 hash of all PySnpTools example files and uses that content-based hash to decide if a file needs to be downloaded.
>>> from pysnptools.util import example_file # Download and return local file name >>> # Download the phenotype file if necessary. Return its local location. >>> pheno_fn = example_file("pysnptools/examples/toydata.phe") >>> print('The local file name is ', pheno_fn) The local file name is ...pysnptools/examples/toydata.phe >>> >>> # Download the bed,bim,&fam files if necessary. Return the location of bed file. >>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed") >>> print('The local file name is ', bedfile) The local file name is ...tests/datasets/all_chr.maf0.001.N300.bed
- pysnptools.util.snp_gen(fst, dfr, iid_count, sid_count, maf_low=0.05, maf_high=0.5, seed=0, sibs_per_family=10, freq_pop_0=0.5, chr_count=None, label_with_pop=False)¶
Generates a random
SnpData
- Parameters:
fst (float) – Degree of Population Structure, e.g. 0 (a special case), 0.005, 0.01, 0.05, 0.1
dfr (float) – Degree of Family Relatedness, the fraction of individuals belonging to a family, ie. fracSibs, e.g. 0.0, 0.5, 0.6, 0.7, 0.8, 0.9
iid_count (int) – The number of individuals to generate. Because of rounding the actual number may be less.
sid_count (int) – The number of snps to generate.
maf_low (float) – (default .05) lower bound of uniformly-generated Minor allele frequency
maf_high (float) – (default .5) upper bound of uniformly-generated Minor allele frequency
seed (int) – (default 0) Random seed
sibs_per_family (int) – (default 10) number of siblings in each family
freq_pop_0 (float) – (default .5) Fraction of individuals in population 0 (the rest will be in population 1)
chr_count (int) – (default one chromosome per SNP) Number of chromosomes to which SNPs should be assigned. The SNPs will be assigned as evenly as possible. Chromosome names are integers starting with 1. SNP positions within a chromosome are sequential integers starting with 1.
- Return type:
- Example:
>>> snpdata = snp_gen(fst=.1,dfr=.5,iid_count=200,sid_count=20,maf_low=.05,seed=6) >>> print(int(snpdata.iid_count), int(snpdata.sid_count)) #because of rounding got 190 individuals 190 20
- Also See:
util.IntRangeSet
¶
- class pysnptools.util.IntRangeSet(*ranges_inputs)¶
Bases:
object
A class for efficiently manipulating ranges of integers (including negatives and longs) using set operations such as
union()
,intersection()
, and difference.The class differs from the built-in set class (and from Boolean numpy arrays) because it does not need to store every element in the set, only for every contiguous range of elements. It differs from other Python interval libraries (that we know of) by being specialized and optimized for integer elements.
- Example:
Here we take the union (operator “|”) of two IntRangeSets:
>>> from pysnptools.util import IntRangeSet >>> a = IntRangeSet("100:500,501:1000") # a is the set of integers from 100 to 500 (exclusive) and 501 to 1000 (exclusive) >>> b = IntRangeSet("-20,400:600") # b is the set of integers -20 and the range 400 to 600 (exclusive) >>> c = a | b # c is the union of a and b, namely -20 and 100 to 1000 (exclusive) >>> print(c) IntRangeSet('-20,100:1000')
- Example:
Suppose we want to find the intron regions of a gene but we are given only the transcription region and the exon regions.
>>> from pysnptools.util import IntRangeSet >>> line = "chr15 29370 37380 29370,32358,36715 30817,32561,37380" >>> chr,trans_start,trans_last,exon_starts,exon_lasts = line.split() # split the line on white space >>> trans_start = int(trans_start) >>> trans_stop = int(trans_last) + 1 # add one to convert the inclusive "last" value into a Pythonesque exclusive "stop" value >>> int_range_set = IntRangeSet((trans_start,trans_stop)) # creates a IntRangeSet from 29370 (inclusive) to 37381 (exclusive) >>> print(int_range_set) # print at any time to see the current value IntRangeSet('29370:37381')
Parse the exon start and last lists from strings to lists of integers (converting ‘last’ to ‘stop’)
>>> exon_starts = [int(start) for start in exon_starts.strip(",").split(',')] >>> exon_stops = [int(last)+1 for last in exon_lasts.strip(",").split(',')] >>> assert len(exon_starts) == len(exon_stops)
Zip together the two lists to create an iterable of exon_start,exon_stop tuples. Then ‘set subtract’ all these ranges from int_range_set
>>> int_range_set -= zip(exon_starts,exon_stops) >>> print(int_range_set) # See what it looks like IntRangeSet('30818:32358,32562:36715')
Create the desired output by iterating through each contiguous range of integers
>>> for start, stop in int_range_set.ranges(): ... print("{0} {1} {2}".format(chr, start, stop-1)) chr15 30818 32357 chr15 32562 36714
Ranges Input
The input to the
IntRangeSet
constructor and many of its methods is a ranges input.A ranges input is one of the following:
A comma-separated string of integers or integer ranges, e.g.,
'100:500,500:1000,2000'
An integer or integer expression, e.g.,
3
A tuple of two integers, start and stop, e.g.,
(2,8)
. stop is exclusive.A slice with non-negative values, e.g.
slice(2,8)
A
IntRangeSet
(or any class with aranges()
method), e.g.,IntRangeSet(3)
A list or iterable (but not tuple) of ranges inputs, e.g.,
[1,6,7,(100,200)]
- Example:
Strings:
>>> a = IntRangeSet("100:500,500:1000,2000") >>> a = IntRangeSet('-10:-8,1,2,3:11') >>> a = IntRangeSet('')
The ranges in an ranges input can overlap and can be in any order.
>>> assert IntRangeSet("2000,100:1500,500:1000,2000") == IntRangeSet('100:1500,2000')
Integers and Integer Expressions:
>>> a = IntRangeSet(7) >>> a = IntRangeSet(151000000000) # longs are OK >>> a = IntRangeSet(2*3+1) # integer expressions are OK
A tuple must have exactly two integers. They represent the start (inclusive) and stop (exclusive) integers in a range.
>>> assert IntRangeSet((2,8)) == IntRangeSet('2:8') # check 'set equality' >>> assert 7 in IntRangeSet((2,8)) # The integer 7 is an element of the set >>> assert IntRangeSet((2,3)) == IntRangeSet(2)
Lists and iterables of ranges inputs:
>>> assert IntRangeSet([1,6,7,(100,200)]) == IntRangeSet('1,6:8,100:200') >>> assert IntRangeSet(range(0,100)) == IntRangeSet('0:100') >>> assert IntRangeSet([range(100,200),3,'1000:2000',[4,6],(20,30)]) == IntRangeSet('3:5,6,20:30,100:200,1000:2000')
Some methods accept zero or more ranges input as their input. This is called a ranges_inputs. For example:
>>> a = IntRangeSet() # zero ranges inputs >>> a = IntRangeSet(3) # one ranges input >>> a = IntRangeSet('3:11,5',14,100) # Three ranges inputs, a string and two integers.
Most Important Methods and Operators
Description
Method or Operator
For Details
is a ‘set equal’ to b?
a == b
is a is superset of b?
b in a
,a.issuperset(b)
,a >= b
is a is subset of b?
a <= b
,a.issubset(b)
remove i th smallest element(s)
del a[i]
get i th smallest element(s)
a[i]
is a not ‘set equal’ to b?
a != b
is a is proper superset of b?
a > b
is a is proper subset of b?
a < b
union b into a
a |= b
,a += b
,a.add(b,c)
,a.update(b,c)
union a and b
a | b
,a b
,a.union(b,c)
intersect b into a
a &= b
,a.intersection_update(b,c)
,intersect a and b
a & b
,a.intersection(b,c)
subtract b from in a
a -= b
,a.difference_update(b)
,a.discard(b)
,remove b from a, error if missing
a.remove(b)
a ‘set difference’ b
a - b
,a.difference(b)
iterate integers in a, from low
for element in a:
iterate integers in a, from high
for element in reverse(a):
symmetric difference into a
a ^= b
a ‘symmetric difference’ b
a ^ b
count of integer elements in a
len(a)
iterate ranges in a, from low
for start,stop in a.ranges():
count of ranges in a
a.ranges_len
index of range containing b
a.ranges_index(b)
get i th smallest range
a.ranges_getitem(i)
a as a string
str(a)
remove all elements from a
a.clear()
copy a
a.copy()
find index of b in a
a.index(b)
are a and b disjoint?
a.isdisjoint(b)
is a empty?
a.isempty(b)
smallest element of a
a.min()
largest element of a
a.max()
sum of elements in a
a.sum()
remove and return an element
a.pop()
Examples, Tips, and Warnings
The argument to a method and the right-hand side of an operator can be a ranges input rather than
IntRangeSet
. For example,>>> big = IntRangeSet('-1000:2000') >>> print(big - '10:20,500') # We can subtract this string because it is a legal ranges input. IntRangeSet('-1000:10,20:500,501:2000')
This even works for equality testing:
>>> assert big - '10:21,500' == '-1000:10,21:500,501:2000'
The Python in operator is backwards from other operators, so the left-hand side can be any ranges input.
>>> assert 501 in big
Like most other Python libraries you specify a range with an inclusive start integer and an exclusive stop.
>>> print(7 in IntRangeSet('4:8')) # includes 7 True >>> print(8 in IntRangeSet('4:8')) # excludes 8 False >>> print(8 in IntRangeSet(range(4,7))) # also excludes 8 False
- Be careful with the ranges inputs specified with tuples. Suppose we want a
IntRangeSet
containing 4,5,6,7. >>> assert IntRangeSet((4,8)) == '4:8' #OK >>> assert IntRangeSet([(4,8)]) == '4:8' # List of a tuple is OK >>> assert IntRangeSet(4,8) == '4,8' #No. This is not a tuple. It is two integer inputs, so we get only 4 and 8 >>> assert IntRangeSet([4,8]) == '4,8' #No. List of two integers, so get only 4 and 8 >>> #Illegal: IntRangeSet((4,8,10)) # tuples must be pairs
Methods and Operators
- __contains__(*ranges_inputs)¶
True exactly when all the ranges input is a subset of the IntRangeSet.
These are the same:
b in a
a.issuperset(b)
a >= b
- Example:
>>> print(3 in IntRangeSet('0:5,6:11')) True >>> print(IntRangeSet('4:7') in IntRangeSet('0:5,6:11')) False >>> '6:9' in IntRangeSet('0:5,6:11') # The left-hand of 'in' can be any ranges input True >>> print(IntRangeSet('0:5,6:11') >= '6:9') # The right-hand of can be any ranges input True
The ‘issuperset’ method also supports unioning multiple ranges inputs.
- Example:
>>> print(IntRangeSet('0:5,6:11').issuperset(4,7,8)) True >>> print(IntRangeSet('0:5,6:11').issuperset(4,7,8,100)) False
Note: By definition, any set is a superset of itself.
- __delitem__(key)¶
Remove elements from the IntRangeSet by position index. Position index can be specified by an integer with negative integers counting from the end. Position indexes can also be specified with slices and a ranges input.
- Example:
Removing with an integer position index:
>>> a = IntRangeSet('100:200,1000') >>> del a[2] >>> print(a) IntRangeSet('100:102,103:200,1000') >>> del a[-1] >>> print(a) IntRangeSet('100:102,103:200')
- Example:
Removing with a slice:
>>> a = IntRangeSet('100:200,1000') >>> del a[2:11] >>> print(a) IntRangeSet('100:102,111:200,1000')
- Example:
Removing with a ranges input:
>>> a = IntRangeSet('100:200,1000') >>> del a['2:11'] >>> print(a) IntRangeSet('100:102,111:200,1000')
- __eq__(other)¶
True exactly when the IntRangeSet on the left is set equivalent to the ranges input on the right.
a == b
>>> print(IntRangeSet('0:10,12') == IntRangeSet('0:10,12')) True >>> print(IntRangeSet('0:10,12') == IntRangeSet('0:10')) False >>> print(IntRangeSet('0:10,12') == IntRangeSet('12,0:5,5:10')) True >>> print(IntRangeSet('0:10,12') == '0:10,12') # The right-hand can be any ranges input True
- __firstlineno__ = 13¶
- __ge__(other)¶
- __getitem__(key)¶
a[i]
returns the ith integer in sorted order (origin 0) from a, an IntRangeSet>>> print(IntRangeSet('100:200,1000')[0]) 100 >>> print(IntRangeSet('100:200,1000')[10]) 110
If i is negative, the indexing goes from the end
>>> print(IntRangeSet('100:200,1000')[-1]) 1000
Python’s standard slice notation may be used and returns IntRangeSets. (Remember that the Stop number in slice notation is exclusive.)
>>> print(IntRangeSet('100:200,1000')[0:10]) # Integers 0 (inclusive) to 10 (exclusive) IntRangeSet('100:110')
>>> print(IntRangeSet('100:200,1000')[0:10:2]) # Integers 0 (inclusive) to 10 (exclusive) with step 2 IntRangeSet('100,102,104,106,108')
>>> print(IntRangeSet('100:200,1000')[-3:]) # The last three integers in the IntRangeSet. IntRangeSet('198:200,1000')
An IntRangeSet can also be accessed with any ranges input.
>>> IntRangeSet('100:200,1000')['0:10,20'] IntRangeSet('100:110,120')
- __gt__(other)¶
True exactly when the IntRangeSet is a proper superset of the ranges.
a > b
- Example:
>>> print(IntRangeSet('0:5,6:11') > '7:10') # The right-hand can be any ranges input True
Note: By definition, no set is a proper superset of itself.
- __hash__ = None¶
- __iadd__(*ranges_inputs)¶
- __iand__()¶
Set the IntRangeSet to itself intersected with a input range
These are the same:
a &= b
a.intersection_update(b)
- Example:
>>> a = IntRangeSet('0:5,6:11') >>> a &= '3:8' >>> print(a) IntRangeSet('3:5,6:8')
- __imul__(n)¶
Put produces n copies of a unioned together in a, where a is an IntRangeSet. Because a is a set, the result will either be an empty IntRangeSet (n is 0 or less) or the original a.
a *= n
- Example:
>>> a = IntRangeSet('0:5,6:11') >>> a *= 5 >>> print(a) # should be unchanged IntRangeSet('0:5,6:11') >>> a *= 0 >>> print(a) # should be empty IntRangeSet('')
- __init__(*ranges_inputs)¶
Create a
IntRangeSet
.
- __ior__(*ranges_inputs)¶
- __isub__(*ranges_inputs)¶
Remove the elements of the range inputs from the IntRangeSet
These are the same:
a -= b
a.difference_update(b)
a.discard(b)
remove
is almost the same except that it raises a KeyError if any element of b is not in a.a.remove(b)
- Example:
>>> a = IntRangeSet('0:5,6:11') >>> a -= '3:7' >>> print(a) IntRangeSet('0:3,7:11')
The ‘difference_update’, ‘discard’ and ‘remove’ methods also support subtracting multiple ranges inputs.
- Example:
>>> a = IntRangeSet('0:5,6:11') >>> a.difference_update('3:7','8:100') >>> print(a) IntRangeSet('0:3,7')
- __iter__()¶
Iterate, in order from smallest to largest, the integer elements of the IntRangeSet
- Example:
>>> for i in IntRangeSet('1:4,10'): ... print(i) 1 2 3 10
- __ixor__(ranges)¶
Set the IntRangeSet to contains exactly those elements that appear in either itself or the input ranges but not both
These are the same:
a ^= b
a.symmetric_difference_update(b)
- Example:
>>> a = IntRangeSet('0:5,6:11') >>> a ^= '3:8' >>> print(a) IntRangeSet('0:3,5,8:11')
- __le__(ranges)¶
True exactly when the IntRangeSet is a subset of the ranges.
These are the same:
a <= b
a.issubset(b)
- Example:
>>> print(IntRangeSet('0:5,6:11') <= '-1:101') # The right-hand can be any ranges input True
Note: By definition, any set is a subset of itself.
- __len__()¶
The number of integer elements in the IntRangeSet
>>> print(len(IntRangeSet('0:10,12'))) 11
Note: This is computed in time linear in the number of ranges, rather than integer elements.
- __lt__(ranges)¶
True exactly when the IntRangeSet is a proper subset of the ranges.
a < b
- Example:
>>> print(IntRangeSet('0:5,6:11') < '-1:101') # The right-hand can be any ranges input True
Note: By definition, no set is a proper subset of itself.
- __mul__(n)¶
a * n
, produces n shallow copies of a unioned, where a is an IntRangeSet. Because a is a set, the result will either be an empty IntRangeSet (n is 0 or less) or a copy of the original IntRangeSet.a * n
- __ne__(other)¶
False exactly when the IntRangeSet on the left is set equivalent to the ranges input on the right.
These are the same.
a != b
not a == b
- __or__()¶
Return the union of a IntRangeSet with zero or more ranges inputs. The original IntRangeSet is not changed.
These are the same:
a | b
a + b
a.union(b)
- Example:
>>> print(IntRangeSet('0:5,6:10') | 5) IntRangeSet('0:10')
The ‘union’ method also support unioning multiple ranges inputs,
- Example:
>>> print(IntRangeSet('0:5,6:10').union(5,'100:200')) IntRangeSet('0:10,100:200')
- __repr__()¶
Use the standard repr(a) function to create a string representation of a, an IntRangeSet.
>>> print("Hello " + repr(IntRangeSet(2,3,4,10))) Hello IntRangeSet('2:5,10')
- __reversed__()¶
reversed(a) is a generator that produces the integer elements of a in order from largest to smallest.
- Example:
>>> for i in reversed(IntRangeSet('1:4,10')): ... print(i) 10 3 2 1
- __static_attributes__ = ('_start_items', '_start_to_length')¶
- __str__()¶
Use the standard str(a) function to create a string representation of a, an IntRangeSet.
>>> print("Hello " + str(IntRangeSet(2,3,4,10))) Hello IntRangeSet('2:5,10')
- __sub__(*ranges_inputs)¶
Return the set difference of a IntRangeSet with zero or more ranges inputs. The original IntRangeSet is not changed.
These are the same:
a - b
a.difference(b)
- Example:
>>> print(IntRangeSet('0:5,6:11') - 1) IntRangeSet('0,2:5,6:11') >>> print(IntRangeSet('0:5,6:11') - '3:100') IntRangeSet('0:3')
The ‘difference’ method also supports subtracting multiple input ranges
- Example:
>>> print(IntRangeSet('0:5,6:11').difference('3:100',1)) IntRangeSet('0,2')
- __xor__(ranges)¶
Returns a new IntRangeSet set with elements in either the input IntRangeSet or the input range but not both.
These are the same:
a ^ b
a.symmetric_difference(b)
- Example:
>>> print(IntRangeSet('0:5,6:11') ^ '3:9') IntRangeSet('0:3,5,9:11')
- add(*ranges_inputs)¶
Union zero or more ranges inputs into the current IntRangeSet.
These are the same:
a |= b
a += b
a.add(b)
a.update(b)
- Example:
>>> a = IntRangeSet('0:5,6:10') >>> a |= 5 >>> print(a) IntRangeSet('0:10')
The ‘add’ and ‘update’ methods also support unioning multiple ranges inputs,
- Example:
>>> a = IntRangeSet('0:5,6:10') >>> a.add('5','100:200') >>> print(a) IntRangeSet('0:10,100:200')
- clear()¶
Remove all ranges from this IntRangeSet.
>>> a = IntRangeSet('0:10,12') >>> a.clear() >>> print(a) IntRangeSet('')
- copy()¶
Create a deep copy of a IntRangeSet.
- count(ranges)¶
The number of times that the elements of ranges appears in the IntRangeSet. Because IntRangeSet is a set, the number will be either 0 or 1.
>>> print(IntRangeSet('100:110,1000').count('105:107,1000')) 1
- difference(*ranges_inputs)¶
Return the set difference of a IntRangeSet with zero or more ranges inputs. The original IntRangeSet is not changed.
These are the same:
a - b
a.difference(b)
- Example:
>>> print(IntRangeSet('0:5,6:11') - 1) IntRangeSet('0,2:5,6:11') >>> print(IntRangeSet('0:5,6:11') - '3:100') IntRangeSet('0:3')
The ‘difference’ method also supports subtracting multiple input ranges
- Example:
>>> print(IntRangeSet('0:5,6:11').difference('3:100',1)) IntRangeSet('0,2')
- difference_update(*ranges_inputs)¶
- discard(*ranges_inputs)¶
- index(other)¶
If x is an integer, returns the index of x in a, where a is an IntRangeSet. If x is an ranges input, returns an IntRangeSet of index of every integer in x. Raises an IndexError is x not in a.
* a.index(x)
>>> print(IntRangeSet('100:110,1000').index(109)) 9 >>> print(IntRangeSet('100:110,1000').index('109,100:104')) IntRangeSet('0:4,9')
- intersection()¶
Return the intersection of a IntRangeSet and zero or more ranges inputs. The original IntRangeSet is not changed.
These are the same:
a & b
a.intersection(b)
- Example:
>>> print(IntRangeSet('0:5,6:11') & '3:8') IntRangeSet('3:5,6:8')
The ‘intersection’ method also support intersecting multiple ranges inputs,
- Example:
>>> print(IntRangeSet('0:5,6:11').intersection('3:8','4:7')) IntRangeSet('4,6')
- intersection_update()¶
- isdisjoint(ranges)¶
True exactly when the two sets have no integer elements in common.
- Example:
>>> print(IntRangeSet('100:110,1000').isdisjoint('900:2000')) False >>> print(IntRangeSet('100:110,1000').isdisjoint('1900:2000')) True
- property isempty¶
True exactly when the IntRangeSet is empty.
>>> print(IntRangeSet().isempty) True >>> print(IntRangeSet(4).isempty) False
- issubset(ranges)¶
True exactly when the IntRangeSet is a subset of the ranges.
These are the same:
a <= b
a.issubset(b)
- Example:
>>> print(IntRangeSet('0:5,6:11') <= '-1:101') # The right-hand can be any ranges input True
Note: By definition, any set is a subset of itself.
- issuperset(*ranges_inputs)¶
True exactly when all the ranges input is a subset of the IntRangeSet.
These are the same:
b in a
a.issuperset(b)
a >= b
- Example:
>>> print(3 in IntRangeSet('0:5,6:11')) True >>> print(IntRangeSet('4:7') in IntRangeSet('0:5,6:11')) False >>> '6:9' in IntRangeSet('0:5,6:11') # The left-hand of 'in' can be any ranges input True >>> print(IntRangeSet('0:5,6:11') >= '6:9') # The right-hand of can be any ranges input True
The ‘issuperset’ method also supports unioning multiple ranges inputs.
- Example:
>>> print(IntRangeSet('0:5,6:11').issuperset(4,7,8)) True >>> print(IntRangeSet('0:5,6:11').issuperset(4,7,8,100)) False
Note: By definition, any set is a superset of itself.
- max()¶
The largest integer element in the IntRangeSet
>>> print(IntRangeSet('0:10,12').max()) 12
Note: This is more efficient than max(IntRangeSet(‘0:10,12’)) because is computed in constant time rather than in time linear to the number of integer elements.
- min()¶
The smallest integer element in the IntRangeSet
- Example:
>>> print(IntRangeSet('0:10,12').min()) 0
Note: This is more efficient than
min(IntRangeSet('0:10,12'))
because is computed in constant time rather than in time linear to the number of integer elements.
- pop()¶
Remove and return the largest integer element from the IntRangeSet. Raises KeyError if the IntRangeSet is empty.
- Example:
>>> a = IntRangeSet('0:5,6:11') >>> print(a.pop()) 10 >>> print(a) IntRangeSet('0:5,6:10')
- ranges()¶
Iterate, in order, the ranges of a IntRangeSet as (start,stop) tuples.
- Example:
>>> for start,stop in IntRangeSet('0:10,100:200').ranges(): ... print("start is {0}, stop is {1}".format(start,stop)) start is 0, stop is 10 start is 100, stop is 200
- ranges_getitem(index)¶
Returns the index-th range in the collection of ranges
>>> print(IntRangeSet('-30:-20,0:10,12').ranges_getitem(1)) (0, 10)
- ranges_index(element)¶
Returns the ranges index of range containing the element
>>> int_range_set = IntRangeSet('-30:-20,0:10,12') >>> index = int_range_set.ranges_index(5) >>> print(index) 1 >>> int_range_set.ranges_getitem(index) (0, 10)
- property ranges_len¶
The number of contiguous ranges in the IntRangeSet
>>> print(IntRangeSet('0:10,12').ranges_len) 2
- remove(*ranges_inputs)¶
See
IntRangeSet__isub__()
- sum()¶
The sum of the integer elements in the IntRangeSet
>>> print(IntRangeSet('0:10,12').sum()) 57
Note: This is more efficient than
sum(IntRangeSet('0:10,12'))
because is computed in time linear in the number of ranges, rather than integer elements.
- symmetric_difference(ranges)¶
Returns a new IntRangeSet set with elements in either the input IntRangeSet or the input range but not both.
These are the same:
a ^ b
a.symmetric_difference(b)
- Example:
>>> print(IntRangeSet('0:5,6:11') ^ '3:9') IntRangeSet('0:3,5,9:11')
- symmetric_difference_update(ranges)¶
- union()¶
Return the union of a IntRangeSet with zero or more ranges inputs. The original IntRangeSet is not changed.
These are the same:
a | b
a + b
a.union(b)
- Example:
>>> print(IntRangeSet('0:5,6:10') | 5) IntRangeSet('0:10')
The ‘union’ method also support unioning multiple ranges inputs,
- Example:
>>> print(IntRangeSet('0:5,6:10').union(5,'100:200')) IntRangeSet('0:10,100:200')
- update(*ranges_inputs)¶
Union zero or more ranges inputs into the current IntRangeSet.
These are the same:
a |= b
a += b
a.add(b)
a.update(b)
- Example:
>>> a = IntRangeSet('0:5,6:10') >>> a |= 5 >>> print(a) IntRangeSet('0:10')
The ‘add’ and ‘update’ methods also support unioning multiple ranges inputs,
- Example:
>>> a = IntRangeSet('0:5,6:10') >>> a.add('5','100:200') >>> print(a) IntRangeSet('0:10,100:200')
util.pheno
¶
Deprecated since version Use: Pheno
instead.
- pysnptools.util.pheno.loadOnePhen(filename, i_pheno=0, missing='-9', vectorize=False)¶
Deprecated since version Use:
Pheno
instead.Load one column of a phenotype file. Remove any rows with missing data
- Parameters:
filename (string) – name of the file
i_pheno (int) – column to return (default ‘0’, the first column)
missing (string) – value to threat as missing
vectorize (bool) – if true, return a 1-D vector rather than a 2-D array
- Return type:
An output dictionary
The output dictionary looks like:
‘header’ : [1] array phenotype namesv (only if header line is specified in file),
‘vals’ : [N*1] array of phenotype-data,
‘iid’ : [N*2] array of family IDs and individual IDs
- pysnptools.util.pheno.loadPhen(filename, missing='-9', famid='FID', sampid='ID')¶
Deprecated since version Use:
Pheno
instead.Load a phenotype or covariate file. Covariates have the same file format.
- Parameters:
filename (string) – name of the file
missing (string) – value to threat as missing
vectorize (bool) – if true, return a 1-D vector rather than a 2-D array
- Return type:
An output dictionary
The output dictionary looks like:
‘header’ : [1] array phenotype names (only if header line is specified in file),
‘vals’ : [N*1] array of phenotype-data,
‘iid’ : [N*2] array of family IDs and individual IDs
util.mapreduce1
Module¶
- pysnptools.util.mapreduce1.map_reduce(input_seq, mapper=<function _identity>, reducer=<class 'list'>, input_files=None, output_files=None, name=None, runner=None, nested=None)¶
Runs a function on sequence of inputs and runs a second function on the results. Can be nested and clusterized.
- Parameters:
input_seq (a sequence) – a sequence of inputs. The sequence must support the len function and be indexable. e.g. a list, range(100)
mapper (a function) – A function to apply to each set of inputs (optional). Defaults to the identity function.
reducer (a function that takes a sequence) – A function to turn the results from the mapper to a single value (optional). Defaults to creating a list of the results.
input_files (a list) – An optional list that tells what input files are needed. The list can contain the names of files (strings), None (ignored), or objects such as
SnpReader
’s that can self-report their input files.output_files (a list) – An optional list that tells what output files will be produced. The list can contain the names of files (strings), None (ignored), or objects such as
SnpReader
’s that can self-report their output files.name (a string) – A name to be displayed if this work is done on a cluster.
runner (
Runner
) – aRunner
, optional: Tells how to run locally, multi-processor, or on a cluster. If not given, the function is run locally.nested (a function) – a mapper function that is itself a map_reduce. Some runners can efficiently clusterize such nested mappers.
- Return type:
The results from the reducer.
- Example:
Square the numbers 0 to 99 and report their sum, locally:
>>> from pysnptools.util.mapreduce1 import map_reduce >>> map_reduce(range(100), ... mapper=lambda x: x*x, ... reducer=sum) 328350
Compute it again, this time run on four processors:
>>> from pysnptools.util.mapreduce1.runner import LocalMultiProc >>> map_reduce(range(100), ... mapper=lambda x: x*x, ... reducer=sum, ... runner=LocalMultiProc(4)) 328350
Compute it using named functions, again using four processors:
>>> def holder1(n,runner): ... def mapper1(x): ... return x*x ... def reducer1(sequence): ... return sum(sequence) ... return map_reduce(range(n),mapper=mapper1,reducer=reducer1,runner=runner) >>> holder1(100,LocalMultiProc(4)) 328350
util.mapreduce1
¶
- pysnptools.util.mapreduce1.map_reduce(input_seq, mapper=<function _identity>, reducer=<class 'list'>, input_files=None, output_files=None, name=None, runner=None, nested=None)¶
Runs a function on sequence of inputs and runs a second function on the results. Can be nested and clusterized.
- Parameters:
input_seq (a sequence) – a sequence of inputs. The sequence must support the len function and be indexable. e.g. a list, range(100)
mapper (a function) – A function to apply to each set of inputs (optional). Defaults to the identity function.
reducer (a function that takes a sequence) – A function to turn the results from the mapper to a single value (optional). Defaults to creating a list of the results.
input_files (a list) – An optional list that tells what input files are needed. The list can contain the names of files (strings), None (ignored), or objects such as
SnpReader
’s that can self-report their input files.output_files (a list) – An optional list that tells what output files will be produced. The list can contain the names of files (strings), None (ignored), or objects such as
SnpReader
’s that can self-report their output files.name (a string) – A name to be displayed if this work is done on a cluster.
runner (
Runner
) – aRunner
, optional: Tells how to run locally, multi-processor, or on a cluster. If not given, the function is run locally.nested (a function) – a mapper function that is itself a map_reduce. Some runners can efficiently clusterize such nested mappers.
- Return type:
The results from the reducer.
- Example:
Square the numbers 0 to 99 and report their sum, locally:
>>> from pysnptools.util.mapreduce1 import map_reduce >>> map_reduce(range(100), ... mapper=lambda x: x*x, ... reducer=sum) 328350
Compute it again, this time run on four processors:
>>> from pysnptools.util.mapreduce1.runner import LocalMultiProc >>> map_reduce(range(100), ... mapper=lambda x: x*x, ... reducer=sum, ... runner=LocalMultiProc(4)) 328350
Compute it using named functions, again using four processors:
>>> def holder1(n,runner): ... def mapper1(x): ... return x*x ... def reducer1(sequence): ... return sum(sequence) ... return map_reduce(range(n),mapper=mapper1,reducer=reducer1,runner=runner) >>> holder1(100,LocalMultiProc(4)) 328350
util.mapreduce1.runner.Runner
¶
- class pysnptools.util.mapreduce1.runner.Runner¶
A Runner is an object that tells a
map_reduce()
how to run on a local machine or cluster. Examples includeLocal
andLocalMultiProc
.
util.mapreduce1.runner.Local
¶
- class pysnptools.util.mapreduce1.runner.Local(mkl_num_threads=None, logging_handler=<StreamHandler <stdout> (NOTSET)>)¶
A
Runner
that runs amap_reduce()
locally. To save memory, it will feed the results of the mapper to the reducer as those results are computed.- Constructor:
- Parameters:
mkl_num_threads (number) – (default None) Limit on the number threads used by the NumPy MKL library.
- Parameters:
logging_handler (stream) – (default stdout) Where to output logging messages.
- Example:
>>> from pysnptools.util.mapreduce1 import map_reduce >>> from pysnptools.util.mapreduce1.runner import Local >>> def holder1(n,runner): ... def mapper1(x): ... return x*x ... def reducer1(sequence): ... return sum(sequence) ... return map_reduce(range(n),mapper=mapper1,reducer=reducer1,runner=runner) >>> holder1(100,Local()) 328350
util.mapreduce1.runner.LocalMultiProc
¶
- class pysnptools.util.mapreduce1.runner.LocalMultiProc(taskcount, mkl_num_threads=None, weights=None, taskindex_to_environ=None, just_one_process=False, logging_handler=<StreamHandler <stdout> (NOTSET)>)¶
A
Runner
that runs amap_reduce()
as multiple processes on a single machine.- Constructor:
- Parameters:
taskcount (number) – The number of processes to run on.
mkl_num_threads (number) – (default None) Limit on the number threads used by the NumPy MKL library.
- weights (array of integers) – (default None) If given, tells the relative amount of work assigned to
each task. The length of the array must be taskcount. If not given, all tasks are assigned the same amount of work.
- taskindex_to_environ (function from integers to dictionaries) – (default None). If given, this
should be function from taskindex to a dictionary. The dictionary tells how to temporarily set environment variables while the task is running. The dictionary is a mapping of variables and values (both strings).
just_one_process (bool) – (default False) Divide the work for multiple processes, but runs sequentially on one process. Can be useful for debugging.
logging_handler (stream) – (default stdout) Where to output logging messages.
- Example:
>>> from pysnptools.util.mapreduce1 import map_reduce >>> from pysnptools.util.mapreduce1.runner import LocalMultiProc >>> def holder1(n,runner): ... def mapper1(x): ... return x*x ... def reducer1(sequence): ... return sum(sequence) ... return map_reduce(range(n),mapper=mapper1,reducer=reducer1,runner=runner) >>> holder1(100,LocalMultiProc(4)) 328350
util.mapreduce1.runner.LocalMultiThread
¶
- class pysnptools.util.mapreduce1.runner.LocalMultiThread(taskcount, mkl_num_threads=None, just_one_process=False)¶
A
Runner
that runs amap_reduce()
as multiple threads on a single machine.Note that Python has problems running some programs efficiently on multiple threads. (Search ‘python global interpreter lock’ for details.) If this
Runner
doesn’t work, considerLocalMultiProc
instead.- Constructor:
- Parameters:
taskcount (number) – The number of threads to run on.
- Parameters:
mkl_num_threads (number) – (default None) Limit on the number threads used by the NumPy MKL library.
- Parameters:
just_one_process (bool) – (default False) Divide the work for multiple threads, but sequentially on one thread. Can be useful for debugging.
- Example:
>>> from pysnptools.util.mapreduce1 import map_reduce >>> from pysnptools.util.mapreduce1.runner import LocalMultiThread >>> def holder1(n,runner): ... def mapper1(x): ... return x*x ... def reducer1(sequence): ... return sum(sequence) ... return map_reduce(range(n),mapper=mapper1,reducer=reducer1,runner=runner) >>> holder1(100,LocalMultiThread(4)) 328350
util.mapreduce1.runner.LocalInParts
¶
- class pysnptools.util.mapreduce1.runner.LocalInParts(taskindex, taskcount, mkl_num_threads=None, weights=None, environ=None, result_file=None, run_dir='.', temp_dir=None, logging_handler=<StreamHandler <stdout> (NOTSET)>)¶
A
Runner
that runs one piece of amap_reduce()
job locally. Partial results are saved to disk. Clustering runners andLocalMultiProc
use this runner internally.- Constructor:
- Parameters:
taskindex (number) – Which piece of work to run. When 0 to taskcount-1, does map work. When taskcount, does the reduce work.
taskcount (number) – The number of pieces into which to divide the work.
mkl_num_threads (number) – (default None) Limit on the number threads used by the NumPy MKL library.
- weights (array of integers) – (default None) If given, tells the relative amount of work assigned to
each task. The length of the array must be taskcount. If not given, all tasks are assigned the same amount of work.
- environ (dictionary of variables and values) – (default None). Temporarily assigns environment variables.
both variables and values should be strings.
result_file (string) – (default None) Where to pickle the final results. If no file is given, the final results are returned, but not saved to a file.
result_dir (string) – (default None) The directory for any result_file. Defaults to the current working directory.
temp_dir (string) – (default None) The directory for partial results. Defaults to the result_dir/.working_directory.{map_reduce’s Name}.
logging_handler (stream) – (default stdout) Where to output logging messages.
- Example:
>>> from pysnptools.util.mapreduce1 import map_reduce >>> from pysnptools.util.mapreduce1.runner import LocalInParts >>> def holder1(n,runner): ... def mapper1(x): ... return x*x ... def reducer1(sequence): ... return sum(sequence) ... return map_reduce(range(n),mapper=mapper1,reducer=reducer1,runner=runner) >>> holder1(100,LocalInParts(0,4)) #Run part 0 of 4 and save partial results to disk as '0.4.p'. >>> holder1(100,LocalInParts(1,4)) #Run part 1 of 4 and save partial results to disk as '1.4.p'. >>> holder1(100,LocalInParts(2,4)) #Run part 2 of 4 and save partial results to disk as '2.4.p'. >>> holder1(100,LocalInParts(3,4)) #Run part 3 of 4 and save partial results to disk as '3.4.p'. >>> holder1(100,LocalInParts(4,4)) #Read the all partial results and then apply the reduce function and return the result. 328350
util.filecache
Module¶
Tools for reading and writing files, locally or across clusters.
- pysnptools.util.filecache.ip_address()¶
- pysnptools.util.filecache.ip_address_pid()¶
Return the ip address of this computer and process id of the executing process.
util.filecache.FileCache
¶
- class pysnptools.util.filecache.FileCache¶
- A FileCache is class such as
LocalCache
orPeerToPeer
such that for reading, copies a (possibly remote) file to a local disk. (If the local file already exists and is up-to-date, retrieval is typically skipped.)
for writing, copies a newly-written local file to (possibly remote) storage.
provides an assortment of file- and directory-related commands.
- Example:
Suppose all machines in a cluster can read & write the storage at ‘peertopeer1/common’. Also, suppose that ‘peertopeer1/192.168.1.105’ is stored locally on machine 192.168.1.105 but readable other machines in the cluster (and so on for all the machines and their IP addresses).
>>> from pysnptools.util.filecache import PeerToPeer, ip_address >>> def id_and_path_function(): ... ip = ip_address() ... return ip, 'peertopeer1/{0}'.format(ip) >>> file_cache = PeerToPeer(common_directory='peertopeer1/common',id_and_path_function=id_and_path_function) >>> file_cache PeerToPeer('peertopeer1/common',id_and_path_function=...')
Remove anything already in remote storage
>>> file_cache.rmtree()
Create a random SNP file and store it remotely.
>>> from pysnptools.snpreader import SnpGen, Dense >>> snp_gen = SnpGen(seed=123,iid_count=100,sid_count=500) >>> with file_cache.open_write('r123.100x500.dense.txt') as local_filename: ... dense1 = Dense.write(local_filename,snp_gen.read()) >>> list(file_cache.walk()) ['r123.100x500.dense.txt']
Copy back from remote storage (if needed) and then read SNP data from local file.
>>> with file_cache.open_read('r123.100x500.dense.txt') as local_filename: ... dense2 = Dense(local_filename) ... print(dense2[:3,:3].read().val) #Read 1st 3 individuals and SNPs [[ 0. nan nan] [ 0. 0. nan] [ 0. 0. 0.]]
Given an appropriate module, such as
LocalCache
orPeerToPeer
, the FileCache library provides a unified way to work with any remote storage scheme. It differs from virtual file systems, such as CIFS VFS, because:FileCache works with all operating systems and requires no operating system changes.
FileCache can take advantage of the highest performance file retrieval methods (e.g. kernel-space file systems, peer-to-peer transfers, tree copies, etc).
FileCache can take advantage of the highest performance local read and write storage (e.g. SSDs)
FileCache can work on top of CIFS VFS and any other remote storage system with an appropriate module.
The downside of FileCache is that:
Writing generally requires two statements (e.g.
open_write()
,Dense.write()
) instead of just one. Likewise, Reading generally requires two statements (e.g.open_read()
,SnpReader.read()
) instead of just one.
Methods & Properties:
Every FileCache, such as
LocalCache
andPeerToPeer
, has this property:name
, and these methods:file_exists()
,getmtime()
,join()
,load()
,open_read()
,open_write()
,remove()
,rmtree()
,save()
, andwalk()
. See below for details.Details of Methods & Properties:
- file_exists(file_name)¶
Tells if there a file with this name. (A directory with the name doesn’t count.)
- Parameters:
file_name (string) – The name of the file to look for
- Return type:
bool
- Example:
>>> from pysnptools.util.filecache import LocalCache >>> file_cache = LocalCache('localcache1') >>> file_cache.rmtree() >>> file_cache.file_exists('localcache1/file1.txt') False
>>> file_cache.save('localcache1/file1.txt','Hello') >>> file_cache.file_exists('localcache1/file1.txt') True
- getmtime(file_name)¶
Return the time that the file was last modified.
- Parameters:
file_name (string) – The name of the (possibly remote) file of interest
- Return type:
number, the number of seconds since the epoch
- join(path)¶
The
FileCache
created by appending a path to the currentFileCache
.>>> from pysnptools.util.filecache import LocalCache >>> file_cache = LocalCache('localcache1') >>> file_cache LocalCache('localcache1') >>> sub = file_cache.join('sub1') >>> sub LocalCache('localcache1/sub1')
- load(file_name, updater=None)¶
Returns the contents of a file in storage as a string.
- Parameters:
file_name (string) – The name of the (possibly remote) file to read
updater (function or lambda) – (Default, None). Optional function to which status messages may be written. For example, the function created by
log_in_place()
.
- Return type:
string - what was written in the file.
>>> from pysnptools.util.filecache import LocalCache >>> file_cache = LocalCache('localcache1') >>> file_cache.rmtree() >>> file_cache.save('file1.txt','Hello') >>> file_cache.load('file1.txt') 'Hello'
- property name¶
A path-like name for this FileCache.
- Return type:
string
>>> from pysnptools.util.filecache import LocalCache >>> file_cache = LocalCache('localcache1') >>> file_cache.name 'localcache1'
- open_read(file_name, updater=None)¶
Used with a ‘with’ statement to produce a local copy of a (possibly remote) file.
- Parameters:
file_name (string) – The name of the (possibly remote) file to read
updater (function or lambda) – (Default, None). Optional function to which status messages may be written. For example, the function created by
log_in_place()
.
- Return type:
a local filename to read.
- Example:
>>> from pysnptools.util.filecache import LocalCache >>> file_cache = LocalCache('localcache1') >>> file_cache.rmtree() >>> file_cache.save('file1.txt','Hello') >>> with file_cache.open_read('file1.txt') as local_filename: ... with open(local_filename,'r') as fp: ... line = fp.readline() >>> line 'Hello'
- open_write(file_name, size=0, updater=None)¶
Used with a ‘with’ statement to produce a local file name that will be copied to remote storage when the ‘with’ statement is exited.
- Parameters:
file_name (string) – The name of the (possibly remote) file to which to write.
size (number) – (default 0) if given, an error will be thrown immediately if local storage doesn’t have room for that many bytes.
updater (function or lambda) – (Default, None). Optional function to which status messages may be written. For example, the function created by
log_in_place()
.
- Return type:
a local filename to read.
- Example:
>>> from pysnptools.util.filecache import LocalCache >>> file_cache = LocalCache('localcache1') >>> file_cache.rmtree() >>> with file_cache.open_write('file1.txt') as local_filename: ... with open(local_filename,'w') as fp: ... _= fp.write('Hello') >>> file_cache.load('file1.txt') 'Hello'
- remove(file_name, updater=None)¶
Remove a file from storage. It is an error to remove a directory this way.
- Parameters:
file_name (string) – The name of the (possibly remote) file to remove.
updater (function or lambda) – (Default, None). Optional function to which status messages may be written. For example, the function created by
log_in_place()
.
- rmtree(path=None, updater=None)¶
Delete all files in this
FileCache
. It is OK if there are no files.- Parameters:
path (string) – (Default, None, the current
FileCache
). Optional a path (subdirectory, not file) to start in.updater (function or lambda) – (Default, None). Optional function to which status messages may be written. For example, the function created by
log_in_place()
.
- save(file_name, contents, size=0, updater=None)¶
Write a string to a file in storage.
- Parameters:
file_name (string) – The name of the (possibly remote) file to which to write.
contents (string) – What to write to the file.
size (number) – (default 0) if given, an error will be thrown immediately if local storage doesn’t have room for that many bytes.
updater (function or lambda) – (Default, None). Optional function to which status messages may be written. For example, the function created by
log_in_place()
.
>>> from pysnptools.util.filecache import LocalCache >>> file_cache = LocalCache('localcache1') >>> file_cache.rmtree() >>> file_cache.save('file1.txt','Hello') >>> file_cache.load('file1.txt') 'Hello'
- walk(path=None)¶
Generates the relative paths of the files in the
FileCache
. It is OK if there are no files.- Parameters:
path (string) – (Default, None, the current
FileCache
). Optional path (subdirectory, not file) to start in.- Return type:
a generator of strings
- Example:
>>> from pysnptools.util.filecache import LocalCache >>> file_cache = LocalCache('localcache1') >>> file_cache.rmtree() >>> list(file_cache.walk()) []
>>> file_cache.save('file1.txt','Hello') >>> list(file_cache.walk()) ['file1.txt']
>>> list(file_cache.walk('no_such_sub_path')) []
- A FileCache is class such as
util.filecache.LocalCache
¶
- class pysnptools.util.filecache.LocalCache(directory)¶
A
FileCache
for working with files locally.See
FileCache
for general examples of using FileCache.This is the simplest
FileCache
in that it stores everything on a local disk rather than storing things remotely.- Constructor:
- Parameters:
directory (string) – The directory under which files should be written and read.
- Example:
>>> from pysnptools.util.filecache import LocalCache >>> file_cache = LocalCache('localcache1') >>> file_cache.rmtree() >>> file_cache.save('sub1/file1.txt','Hello') >>> file_cache.file_exists('sub1/file1.txt') True
- property name¶
A path-like name for this LocalCache.
- Return type:
string
util.filecache.PeerToPeer
¶
- class pysnptools.util.filecache.PeerToPeer(common_directory, id_and_path_function, leave_space=0)¶
A
FileCache
that allows multiple machines (or processes on one machine) to share files in a peer-to-peer fashion.See
FileCache
for general examples of using FileCache.This is useful for a cluster with multiple machines with persistent storage and a fast network. Every time a machine writes a file, the file stays on that machine, but information about the file’s name and location is saved to a common directory. When a new machine needs the file, it looks to the common directory for a location and then copies from that location. That second location is also saved to the common directory. Later machines that need the file will randomly pick among the available locations. When a file is removed, it is only removed from the common directory; clean up of local copies happens later when space is needed for other reads.
- Constructor:
- Parameters:
common_directory (
FileCache
or string) – The path orFileCache
where each file’s name and locations (but not contents) are stored. This path must be read/writable by every machine in the cluster.id_and_path_function (function) – A function (or lambda) that deterministically returns two values: 1. an id unique to each machine (or process) and 2. a file path to this machine’s local storage that any machine in the cluster can read.
leave_space (number) – (Default “0”) How much free space must be left on local drive after downloads and writes.
- Example:
>>> #Suppose that every machine can access a shared 'peertopeer1' and that 'peertopeer1/IPADDRESS' is local to each machine. >>> from pysnptools.util.filecache import PeerToPeer, ip_address >>> def id_and_path_function(): ... ip = ip_address() ... return ip, 'peertopeer1/{0}'.format(ip) >>> file_cache = PeerToPeer(common_directory='peertopeer1/common',id_and_path_function=id_and_path_function) >>> file_cache.rmtree() >>> file_cache.save('sub1/file1.txt','Hello') >>> file_cache.file_exists('sub1/file1.txt') True
- property name¶
A path-like name for this PeerToPeer.
- Return type:
string
util.filecache.Hashdown
¶
- class pysnptools.util.filecache.Hashdown(url, file_to_hash={}, directory=None, allow_unknown_files=False, trust_local_files=False, _relative_directory=None)¶
A
FileCache
for downloading files with known MD5 hashes from a URL. Unloading is not supported.See
FileCache
for general examples of using FileCache.- Constructor:
- Parameters:
url (string) – The URL from which to download any needed files.
file_to_hash (dictionary) – A dictionary from file names to MD5 hashes.
- directory (optional, string) – Local location for files. If not given will be under the system temp directory
(typically controlled with the TEMP environment variable).
- allow_unknown_files (optional, bool) – By default, all requested files must be in the dictionary. If True,
other files can be requested. If found under the URL, they will be downloaded and an entry will be added to the dictionary.
- trust_local_files (optional, bool) – By default, when allow_unknown_files is True, unknown files
will be download. If trust_local_files is also True, then any local files in directory will be assumed to have the correct hash.
_relative_directory (string) – Used internally
- Example:
>>> from pysnptools.util.filecache import Hashdown >>> file_to_hash= {'pysnptools/examples/toydata.5chrom.bed': '766f55aa716bc7bc97cad4de41a50ec3', ... 'pysnptools/examples/toydata.5chrom.bim': '6a07f96e521f9a86df7bfd7814eebcd6', ... 'pysnptools/examples/toydata.5chrom.fam': 'f4eb01f67e0738d4865fad2014af8537'} >>> hashdown = Hashdown('https://github.com/fastlmm/PySnpTools/raw/cf248cbf762516540470d693532590a77c76fba2', ... file_to_hash=file_to_hash) >>> hashdown.file_exists('pysnptools/examples/toydata.5chrom.bed') True >>> hashdown.load('pysnptools/examples/toydata.5chrom.fam').split('\n')[0] 'per0 per0 0 0 2 0.408848'
- static load_hashdown(filename, directory=None, allow_unknown_files=False, trust_local_files=False)¶
Load a Hashdown object from a json file.
- Parameters:
filename – name of file to load from.
directory – Local location for files. If not given will be under the system temp directory (typically controlled with the TEMP environment variable).
allow_unknown_files – By default, all requested files must be in the dictionary. If True, other files can be requested. If found under the URL, they will be downloaded and an entry will be added to the dictionary.
trust_local_files – By default, when allow_unknown_files is True, unknown files will be download. If trust_local_files is also True, then any local files in directory will be assumed to have the correct hash.
- Return type:
>>> from pysnptools.util.filecache import Hashdown >>> file_to_hash= {'pysnptools/examples/toydata.5chrom.bed': '766f55aa716bc7bc97cad4de41a50ec3', ... 'pysnptools/examples/toydata.5chrom.bim': '6a07f96e521f9a86df7bfd7814eebcd6', ... 'pysnptools/examples/toydata.5chrom.fam': 'f4eb01f67e0738d4865fad2014af8537'} >>> hashdown1 = Hashdown('https://github.com/fastlmm/PySnpTools/raw/cf248cbf762516540470d693532590a77c76fba2', ... file_to_hash=file_to_hash) >>> hashdown1.save_hashdown('tempdir/demo.hashdown.json') >>> hashdown2 = Hashdown.load_hashdown('tempdir/demo.hashdown.json') >>> hashdown2.file_exists('pysnptools/examples/toydata.5chrom.bed') True >>> hashdown2.load('pysnptools/examples/toydata.5chrom.fam').split('\n')[0] 'per0 per0 0 0 2 0.408848'
- property name¶
A path-like name for this Hashdown.
- Return type:
string
>>> from pysnptools.util.filecache import Hashdown >>> file_to_hash= {'pysnptools/examples/toydata.5chrom.bed': '766f55aa716bc7bc97cad4de41a50ec3', ... 'pysnptools/examples/toydata.5chrom.bim': '6a07f96e521f9a86df7bfd7814eebcd6', ... 'pysnptools/examples/toydata.5chrom.fam': 'f4eb01f67e0738d4865fad2014af8537'} >>> hashdown = Hashdown('https://github.com/fastlmm/PySnpTools/raw/cf248cbf762516540470d693532590a77c76fba2', ... file_to_hash=file_to_hash) >>> hashdown.name 'hashdown/9ac30da2bf589db947e91744dff0ec24' >>> hashdown.join('pysnptools').name 'hashdown/9ac30da2bf589db947e91744dff0ec24/pysnptools'
- save_hashdown(filename)¶
Save a Hashdown object to a json file.
- Parameters:
filename – name of file to save to.
>>> from pysnptools.util.filecache import Hashdown >>> file_to_hash= {'pysnptools/examples/toydata.5chrom.bed': '766f55aa716bc7bc97cad4de41a50ec3', ... 'pysnptools/examples/toydata.5chrom.bim': '6a07f96e521f9a86df7bfd7814eebcd6', ... 'pysnptools/examples/toydata.5chrom.fam': 'f4eb01f67e0738d4865fad2014af8537'} >>> hashdown1 = Hashdown('https://github.com/fastlmm/PySnpTools/raw/cf248cbf762516540470d693532590a77c76fba2', ... file_to_hash=file_to_hash) >>> hashdown1.save_hashdown('tempdir/demo.hashdown.json') >>> hashdown2 = Hashdown.load_hashdown('tempdir/demo.hashdown.json') >>> hashdown2.file_exists('pysnptools/examples/toydata.5chrom.bed') True >>> hashdown2.load('pysnptools/examples/toydata.5chrom.fam').split('\n')[0] 'per0 per0 0 0 2 0.408848'
- static scan_local(local_directory, url=None, logging_level=30)¶
Bootstrap a Hashdown by recursively walking a local directory and finding the local MD5 hashes. (A local hash might be wrong if the files are out of date or have OS-dependent line endings.) Typically, you’ll then want to save the result to a JSON file and then edit that JSON file manually to remove uninteresting files.
- Parameters:
local_directory – Local directory to recursively walk
url – URL to give to the Hashdown. (It will not be checked.)
logging_level – Logging level for printing progress of the walk. Default is logging.WARNING)
- Return type: