pysnptools Documentation

PySnpTools: A library for reading and manipulating genetic data.

See PySnpTools’s README.md for installation instructions, documentation, and code.

Synopsis:

  • snpreader: Efficiently read genetic PLINK formats including *.bed/bim/fam and phenotype files. Also, efficiently read parts of files and standardize data.

  • new:

    distreader: Efficiently work with unphased BGEN format and other diploid, biallelic distribution data. Also, efficiently read parts of files. See Distribution IPython Notebook

  • kernelreader: Efficiently create, read, and manipulate kernel data.

  • standardizer: Specify standardizers for snpreader.

  • kernelstandardizer: Specify standardizers for kernelreader.

  • pstreader: Generalizes snpreader and kernelreader (provides the efficiency of numpy arrays with some of the flexibility of pandas)

  • util: In one line, intersect and re-order IIDs from snpreader, kernelreader and other sources. Also, efficiently extract a submatrix from an ndarray.

  • util.IntRangeSet: Efficiently manipulate ranges of integers – for example, genetic position – with set operators including union, intersection, and set difference.

  • util.mapreduce1: Run loops in parallel on multiple processes, threads, or clusters.

  • util.filecache: Automatically copy files to and from any remote storage.

Tutorial:

From PyData Conference, Seattle

Code:

Contacts:

snpreader Module

Tools for reading and manipulating SNP data.

snpreader.SnpReader

class pysnptools.snpreader.SnpReader(*args, **kwargs)

A SnpReader is one of three things:

  • A class such as Bed for you to specify a file with data. For example,

>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snp_on_disk = Bed(bedfile,count_A1=False)
>>> print(snp_on_disk) # prints the name of the file reader
Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)
>>> snp_on_disk.sid_count # prints the number of SNPS (but doesn't read any SNP values)
1015
  • A SnpData class that holds SNP data in memory, typically after reading it from disk:

>>> snp_on_disk = Bed(bedfile,count_A1=False)
>>> snpdata1 = snp_on_disk.read() #reads the SNP values
>>> type(snpdata1.val).__name__ # The val property is an ndarray of SNP values
'ndarray'
>>> print(snpdata1) # prints the name of the file reader
SnpData(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False))
>>> snpdata1.iid_count #prints the number of iids (number of individuals) in this in-memory data
300
  • A subset of any SnpReader, specified with “[ iid_index , sid_index ]”, to read only some SNP values.

    It can also be used to re-order the values.

>>> snp_on_disk = Bed(bedfile,count_A1=False)
>>> subset_on_disk = snp_on_disk[[3,4],::2] # specification for a subset of the data on disk. No SNP values are read yet.
>>> print(subset_on_disk.sid_count) # prints the number of sids in this subset (but still doesn't read any SNP values)
508
>>> print(subset_on_disk) #prints a specification of 'subset_on_disk'
Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)[[3,4],::2]
>>> snpdata_subset = subset_on_disk.read() # efficiently reads the specified subset of values from the disk
>>> print(snpdata_subset) # prints the name of the file reader
SnpData(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)[[3,4],::2])
>>> print((int(snpdata_subset.val.shape[0]), int(snpdata_subset.val.shape[1]))) # The dimensions of the ndarray of SNP values
(2, 508)

The SnpReaders Classes

Class

Format

Random Access

Suffixes

Write method?

SnpData

in-memory floats

Yes

n/a

n/a

Bed

binary, 0,1,2

Yes (by sid)

.bed/.bim/.fam

Yes

Pheno

text, floats

No

.txt,.phe

Yes

Dat

text, floats

No

.dat/.map/.fam

Yes

Ped

text, 0,1,2

No

.ped/.map

Yes

Dense

text, 0,1,2

No

.dense.txt

Yes

SnpNpz

binary, floats

No

.snp.npz

Yes

SnpHdf5

binary, floats

Yes (by sid or iid)

.snp.hdf5

Yes

SnpMemMap

mem-mapped floats

Yes

.snp.memmap

Yes

SnpGen

generated values

Yes (by sid)

n/a

n/a

Methods & Properties:

Every SnpReader, such as Bed and SnpData, has these properties: iid, iid_count, sid, sid_count, pos and these methods: read(), iid_to_index(), sid_to_index(), read_kernel(). See below for details.

SnpData is a SnpReader so it supports the above properties and methods. In addition, it supports property SnpData.val, method SnpData.standardize(), and equality testing. See below for details.

Many of the classes, such as Bed, also provide a static Bed.write() method for writing SnpData to disk.

>>> # read from Pheno, write to Bed
>>> from pysnptools.snpreader import Pheno, Bed
>>> import pysnptools.util as pstutil
>>> from pysnptools.util import example_file # Download and return local file name
>>> bed_fn = example_file("pysnptools/examples/toydata.5chrom.*", "*.bed")
>>> snpdata = Bed(bed_fn)[:,::2].read() # Read every-other SNP
>>> pstutil.create_directory_if_necessary("tempdir/everyother.bed")
>>> Bed.write("tempdir/everyother.bed",snpdata,count_A1=False)   # Write data in Bed format
Bed('tempdir/everyother.bed',count_A1=False)

iids and sids:

Individuals are identified with an iid, which is a ndarray of two strings: a family ID and a individual ID. SNP locations are identified with an sid string in a ndarray. For example:

>>> snp_on_disk = Bed(bedfile,count_A1=False)
>>> print(snp_on_disk.iid[:3]) # print the first three iids
[['POP1' '0']
 ['POP1' '12']
 ['POP1' '44']]
>>> print(snp_on_disk.sid[:9]) # print the first nine sids
['1_12' '1_34' '1_10' '1_35' '1_28' '1_25' '1_36' '1_39' '1_4']
>>> print(snp_on_disk.iid_to_index([['POP1','44'],['POP1','12']])) #Find the indexes for two iids.
[2 1]

Selecting and Reordering Individuals and SNPs

You often don’t want to read the SNP values for all iids and sids. You can use indexing to create a subsetting SnpReader that will read only the SNP values of interest.

SnpReaders support the indexing formats supported by ndarray plus two generalizations. Here are examples of indexing with an array of indexes, with slicing, and with an array of Booleans.

>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format
>>> subset_snpreader_1 = snp_on_disk[[3,-1],:] #index with an array of indexes. Negatives count from end.
>>> print((subset_snpreader_1.iid_count, subset_snpreader_1.sid_count))
(2, 1015)
>>> snpdata1 = subset_snpreader_1.read() # read just the two rows of interest from the disk
>>> subset_snpreader_2 = snp_on_disk[:,:0:-2] #index with a slice
>>> print((subset_snpreader_2.iid_count, subset_snpreader_2.sid_count))
(300, 507)
>>> boolindexes = [s.startswith('23_') for s in snp_on_disk.sid] # create a Boolean index of sids that start '23_'
>>> subset_snpreader_3 = snp_on_disk[:,boolindexes] #index with array of Booleans
>>> print((subset_snpreader_3.iid_count, subset_snpreader_3.sid_count))
(300, 24)

The first generalization over what ndarray offers is full indexing on both the iid dimension and the sid dimension, in other words, full multidimensional indexing. For example,

>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format
>>> subset_snpreader_4 = snp_on_disk[[3,4],:0:-2] # index on two dimensions at once
>>> print((subset_snpreader_4.iid_count, subset_snpreader_4.sid_count))
(2, 507)

The second generalization is indexing on a single integer index.

>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format
>>> subset_snpreader_5 = snp_on_disk[5,:] #index with single integer
>>> print((subset_snpreader_5.iid_count, subset_snpreader_5.sid_count))
(1, 1015)

Indexing is also useful when you have SNP values in memory via a SnpData index and want to copy a subset of those values. While you could instead index directly on the SnpData.val ndarray, by indexing on the SnpData instance you also get iid and cid information.

>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format
>>> snpdata1 = snp_on_disk.read() # read all SNP values into memory
>>> print(snpdata1.sid[:9]) # print the first 9 sids
['1_12' '1_34' '1_10' '1_35' '1_28' '1_25' '1_36' '1_39' '1_4']
>>> snpdata_subset = snpdata1[:,::2].read(view_ok=True,order='A') # create a copy or view with every other sid
>>> print(snpdata_subset.sid[:9])# print the first 9 sids in the subset
['1_12' '1_10' '1_28' '1_36' '1_4' '1_11' '1_32' '1_9' '1_17']

You can apply indexing on top of indexing to specify subsets of subsets of data to read. In this example, only the SNP values for every 16th sid is actually read from the disk.

>>> # These are just SnpReaders, nothing is read from disk yet
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format
>>> half_snpreader = snp_on_disk[:,::2] # a reader for half the sids
>>> quarter_snpreader = half_snpreader[:,::2] # a reader for half of half the sids
>>> sixteenth_snpreader = quarter_snpreader[:,::2][:,::2] # a reader for half of half of half of half the sids
>>> print(sixteenth_snpreader) #Print the specification of this reader
Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)[:,::2][:,::2][:,::2][:,::2]
>>> # Now we read from disk. Only values for one sid in every 16 will be read.
>>> snpdata_sixteenth = sixteenth_snpreader.read()
>>> print(snpdata_sixteenth.val[0,3])
2.0

When Data is Read:

SNP data can be enormous so we generally avoid reading it to the degree practical. Specifically,

  • Constructing and printing a SnpReader causes no file reading. For example, these commands read no data:

    >>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read.
    >>> print(snp_on_disk) # Print the Bed SnpReader specification. No data is read.
    Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)
    >>> subset_on_disk = snp_on_disk[[3,4],::2] # Construct a subsetting SnpReader. No data is read.
    >>> print(subset_on_disk) # print the subset SnpReader. No data is read.
    Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False)[[3,4],::2]
    
  • Properties and methods related to the iids and sids (to the degree practical) read just some iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once. Consider these commands:

    >>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read.
    >>> print(snp_on_disk.sid[:9]) # without reading any SNP values data from disk, read the sid and iid data from disk, cache it, and then print the first nine sids.
    ['1_12' '1_34' '1_10' '1_35' '1_28' '1_25' '1_36' '1_39' '1_4']
    >>> print(snp_on_disk.sid_to_index(['1_10','1_13'])) #use the cached sid information to find the indexes of '1_10' and '1_13'. (No data is read from disk.)
    [2 9]
    
  • The only methods that read SNP values from file are read() and read_kernel() (to the degree practical). For example:

    >>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read.
    >>> snpdata1 = snp_on_disk.read() #read all the SNP values from disk, creating a new SnpData instance that keeps these values in memory
    >>> print(snpdata1.val[0,2])# print the SNP value for the iid with index 0 and the sid with index 2. (No data is read from disk.)
    1.0
    
  • If you request the values for only a subset of the iids or sids, (to the degree practical) only that subset will be read from disk. for example:

    >>> subset_on_disk = Bed(bedfile,count_A1=False)[[3,4],::2] # Construct a subsetting SnpReader. No data is read.
    >>> snpdata_subset = subset_on_disk.read() # from disk, read the SNP values for the iids with index 3 and 4 AND sids with even numbered indexes.
    >>> print(snpdata_subset.val[0,2]) # print the SNP value with subset iid index 0 and sid index 2 (corresponding to iid index 3 and sid index 4 in the full data). No data is read from disk.
    2.0
    

When Data is Re-Read and Copied:

Every time you call a SnpReader’s read() method, the SnpReader re-reads the SNP value data and returns a new in-memory SnpData (with SnpData.val property containing a new ndarray of the SNP values). Likewise, when you call the kernel() method, the SnpReader re-reads the data and returns a new kernel ndarray.

Here is an example of what not to do, because it causes all the SNP value data to be read twice.

>>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read.
>>> # The following is not recommended because it inefficiently reads all the SNP values twice.
>>> print(snp_on_disk.read().val[0,2]) # read all values into a new SnpData, print a SNP value
1.0
>>> print(snp_on_disk.read().val[0,3]) # read all values (again) into a second new SnpData, print a SNP value
2.0

Here are two efficient alternatives. First, if all SNP values can all fit in memory, read them once into a SnpData and then access that SnpData multiple times.

>>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read.
>>> snpdata1 = snp_on_disk.read() # read all values into a new SnpData
>>> print(snpdata1.val[0,2]) # print a SNP value from snpdata1's in-memory ndarray
1.0
>>> print(snpdata1.val[0,3]) # print another SNP value from snpdata1's in-memory ndarray.
2.0

Second, if the SNP value data is too large to fit in memory, use subsetting to read only the SNP values of interest from disk.

>>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read.
>>> print(snp_on_disk[0,2].read().val[0,0]) #Define the subset of data and read only that subset from disk.
1.0
>>> print(snp_on_disk[0,3].read().val[0,0]) #Define a second subset of data and read only that subset from disk.
2.0

Because the in-memory SnpData class is a kind of SnpReader, you may read from it, too. Doing so create a new SnpData instance containing a copy of the SNP values in a new ndarray.

>>> snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read.
>>> snpdata1 = snp_on_disk.read() # read all SNP values from disk into a new SnpData
>>> print(snpdata1.val is snpdata1.val) # Do the in-memory SNP values use the same memory as themselves? Yes
True
>>> snpdata2 = snpdata1.read() # copy all the SNP values into a new ndarray in a new SnpData
>>> print(snpdata2.val is snpdata1.val) # Do the two ndarrays of in-memory SNP values use the same memory?
False

Avoiding Unwanted ndarray Allocations

You may want a subset of SNPs values from an in-memory SnpData and you may know that this subset and the original SnpData can safely share the memory of the ndarray of SNP values. For this case, the read() has optional parameters called view_ok and order. If you override the defaults of “view_ok=False,order=’F’” with “view_ok=True,order=’A’, the read() will, if practical, return a new SnpData with a ndarray shares memory with the original ndarray. Use these parameters with care because any change to either ndarray (for example, via SnpData.standardize()) will effect the others. Also keep in mind that read() relies on ndarray’s mechanisms to decide whether to actually share memory and so it may ignore your suggestion and allocate a new ndarray anyway.

snp_on_disk = Bed(bedfile,count_A1=False) # Construct a Bed SnpReader. No data is read.
>>> snpdata1 = snp_on_disk.read() # read all data from disk into a SnpData with a new ndarray
>>> column01 = snpdata1[:,0:1].read(view_ok=True,order='A') #create SnpData with the data from just the first two SNPs. Sharing memory is OK. The memory may be laid out in any order (that is sid-major and iid-major are both OK).
>>> import numpy as np
>>> #print np.may_share_memory(snpdata1.val, column01.val) # Do the two ndarray's share memory? They could (but currently they won't)
>>> column201 = snpdata1[:,[2,0,1]].read(view_ok=True,order='A') #create SnpData with the data from three SNPs, permuted. Sharing memory is OK.
>>> print(np.may_share_memory(snpdata1.val, column201.val)) # Do the two ndarray's share memory? No, ndarray decided that this indexing was too complex for sharing.
False

The read() Method

By default the read() returns a ndarray of numpy.float64 laid out in memory in F-contiguous order (iid-index varies the fastest). You may, instead, ask for numpy.float32 or for C-contiguous order or any order. See read() for details.

The SnpData.standardize() Method

The SnpData.standardize() method, available only on SnpData, does in-place standardization of the in-memory SNP data. By default, it applies ‘Unit’ standardization, that is: the values for each SNP will have mean zero and standard deviation 1.0. NaN values are then filled with zeros (consequently, if there are NaN values, the final standard deviation will not be zero). Note that, for efficiently, this method works in-place, actually changing values in the ndarray. Although it works in place, for convenience it also returns itself. See SnpData.standardize() for options and details.

>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format
>>> snpdata1 = snp_on_disk.read() # read all SNP values into memory
>>> print(snpdata1) # Prints the specification for this SnpData
SnpData(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False))
>>> print(snpdata1.val[0,0])
2.0
>>> snpdata1.standardize() # standardize changes the values in snpdata1.val and changes the specification.
SnpData(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False),Unit())
>>> print('{0:.6f}'.format(snpdata1.val[0,0]))
0.229416
>>> snpdata2 = snp_on_disk.read().standardize() # Read and standardize in one expression with only one ndarray allocated.
>>> print('{0:.6f}'.format(snpdata2.val[0,0]))
0.229416

The read_kernel() Method

The read_kernel() method, available on any SnpReader, returns a KernelData. The val property of the KernelData is an ndarray of the (possibility standardized) SNP values multiplied with their transposed selves. When applied to an read-from-disk SnpReader, such as Bed, the method can save memory by reading (and standardizing) the data in blocks. See read_kernel() for details.

>>> from pysnptools.standardizer import Unit
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify some data on disk in Bed format
>>> kerneldata1 = snp_on_disk.read_kernel(Unit()) #Create an in-memory kernel from the snp data on disk.
>>> print('{0:.6f}'.format(kerneldata1.val[0,0]))
901.421836
>>> kerneldata2 = snp_on_disk.read_kernel(Unit(),block_size=10) #Create an in-memory kernel from the snp data on disk, but only read 10 SNPS at a time from the disk.
>>> print('{0:.6f}'.format(kerneldata2.val[0,0]))
901.421836

Details of Methods & Properties:

as_dist(max_weight=2.0, block_size=None)

Returns a pysnptools.distreader.DistReader such that turns the an allele count of 0,1, or 2 into a probability distribution of [1,0,0],[0,1,0], or [0,0,1], respectively. Any other values produce a distribution of [NaN,NaN,NaN].

Parameters:
  • max_weight (number) – optional – Tells the maximum allele count. Default of 2.

  • block_size (int or None) – optional – Default of None (meaning to load all). Suggested number of sids to read into memory at a time.

Return type:

class:DistReader

Example:

>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snpreader = Bed(bedfile, count_A1=False)
>>> print(snpreader[0,0].read().val)
[[2.]]
>>> distreader = snpreader.as_dist(max_weight=2)
>>> print(distreader[0,0].read().val)
[[[0. 0. 1.]]]
property iid

A ndarray of the iids. Each iid is a ndarray of two strings (a family ID and a individual ID) that identifies an individual.

Return type:

ndarray of strings with shape [iid_count,2]

This property (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.

Example:

>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snp_on_disk = Bed(bedfile,count_A1=False)
>>> print(snp_on_disk.iid[:3]) # print the first three iids
[['POP1' '0']
 ['POP1' '12']
 ['POP1' '44']]
property iid_count

number of iids

Return type:

integer

This property (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.

iid_to_index(list)

Takes a list of iids and returns a list of index numbers

Parameters:

list – list of iids

Return type:

ndarray of int

This method (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.

Example:

>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify SNP data on disk
>>> print(snp_on_disk.iid_to_index([['POP1','44'],['POP1','12']])) #Find the indexes for two iids.
[2 1]
kernel(standardizer, allowlowrank=False, block_size=10000, blocksize=None, num_threads=None)

Warning

Deprecated. Use read_kernel() instead.

Returns a ndarray of size iid_count x iid_count. The returned array has the value of the standardized SNP values multiplied with their transposed selves.

Parameters:
  • standardizer (Standardizer) – – Specify standardization to be applied before the matrix multiply. Any Standardizer may be used. Some choices include Standardizer.Identity (do nothing), Unit (make values for each SNP have mean zero and standard deviation 1.0), Beta.

  • block_size (int or None) – optional – Default of 10000. None means to load all. Suggested number of sids to read into memory at a time.

  • num_threads (None or int) – optional – The number of threads with which to standardize data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

Return type:

ndarray of size iid_count x iid_count

Calling the method again causes the SNP values to be re-read and allocates a new ndarray.

When applied to an read-from-disk SnpReader, such as Bed, the method can save memory by reading (and standardizing) the data in blocks.

If you request the values for only a subset of the sids or iids, (to the degree practical) only that subset will be read from disk.

Example:

>>> from pysnptools.snpreader import Bed
>>> from pysnptools.standardizer import Unit
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify SNP data on disk
>>> kernel = snp_on_disk.kernel(Unit())
>>> print(((int(kernel.shape[0]),int(kernel.shape[1])), '{0:.6f}'.format(kernel[0,0])))
((300, 300), '901.421836')
property pos

A ndarray of the position information for each sid. Each element is a ndarray of three numpy.numbers (chromosome, genetic distance, basepair distance).

Return type:

ndarray of float64 with shape [sid_count, 3]

This property (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.

Example:

>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snp_on_disk = Bed(bedfile,count_A1=False)
>>> print(snp_on_disk.pos[:3,]) # print position information for the first three sids: #The '...' is for possible space char
[[1.         0.00800801        nan]
 [1.         0.023023   1.        ]
 [1.         0.0700701  4.        ]]
read(order='F', dtype=<class 'numpy.float64'>, force_python_only=False, view_ok=False, num_threads=None, _require_float32_64=True)

Reads the SNP values and returns a SnpData (with SnpData.val property containing a new ndarray of the SNP values).

Parameters:
  • order (string or None) – {‘F’ (default), ‘C’, ‘A’}, optional – Specify the order of the ndarray. If order is ‘F’ (default), then the array will be in F-contiguous order (iid-index varies the fastest). If order is ‘C’, then the returned array will be in C-contiguous order (sid-index varies the fastest). If order is ‘A’, then the SnpData.val ndarray may be in any order (either C-, Fortran-contiguous).

  • dtype (data-type.) – {numpy.float64 (default), numpy.float32}, optional – The data-type for the SnpData.val ndarray. (For Bed, only, it can also be numpy.int8. Hidden option _require_float32_64 must be set to False. See Bed for an example.)

  • force_python_only (bool) – optional – If False (default), may use outside library code. If True, requests that the read be done without outside library code.

  • view_ok (bool) – optional – If False (default), allocates new memory for the SnpData.val’s ndarray. If True, if practical and reading from a SnpData, will return a new SnpData with a ndarray shares memory with the original SnpData. Typically, you’ll also wish to use “order=’A’” to increase the chance that sharing will be possible. Use these parameters with care because any change to either ndarray (for example, via SnpData.standardize()) will effect the others. Also keep in mind that read() relies on ndarray’s mechanisms to decide whether to actually share memory and so it may ignore your suggestion and allocate a new ndarray anyway.

  • num_threads (None or int) – optional – The number of threads with which to read data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

Return type:

SnpData

Calling the method again causes the SNP values to be re-read and creates a new in-memory SnpData with a new ndarray of SNP values.

If you request the values for only a subset of the sids or iids, (to the degree practical) only that subset will be read from disk.

Example:

>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snp_on_disk = Bed(bedfile, count_A1=False) # Specify SNP data on disk
>>> snpdata1 = snp_on_disk.read() # Read all the SNP data returning a SnpData instance
>>> print(type(snpdata1.val).__name__) # The SnpData instance contains a ndarray of the data.
ndarray
>>> subset_snpdata = snp_on_disk[:,::2].read() # From the disk, read SNP values for every other sid
>>> print(subset_snpdata.val[0,0]) # Print the first SNP value in the subset
2.0
>>> subsub_snpdata = subset_snpdata[:10,:].read(order='A',view_ok=True) # Create an in-memory subset of the subset with SNP values for the first ten iids. Share memory if practical.
>>> import numpy as np
>>> # print np.may_share_memory(subset_snpdata.val, subsub_snpdata.val) # Do the two ndarray's share memory? They could. Currently they won't.
read_kernel(standardizer=None, block_size=None, order='A', dtype=<class 'numpy.float64'>, force_python_only=False, view_ok=False, num_threads=None)

Returns a KernelData such that the KernelData.val() property will be a ndarray of the standardized SNP values multiplied with their transposed selves.

Parameters:
  • standardizer (Standardizer) – – (required) Specify standardization to be applied before the matrix multiply. Any Standardizer may be used. Some choices include Standardizer.Identity (do nothing), Unit (make values for each SNP have mean zero and standard deviation 1.0) and Beta.

  • block_size (int or None) – optional – Default of None (meaning to load all). Suggested number of sids to read into memory at a time.

Return type:

class:KernelData

Calling the method again causes the SNP values to be re-read and allocates a new class:KernelData.

When applied to an read-from-disk SnpReader, such as Bed, the method can save memory by reading (and standardizing) the data in blocks.

Example:

>>> from pysnptools.snpreader import Bed
>>> from pysnptools.standardizer import Unit
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify SNP data on disk
>>> kerneldata1 = snp_on_disk.read_kernel(Unit())
>>> print((int(kerneldata1.iid_count), '{0:.6f}'.format(kerneldata1.val[0,0])))
(300, '901.421836')
property row_property

Defined as a zero-width array for compatibility with PstReader, but not used.

property sid

A ndarray of the sids. Each sid is a string that identifies a SNP.

Return type:

ndarray (length sid_count) of strings

This property (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.

Example:

>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snp_on_disk = Bed(bedfile,count_A1=False)
>>> print(snp_on_disk.sid[:9]) # print the first nine sids
['1_12' '1_34' '1_10' '1_35' '1_28' '1_25' '1_36' '1_39' '1_4']
property sid_count

number of sids

Return type:

integer

This property (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.

sid_to_index(list)

Takes a list of sids and returns a list of index numbers

Parameters:

list (list of strings) – list of sids

Return type:

ndarray of int

This method (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.

Example:

>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snp_on_disk = Bed(bedfile,count_A1=False) # Specify SNP data on disk
>>> print(snp_on_disk.sid_to_index(['1_10','1_13'])) #Find the indexes for two sids.
[2 9]
property val_shape

Tells the shape of value for a given individual and SNP. For SnpReaders always returns None, meaning a single scalar value.

snpreader.Bed

class pysnptools.snpreader.Bed(filename, count_A1=None, iid=None, sid=None, pos=None, num_threads=None, skip_format_check=False, fam_filename=None, bim_filename=None, chrom_map={'MT': 26, 'X': 23, 'XY': 25, 'Y': 24})

A SnpReader for random-access reads of Bed/Bim/Fam files from disk.

See SnpReader for details and examples.

The format is described here.

Constructor:
Parameters:
  • filename (string) – The *.bed file to read. The ‘.bed’ suffix is optional. The related *.bim and *.fam files will also be read.

  • count_A1 (bool) – Tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.

The following three options are never needed, but can be used to avoid reading large ‘.fam’ and ‘.bim’ files when their information is already known.

  • iid (an array of strings) – The SnpReader.iid information. If not given, reads info from ‘.fam’ file.

  • sid (an array of strings) – The SnpReader.sid information. If not given, reads info from ‘.bim’ file.

  • pos (optional, an array of strings) – The SnpReader.pos information. If not given, reads info from ‘.bim’ file.

  • num_threads (optinal, int) – The number of threads with which to read data. Defaults to all available processors.

    Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

  • skip_format_check (bool) – By default (False), checks that the ‘.bed’ file starts with the expected bytes

    the first time any file (‘.bed’, ‘.fam’, or ‘.bim’) is opened.

  • fam_filename (optional, string) – The *.fam file to read. Defaults to the bed filename with the suffix replaced.

  • bim_filename (optional, string) – The *.bim file to read. Defaults to the bed filename with the suffix replaced.

  • chrom_map (optional, dictionary) – A dictionary from non-numeric chromosome names to numbers. Defaults to the PLINK

    mapping, namely, plink_chrom_map.

Constants

  • plink_chrom_map

    A dictionary that ‘X’ to 23, ‘Y’ to 24, ‘XY’ to 25, and ‘MT’ to 26.

  • reverse_plink_chrom_map

    A dictionary that maps 23 to ‘X’, 24 to ‘Y’, 25 to ‘XY’, and 26 to ‘MT’

Methods beyond SnpReader

The SnpReader.read() method returns a SnpData with a SnpData.val ndarray. By default, this ndarray is numpy.float32. Optionally, it can be numpy.float16. For Bed, however, it can also be numpy.int8 with missing values represented by -127.

When reading, any chromosome and position values of 0 (the PLINK standard for missing) will be represented in pos as NaN.

Example:

>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/distributed_bed_test1_X.*","*.bed")
>>> snp_on_disk = Bed(bedfile, count_A1=False)
>>> snpdata1 = snp_on_disk.read() # Read into the default, an float64 ndarray
>>> snpdata1.val.dtype
dtype('float64')
>>> snpdata1 = snp_on_disk.read(dtype='int8',_require_float32_64=False) #Read into an 'int8' ndarray.
>>> snpdata1.val.dtype
dtype('int8')
static write(filename, snpdata, count_A1=False, force_python_only=False, _require_float32_64=True, num_threads=None, reverse_chrom_map={})

Writes a SnpData to Bed format and returns the Bed.

Parameters:
  • filename (string) – the name of the file to create

  • snpdata (SnpData) – The in-memory data that should be written to disk.

  • count_A1 (bool) – Tells if it should count the number of A1 alleles (the PLINK standard) or the number of A2 alleles. False is the current default, but in the future the default will change to True.

  • force_python_only (bool) – Defaults to False. Tells to use Python code rather than faster Rust code. (Might be useful for debugging).

  • _require_float32_64 (bool) – Defaults to True. Requires that snpdata’s dtype is float32 or float64. (False is useful for writing int8 data.)

  • num_threads (int) – Maximum number of threads to use. Currently ignored and number of threads is 1.

  • reverse_chrom_map (dictionary) – Dictionary from chromosome number to chromsome string to write in the *.bim file. Defaults to empty dictionary.

Return type:

Bed

Any pos values of NaN will be written as 0, the PLINK standard for missing chromosome and position values.

>>> from pysnptools.snpreader import Pheno, Bed
>>> import pysnptools.util as pstutil
>>> from pysnptools.util import example_file # Download and return local file name
>>> bed_fn = example_file("pysnptools/examples/toydata.5chrom.*", "*.bed")
>>> snpdata = Bed(bed_fn)[:,::2].read() # Read every-other SNP
>>> pstutil.create_directory_if_necessary("tempdir/everyother.bed")
>>> Bed.write("tempdir/everyother.bed",snpdata,count_A1=False)   # Write data in Bed format
Bed('tempdir/everyother.bed',count_A1=False)
>>> # Can write from an int8 array, too.
>>> snpdata_int = SnpData(val=np.int_(snpdata.val).astype('int8'),iid=snpdata.iid,sid=snpdata.sid,pos=snpdata.pos,_require_float32_64=False)
>>> snpdata_int.val.dtype
dtype('int8')
>>> Bed.write("tempdir/everyother.bed",snpdata_int,count_A1=False,_require_float32_64=False)
Bed('tempdir/everyother.bed',count_A1=False)

snpreader.SnpData

class pysnptools.snpreader.SnpData(iid, sid, val, pos=None, name=None, parent_string=None, copyinputs_function=None, xp=None, _require_float32_64=True)

A SnpReader for holding SNP values (or similar values) in-memory, along with related iid and sid information. It is usually created by calling the SnpReader.read() method on another SnpReader, for example, Bed. It can also be constructed directly.

See SnpReader for details and examples.

Constructor:
Parameters:
  • iid (an array of string pair) – The SnpReader.iid information.

  • sid (an array of strings) – The SnpReader.sid information.

  • val (a 2-D array of floats) – The SNP values

  • pos (optional, an array of strings) – The SnpReader.pos information

  • name (optional, string) – Information to be display about the origin of this data

  • copyinputs_function (optional, function) – Used internally by optional clustering code

  • xp (optional, a Python module or string) – The array module that controls val,

    for example, ‘numpy’ or ‘cupy’. Defaults to numpy. Can also be set with the ‘ARRAY_MODULE’ environment variable.

Example:

>>> from pysnptools.snpreader import SnpData
>>> snpdata = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]])
>>> print((snpdata.val[0,1], snpdata.iid_count, snpdata.sid_count))
(2.0, 2, 3)

Equality:

Two SnpData objects are equal if their four arrays (SnpData.val, SnpReader.iid, SnpReader.sid, and SnpReader.pos) are ‘array_equal’. (Their ‘name’ does not need to be the same). If either SnpData.val contains NaN, the objects will not be equal. However, SnpData.allclose() can be used to treat NaN as regular values.

Example:

>>> import numpy as np
>>> from pysnptools.snpreader import SnpData
>>> snpdata1 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]], pos=[[0,0,0],[0,0,0],[0,0,0]])
>>> snpdata2 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]], pos=[[0,0,0],[0,0,0],[0,0,0]])
>>> print(snpdata1 == snpdata2) #True, because all the arrays have the same values.
True
>>> print(snpdata1.val is snpdata2.val) #False, because the two arrays have different memory.
False
>>> snpdata3 = SnpData(iid=[['a','0'],['b','0']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]], pos=[[0,0,0],[0,0,0],[0,0,0]])
>>> snpdata4 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]], pos=[[0,0,0],[0,0,0],[0,0,0]])
>>> print(snpdata3 == snpdata4) #False, because the iids are different.
False
>>> snpdata5 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]], pos=[[0,0,0],[0,0,0],[0,0,0]])
>>> snpdata6 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]], pos=[[0,0,0],[0,0,0],[0,0,0]])
>>> print(snpdata5 == snpdata6) #False, because the val's contain NaN
False
>>> print(snpdata5.allclose(snpdata6)) #True, if we consider the NaN as regular values, all the arrays have the same values.
True

Methods beyond SnpReader

allclose(value, equal_nan=True)
Parameters:
  • value (SnpData) – Other object with which to compare.

  • equal_nan (bool) – (Default: True) Tells if NaN in SnpData.val should be treated as regular values when testing equality.

>>> import numpy as np
>>> snpdata5 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]], pos=[[0,0,0],[0,0,0],[0,0,0]])
>>> snpdata6 = SnpData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]], pos=[[0,0,0],[0,0,0],[0,0,0]])
>>> print(snpdata5.allclose(snpdata6)) #True, if we consider the NaN as regular values, all the arrays have the same values.
True
>>> print(snpdata5.allclose(snpdata6,equal_nan=False)) #False, if we consider the NaN as special values, all the arrays are not equal.
False
standardize(standardizer=Unit(), block_size=None, return_trained=False, force_python_only=False, num_threads=None)

Does in-place standardization of the in-memory SNP data. By default, it applies ‘Unit’ standardization, that is: the values for each SNP will have mean zero and standard deviation 1.0. NaN values are then filled with zero, the mean (consequently, if there are NaN values, the final standard deviation will not be zero. Note that, for efficiency, this method works in-place, actually changing values in the ndarray. Although it works in place, for convenience it also returns the SnpData.

Parameters:
  • standardizer (Standardizer) – optional – Specify standardization to be applied. Any Standardizer may be used. Some choices include Unit (default, makes values for each SNP have mean zero and standard deviation 1.0) and Beta.

  • block_size (None) – optional – Deprecated.

  • return_trained (bool) – If true, returns a second value containing a constant Standardizer trained on this data.

  • force_python_only (bool) – optional – If true, will use pure Python instead of faster C++ libraries.

  • num_threads (None or int) – optional – The number of threads with which to standardize data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

Return type:

SnpData (standardizes in place, but for convenience, returns ‘self’)

>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bed_file = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snp_on_disk = Bed(bed_file,count_A1=False) # Specify some data on disk in Bed format
>>> snpdata1 = snp_on_disk.read() # read all SNP values into memory
>>> print(snpdata1) # Prints the specification for this SnpData
SnpData(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False))
>>> print(snpdata1.val[0,0])
2.0
>>> snpdata1.standardize() # standardize changes the values in snpdata1.val and changes the specification.
SnpData(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False),Unit())
>>> print('{0:.6f}'.format(snpdata1.val[0,0]))
0.229416
>>> snpdata2 = snp_on_disk.read().standardize() # Read and standardize in one expression with only one ndarray allocated.
>>> print('{0:.6f}'.format(snpdata2.val[0,0]))
0.229416
train_standardizer(apply_in_place, standardizer=Unit(), force_python_only=False, num_thread=None)

Deprecated since version 0.2.23: Use standardize() with return_trained=True instead.

property val

The 2D NumPy array of floats that represents the values of the SNPs. You can get or set this property.

>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bed_file = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snpdata = Bed(bed_file,count_A1=False)[:5,:].read() #read data for first 5 iids
>>> print(snpdata.val[4,100]) #print one of the SNP values
2.0

snpreader.Pheno

class pysnptools.snpreader.Pheno(input, iid_if_none=None, missing=None)

A SnpReader for reading “alternative” phenotype files from disk.

See SnpReader for general examples of using SnpReaders.

This text format is described here and looks like:

FID    IID      qt1   bmi    site
F1     1110     2.3   22.22  2
F2     2202     34.12 18.23  1
...

where the heading row is optional.

Constructor:
Parameters:
  • input (string) – The phenotype file to read.

  • iid_if_none (None or array of strings) – An SnpReader.iid to use if the file is empty.

  • missing (None or string) – The value in the file that represents missing data.

For backwards compatibility with older code, input can be a dictionary (instead of a filename) with these fields:

  • ‘header’ : [1] array phenotype name,

  • ‘vals’ : [N*1] array of phenotype data,

  • ‘iid’ : [N*2] array of family IDs and individual IDs

Example:

>>> from pysnptools.snpreader import Pheno, Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> pheno_file = example_file('pysnptools/examples/toydata.phe')
>>> data_on_disk = Pheno(pheno_file)
>>> print((data_on_disk.iid_count, data_on_disk.sid_count))
(500, 1)

Methods beyond SnpReader

static write(filename, snpdata, missing='NaN', sep='\t')

Writes a SnpData to Pheno format and returns the Pheno.

Parameters:
  • filename (string) – the name of the file to create

  • missing (string) – value to threat as missing (default ‘NaN’)

  • sep (string) – the separator (default ‘ ‘)

Return type:

Pheno

>>> from pysnptools.snpreader import Pheno, Bed
>>> import pysnptools.util as pstutil
>>> from pysnptools.util import example_file # Download and return local file name
>>> bed_file = example_file("pysnptools/examples/toydata.5chrom.*","*.bed")
>>> snpdata = Bed(bed_file,count_A1=False)[:,:10].read()  # Read first 10 snps from Bed format
>>> pstutil.create_directory_if_necessary("tempdir/toydata10.phe")
>>> Pheno.write("tempdir/toydata10.txt",snpdata)       # Write data in Pheno format
Pheno('tempdir/toydata10.txt')

snpreader.Ped

class pysnptools.snpreader.Ped(filename, missing='0')

A SnpReader for reading Ped-formated (and the related Map-formated) files from disk.

See SnpReader for general examples of using SnpReaders.

This format is described here and looks like:

FAM001  1  0 0  1  2  A A  G G  A C 
FAM001  2  0 0  1  2  A A  A G  0 0 
...

the direction of the encoding from allele pair to 0,1,2 is arbitrary. That is, if the alleles are “A” and “G”, then “G G” could be 0 and “A A” could be 2 or visa versa. The pair “A G” will always be 1.

Constructor:
Parameters:
  • filename (string) – The Ped file to read.

  • missing (string) – The value in the file that represents missing data.

Example:

>>> from pysnptools.snpreader import Ped
>>> from pysnptools.util import example_file # Download and return local file name
>>> ped_file = example_file('pysnptools/examples/toydata.ped')
>>> data_on_disk = Ped(ped_file)
>>> print((data_on_disk.iid_count, data_on_disk.sid_count))
(500, 10000)

Methods beyond SnpReader

static write(filename, snpdata)

Writes a SnpData to Ped format. The values must be 0,1,2. The direction of the encoding to allele pairs is arbitrary. This means that if a SnpData is written in Ped format and then read back, then 0’s may become 2’s and 2’s may become 0’s. (1’s will stay 1’s). Returns the Ped

Parameters:
  • filename (string) – the name of the file to create

  • snpdata (SnpData) – The in-memory data that should be written to disk.

Return type:

Ped

>>> from pysnptools.snpreader import Ped, Bed
>>> import pysnptools.util as pstutil
>>> from pysnptools.util import example_file # Download and return local file name
>>> bed_file = example_file("pysnptools/examples/toydata.5chrom.*","*.bed")
>>> snpdata = Bed(bed_file,count_A1=False)[:,:10].read()  # Read first 10 snps from Bed format
>>> pstutil.create_directory_if_necessary("tempdir/toydata10.ped")
>>> Ped.write("tempdir/toydata10.ped",snpdata)            # Write data in Ped format
Ped('tempdir/toydata10.ped')

snpreader.Dat

class pysnptools.snpreader.Dat(filename, skiprows=0)

A SnpReader for reading Dat/Fam/Map-formated files from disk.

See SnpReader for general examples of using SnpReaders.

This is a text format that can store any numeric values. (In contrast, Bed and Ped can only store 0,1,2, and missing). Its Dat files look like:

null_0          j       n       0.333   1       2
null_100        j       n       2       1       1
null_200        j       n       0       nan     1
...

Its Map and Fam files are described here.

Constructor:
Parameters:
  • filename (string) – The Dat file to read.

  • skiprows (int) – Number of lines to skip before reading. Defaults to 0.

Example:

>>> from pysnptools.snpreader import Dat
>>> from pysnptools.util import example_file # Download and return local file name
>>> dat_file = example_file("pysnptools/examples/toydata.*","*.dat")
>>> data_on_disk = Dat(dat_file)
>>> print((data_on_disk.iid_count, data_on_disk.sid_count))
(500, 10000)

Methods beyond SnpReader

static write(filename, snpdata)

Writes a SnpData to dat/fam/map format and returns the Dat.

Parameters:
  • filename (string) – the name of the file to create

  • snpdata (SnpData) – The in-memory data that should be written to disk.

Return type:

Dat

>>> from pysnptools.snpreader import Dat, Bed
>>> import pysnptools.util as pstutil
>>> from pysnptools.util import example_file # Download and return local file name
>>> bed_file = example_file("pysnptools/examples/toydata.5chrom.*","*.bed")
>>> snpdata = Bed(bed_file,count_A1=False)[:,:10].read()  # Read first 10 snps from Bed format
>>> pstutil.create_directory_if_necessary("tempdir/toydata10.dat")
>>> Dat.write("tempdir/toydata10.dat",snpdata)              # Write data in dat/fam/map format
Dat('tempdir/toydata10.dat')

snpreader.Dense

class pysnptools.snpreader.Dense(filename, extract_iid_function=<function zero_family>, extract_sid_pos_function=<function zero_pos>)

A SnpReader for reading *.dense.txt (0,1,2 text files) from disk.

See SnpReader for general examples of using SnpReaders.

This format looks like:

var     4006    9570    22899   37676   41236   41978   55159   66828...
1-10004-rs12354060      22222222...
1-707348-rs12184279     222222?2...
1-724325-rs12564807     00000000...
...

where rows are sids and columns are iids.

Constructor:
Parameters:
  • filename (string) – The Dense file to read.

  • extract_iid_function (string) – A function that breaks the row id from the file into a family id and an individual id. Defaults to setting the family id to 0 and using the whole row id as the iid.

  • extract_sid_pos_function (string) – A function that breaks the column id from the file into (chromosome, sid, genetic distance, basepair distance). Defaults to setting the SnpReader.pos information to 0 and using the whole column id as the sid.

Example:

>>> from pysnptools.snpreader import Dense
>>> from pysnptools.util import example_file # Download and return local file name
>>> dense_file = example_file("pysnptools/examples/toydata100.dense.txt")
>>> data_on_disk = Dense(dense_file)
>>> print((data_on_disk.iid_count, data_on_disk.sid_count))
(500, 100)

Methods beyond SnpReader

static write(filename, snpdata, join_iid_function=<function just_case>, join_sid_pos_function=<function just_sid>)

Writes a SnpData to Dense format. The values must be 0,1,2 (or missing). Returns a Dense with the default extractors.

Parameters:
  • filename (string) – the name of the file to create

  • snpdata (SnpData) – The in-memory data that should be written to disk.

  • join_iid_function (a function) – function to turn a family id and individual id into a file id for columns. Defaults ignoring the family id and using the individual id as the column id.

  • join_sid_pos_function (a function) – function to turn a sid and pos data into a file id for rows. Defaults ignoring the SnpReader.pos information and using the sid id as the row id.

Return type:

Dense

>>> from pysnptools.snpreader import Dense, Bed
>>> import pysnptools.util as pstutil
>>> from pysnptools.util import example_file # Download and return local file name
>>> bed_file = example_file("pysnptools/examples/toydata.5chrom.*","*.bed")
>>> snpdata = Bed(bed_file,count_A1=False)[:,:10].read()  # Read first 10 snps from Bed format
>>> pstutil.create_directory_if_necessary("tempdir/toydata10.dense.txt")
>>> Dense.write("tempdir/toydata10.dense.txt",snpdata)        # Write data in Dense format
Dense('tempdir/toydata10.dense.txt')

snpreader.SnpHdf5

class pysnptools.snpreader.SnpHdf5(*args, **kwargs)

A SnpReader for reading *.snp.hdf5 files from disk.

See SnpReader for general examples of using SnpReaders.

The general HDF5 format is described here. The SnpHdf5 format stores val, iid, sid, and pos information in Hdf5 format.

Constructor:
Parameters:
  • filename (string) – The SnpHdf5 file to read.

Example:

>>> from pysnptools.snpreader import SnpHdf5
>>> from pysnptools.util import example_file # Download and return local file name
>>> hd5f_file = example_file("pysnptools/examples/toydata.snpmajor.snp.hdf5")
>>> data_on_disk = SnpHdf5(hd5f_file)
>>> print((data_on_disk.iid_count, data_on_disk.sid_count))
(500, 10000)

Methods beyond SnpReader

static write(filename, snpdata, hdf5_dtype=None, sid_major=True)

Writes a SnpData to SnpHdf5 format and return a the SnpHdf5.

Parameters:
  • filename (string) – the name of the file to create

  • snpdata (SnpData) – The in-memory data that should be written to disk.

  • hdf5_dtype (string) – None (use the .val’s dtype) or a Hdf5 dtype, e.g. ‘f8’,’f4’,etc.

  • sid_major (bool) – Tells if vals should be stored on disk in sid_major (default) or iid_major format.

Return type:

SnpHdf5

>>> from pysnptools.snpreader import SnpHdf5, Bed
>>> import pysnptools.util as pstutil
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("pysnptools/examples/toydata.5chrom.*","*.bed")
>>> snpdata = Bed(bedfile,count_A1=False)[:,:10].read()     # Read first 10 snps from Bed format
>>> pstutil.create_directory_if_necessary("tempdir/toydata10.snp.hdf5")
>>> SnpHdf5.write("tempdir/toydata10.snp.hdf5",snpdata)        # Write data in SnpHdf5 format
SnpHdf5('tempdir/toydata10.snp.hdf5')

snpreader.SnpNpz

class pysnptools.snpreader.SnpNpz(*args, **kwargs)

A SnpReader for reading *.snp.npz files from disk.

See SnpReader for general examples of using SnpReaders.

The general NPZ format is described here. The SnpNpz format stores val, iid, sid, and pos information in NPZ format.

Constructor:
Parameters:
  • filename (string) – The SnpNpz file to read.

Example:

>>> from pysnptools.snpreader import SnpNpz
>>> from pysnptools.util import example_file # Download and return local file name
>>> npz_file = example_file('pysnptools/examples/toydata10.snp.npz')
>>> data_on_disk = SnpNpz(npz_file)
>>> print((data_on_disk.iid_count, data_on_disk.sid_count))
(500, 10)

Methods beyond SnpReader

static write(filename, snpdata)

Writes a SnpData to SnpNpz format and returns the SnpNpz

Parameters:
  • filename (string) – the name of the file to create

  • snpdata (SnpData) – The in-memory data that should be written to disk.

Return type:

SnpNpz

>>> from pysnptools.snpreader import SnpNpz, Bed
>>> import pysnptools.util as pstutil
>>> from pysnptools.util import example_file # Download and return local file name
>>> bed_file = example_file("pysnptools/examples/toydata.5chrom.*","*.bed")
>>> snpdata = Bed(bed_file,count_A1=False)[:,:10].read()     # Read first 10 snps from Bed format
>>> pstutil.create_directory_if_necessary("tempdir/toydata10.snp.npz")
>>> SnpNpz.write("tempdir/toydata10.snp.npz",snpdata)          # Write data in SnpNpz format
SnpNpz('tempdir/toydata10.snp.npz')

snpreader.SnpMemMap

class pysnptools.snpreader.SnpMemMap(*args, **kwargs)

A SnpData that keeps its data in a memory-mapped file. This allows data large than fits in main memory.

See SnpData for general examples of using SnpData.

Constructor:
Parameters:

filename (string) – The *.snp.memmap file to read.

Also see SnpMemMap.empty() and SnpMemMap.write().

Example:

>>> from pysnptools.snpreader import SnpMemMap
>>> from pysnptools.util import example_file # Download and return local file name
>>> mem_map_file = example_file('pysnptools/examples/tiny.snp.memmap')
>>> snp_mem_map = SnpMemMap(mem_map_file)
>>> print(snp_mem_map.val[0,1], snp_mem_map.iid_count, snp_mem_map.sid_count)
2.0 2 3

Methods inherited from SnpData

Methods beyond SnpReader

static empty(iid, sid, filename, pos=None, order='F', dtype=<class 'numpy.float64'>)

Create an empty SnpMemMap on disk.

Parameters:
  • iid (an array of string pairs) – The SnpReader.iid information

  • sid (an array of strings) – The SnpReader.sid information

  • filename (string) – name of memory-mapped file to create

  • pos (an array of numeric triples) – optional – The additional SnpReader.pos information associated with each sid. Default: None

  • order (string or None) – {‘F’ (default), ‘C’}, optional – Specify the order of the ndarray.

  • dtype (data-type) – {numpy.float64 (default), numpy.float32}, optional – The data-type for the SnpMemMap.val ndarray.

Return type:

SnpMemMap

>>> import pysnptools.util as pstutil
>>> from pysnptools.snpreader import SnpMemMap
>>> filename = "tempdir/tiny.snp.memmap"
>>> pstutil.create_directory_if_necessary(filename)
>>> snp_mem_map = SnpMemMap.empty(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'],filename=filename,order="F",dtype=np.float64)
>>> snp_mem_map.val[:,:] = [[0.,2.,0.],[0.,1.,2.]]
>>> snp_mem_map.flush()
property filename

The name of the memory-mapped file

flush()

Flush SnpMemMap.val to disk and close the file. (If values or properties are accessed again, the file will be reopened.)

>>> import pysnptools.util as pstutil
>>> from pysnptools.snpreader import SnpMemMap
>>> filename = "tempdir/tiny.snp.memmap"
>>> pstutil.create_directory_if_necessary(filename)
>>> snp_mem_map = SnpMemMap.empty(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'],filename=filename,order="F",dtype=np.float64)
>>> snp_mem_map.val[:,:] = [[0.,2.,0.],[0.,1.,2.]]
>>> snp_mem_map.flush()
property offset

The byte position in the file where the memory-mapped values start.

(The disk space before this is used to store SnpReader.iid, etc. information. This property is useful when interfacing with, for example, external Fortran and C matrix libraries.)

property val

The 2D NumPy memmap array of floats that represents the values. You can get this property, but cannot set it (except with itself)

>>> from pysnptools.snpreader import SnpMemMap
>>> from pysnptools.util import example_file # Download and return local file name
>>> mem_map_file = example_file('pysnptools/examples/tiny.snp.memmap')
>>> snp_mem_map = SnpMemMap(mem_map_file)
>>> print(snp_mem_map.val[0,1])
2.0
static write(filename, snpreader, standardizer=Identity(), order='A', dtype=None, block_size=None, num_threads=None)

Writes a SnpReader to SnpMemMap format.

Parameters:
  • filename (string) – the name of the file to create

  • snpreader (SnpReader) – The data that should be written to disk.

Return type:

SnpMemMap

>>> import pysnptools.util as pstutil
>>> from pysnptools.util import example_file # Download and return local file name
>>> from pysnptools.snpreader import Bed, SnpMemMap
>>> bed_file = example_file("pysnptools/examples/toydata.5chrom.*","*.bed")
>>> bed = Bed(bed_file)
>>> pstutil.create_directory_if_necessary("tempdir/toydata.5chrom.snp.memmap") #LATER should we just promise to create directories?
>>> SnpMemMap.write("tempdir/toydata.5chrom.snp.memmap",bed)      # Write bed in SnpMemMap format
SnpMemMap('tempdir/toydata.5chrom.snp.memmap')

snpreader.SnpGen

class pysnptools.snpreader.SnpGen(seed, iid_count, sid_count, chrom_count=22, cache_file=None, block_size=None, sid_batch_size=None)

A SnpReader that generates deterministic “random” SNP data on the fly.

See SnpReader for general examples of using SnpReader.

Constructor:
Parameters:
  • seed (number) – The random seed that (along with block_size) determines the SNP values.

Parameters:
  • iid_count (number) – the number of iids (number of individuals)

Parameters:
  • sid_count (number) – the number of sids (number of SNPs)

Parameters:
  • chrom_count (number) – the number of chromosomes to generate (must be 22 or fewer)

Parameters:
  • cache_file (string) – (default None) If provided, tells where to cache the common iid, sid, and pos information. Using it can save time.

Parameters:
  • block_size (number or None) – (default None) Tells how many SNP to generate at once. The default picks a block_size such that iid_count * *block_size is about 100,000.

Example:

>>> from pysnptools.snpreader import SnpGen
>>> #Prepare to generate data for 1000 individuals and 1,000,000 SNPs
>>> snp_gen = SnpGen(seed=332,iid_count=1000,sid_count=1000*1000)
>>> print(snp_gen.iid_count,snp_gen.sid_count)
1000 1000000
>>> snp_data = snp_gen[:,200*1000:201*1000].read() #Generate for all users and for SNPs 200K to 201K
>>> print(snp_data.val[1,1], snp_data.iid_count, snp_data.sid_count)
0.0 1000 1000
Also See:

snp_gen()

snpreader.DistributedBed

class pysnptools.snpreader.DistributedBed(storage)

A class that implements the SnpReader interface. It stores Bed-like data in pieces on storage. When you request data, it retrieves only the needed pieces.

Constructor:
Parameters:

storage (string or FileCache) – Tells where the DistirubtedBed data is stored. A string can be given and will be interpreted as the path to a directory. A FileCache instance can be given, which provides a method to specify cluster-distributed storage.

type storage:

string or FileCache

Example:

>>> import os
>>> from pysnptools.snpreader import DistributedBed
>>> from pysnptools.util import example_file # Download and return local file name
>>> folder = os.path.dirname(example_file('pysnptools/examples/toydataSkip10.distributedbed/*'))
>>> data_on_disk = DistributedBed(folder)
>>> print((data_on_disk.iid_count, data_on_disk.sid_count))
(500, 1000)
static write(storage, snpreader, piece_per_chrom_count=1, updater=None, runner=None)

Uploads from any Bed-like data to cluster storage for efficient retrieval later. If some of the contents already exists in storage, it skips uploading that part of the contents. (To avoid this behavior, clear the storage.)

Parameters:
  • storage (string or FileCache or None.) – Tells where to store SNP data. A string can be given and will be interpreted as the path of a local directory to use for storage. (The local directory will not be automatically erased and so must be user managed.) A FileCache instance can be given, which provides a method to specify cluster-distributed storage. (FileCache’s will not be automatically erased and must be user managed.) If None, the storage will be in an automatically-erasing temporary directory. (If the TEMP environment variable is set, Python places the temp directory under it.)

  • snpreader (SnpReader) – A Bed or other SnpReader with values of 0,1,2, or missing. (Note that this differs from most other write methods that take a SnpData)

  • piece_per_chrom_count (A number) – The number of pieces in which to store the data from each chromosome. Data is split across SNPs. For exmple, if piece_per_chrom_count is set to 100 and 22 chromosomes are uploaded, then data will be stored in 2200 pieces. Later, when data is requested only the pieces necessary for the request will be downloaded to local storage.

  • updater (A function or lambda) – A single argument function to write logging message to, for example, the function created by log_in_place().

  • runner (Runner) – a Runner, optional: Tells how to run. (Note that Local and LocalMultProc are good options.) If not given, the function is run locally.

Return type:

DistributedBed

>>> from pysnptools.snpreader import DistributedBed, Bed
>>> import shutil
>>> from pysnptools.util import example_file # Download and return local file name
>>> directory = 'tempdir/toydataSkip10.distributedbed'
>>> if os.path.exists(directory):
...     shutil.rmtree(directory)
>>> bedfile = example_file("pysnptools/examples/toydata.5chrom.*","*.bed")
>>> snpreader = Bed(bedfile,count_A1=False)[:,::10]  # Read every 10 snps from Bed format
>>> DistributedBed.write(directory,snpreader,piece_per_chrom_count=5)  # Write data in DistributedBed format
DistributedBed(LocalCache('tempdir/toydataSkip10.distributedbed'))

kernelreader Module

Tools for reading and manipulating kernels. A kernels is a matrix from iid to iid that typically tells the relatedness of pairs of people.

kernelreader.KernelReader

class pysnptools.kernelreader.KernelReader(*args, **kwargs)

A KernelReader is one of three things:

  • A class such as KernelNpz for you to a file with data. For example,

    >>> from pysnptools.kernelreader import KernelNpz
    >>> from pysnptools.util import example_file # Download and return local file name
    >>>
    >>> kernel_file = example_file('pysnptools/examples/toydata.kernel.npz')
    >>> kernel_on_disk = KernelNpz(kernel_file)
    >>> print(kernel_on_disk) # prints specification for reading from file
    KernelNpz('...pysnptools/examples/toydata.kernel.npz')
    >>> kernel_on_disk.iid_count # prints the number of iids (but doesn't read any kernel values)
    500
    
  • A KernelData class that holds kernel data in memory, typically after computing from a SnpReader or reading it from a KernelReader:

    >>> # Compute kernel from a SnpReader
    >>> from pysnptools.snpreader import Bed
    >>> from pysnptools.standardizer import Unit
    >>> from pysnptools.util import example_file # Download and return local file name
    >>>
    >>> bed_file = example_file('tests/datasets/all_chr.maf0.001.N300.*','*.bed')
    >>> snp_on_disk = Bed(bed_file,count_A1=False)
    >>> kerneldata1 = snp_on_disk.read_kernel(Unit()) #reads the SNP values and computes the kernel
    >>> type(kerneldata1.val).__name__ # The val property is an ndarray of kernel values
    'ndarray'
    >>> print(kerneldata1) # prints the specification of the in-memory kernel information
    KernelData(SnpKernel(Bed(...tests/datasets/all_chr.maf0.001.N300.bed',count_A1=False),standardizer=Unit()))
    >>> kerneldata1.iid_count #prints the number of iids (number of individuals) in this in-memory data
    300
    >>> # Read kernel from a KernelReader
    >>> kernel_file = example_file('pysnptools/examples/toydata.kernel.npz')
    >>> kernel_on_disk = KernelNpz(kernel_file)
    >>> kerneldata2 = kernel_on_disk.read() #reads the kernel values
    >>> print(kerneldata2) # prints the specification of the in-memory kernel information
    KernelData(KernelNpz(...pysnptools/examples/toydata.kernel.npz'))
    >>> kerneldata2.iid_count #prints the number of iids (number of individuals) in this in-memory data
    500
    
  • A subset of any KernelReader, specified with “[ iid_index ]” (or specified with “[ iid0_index , iid1_index ]”), to read only some kernel values. It can also be used to re-order the values.

    >>> kernel_on_disk = KernelNpz(kernel_file)
    >>> subset_on_disk1 = kernel_on_disk[[3,4]] # specification for a subset of the data on disk. No kernel values are read yet.
    >>> print(subset_on_disk1.iid_count) # prints the number of iids in this subset (but still doesn't read any kernel values)
    2
    >>> print(subset_on_disk1) #prints a specification of 'subset_on_disk1'
    KernelNpz(...pysnptools/examples/toydata.kernel.npz')[[3,4],[3,4]]
    >>> kerneldata_subset = subset_on_disk1.read() # efficiently (if possible) reads the specified subset of values from the disk
    >>> print(kerneldata_subset) # prints the specification of the in-memory kernel information
    KernelData(KernelNpz(...pysnptools/examples/toydata.kernel.npz')[[3,4],[3,4]])
    >>> print((int(kerneldata_subset.val.shape[0]), int(kerneldata_subset.val.shape[1]))) # The dimensions of the ndarray of kernel values
    (2, 2)
    >>> subset_on_disk2 = kernel_on_disk[[3,4],::2] # specification for a subset of the data on disk. No kernel values are read yet.
    >>> print((subset_on_disk2.iid0_count, subset_on_disk2.iid1_count))
    (2, 250)
    

The KernelReaders Classes

Class

Format

Random Access

Suffixes

Write method?

KernelData

in-memory

Yes

n/a

n/a

KernelNpz

binary

No

.kernel.npz

Yes

KernelHdf5

binary

Yes

.kernel.hdf5

Yes

Identity

n/a

Yes

n/a

No

SnpKernel

depends

depends

n/a

No

Methods & Properties:

Every KernelReader, such as KernelNpz and KernelData, when square has these properties: iid, iid_count, and these methods: read(), and iid_to_index(). A square kernel is one that has the same iid list for both its rows and columns.

More generally, KernelReaders can have one iid list for its rows and a different iid list for its columns, so these properties and methods are also defined: iid0, iid1, iid0_count, iid1_count, iid0_to_index(), and iid1_to_index().

See below for details.

KernelData is a KernelReader so it supports the above properties and methods. In addition, it supports property KernelData.val, method KernelData.standardize(), and equality testing.

See below for details.

Some of the classes, such as KernelNpz, also provide a static KernelNpz.write() method for writing KernelData.

>>> # create a kernel from a Bed file and write to KernelNpz format
>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> from pysnptools.standardizer import Unit
>>> import pysnptools.util as pstutil
>>> bed_file = example_file('pysnptools/examples/toydata.5chrom.bed')
>>> kerneldata = Bed(bed_file,count_A1=False).read_kernel(Unit())     # Create a kernel from the data in the Bed file
>>> pstutil.create_directory_if_necessary("tempdir/toydata.kernel.npz")
>>> KernelNpz.write("tempdir/toydata.kernel.npz",kerneldata)      # Write data in KernelNpz format
KernelNpz('tempdir/toydata.kernel.npz')

iids:

Individual are identified with an iid, which is a ndarray of two strings: a family ID and an individual ID. For example:

>>> kernel_on_disk = KernelNpz(kernel_file)
>>> print(kernel_on_disk.iid[:3]) # print the first three iids
[['per0' 'per0']
 ['per1' 'per1']
 ['per2' 'per2']]
>>> print(kernel_on_disk.iid_to_index([['per2','per2'],['per1','per1']])) #Find the indexes for two iids.
[2 1]

KernelReader is a kind of PstReader. See the documentation for PstReader to learn about:

  • When Data is Read

  • When Data is Re-Read and Copied

  • Avoiding Unwanted ndarray Allocations

  • Creating Subsetting PstReaders with Indexing

The read() Method

By default the read() returns a ndarray of numpy.float64 laid out in memory in F-contiguous order (iid0-index varies the fastest). You may, instead, ask for numpy.float32 or for C-contiguous order or any order. See read() for details.

The KernelData.standardize() Method

The KernelData.standardize() method, available only on KernelData, does in-place standardization of the in-memory kernel data. The method multiples the values with a scalar factor such that the diagonal sums to iid_count. Although it works in place, for convenience it also returns itself. See KernelData.standardize() for details.

>>> kernel_on_disk = KernelNpz(kernel_file)
>>> kerneldata1 = kernel_on_disk.read() # read all kernel values into memory
>>> print(np.diag(kerneldata1.val).sum())
5000000.0
>>> kerneldata1.standardize() # standardize changes the values in kerneldata1.val
KernelData(KernelNpz(...pysnptools/examples/toydata.kernel.npz'))
>>> print(np.diag(kerneldata1.val).sum())
500.0
>>> kerneldata2 = kernel_on_disk.read().standardize() # Read and standardize in one expression with only one ndarray allocated.
>>> print(np.diag(kerneldata2.val).sum())
500.0

Details of Methods & Properties:

property iid

A ndarray of the iids. Each iid is a ndarray of two strings (a family ID and a individual ID) that identifies an individual. Assumes the kernel is square, so will throw an exception if the row iids are different from the column iids.

Return type:

ndarray (length iid_count) of ndarray (length 2) of strings

This property (to the degree practical) reads only iid and sid data from the disk, not kernel value data. Moreover, the iid data is read from file only once.

Example:

>>> from pysnptools.kernelreader import KernelNpz
>>> from pysnptools.util import example_file # Download and return local file name
>>>
>>> kernel_file = example_file('pysnptools/examples/toydata.kernel.npz')
>>> kernel_on_disk = KernelNpz(kernel_file)
>>> print(kernel_on_disk.iid[:3]) # print the first three iids
[['per0' 'per0']
 ['per1' 'per1']
 ['per2' 'per2']]
property iid0

A ndarray of the row iids. See iid

property iid0_count

number of row iids. See iid_count

Return type:

integer

iid0_to_index(list)

Takes a list of row iids and returns a list of index numbers. See iid_to_index

property iid1

A ndarray of the column iids. See iid

property iid1_count

number of column iids. See iid_count

Return type:

integer

iid1_to_index(list)

Takes a list of column iids and returns a list of index numbers. See iid_to_index

property iid_count

number of iids Assumes the kernel is square, so will throw an exception if the row iids are different from the column iids.

Return type:

integer

This property (to the degree practical) reads only iid data from the disk, not kernel value data. Moreover, the iid data is read from file only once.

iid_to_index(list)

Takes a list of iids and returns a list of index numbers. Assumes the kernel is square, so will throw an exception if the row iids are different from the column iids.

Parameters:

list – list of iids

Return type:

ndarray of int

This method (to the degree practical) reads only iid from the disk, not kernel value data. Moreover, the iid data is read from file only once.

Example:

>>> from pysnptools.kernelreader import KernelNpz
>>> from pysnptools.util import example_file # Download and return local file name
>>> npz_file = example_file('pysnptools/examples/toydata.kernel.npz')
>>> kernel_on_disk = KernelNpz(npz_file)
>>> print(kernel_on_disk.iid_to_index([['per2','per2'],['per1','per1']])) #Find the indexes for two iids.
[2 1]
read(order='F', dtype=<class 'numpy.float64'>, force_python_only=False, view_ok=False, num_threads=None)

Reads the kernel values and returns a KernelData (with KernelData.val property containing a new ndarray of the kernel values).

Parameters:
  • order (string or None) – {‘F’ (default), ‘C’, ‘A’}, optional – Specify the order of the ndarray. If order is ‘F’ (default), then the array will be in F-contiguous order (iid0-index varies the fastest). If order is ‘C’, then the returned array will be in C-contiguous order (iid1-index varies the fastest). If order is ‘A’, then the KernelData.val ndarray may be in any order (either C-, Fortran-contiguous, or even discontiguous).

  • dtype (data-type) – {numpy.float64 (default), numpy.float32}, optional – The data-type for the KernelData.val ndarray.

  • force_python_only (bool) – optional – If False (default), may use outside library code. If True, requests that the read be done without outside library code.

  • view_ok (bool) – optional – If False (default), allocates new memory for the KernelData.val’s ndarray. If True, if practical and reading from a KernelData, will return a new KernelData with a ndarray shares memory with the original KernelData. Typically, you’ll also wish to use “order=’A’” to increase the chance that sharing will be possible. Use these parameters with care because any change to either ndarray (for example, via KernelData.standardize()) will effect the others. Also keep in mind that read() relies on ndarray’s mechanisms to decide whether to actually share memory and so it may ignore your suggestion and allocate a new ndarray anyway.

  • num_threads (None or int) – optional – The number of threads with which to standardize data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

Return type:

KernelData

Calling the method again causes the kernel values to be re-read and creates a new in-memory KernelData with a new ndarray of kernel values.

If you request the values for only a subset of the sids or iids, (to the degree practical) only that subset will be read from disk.

Example:

>>> from pysnptools.kernelreader import KernelNpz
>>> from pysnptools.util import example_file # Download and return local file name
>>> npz_file = example_file('pysnptools/examples/toydata.kernel.npz')
>>> kernel_on_disk = KernelNpz(npz_file)
>>> kerneldata1 = kernel_on_disk.read() # Read all the kernel data returning a KernelData instance
>>> print(type(kerneldata1.val).__name__) # The KernelData instance contains a ndarray of the data.
ndarray
>>> subset_kerneldata = kernel_on_disk[::2].read() # From the disk, read kernel values for every other iid
>>> print('{0:.6f}'.format(subset_kerneldata.val[0,0])) # Print the first kernel value in the subset
9923.069928
>>> subsub_kerneldata = subset_kerneldata[:10].read(order='A',view_ok=True) # Create an in-memory subset of the subset with kernel values for the first ten iids. Share memory if practical.
>>> import numpy as np
>>> #print(np.may_share_memory(subset_kerneldata.val, subsub_kerneldata.val)) # Do the two ndarray's share memory? They could. Currently they won't.
property val_shape

Tells the shape of value for a given individual and SNP. For KernelReaders always returns None, meaning a single scalar value.

kernelreader.KernelData

class pysnptools.kernelreader.KernelData(iid=None, iid0=None, iid1=None, val=None, name=None, parent_string=None, xp=None)

A KernelReader for holding kernel values in-memory, along with related iid information. It is usually created by calling the SnpReader.read_kernel() method on a SnpReader, for example, Bed. It can also be created by calling the KernelReader.read() method on SnpKernel, for example, KernelNpz. It can also be constructed directly.

See KernelReader for details and examples.

Constructor:
Parameters:
  • iid (an array of string pairs) – The KernelReader.iid information.

  • iid0 (an array of string pairs) – The KernelReader.iid0 information.

  • iid1 (an array of strings pairs) – The KernelReader.iid1 information.

  • val (a 2-D array of floats) – The SNP values

  • name (optional, string) – Information to be display about the origin of this data

  • xp (optional, a Python module or string) – The array module that controls val,

    for example, ‘numpy’ or ‘cupy’. Defaults to numpy. Can also be set with the ‘ARRAY_MODULE’ environment variable.

If iid is provided, don’t provide iid0 and iid1. Likewise, if iid0 and iid1 are provided, don’t provide iid.

Example:

>>> from pysnptools.kernelreader import KernelData
>>> kerneldata = KernelData(iid=[['fam0','iid0'],['fam0','iid1']], val=[[1.,.5],[.5,1.]])
>>> print((kerneldata.val[0,1], kerneldata.iid_count))
(0.5, 2)

Equality:

Two KernelData objects are equal if their three arrays (iid0, iid1, and KernelData.val) are ‘array_equal’. (Their ‘string’ does not need to be the same). If either KernelData.val contains NaN, the objects will not be equal. However, KernelData.allclose() can be used to treat NaN as regular values.

Example:

>>> import numpy as np
>>> from pysnptools.kernelreader import KernelData
>>> kerneldata1 = KernelData(iid=[['fam0','iid0'],['fam0','iid1']], val=[[1.,.5],[.5,1.]])
>>> kerneldata2 = KernelData(iid=[['fam0','iid0'],['fam0','iid1']], val=[[1.,.5],[.5,1.]])
>>> print(kerneldata1 == kerneldata2) #True, because all the arrays have the same values.
True
>>> print(kerneldata1.val is kerneldata2.val) #False, because the two arrays have different memory.
False
>>> kerneldata3 = KernelData(iid=[['a','0'],['b','0']], val=[[1.,.5],[.5,1.]])
>>> kerneldata4 = KernelData(iid=[['fam0','iid0'],['fam0','iid1']], val=[[1.,.5],[.5,1.]])
>>> print(kerneldata3 == kerneldata4) #False, because the iids are different.
False
>>> kerneldata5 = KernelData(iid=[['fam0','iid0'],['fam0','iid1']], val=[[1.,.5],[.5,np.nan]])
>>> kerneldata6 = KernelData(iid=[['fam0','iid0'],['fam0','iid1']], val=[[1.,.5],[.5,np.nan]])
>>> print(kerneldata5 == kerneldata6) #False, because the val's contain NaN
False
>>> print(kerneldata5.allclose(kerneldata6)) #True, if we consider the NaN as regular values, all the arrays have the same values.
True

Methods beyond KernelReader

allclose(value, equal_nan=True)
Parameters:
  • value (KernelData) – Other object with which to compare.

  • equal_nan (bool) – (Default: True) Tells if NaN in KernelData.val should be treated as regular values when testing equality.

>>> import numpy as np
>>> kerneldata5 = KernelData(iid=[['fam0','iid0'],['fam0','iid1']], val=[[1.,.5],[.5,np.nan]])
>>> kerneldata6 = KernelData(iid=[['fam0','iid0'],['fam0','iid1']], val=[[1.,.5],[.5,np.nan]])
>>> print(kerneldata5.allclose(kerneldata6)) #True, if we consider the NaN as regular values, all the arrays have the same values.
True
>>> print(kerneldata5.allclose(kerneldata6,equal_nan=False)) #False, if we consider the NaN as special values, all the arrays are not equal.
False
standardize(standardizer=DiagKtoN(), return_trained=False, force_python_only=False, num_threads=None)

Does in-place standardization of the in-memory kernel data. The method multiples the values with a scalar factor such that the diagonal sums to iid_count. Although it works in place, for convenience it also returns the KernelData.

Return type:

KernelData (standardizes in place, but for convenience, returns ‘self’)

>>> from pysnptools.kernelreader import KernelNpz
>>> from pysnptools.util import example_file # Download and return local file name
>>> import numpy as np
>>> kernel_file = example_file('pysnptools/examples/toydata.kernel.npz')
>>> kernel_on_disk = KernelNpz(kernel_file)
>>> kerneldata1 = kernel_on_disk.read() # read all kernel values into memory
>>> print(np.diag(kerneldata1.val).sum())
5000000.0
>>> kerneldata1.standardize() # standardize changes the values in kerneldata1.val
KernelData(KernelNpz(...pysnptools/examples/toydata.kernel.npz'))
>>> print(np.diag(kerneldata1.val).sum())
500.0
>>> kerneldata2 = kernel_on_disk.read().standardize() # Read and standardize in one expression with only one ndarray allocated.
>>> print(np.diag(kerneldata2.val).sum())
500.0
property val

The 2D NumPy array of floats that represents the values of the kernel. You can get or set this property.

>>> from pysnptools.kernelreader import KernelData
>>> kerneldata = KernelData(iid=[['fam0','iid0'],['fam0','iid1']], val=[[1.,.5],[.5,1.]])
>>> print((kerneldata.val[0,1], kerneldata.iid_count))
(0.5, 2)

kernelreader.KernelNpz

class pysnptools.kernelreader.KernelNpz(*args, **kwargs)

A KernelReader for reading *.kernel.npz files from disk.

See KernelReader for general examples of using KernelReaders.

The general NPZ format is described here. The KernelNpz format stores val, iid0, and iid1 information in NPZ format.

Constructor:
Parameters:
  • filename (string) – The KernelNpz file to read.

Example:

>>> from pysnptools.kernelreader import KernelNpz
>>> from pysnptools.util import example_file # Download and return local file name
>>> npz_file = example_file('pysnptools/examples/toydata.kernel.npz')
>>> data_on_disk = KernelNpz(npz_file)
>>> print(data_on_disk.iid_count)
500

Methods beyond KernelReader

static write(filename, kerneldata)

Writes a KernelData to KernelNpz format and returns the KernelNpz.

Parameters:
  • filename (string) – the name of the file to create

  • kerneldata (KernelData) – The in-memory data that should be written to disk.

Return type:

KernelNpz

>>> from pysnptools.snpreader import Bed
>>> from pysnptools.standardizer import Unit
>>> import pysnptools.util as pstutil
>>> from pysnptools.util import example_file # Download and return local file name
>>> bed_file = example_file('pysnptools/examples/toydata.5chrom.*','*.bed')
>>> kerneldata = Bed(bed_file,count_A1=False).read_kernel(Unit())     # Create a kernel from the data in the Bed file
>>> pstutil.create_directory_if_necessary("tempdir/toydata.kernel.npz")
>>> KernelNpz.write("tempdir/toydata.kernel.npz",kerneldata)      # Write data in KernelNpz format
KernelNpz('tempdir/toydata.kernel.npz')

kernelreader.KernelHdf5

class pysnptools.kernelreader.KernelHdf5(*args, **kwargs)

A KernelReader for reading *.kernel.hdf5 files from disk.

See KernelReader for general examples of using KernelReaders.

The general HDF5 format is described in here. The KernelHdf5 format stores val, iid, sid, and pos information in Hdf5 format.

Constructor:
Parameters:
  • filename (string) – The KernelHdf5 file to read.

Example:

>>> from pysnptools.kernelreader import KernelHdf5
>>> from pysnptools.util import example_file # Download and return local file name
>>> hdf5_file = example_file('pysnptools/examples/toydata.kernel.hdf5')
>>> data_on_disk = KernelHdf5(hdf5_file)
>>> print(data_on_disk.iid_count)
500

Methods beyond KernelReader

static write(filename, kerneldata, hdf5_dtype=None, sid_major=True)

Writes a KernelData to KernelHdf5 format and returns the KernelHdf5.

Parameters:
  • filename (string) – the name of the file to create

  • kerneldata (KernelData) – The in-memory data that should be written to disk.

  • hdf5_dtype (string) – None (use the .val’s dtype) or a Hdf5 dtype, e.g. ‘f8’,’f4’,etc.

  • col_major (bool) – Tells if vals should be stored on disk in sid_major (default) or iid_major format.

Return type:

KernelHdf5

>>> from pysnptools.snpreader import Bed
>>> from pysnptools.standardizer import Unit
>>> import pysnptools.util as pstutil
>>> from pysnptools.util import example_file # Download and return local file name
>>> bed_file = example_file('pysnptools/examples/toydata.5chrom.*','*.bed')
>>> kerneldata = Bed(bed_file,count_A1=False).read_kernel(Unit())     # Create a kernel from the data in the Bed file
>>> pstutil.create_directory_if_necessary("tempdir/toydata.kernel.hdf5")
>>> KernelHdf5.write("tempdir/toydata.kernel.hdf5",kerneldata)          # Write data in KernelHdf5 format
KernelHdf5('tempdir/toydata.kernel.hdf5')

kernelreader.Identity

class pysnptools.kernelreader.Identity(iid, test=None)

A KernelReader for that represents an identity matrix. No memory for the values is allocated until Identity.read() is called.

See KernelReader for general examples of using KernelReaders.

Constructor:
Parameters:
Example:

>>> from pysnptools.kernelreader import Identity
>>> identity = Identity(iid=[['fam0','iid0'],['fam0','iid1']])
>>> print(identity.iid_count)
2
>>> print(identity.read().val) # '...' is possible space character
[[...1.  0.]
 [...0.  1.]]
>>> identity = Identity(iid=[['fam0','iid0'],['fam0','iid1'],['fam0','iid2']],test=[['fam0','iid1'],['fam0','iid3']])
>>> print((identity.iid0_count, identity.iid1_count))
(3, 2)
>>> print(identity.read().val) # '...' is possible space character
[[...0.  0.]
 [...1.  0.]
 [...0.  0.]]

kernelreader.SnpKernel

class pysnptools.kernelreader.SnpKernel(snpreader, standardizer=None, block_size=None)

A KernelReader that creates a kernel from a SnpReader. No SNP data will be read until the SnpKernel.read() method is called. Use block_size to avoid ever reading all the SNP data into memory. at once.

See KernelReader for general examples of using KernelReaders.

Constructor:
Parameters:
  • snpreader (SnpReader) – The SNP data

  • standardizer (Standardizer) – How the SNP data should be standardized

  • block_size (optional, int) – The number of SNPs to read at a time.

If block_size is not given, then all SNP data will be read at once.

Example:

>>> from pysnptools.snpreader import Bed
>>> from pysnptools.standardizer import Unit
>>> from pysnptools.util import example_file # Download and return local file name
>>> bed_file = example_file('pysnptools/examples/toydata.5chrom.*','*.bed')
>>> snp_on_disk = Bed(bed_file,count_A1=False)     # A Bed file is specified, but nothing is read from disk
>>> kernel_on_disk = SnpKernel(snp_on_disk, Unit(),block_size=500)  # A kernel is specified, but nothing is read from disk
>>> print(kernel_on_disk) #Print the specification
SnpKernel(Bed(...pysnptools/examples/toydata.5chrom.bed',count_A1=False),standardizer=Unit(),block_size=500)
>>> print(kernel_on_disk.iid_count)                                  # iid information is read from disk, but not SNP data
500
>>> kerneldata = kernel_on_disk.read().standardize()                # SNPs are read and Unit standardized, 500 at a time, to create a kernel, which is then standardized
>>> print('{0:.6f}'.format(kerneldata.val[0,0]))
0.992307
property pos

The SnpReader.pos property of the SNP data.

read_snps(order='F', dtype=<class 'numpy.float64'>, force_python_only=False, view_ok=False, num_threads=None)

Reads the SNP values, applies the standardizer, and returns a SnpData.

Parameters:
  • order (string or None) – {‘F’ (default), ‘C’, ‘A’}, optional – Specify the order of the ndarray. If order is ‘F’ (default), then the array will be in F-contiguous order (iid-index varies the fastest). If order is ‘C’, then the returned array will be in C-contiguous order (sid-index varies the fastest). If order is ‘A’, then the SnpData.val ndarray may be in any order (either C-, Fortran-contiguous).

  • dtype (data-type) – {numpy.float64 (default), numpy.float32}, optional – The data-type for the SnpData.val ndarray.

  • force_python_only (bool) – optional – If False (default), may use outside library code. If True, requests that the read be done without outside library code.

  • view_ok (bool) – optional – If False (default), allocates new memory for the SnpData.val’s ndarray. If True, if practical and reading from a SnpData, will return a new SnpData with a ndarray shares memory with the original SnpData. Typically, you’ll also wish to use “order=’A’” to increase the chance that sharing will be possible. Use these parameters with care because any change to either ndarray (for example, via SnpData.standardize()) will effect the others. Also keep in mind that read() relies on ndarray’s mechanisms to decide whether to actually share memory and so it may ignore your suggestion and allocate a new ndarray anyway.

  • num_threads (None or int) – optional – The number of threads with which to read data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

Return type:

SnpData

property sid

The SnpReader.sid property of the SNP data.

property sid_count

The SnpReader.sid_count property of the SNP data.

standardizer Module

standardizer.Standardizer

class pysnptools.standardizer.Standardizer

A Standardizer is a class such as Unit and Beta to be used by the SnpData.standardize() and SnpReader.read_kernel() method to standardize SNP data.

Example:

Read and standardize SNP data.

>>> from pysnptools.standardizer import Unit
>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snpdata1 = Bed(bedfile,count_A1=False).read().standardize(Unit())
>>> print('{0:.6f}'.format(snpdata1.val[0,0]))
0.229416

Create a kernel from SNP data on disk.

>>> bedfile2 = example_file("pysnptools/examples/toydata.5chrom.*","*.bed")
>>> kerneldata = Bed(bedfile2,count_A1=False).read_kernel(Unit())
>>> print('{0:.6f}'.format(kerneldata.val[0,0]))
9923.069928

Can also return a constant SNP standardizer that can be applied to other SnpData.

>>> snp_whole = Bed(bedfile,count_A1=False)
>>> train_idx, test_idx = range(10,snp_whole.iid_count), range(0,10) #test on the first 10, train on the rest
>>> snp_train, trained_standardizer = Unit().standardize(snp_whole[train_idx,:].read(),return_trained=True)
>>> print('{0:.6f}'.format(snp_train.val[0,0]))
0.233550
>>> print(trained_standardizer.stats[0,:]) #The mean and stddev of the 1st SNP on the training data # '...' for a possible space character
[...1.94827586  0.22146953]
>>> snp_test = snp_whole[test_idx,:].read().standardize(trained_standardizer)
>>> snp_test.val[0,0]
0.23354968324845735

Standardize any Numpy array.

>>> val = Bed(bedfile,count_A1=False).read().val
>>> print(val[0,0])
2.0
>>> val = Unit().standardize(val)
>>> print('{0:.6f}'.format(val[0,0]))
0.229416

Details of Methods & Properties:

property is_constant

Tell if a standardizer’s statistics are constant. For example, Unit_Trained contains a pre-specified mean and stddev for each SNP, so Unit_Trained.is_constant is True. Unit, on the other hand, measures the mean and stddev for each SNP, so Unit.is_constant is False.

Return type:

bool

standardize(snps, block_size=None, return_trained=False, force_python_only=False, num_threads=None)

Applies standardization, in place, to SnpData (or a NumPy array). For convenience also returns the SnpData (or a NumPy array).

Parameters:
  • snps (SnpData (or a NumPy array)) – SNP values to standardize

  • block_size (None) – Not used

  • return_trained (bool) – If true, returns a second value containing a constant standardizer trained on this data.

  • force_python_only (bool) – optional – If False (default), may use outside library code. If True, requests that the read be done without outside library code.

  • num_threads (None or int) – optional – The number of threads with which to standardize data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

Return type:

SnpData (or a NumPy array), (optional) constant Standardizer

standardizer.Unit

class pysnptools.standardizer.Unit

A Standardizer to unit standardize SNP data. For each sid, the mean of the values is zero with standard deviation 1. NaN values are then filled with zero, the mean (consequently, if there are NaN values, the final standard deviation will not be zero.

See Standardizer for more information about standardization.

>>> from pysnptools.standardizer import Unit
>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snpdata1 = Bed(bedfile,count_A1=False).read().standardize(Unit())
>>> print('{0:.6f}'.format(snpdata1.val[0,0]))
0.229416

standardizer.Beta

class pysnptools.standardizer.Beta(a, b)

A Standardizer to beta standardize SNP data.

See Standardizer for more information about standardization.

Constructor:
Parameters:
  • a (float) – The a parameter of the beta distribution

  • b (float) – The b parameter of the beta distribution

>>> from pysnptools.standardizer import Beta
>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snpdata1 = Bed(bedfile,count_A1=False).read().standardize(Beta(1,25))
>>> print('{0:.6f}'.format(snpdata1.val[0,0]))
0.680802

standardizer.Identity

class pysnptools.standardizer.Identity

A Standardizer that does nothing to SNP data.

See Standardizer for more information about standardization.

>>> from pysnptools.standardizer import Identity
>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snpdata1 = Bed(bedfile,count_A1=False).read()
>>> print(snpdata1.val[0,0])
2.0
>>> snpdata1 = snpdata1.standardize(Identity())
>>> print(snpdata1.val[0,0])
2.0

standardizer.DiagKtoN

class pysnptools.standardizer.DiagKtoN(deprecated_iid_count=None)

Both a Standardizer and a A KernelStandardizer.

When applied to a SnpData, it multiplies the SNP values by the square root of a factor such that a kernel constructed from the SNP values will have a diagonal that sums to iid_count. This class thus standardizes the kernel before it is even constructed.

When applied to a KernelData, it multiplies the kernel values by the a factor such that a kernel will have a diagonal that sums to iid_count.

See Standardizer for more information about standardization.

Constructor:
Parameters:
  • deprecated_iid_count (int) (optional) – Deprecated.

Example of DiagKtoN to SnpData:

>>> from pysnptools.standardizer import DiagKtoN, Unit, Identity
>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snpdata1 = Bed(bedfile,count_A1=False).read().standardize(Unit()).standardize(DiagKtoN())
>>> kernel1 = snpdata1.read_kernel(Identity())
>>> print('{0:.6f}'.format(np.diag(kernel1.val).sum()))
300.000000

Example of DiagKtoN to KernelData:

>>> import numpy as np
>>> from pysnptools.standardizer import DiagKtoN, Unit, Identity
>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snpdata1 = Bed(bedfile,count_A1=False).read().standardize(Unit())
>>> kernel1 = snpdata1.read_kernel(DiagKtoN(),block_size=None)
>>> print('{0:.6f}'.format(np.diag(kernel1.val).sum()))
300.000000

standardizer.UnitTrained

class pysnptools.standardizer.UnitTrained(sid, stats)

A Standardizer to unit standardize one set of SNP data based on the mean and stddev of another set of SNP data. NaN values are then filled with zero.

See Standardizer for more information about standardization.

Constructor:
Parameters:
  • stats (ndarray of float) – The mean and stddev of each sid

>>> from pysnptools.standardizer import Unit
>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> train = Bed(bedfile,count_A1=False)[1:,:].read() # read SNP values for all but the first iid
>>> _, unittrained = train.standardize(Unit(),return_trained=True) #Unit standardize and remember the mean and stddev of each sid
>>> print(unittrained.stats[:5,:]) #Print the means and stddev of the first five sids
[[...1.94983278  0.21828988]
 [...1.96989967  0.17086341]
 [...1.84280936  0.39057474]
 [...1.99665552  0.0577347 ]
 [...1.97658863  0.15120608]]
>>> test = Bed(bedfile,count_A1=False)[0,:].read() # read SNP values for the first iid
>>> test = test.standardize(unittrained) # Use the mean and stddev of the train data to unit standardize the test data.
>>> print('{0:.6f}'.format(test.val[0,0]))
0.229819

standardizer.DiagKtoNTrained

class pysnptools.standardizer.DiagKtoNTrained(factor)

Both a Standardizer and a A KernelStandardizer.

When applied to a SnpData, DiagKtoN standardizes one set of SNP data based on another set of SNP data.

When applied to a KernelData, DiagKtoN standardizes one kernel data based on another kernel data.

See Standardizer for more information about standardization.

Constructor:
Parameters:
  • factor (float) – The number what when multiplied into the kernel values will make the diagonal of the kernel values to sum to iid_count.

kernelstandardizer Module

kernelstandardizer.KernelStandardizer

class pysnptools.kernelstandardizer.KernelStandardizer

A KernelStandardizer is a class such as DiagKtoN and Identity to be used by the KernelData.standardize() to standardize Kernel data. It always works in-place and returns the KernelData on which it works.

Example:

Read and standardize Kernel data.

>>> from pysnptools.kernelstandardizer import DiagKtoN
>>> from pysnptools.kernelreader import KernelNpz
>>> from pysnptools.util import example_file # Download and return local file name
>>>
>>> kernel_file = example_file('pysnptools/examples/toydata.kernel.npz')
>>> kerneldata1 = KernelNpz(kernel_file).read()
>>> print(np.diag(kerneldata1.val).sum())
5000000.0
>>> kerneldata1 = kerneldata1.standardize(DiagKtoN())
>>> print(np.diag(kerneldata1.val).sum())
500.0

Can also return a constant kernel standardizer that be applied to other KernelData.

>>> kernel_whole = KernelNpz(kernel_file)
>>> train_idx, test_idx = range(10,kernel_whole.iid_count), range(0,10)  #test on the first 10, train on the rest
>>> kernel_train, trained_standardizer = DiagKtoN().standardize(kernel_whole[train_idx].read(),return_trained=True)
>>> print('{0:.6f}'.format(np.diag(kernel_train.val).sum()))
490.000000
>>> print('{0:.6f}'.format(trained_standardizer.factor))
0.000100
>>> kernel_whole_test = kernel_whole[:,test_idx].read().standardize(trained_standardizer)
>>> print('{0:.6f}'.format(kernel_whole_test.val[0,0]))
0.992217

Details of Methods & Properties:

standardize(kerneldata, return_trained=False, force_python_only=False, num_threads=None)

Applies standardization, in place, to KernelData. For convenience also returns the KernelData.

Parameters:
  • snps (KernelData) – kernel values to standardize

  • return_trained (bool) – If true, returns a second value containing a constant KernelStandardizer trained on this data.

  • force_python_only (bool) – optional – If False (default), may use outside library code. If True, requests that the read be done without outside library code.

  • num_threads (None or int) – optional – The number of threads with which to standardize data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

Return type:

KernelData, (optional) constant KernelStandardizer

kernelstandardizer.DiagKtoN

class pysnptools.kernelstandardizer.DiagKtoN(deprecated_iid_count=None)

Both a Standardizer and a A KernelStandardizer.

When applied to a SnpData, it multiplies the SNP values by the square root of a factor such that a kernel constructed from the SNP values will have a diagonal that sums to iid_count. This class thus standardizes the kernel before it is even constructed.

When applied to a KernelData, it multiplies the kernel values by the a factor such that a kernel will have a diagonal that sums to iid_count.

See Standardizer for more information about standardization.

Constructor:
Parameters:
  • deprecated_iid_count (int) (optional) – Deprecated.

Example of DiagKtoN to SnpData:

>>> from pysnptools.standardizer import DiagKtoN, Unit, Identity
>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snpdata1 = Bed(bedfile,count_A1=False).read().standardize(Unit()).standardize(DiagKtoN())
>>> kernel1 = snpdata1.read_kernel(Identity())
>>> print('{0:.6f}'.format(np.diag(kernel1.val).sum()))
300.000000

Example of DiagKtoN to KernelData:

>>> import numpy as np
>>> from pysnptools.standardizer import DiagKtoN, Unit, Identity
>>> from pysnptools.snpreader import Bed
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> snpdata1 = Bed(bedfile,count_A1=False).read().standardize(Unit())
>>> kernel1 = snpdata1.read_kernel(DiagKtoN(),block_size=None)
>>> print('{0:.6f}'.format(np.diag(kernel1.val).sum()))
300.000000

kernelstandardizer.DiagKtoNTrained

class pysnptools.kernelstandardizer.DiagKtoNTrained(factor)

Both a Standardizer and a A KernelStandardizer.

When applied to a SnpData, DiagKtoN standardizes one set of SNP data based on another set of SNP data.

When applied to a KernelData, DiagKtoN standardizes one kernel data based on another kernel data.

See Standardizer for more information about standardization.

Constructor:
Parameters:
  • factor (float) – The number what when multiplied into the kernel values will make the diagonal of the kernel values to sum to iid_count.

kernelstandardizer.Identity

class pysnptools.kernelstandardizer.Identity

A KernelStandardizer that does nothing to kernel data.

See KernelStandardizer for more information about standardization.

>>> from pysnptools.kernelstandardizer import Identity as KS_Identity
>>> from pysnptools.kernelreader import KernelNpz
>>> from pysnptools.util import example_file # Download and return local file name
>>>
>>> kernel_file = example_file('pysnptools/examples/toydata.kernel.npz')
>>> kerneldata1 = KernelNpz(kernel_file).read()
>>> print(np.diag(kerneldata1.val).sum())
5000000.0
>>> kerneldata1 = kerneldata1.standardize(KS_Identity())
>>> print(np.diag(kerneldata1.val).sum())
5000000.0

distreader Module

Tools for reading and manipulating SNP distribution data. For each individual and SNP location, it gives three probabilities: P(AA),P(AB),P(BB), where A and B are the alleles. The probabilities should sum to 1.0. Missing data is represented by three numpy.NaN’s.

distreader.DistReader

class pysnptools.distreader.DistReader(*args, **kwargs)

A DistReader is one of three things:

  • A class such as Bgen for you to specify a file with data. For example,

    >>> from pysnptools.distreader import Bgen
    >>> from pysnptools.util import example_file # Download and return local file name
    >>> bgen_file = example_file("pysnptools/examples/2500x100.bgen")
    >>> dist_on_disk = Bgen(bgen_file)
    >>> print(dist_on_disk) # prints the name of the file reader
    Bgen('...pysnptools/examples/2500x100.bgen')
    >>> dist_on_disk.sid_count # prints the number of SNPS (but doesn't read any SNP distribution values)
    100
    
  • A DistData class that holds SNP distribution data in memory, typically after reading it from disk:

    >>> from pysnptools.util import example_file # Download and return local file name
    >>> bgen_file = example_file("pysnptools/examples/2500x100.bgen")
    >>> dist_on_disk = Bgen(bgen_file)
    >>> distdata1 = dist_on_disk.read() #reads the SNP distribution values
    >>> print(type(distdata1.val)) # The val property is an 3-D ndarray of SNP distribution values
    <class 'numpy.ndarray'>
    >>> print(distdata1) # prints the name in-memory SNP distribution reader.
    DistData(Bgen('...pysnptools/examples/2500x100.bgen'))
    >>> distdata1.iid_count #prints the number of iids (number of individuals) in this in-memory data
    2500
    
  • A subset of any DistReader, specified with “[ iid_index , sid_index ]”, to read only some SNP distribution values. It can also be used to re-order the values.

    >>> from pysnptools.util import example_file # Download and return local file name
    >>> bgen_file = example_file("pysnptools/examples/2500x100.bgen")
    >>> dist_on_disk = Bgen(bgen_file)
    >>> subset_on_disk = dist_on_disk[[3,4],::2] # specification for a subset of the data on disk. No SNP distriubtion values are read yet.
    >>> print(subset_on_disk.sid_count) # prints the number of sids in this subset (but still doesn't read any SNP distribution values)
    50
    >>> print(subset_on_disk) #prints a specification of 'subset_on_disk'
    Bgen('...pysnptools/examples/2500x100.bgen')[[3,4],::2]
    >>> distdata_subset = subset_on_disk.read() # efficiently reads the specified subset of values from the disk
    >>> print(distdata_subset) # prints the specification of the in-memory SNP distribution information
    DistData(Bgen('...pysnptools/examples/2500x100.bgen')[[3,4],::2])
    >>> print((int(distdata_subset.val.shape[0]), int(distdata_subset.val.shape[1]))) # The dimensions of the ndarray of SNP distriubtion values
    (2, 50)
    

The DistReaders Classes

Class

Format

Random Access

Suffixes

Write method?

DistData

in-memory floats

Yes

n/a

n/a

Bgen

binary, floats

Yes (by sid)

*.bgen

Yes

DistNpz

binary, floats

No

.dist.npz

Yes

DistHdf5

binary, floats

Yes (by sid or iid)

.dist.hdf5

Yes

DistMemMap

mem-mapped floats

Yes

.dist.memmap

Yes

Methods & Properties:

Every DistReader, such as Bgen and DistData, has these properties: iid, iid_count, sid, sid_count, pos and these methods: read(), iid_to_index(), sid_to_index(), as_snp(). See below for details.

DistData is a DistReader so it supports the above properties and methods. In addition, it supports property DistData.val.

See below for details.

Many of the classes, such as Bgen, also provide a static Bgen.write() method for writing DistReader or DistData to disk.

>>> from pysnptools.distreader import DistHdf5, Bgen
>>> import pysnptools.util as pstutil
>>> hdf5_file = example_file("pysnptools/examples/toydata.snpmajor.dist.hdf5")
>>> distreader = DistHdf5(hdf5_file)[:,:10] # A reader for the first 10 SNPs in Hdf5 format
>>> pstutil.create_directory_if_necessary("tempdir/toydata10.bgen")
>>> Bgen.write("tempdir/toydata10.bgen",distreader)        # Write data in BGEN format
Bgen('tempdir/toydata10.bgen')

iids and sids:

same as with SnpReader

Selecting and Reordering Individuals and SNPs

same as with SnpReader

When Data is Read:

same as with SnpReader

When Data is Re-Read and Copied:

same as with SnpReader

Avoiding Unwanted ndarray Allocations

same as with SnpReader

The read() Method

By default the read() returns a 3-D ndarray of numpy.float64 laid out in memory in F-contiguous order (iid-index varies the fastest). You may, instead, ask for numpy.float32 or for C-contiguous order or any order. See read() for details.

Details of Methods & Properties:

as_snp(max_weight=2.0, block_size=None)

Returns a pysnptools.snpreader.SnpReader such that turns the probability distribution into an expected value.

For example, if the probability distribution is [0.466804 0.38812848 0.14506752] and the max_weight is 2, then the expected value will be (0.466804*0+0.38812848*.5+0.14506752*1)*2, 0.67826352 .

Parameters:
  • max_weight (number) – optional – The expected values will range from 0 to max_weight. Default of 2.

  • block_size (int or None) – optional – Default of None (meaning to load all). Suggested number of sids to read into memory at a time.

Return type:

class:SnpReader

Example:

>>> from pysnptools.distreader import Bgen
>>> from pysnptools.util import example_file # Download and return local file name
>>> bgen_file = example_file("pysnptools/examples/2500x100.bgen")
>>> distreader = Bgen(bgen_file) # Specify distribution data on disk
>>> print(distreader[0,0].read().val)
[[[0.466804   0.38812848 0.14506752]]]
>>> snpreader = distreader.as_snp(max_weight=2)
>>> print(snpreader[0,0].read().val)
[[0.67826352]]
property iid

A ndarray of the iids. Each iid is a ndarray of two strings (a family ID and a individual ID) that identifies an individual.

Return type:

ndarray of strings with shape [iid_count,2]

This property (to the degree practical) reads only iid and sid data from the disk, not SNP distribution data. Moreover, the iid and sid data is read from file only once.

Example:

>>> from pysnptools.distreader import Bgen
>>> from pysnptools.util import example_file # Download and return local file name
>>> bgen_file = example_file("pysnptools/examples/2500x100.bgen")
>>> dist_on_disk = Bgen(bgen_file)
>>> print(dist_on_disk.iid[:3]) # print the first three iids
[['0' 'iid_0']
 ['0' 'iid_1']
 ['0' 'iid_2']]
property iid_count

number of iids

Return type:

integer

This property (to the degree practical) reads only iid and sid data from the disk, not SNP distribution data. Moreover, the iid and sid data is read from file only once.

iid_to_index(list)

Takes a list of iids and returns a list of index numbers

Parameters:

list – list of iids

Return type:

ndarray of int

This method (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.

Example:

>>> from pysnptools.distreader import Bgen
>>> from pysnptools.util import example_file # Download and return local file name
>>> bgen_file = example_file("pysnptools/examples/2500x100.bgen")
>>> dist_on_disk = Bgen(bgen_file) # Specify SNP data on disk
>>> print(dist_on_disk.iid_to_index([['0','iid_2'],['0','iid_1']])) #Find the indexes for two iids.
[2 1]
property pos

A ndarray of the position information for each sid. Each element is a ndarray of three numpy.numbers (chromosome, genetic distance, basepair distance).

Return type:

ndarray of float64 with shape [sid_count, 3]

This property (to the degree practical) reads only iid and sid data from the disk, not SNP distribution data. Moreover, the iid and sid data is read from file only once.

Example:

>>> from pysnptools.distreader import Bgen
>>> from pysnptools.util import example_file # Download and return local file name
>>> bgen_file = example_file("pysnptools/examples/2500x100.bgen")
>>> dist_on_disk = Bgen(bgen_file)
>>> print(dist_on_disk.pos[:4,].astype('int')) # print position information for the first four sids: #The '...' is for possible space char
[[        1         0   9270273]
 [        1         0  39900273]
 [        1         0  70530273]
 [        1         0 101160273]]
read(order='F', dtype=<class 'numpy.float64'>, force_python_only=False, view_ok=False, num_threads=None)

Reads the SNP values and returns a DistData (with DistData.val property containing a new 3D ndarray of the SNP distribution values).

Parameters:
  • order (string or None) – {‘F’ (default), ‘C’, ‘A’}, optional – Specify the order of the ndarray. If order is ‘F’ (default), then the array will be in F-contiguous order (iid-index varies the fastest). If order is ‘C’, then the returned array will be in C-contiguous order (sid-index varies the fastest). If order is ‘A’, then the DistData.val ndarray may be in any order (either C-, Fortran-contiguous).

  • dtype (data-type) – {numpy.float64 (default), numpy.float32}, optional – The data-type for the DistData.val ndarray.

  • force_python_only (bool) – optional – If False (default), may use outside library code. If True, requests that the read be done without outside library code.

  • view_ok (bool) – optional – If False (default), allocates new memory for the DistData.val’s ndarray. If True, if practical and reading from a DistData, will return a new DistData with a ndarray shares memory with the original DistData. Typically, you’ll also wish to use “order=’A’” to increase the chance that sharing will be possible. Use these parameters with care because any change to either ndarraywill effect the others. Also keep in mind that read() relies on ndarray’s mechanisms to decide whether to actually share memory and so it may ignore your suggestion and allocate a new ndarray anyway.

  • num_threads (None or int) – optional – The number of threads with which to read data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

Return type:

DistData

Calling the method again causes the SNP distribution values to be re-read and creates a new in-memory DistData with a new ndarray of SNP values.

If you request the values for only a subset of the sids or iids, (to the degree practical) only that subset will be read from disk.

Example:

>>> from pysnptools.distreader import Bgen
>>> from pysnptools.util import example_file # Download and return local file name
>>> bgen_file = example_file("pysnptools/examples/2500x100.bgen")
>>> dist_on_disk = Bgen(bgen_file) # Specify SNP data on disk
>>> distdata1 = dist_on_disk.read() # Read all the SNP data returning a DistData instance
>>> print(type(distdata1.val).__name__) # The DistData instance contains a ndarray of the data.
ndarray
>>> subset_distdata = dist_on_disk[:,::2].read() # From the disk, read SNP values for every other sid
>>> print(subset_distdata.val[0,0]) # Print the first SNP value in the subset
[0.466804   0.38812848 0.14506752]
>>> subsub_distdata = subset_distdata[:10,:].read(order='A',view_ok=True) # Create an in-memory subset of the subset with SNP values for the first ten iids. Share memory if practical.
>>> import numpy as np
>>> # print np.may_share_memory(subset_distdata.val, subsub_distdata.val) # Do the two ndarray's share memory? They could. Currently they won't.
property row_property

Defined as a zero-width array for compatibility with PstReader, but not used.

property sid

A ndarray of the sids. Each sid is a string that identifies a SNP.

Return type:

ndarray (length sid_count) of strings

This property (to the degree practical) reads only iid and sid data from the disk, not SNP distribution data. Moreover, the iid and sid data is read from file only once.

Example:

>>> from pysnptools.distreader import Bgen
>>> from pysnptools.util import example_file # Download and return local file name
>>> bgen_file = example_file("pysnptools/examples/2500x100.bgen")
>>> dist_on_disk = Bgen(bgen_file)
>>> print(dist_on_disk.sid[:9]) # print the first nine sids
['sid_0' 'sid_1' 'sid_2' 'sid_3' 'sid_4' 'sid_5' 'sid_6' 'sid_7' 'sid_8']
property sid_count

number of sids

Return type:

integer

This property (to the degree practical) reads only iid and sid data from the disk, not SNP distribution data. Moreover, the iid and sid data is read from file only once.

sid_to_index(list)

Takes a list of sids and returns a list of index numbers

Parameters:

list (list of strings) – list of sids

Return type:

ndarray of int

This method (to the degree practical) reads only iid and sid data from the disk, not SNP value data. Moreover, the iid and sid data is read from file only once.

Example:

>>> from pysnptools.distreader import Bgen
>>> from pysnptools.util import example_file # Download and return local file name
>>> bgen_file = example_file("pysnptools/examples/2500x100.bgen")
>>> dist_on_disk = Bgen(bgen_file) # Specify SNP data on disk
>>> print(dist_on_disk.sid_to_index(['sid_2','sid_9'])) #Find the indexes for two sids.
[2 9]
property val_shape

Tells the shape of value for a given individual and SNP. For DistReaders always returns 3.

distreader.Bgen

class pysnptools.distreader.Bgen(filename, iid_function=<function default_iid_function>, sid_function=<function default_sid_function>, sample=None, num_threads=None, fresh_properties=True)

A DistReader for reading *.bgen files from disk.

See DistReader for general examples of using DistReaders.

The BGEN format is described here.

Tip: The ‘gen’ in BGEN stands for ‘genetic’. The ‘gen’ in DistGen stands for generate, because it generates random (genetic) data.*

Constructor:
Parameters:
  • filename (string) – The BGEN file to read.

  • iid_function (optional, function) – Function to turn a BGEN sample into a DistReader.iid. (Default: bgen.default_iid_function().)

  • sid_function (optional, function or string) – Function to turn a BGEN (SNP) id and rsid into a DistReader.sid. (Default: bgen.default_sid_function().) Can also be the string ‘id’ or ‘rsid’, which is faster than using a function.

  • sample (optional, string) – A GEN sample file. If given, overrides information in *.bgen file.

  • num_threads (optinal, int) – The number of threads with which to read data. Defaults to all available processors.

    Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

  • fresh_properties (optional, bool) – When true (default), memory will be allocated for the iid, sid, and pos properties. This is safe. When false, the properties will use the Bgen’s on-disk ‘memmap’. That saves memory, but is only safe if the Bgen object is still around when the properties are used.

Example:

>>> from pysnptools.distreader import Bgen
>>> from pysnptools.util import example_file # Download and return local file name
>>> bgen_file = example_file("pysnptools/examples/example.bgen")
>>> data_on_disk = Bgen(bgen_file)
>>> print((data_on_disk.iid_count, data_on_disk.sid_count))
(500, 199)

Methods beyond DistReader

flush()

Close the *.bgen file for reading. (If values or properties are accessed again, the file will be reopened.)

>>> from pysnptools.distreader import Bgen
>>> from pysnptools.util import example_file # Download and return local file name
>>> bgen_file = example_file("pysnptools/examples/example.bgen")
>>> data_on_disk = Bgen(bgen_file)
>>> print((data_on_disk.iid_count, data_on_disk.sid_count))
(500, 199)
>>> data_on_disk.flush()
static genwrite(filename, distreader, decimal_places=None, id_rsid_function=<function default_id_rsid_function>, sample_function=<function default_sample_function>, block_size=None)

Writes a DistReader to Gen format

Parameters:
  • filename (string) – the name of the file to create (will also create filename_without_gen.sample file.)

  • distreader (DistReader) – The data that should be written to disk. It can also be any distreader, for example, DistNpz, DistData, or another Bgen.

  • decimal_places – (Default: None) Number of decimal places with which to write the text numbers. None writes to full precision.

  • id_rsid_function (function) – Function to turn a a DistReader.sid into a GEN (SNP) id and rsid. (Default: bgen.default_id_rsid_function().)

  • sid_function (function) – Function to turn a GEN (SNP) id and rsid into a DistReader.sid. (Default: bgen.default_sid_function().)

  • block_size (number) – The number of SNPs to read in a batch from distreader. Defaults to a block_size such that block_size * iid_count is about 100,000.

Return type:

None

>>> from pysnptools.distreader import DistHdf5, Bgen
>>> import pysnptools.util as pstutil
>>> from pysnptools.util import example_file # Download and return local file name
>>> hdf5_file = example_file("pysnptools/examples/toydata.snpmajor.dist.hdf5")
>>> distreader = DistHdf5(hdf5_file)[:,:10] # A reader for the first 10 SNPs in Hdf5 format
>>> pstutil.create_directory_if_necessary("tempdir/toydata10.bgen")
>>> Bgen.genwrite("tempdir/toydata10.gen",distreader)        # Write data in GEN format
static write(filename, distreader, bits=16, compression=None, sample_function=<function default_sample_function>, id_rsid_function=<function default_id_rsid_function>, iid_function=<function default_iid_function>, sid_function=<function default_sid_function>, block_size=None, qctool_path=None, cleanup_temp_files=True)

Writes a DistReader to BGEN format and return a the Bgen. Requires access to the 3rd party QCTool.

Parameters:
  • filename (string) – the name of the file to create

  • distreader (DistReader) – The data that should be written to disk. It can also be any distreader, for example, DistNpz, DistData, or another Bgen.

  • bits (int) – Number of bits, between 1 and 32 used to represent each 0-to-1 probability value. Default is 16. An np.float32 needs 23 bits. A np.float64 would need 52 bits, which the BGEN format doesn’t offer, so use 32.

  • compression (string) – How to compress the file. Can be None (default), ‘zlib’, or ‘zstd’.

  • sample_function (function) – Function to turn a DistReader.iid into a BGEN sample. (Default: bgen.default_sample_function().)

  • id_rsid_function (function) – Function to turn a a DistReader.sid into a BGEN (SNP) id and rsid. (Default: bgen.default_id_rsid_function().)

  • iid_function (function) – Function to turn a BGEN sample into a DistReader.iid. (Default: bgen.default_iid_function().)

  • sid_function (function) – Function to turn a BGEN (SNP) id and rsid into a DistReader.sid. (Default: bgen.default_sid_function().)

  • block_size (number) – The number of SNPs to read in a batch from distreader. Defaults to a block_size such that block_size * iid_count is about 100,000.

  • qctool_path (string) – Tells the path to the 3rd party QCTool. Defaults to reading path from environment variable QCTOOLPATH. (To use on Windows, install Ubuntu for Windows, install QCTool in Ubuntu, and then give the path as “ubuntu run <UBUNTU PATH TO QCTOOL”.)

  • cleanup_temp_files (bool) – Tells if delete temporary *.gen and *.sample files.

Return type:

Bgen

>>> from pysnptools.distreader import DistHdf5, Bgen
>>> import pysnptools.util as pstutil
>>> from pysnptools.util import example_file # Download and return local file name
>>> hdf5_file = example_file("pysnptools/examples/toydata.snpmajor.dist.hdf5")
>>> distreader = DistHdf5(hdf5_file)[:,:10] # A reader for the first 10 SNPs in Hdf5 format
>>> pstutil.create_directory_if_necessary("tempdir/toydata10.bgen")
>>> Bgen.write("tempdir/toydata10.bgen",distreader)        # Write data in BGEN format
Bgen('tempdir/toydata10.bgen')
pysnptools.distreader.bgen.default_iid_function(sample)

The default function for turning a Bgen sample into a two-part pysnptools.distreader.DistReader.iid. If the Bgen sample contains a single comma, we split on the comma to create the iid. Otherwise, the iid will be (‘0’,sample)

>>> default_iid_function('fam0,ind0')
('fam0', 'ind0')
>>> default_iid_function('ind0')
('0', 'ind0')
pysnptools.distreader.bgen.default_sid_function(id, rsid)

The default function for turning a Bgen (SNP) id and rsid into a pysnptools.distreader.DistReader.sid. If the Bgen rsid is ‘’ or ‘0’, the sid will be the (SNP) id. Otherwise, the sid will be ‘ID,RSID’

>>> default_sid_function('SNP1','rs102343')
'SNP1,rs102343'
>>> default_sid_function('SNP1','0')
'SNP1'
pysnptools.distreader.bgen.default_sample_function(famid, indid)

The default function for turning a two-part pysnptools.distreader.DistReader.iid into a a Bgen sample. If the iid’s first part (the family id) is ‘0’ or ‘’, the sample will be iid’s 2nd part. Otherwise, the sample will be ‘FAMID,INDID’

>>> default_sample_function('fam0','ind0')
'fam0,ind0'
>>> default_sample_function('0','ind0')
'ind0'
pysnptools.distreader.bgen.default_id_rsid_function(sid)

The default function for turning a pysnptools.distreader.DistReader.sid into a Bgen (SNP) id and rsid. If the sid contains a single comma, we split on the comma to create the id and rsid. Otherwise, the (SNP) id will be the sid and the rsid will be ‘0’

>>> default_id_rsid_function('SNP1,rs102343')
('SNP1', 'rs102343')
>>> default_id_rsid_function('SNP1')
('SNP1', '0')

distreader.DistData

class pysnptools.distreader.DistData(iid, sid, val, pos=None, name=None, copyinputs_function=None)

A DistReader for holding SNP distributions (or similar values) in-memory, along with related iid, sid, and pos information. It is usually created by calling the DistReader.read() method on another DistReader, for example, Bgen. It can also be constructed directly.

See DistReader for details and examples.

Constructor:
Parameters:
  • iid (an array of string pair) – The DistReader.iid information.

  • sid (an array of strings) – The DistReader.sid information.

  • val (a 3-D array of floats) – The SNP value distribution

  • pos (optional, an array of strings) – The DistReader.pos information

  • name (optional, string) – Information to be display about the origin of this data

  • copyinputs_function (optional, function) – Used internally by optional clustering code

Example:

>>> from pysnptools.distreader import DistData
>>> distdata = DistData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'],
...                     val=[[[.5,.5,0],[0,0,1],[.5,.5,0]],
...                          [[0,1.,0],[0,.75,.25],[.5,.5,0]]])
>>> print((distdata.val[1,1], distdata.iid_count, distdata.sid_count))
(array([0.  , 0.75, 0.25]), 2, 3)

Equality:

Two DistData objects are equal if their four arrays (DistData.val, DistReader.iid, DistReader.sid, and DistReader.pos) are ‘array_equal’. (Their ‘name’ does not need to be the same). If either DistData.val contains NaN, the objects will not be equal. However, DistData.allclose() can be used to treat NaN as regular values.

Example:

>>> import numpy as np
>>> from pysnptools.distreader import DistData
>>> snpdata1 = DistData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'],
...                     val=[[[.5,.5,0],[0,0,1],[.5,.5,0]],
...                          [[0,1.,0],[0,.75,.25],[.5,.5,0]]],
...                     pos=[[0,0,0],[0,0,0],[0,0,0]])
>>> snpdata2 = DistData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'],
...                     val=[[[.5,.5,0],[0,0,1],[.5,.5,0]],
...                          [[0,1.,0],[0,.75,.25],[.5,.5,0]]],
...                     pos=[[0,0,0],[0,0,0],[0,0,0]])
>>> print(snpdata1 == snpdata2) #True, because all the arrays have the same values.
True
>>> print(snpdata1.val is snpdata2.val) #False, because the two arrays have different memory.
False
>>> snpdata3 = DistData(iid=[['a','0'],['b','0']], sid=['snp334','snp349','snp921'],
...                     val=[[[.5,.5,0],[0,0,1],[.5,.5,0]],
...                          [[0,1.,0],[0,.75,.25],[.5,.5,0]]],
...                     pos=[[0,0,0],[0,0,0],[0,0,0]])
>>> snpdata4 = DistData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'],
...                     val=[[[.5,.5,0],[0,0,1],[.5,.5,0]],
...                          [[0,1.,0],[0,.75,.25],[.5,.5,0]]],
...                     pos=[[0,0,0],[0,0,0],[0,0,0]])
>>> print(snpdata3 == snpdata4) #False, because the iids are different.
False
>>> snpdata5 = DistData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'],
...                     val=[[[.5,.5,0],[0,0,1],[.5,.5,0]],
...                          [[0,1.,0],[0,.75,.25],[np.nan,np.nan,np.nan]]],
...                     pos=[[0,0,0],[0,0,0],[0,0,0]])
>>> snpdata6 = DistData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'],
...                     val=[[[.5,.5,0],[0,0,1],[.5,.5,0]],
...                          [[0,1.,0],[0,.75,.25],[np.nan,np.nan,np.nan]]],
...                     pos=[[0,0,0],[0,0,0],[0,0,0]])
>>> print(snpdata5 == snpdata6) #False, because the val's contain NaN
False
>>> print(snpdata5.allclose(snpdata6)) #True, if we consider the NaN as regular values, all the arrays have the same values.
True

Methods beyond DistReader

allclose(value, equal_nan=True)
Parameters:
  • value (DistData) – Other object with which to compare.

  • equal_nan (bool) – (Default: True) Tells if NaN in DistData.val should be treated as regular values when testing equality.

>>> import numpy as np
>>> snpdata5 = DistData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'],
...                     val=[[[.5,.5,0],[0,0,1],[.5,.5,0]],
...                          [[0,1.,0],[0,.75,.25],[np.nan,np.nan,np.nan]]],
...                     pos=[[0,0,0],[0,0,0],[0,0,0]])
>>> snpdata6 = DistData(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'],
...                     val=[[[.5,.5,0],[0,0,1],[.5,.5,0]],
...                          [[0,1.,0],[0,.75,.25],[np.nan,np.nan,np.nan]]],
...                     pos=[[0,0,0],[0,0,0],[0,0,0]])
>>> print(snpdata5.allclose(snpdata6)) #True, if we consider the NaN as regular values, all the arrays have the same values.
True
>>> print(snpdata5.allclose(snpdata6,equal_nan=False)) #False, if we consider the NaN as special values, all the arrays are not equal.
False
property val

The 3D NumPy array of floats that represents the distribution of SNP values. You can get or set this property.

>>> from pysnptools.distreader import Bgen
>>> from pysnptools.util import example_file # Download and return local file name
>>> bgen_file = example_file("pysnptools/examples/2500x100.bgen")
>>> distdata = Bgen(bgen_file)[:5,:].read() #read data for first 5 iids
>>> print(distdata.val[4,55]) #print one of the SNP values
[0.23137255 0.65342184 0.11520562]

distreader.DistHdf5

class pysnptools.distreader.DistHdf5(*args, **kwargs)

A DistReader for reading *.dist.hdf5 files from disk.

See DistReader for general examples of using DistReaders.

The general HDF5 format is described here. The DistHdf5 format stores val, iid, sid, and pos information in Hdf5 format.

Constructor:
Parameters:
  • filename (string) – The DistHdf5 file to read.

Example:

>>> from pysnptools.distreader import DistHdf5
>>> from pysnptools.util import example_file # Download and return local file name
>>> hdf5_file = example_file("pysnptools/examples/toydata.snpmajor.dist.hdf5")
>>> data_on_disk = DistHdf5(hdf5_file)
>>> print((data_on_disk.iid_count, data_on_disk.sid_count))
(25, 10000)

Methods beyond DistReader

static write(filename, distdata, hdf5_dtype=None, sid_major=True)

Writes a DistData to DistHdf5 format and return a the DistHdf5.

Parameters:
  • filename (string) – the name of the file to create

  • distdata (DistData) – The in-memory data that should be written to disk.

  • hdf5_dtype (string) – None (use the .val’s dtype) or a Hdf5 dtype, e.g. ‘f8’,’f4’,etc.

  • sid_major – Tells if vals should be stored on disk in sid_major (default) or iid_major format.

Return type:

DistHdf5

>>> from pysnptools.distreader import DistHdf5, Bgen
>>> import pysnptools.util as pstutil
>>> from pysnptools.util import example_file # Download and return local file name
>>> bgen_file = example_file("pysnptools/examples/2500x100.bgen")
>>> distdata = Bgen(bgen_file)[:,:10].read()     # Read first 10 snps from BGEN format
>>> pstutil.create_directory_if_necessary("tempdir/toydata10.dist.hdf5")
>>> DistHdf5.write("tempdir/toydata10.dist.hdf5",distdata)        # Write data in DistHdf5 format
DistHdf5('tempdir/toydata10.dist.hdf5')

distreader.DistNpz

class pysnptools.distreader.DistNpz(*args, **kwargs)

A DistReader for reading *.dist.npz files from disk.

See DistReader for general examples of using DistReaders.

The general NPZ format is described here. The DistNpz format stores val, iid, sid, and pos information in NPZ format.

Constructor:
Parameters:
  • filename (string) – The DistNpz file to read.

Example:

>>> from pysnptools.distreader import DistNpz
>>> from pysnptools.util import example_file # Download and return local file name
>>> npz_file = example_file("pysnptools/examples/toydata10.dist.npz")
>>> data_on_disk = DistNpz(npz_file)
>>> print((data_on_disk.iid_count, data_on_disk.sid_count))
(25, 10)

Methods beyond DistReader

static write(filename, distdata)

Writes a DistData to DistNpz format and returns the DistNpz

Parameters:
  • filename (string) – the name of the file to create

  • distdata (DistData) – The in-memory data that should be written to disk.

Return type:

DistNpz

>>> from pysnptools.distreader import DistNpz, DistHdf5
>>> import pysnptools.util as pstutil
>>> from pysnptools.util import example_file # Download and return local file name
>>> hdf5_file = example_file("pysnptools/examples/toydata.iidmajor.dist.hdf5")
>>> distdata = DistHdf5(hdf5_file)[:,:10].read()     # Read first 10 snps from DistHdf5 format
>>> pstutil.create_directory_if_necessary("tempdir/toydata10.dist.npz")
>>> DistNpz.write("tempdir/toydata10.dist.npz",distdata)          # Write data in DistNpz format
DistNpz('tempdir/toydata10.dist.npz')

distreader.DistMemMap

class pysnptools.distreader.DistMemMap(*args, **kwargs)

A DistData that keeps its data in a memory-mapped file. This allows data large than fits in main memory.

See DistData for general examples of using DistData.

Constructor:
Parameters:

filename (string) – The *.dist.memmap file to read.

Also see DistMemMap.empty() and DistMemMap.write().

Example:

>>> from pysnptools.distreader import DistMemMap
>>> from pysnptools.util import example_file # Download and return local file name
>>> mem_map_file = example_file("pysnptools/examples/tiny.dist.memmap")
>>> dist_mem_map = DistMemMap(mem_map_file)
>>> print(dist_mem_map.val[0,1], dist_mem_map.iid_count, dist_mem_map.sid_count)
[0.43403135 0.28289911 0.28306954] 25 10

Methods inherited from DistData

Methods beyond DistReader

static empty(iid, sid, filename, pos=None, order='F', dtype=<class 'numpy.float64'>)

Create an empty DistMemMap on disk.

Parameters:
  • iid (an array of string pairs) – The DistReader.iid information

  • sid (an array of strings) – The DistReader.sid information

  • filename (string) – name of memory-mapped file to create

  • pos (an array of numeric triples) – optional – The additional DistReader.pos information associated with each sid. Default: None

  • order (string or None) – {‘F’ (default), ‘C’}, optional – Specify the order of the ndarray.

  • dtype (data-type) – {numpy.float64 (default), numpy.float32}, optional – The data-type for the DistMemMap.val ndarray.

Return type:

DistMemMap

>>> import pysnptools.util as pstutil
>>> from pysnptools.distreader import DistMemMap
>>> filename = "tempdir/tiny.dist.memmap"
>>> pstutil.create_directory_if_necessary(filename)
>>> dist_mem_map = DistMemMap.empty(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'],filename=filename,order="F",dtype=np.float64)
>>> dist_mem_map.val[:,:,:] = [[[.5,.5,0],[0,0,1],[.5,.5,0]],
...                            [[0,1.,0],[0,.75,.25],[.5,.5,0]]]
>>> dist_mem_map.flush()
property filename

The name of the memory-mapped file

flush()

Flush DistMemMap.val to disk and close the file. (If values or properties are accessed again, the file will be reopened.)

>>> import pysnptools.util as pstutil
>>> from pysnptools.distreader import DistMemMap
>>> filename = "tempdir/tiny.dist.memmap"
>>> pstutil.create_directory_if_necessary(filename)
>>> dist_mem_map = DistMemMap.empty(iid=[['fam0','iid0'],['fam0','iid1']], sid=['snp334','snp349','snp921'],filename=filename,order="F",dtype=np.float64)
>>> dist_mem_map.val[:,:,:] = [[[.5,.5,0],[0,0,1],[.5,.5,0]],
...                            [[0,1.,0],[0,.75,.25],[.5,.5,0]]]
>>> dist_mem_map.flush()
property offset

The byte position in the file where the memory-mapped values start.

(The disk space before this is used to store DistReader.iid, etc. information. This property is useful when interfacing with, for example, external Fortran and C matrix libraries.)

property val

The 3D NumPy memmap array of floats that represents the distribution of SNP values. You can get this property, but cannot set it (except with itself)

>>> from pysnptools.distreader import DistMemMap
>>> from pysnptools.util import example_file # Download and return local file name
>>> mem_map_file = example_file("pysnptools/examples/tiny.dist.memmap")
>>> dist_mem_map = DistMemMap(mem_map_file)
>>> print(dist_mem_map.val[0,1])
[0.43403135 0.28289911 0.28306954]
static write(filename, distreader, order='A', dtype=None, block_size=None, num_threads=None)

Writes a DistReader to DistMemMap format.

Parameters:
  • filename (string) – the name of the file to create

  • distreader (DistReader) – The data that should be written to disk. It can also be any distreader, for example, DistNpz, DistData, or another Bgen.

  • order (string or None) – {‘A’ (default), ‘F’, ‘C’}, optional – Specify the order of the ndarray. By default, will match the order of the input if knowable; otherwise, ‘F’

  • dtype (data-type) – {None (default), numpy.float64, numpy.float32}, optional – The data-type for the DistMemMap.val ndarray. By default, will match the order of the input if knowable; otherwise np.float64.

  • block_size (number) – The number of SNPs to read in a batch from distreader. Defaults to a block_size such that block_size * iid_count is about 100,000.

  • num_threads (None or int) – optional – The number of threads with which to write data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

Return type:

DistMemMap

>>> import pysnptools.util as pstutil
>>> from pysnptools.distreader import Bgen, DistMemMap
>>> from pysnptools.util import example_file # Download and return local file name
>>> bgen_file = example_file("pysnptools/examples/2500x100.bgen")
>>> distreader = Bgen(bgen_file)[:,:10] #Create a reader for the first 10 SNPs
>>> pstutil.create_directory_if_necessary("tempdir/tiny.dist.memmap")
>>> DistMemMap.write("tempdir/tiny.dist.memmap",distreader)      # Write distreader in DistMemMap format
DistMemMap('tempdir/tiny.dist.memmap')

distreader.DistGen

class pysnptools.distreader.DistGen(seed, iid_count, sid_count, chrom_count=22, cache_file=None, block_size=None)

A DistReader that generates – on the fly – deterministic “random” distributions over SNP data..

See DistReader for general examples of using DistReader.

Tip: The ‘gen’ in DistGen stands for generate, because it generates random (genetic) data. The ‘gen’ in Bgen stands for ‘genetic’. *

Constructor:
Parameters:
  • seed (number) – The random seed that (along with block_size) determines the SNP values.

Parameters:
  • iid_count (number) – the number of iids (number of individuals)

Parameters:
  • sid_count (number) – the number of sids (number of SNPs)

Parameters:
  • chrom_count (number) – the number of chromosomes to generate. Default is 22. Must be 22 or less.

Parameters:
  • cache_file (string) – (default None) If provided, tells where to cache the common iid, sid, and pos information. Using it can save time.

Parameters:
  • block_size (number or None) – (default None) Tells how many SNP to generate at once. The default picks a block_size such that iid_count * *block_size is about 100,000.

Example:

>>> from pysnptools.distreader import DistGen
>>> #Prepare to generate data for 1000 individuals and 1,000,000 SNPs
>>> dist_gen = DistGen(seed=332,iid_count=1000,sid_count=1000*1000)
>>> print(dist_gen.iid_count,dist_gen.sid_count)
1000 1000000
>>> dist_data = dist_gen[:,200*1000:201*1000].read() #Generate data for all users and for SNPs 200K to 201K
>>> print(dist_data.val[2,2], dist_data.iid_count, dist_data.sid_count)
[0.2760024  0.19624834 0.52774925] 1000 1000

pstreader Module

Tools for reading and manipulating matrix data along with row and column ids and properties.

pstreader.PstReader

class pysnptools.pstreader.PstReader(*args, **kwargs)

A PstReader is one of three things:

  • A PstData class that holds matrix data in memory:

    >>> import numpy as np
    >>> from pysnptools.pstreader import PstData
    >>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C'])
    >>> type(data1.val).__name__ # The val property is an ndarray of values
    'ndarray'
    >>> data1.row_count #prints the number of rows in this in-memory data
    3
    
  • A class such as PstNpz for you to specify data in file. For example,

    >>> from pysnptools.pstreader import PstNpz
    >>> from pysnptools.util import example_file # Download and return local file name
    >>>
    >>> pstnpz_file = example_file('tests/datasets/all_chr.maf0.001.N300.pst.npz')
    >>> on_disk = PstNpz(pstnpz_file)
    >>> print(on_disk) # prints the name of the file reader
    PstNpz(...tests/datasets/all_chr.maf0.001.N300.pst.npz')
    >>> on_disk.col_count # prints the number of columns (but doesn't read any matrix values)
    1015
    
  • A subset of any PstReader, specified with “[ row_index , col_index ]”, to read just some values.

    >>> from pysnptools.pstreader import PstNpz
    >>> from pysnptools.util import example_file # Download and return local file name
    >>>
    >>> pstnpz_file = example_file('tests/datasets/all_chr.maf0.001.N300.pst.npz')
    >>> on_disk = PstNpz(pstnpz_file)
    >>> subset_on_disk = on_disk[[3,4],::2] # specification for a subset of the data on disk. No values are read yet.
    >>> print(subset_on_disk.col_count) # prints the number of columns in this subset (but still doesn't read any values)
    508
    >>> print(subset_on_disk) #prints a specification of 'subset_on_disk'
    PstNpz(...tests/datasets/all_chr.maf0.001.N300.pst.npz')[[3,4],::2]
    >>> data_subset = subset_on_disk.read() # efficiently (if practical) reads the specified subset of values from the disk
    >>> print(data_subset) # prints the specification of the in-memory information
    PstData(PstNpz(...tests/datasets/all_chr.maf0.001.N300.pst.npz')[[3,4],::2])
    >>> int(data_subset.val.shape[0]),int(data_subset.val.shape[1]) # The dimensions of the ndarray of values
    (2, 508)
    

The PstReaders Classes

Class

Format

Random Access

Suffixes

PstData

in-memory floats

Yes

n/a

PstNpz

binary, floats

No

.pst.npz,.snp.npz, .kernel.npz

PstHdf5

binary, floats

Yes

.pst.hdf5,.snp.hdf5, .kernel.hdf5

PstMemMap

mem-mapped floats

Yes

.pst.memmap,.snp.memmap

various SnpReader

varies

varies

varies

various KernelReader

varies

varies

varies

A SnpReader and KernelReader are each a kind of PstReader. They have some restrictions summarized here: ================================== =============== ============ ============ ==================== ==================== Class val type row type col type row_property type col_property type PstReader float any any any any SnpReader float str,str str none float,float,float KernelReader float str,str str,str none none ================================== =============== ============ ============ ==================== ====================

For convenience, they allow additional ways to access rows and columns.

Class

val name

row name

col name

row_property name

col_property name

PstReader

val

row

col

row_property

col_property

SnpReader

val

iid,row

sid,col

none

col_property,pos

KernelReader

val

iid0,row,iid

iid1,col,iid

none

none

Note:

A KernelReader.iid may be used when KernelReader.iid0 is equal to KernelReader.iid1

Methods & Properties:

Every PstReader, such as PstNpz and PstData, has these properties: row, row_count, col, col_count, row_property, col_property and these methods: read(), row_to_index(), col_to_index(). See below for details.

PstData is a PstReader so it supports the above properties and methods. In addition, it supports property PstData.val and equality testing. See below for details.

Rows and Cols:

Example:

>>> from pysnptools.pstreader import PstHdf5
>>> from pysnptools.util import print2 #print bytes strings and Unicode strings the same
>>> from pysnptools.util import example_file # Download and return local file name
>>> hdf5_file = example_file('pysnptools/examples/toydata.iidmajor.snp.hdf5')
>>> on_disk = PstHdf5(hdf5_file) # PstHdf5 can load .pst.hdf5, .snp.hdf5, and kernel.hdf5
>>> print2(on_disk.row[:3]) # print the first three rows
[['per0' 'per0']
 ['per1' 'per1']
 ['per2' 'per2']]
>>> print2(on_disk.col[:7]) # print the first seven columns
['null_0' 'null_1' 'null_2' 'null_3' 'null_4' 'null_5' 'null_6']
>>> print2(on_disk.row_to_index([[b'per2', b'per2'],[b'per1', b'per1']])) #Find the indexes for two rows.
[2 1]

When Matrix Data is Read:

Matrix data can be enormous so we generally avoid reading it to the degree practical. Specifically,

  • Constructing and printing a PstReader causes no file reading. For example, these commands read no data:

    >>> on_disk = PstHdf5(hdf5_file) # Construct a PstHdf5 PstReader. No data is read.
    >>> print(on_disk) # Print the PstHdf5 PstReader specification. No data is read.
    PstHdf5(...pysnptools/examples/toydata.iidmajor.snp.hdf5')
    >>> subset_on_disk = on_disk[[3,4],::2] # Construct a subsetting PstReader. No data is read.
    >>> print(subset_on_disk) # print the subset PstReader. No data is read.
    PstHdf5(...pysnptools/examples/toydata.iidmajor.snp.hdf5')[[3,4],::2]
    
  • Properties and methods related to the rows and columns (to the degree practical) read only row and col data from the disk, not value data. Moreover, the row and col data is read from file only once. Consider these commands:

    >>> on_disk = PstHdf5(hdf5_file) # Construct a PstHdf5 PstReader. No data is read.
    >>> print2(on_disk.col[:7]) # without reading any values data from disk, read the row and col data from disk, cache it, and then print the first seven cols.
    ['null_0' 'null_1' 'null_2' 'null_3' 'null_4' 'null_5' 'null_6']
    >>> print(on_disk.col_to_index([b'null_7',b'null_2'])) #use the cached col information to find the indexes of 'null_7' and 'null_2'. (No data is read from disk.)
    [7 2]
    
  • The only method that reads values from file (to the degree practical) is read(). For example:

    >>> on_disk = PstHdf5(hdf5_file) # Construct a PstHdf5 PstReader. No data is read.
    >>> data1 = on_disk.read() #read all the values from disk, creating a new PstData instance that keeps these values in memory
    >>> print(data1.val[0,2]) # print the value for the row with index 0 and the col with index 2. (No data is read from disk.)
    2.0
    
  • If you request the values for only a subset of the rows or columns, (to the degree practical) only that subset will be read from disk. for example:

    >>> hdf5_file2 = example_file('tests/datasets/all_chr.maf0.001.N300.hdf5')
    >>> on_disk = PstHdf5(hdf5_file2)[[3,4],::2] # Construct a subsetting PstReader. No data is read.
    >>> data_subset = subset_on_disk.read() # from disk, read the values for the rows with index 3 and 4 AND cols with even numbered indexes.
    >>> print(data_subset.val[0,2]) # print the value with subset row index 0 and col index 2 (corresponding to row index 3 and col index 4 in the full data). No data is read from disk.
    1.0
    

When Matrix Data is Re-Read and Copied:

Every time you call a PstReader’s read() method, the PstReader re-reads the value data and returns a new in-memory PstData (with PstData.val property containing a new ndarray of the values).

Here is an example of what not to do, because it causes all the matrix value data to be read twice.

>>> on_disk = PstHdf5(hdf5_file) # Construct a PstHdf5 PstReader. No data is read.
>>> # Not recommended because it inefficiently reads all the values twice.
>>> print(on_disk.read().val[0,2]) # read all values into a new PstData, print a value
2.0
>>> print(on_disk.read().val[0,3]) # read all values (again) into a second new PstData, print a value
2.0

Here are two efficient alternatives. First, if all values can all fit in memory, read them once into a PstData and then access that PstData multiple times.

>>> on_disk = PstHdf5(hdf5_file) # Construct a PstHdf5 PstReader. No data is read.
>>> data1 = on_disk.read() # read all values into a new PstData
>>> print(data1.val[0,2]) # print a value from data1's in-memory ndarray
2.0
>>> print(data1.val[0,3]) # print another value from data1's in-memory ndarray.
2.0

Second, if the value data is too large to fit in memory, use subsetting to read only the values of interest from disk.

>>> on_disk = PstHdf5(hdf5_file) # Construct a PstHdf5 PstReader. No data is read.
>>> print(on_disk[0,2].read().val[0,0]) #Define the subset of data and read only that subset from disk.
2.0
>>> print(on_disk[0,3].read().val[0,0]) #Define a second subset of data and read only that subset from disk.
2.0

Because the in-memory PstData class is a kind of PstReader, you may read from it, too. Doing so create a new PstData instance containing a copy of the matrix values in a new ndarray.

>>> on_disk = PstHdf5(hdf5_file) # Construct a PstHdf5 PstReader. No data is read.
>>> data1 = on_disk.read() # read all matrix values from disk into a new PstData
>>> print(data1.val is data1.val) # Do the in-memory SNP values use the same memory as themselves? Yes
True
>>> data2 = data1.read() # copy all the matrix values into a new ndarray in a new PstData
>>> print(data2.val is data1.val) # Do the two ndarrays of in-memory matrix values use the same memory?
False

Avoiding Unwanted ndarray Allocations

You may want a subset of matrix values from an in-memory PstData and you may know that this subset and the original PstData can safely share the memory of the ndarray of matrix values. For this case, the read() has optional parameters called view_ok and order. If you override the defaults of “view_ok=False,order=’F’” with “view_ok=True,order=’A’, the read() will, if practical, return a new PstData with a ndarray shares memory with the original ndarray. Use these parameters with care because any change to either ndarray will effect the others. Also keep in mind that read() relies on ndarray’s mechanisms to decide whether to actually share memory and so it may ignore your suggestion and allocate a new ndarray anyway.

>>> on_disk = PstNpz(pstnpz_file) # Construct a PstNpz PstReader. No data is read.
>>> data1 = on_disk.read() # read all data from disk into a PstData with a new ndarray
>>> column01 = data1[:,0:1].read(view_ok=True,order='A') #create PstData with the data from just the first two columns. Sharing memory is OK. The memory may be laid out in any order (that is col-major and row-major are both OK).
>>> import numpy as np
>>> #print(np.may_share_memory(data1.val, column01.val)) # Do the two ndarray's share memory? They could (but currently they won't)
>>> column201 = data1[:,[2,0,1]].read(view_ok=True,order='A') #create PstData with the data from three columns, permuted. Sharing memory is OK.
>>> print(np.may_share_memory(data1.val, column201.val)) # Do the two ndarray's share memory? No, ndarray decided that this indexing was too complex for sharing.
False

Creating Subsetting PstReaders with Indexing

You often don’t want to read the matrix values for all rows and cols. You can use indexing to create a subsetting PstReader that will read only the matrix values of interest.

PstReaders support the indexing formats supported by ndarray plus two generalizations. Here are examples of indexing with an array of indexes, with slicing, and with an array of Booleans.

>>> on_disk = PstNpz(pstnpz_file) # Specify some data on disk in PstNpz format
>>> subset_reader_1 = on_disk[[3,-1],:] #index with an array of indexes (negatives count from end)
>>> print((subset_reader_1.row_count, subset_reader_1.col_count))
(2, 1015)
>>> data1 = subset_reader_1.read() # read just the two rows of interest from the disk
>>> subset_reader_2 = on_disk[:,:0:-2] #index with a slice
>>> print((subset_reader_2.row_count, subset_reader_2.col_count))
(300, 507)
>>> boolindexes = [s.startswith(b'23_') for s in on_disk.col] # create a Boolean index of cols that start '23_'
>>> subset_reader_3 = on_disk[:,boolindexes] #index with array of Booleans
>>> print((subset_reader_3.row_count, subset_reader_3.col_count))
(300, 24)

The first generalization over what ndarray offers is full indexing on both the row dimension and the col dimension, in other words, full multidimensional indexing. For example,

>>> on_disk = PstNpz(pstnpz_file) # Specify some data on disk in PstNpz format
>>> subset_reader_4 = on_disk[[3,4],:0:-2] # index on two dimensions at once
>>> print((subset_reader_4.row_count, subset_reader_4.col_count))
(2, 507)

The second generalization is indexing on a single integer index.

>>> on_disk = PstNpz(pstnpz_file) # Specify some data on disk in PstNpz format
>>> subset_reader_5 = on_disk[5,:] #index with single integer
>>> print((subset_reader_5.row_count, subset_reader_5.col_count))
(1, 1015)

Indexing is also useful when you have matrix values in memory via a PstData index and want to copy a subset of those values. While you could instead index directly on the PstData.val ndarray, by indexing on the PstData instance you also get row and col information.

>>> on_disk = PstNpz(pstnpz_file) # Specify some data on disk in PstNpz format
>>> data1 = on_disk.read() # read all matrix values into memory
>>> print2(data1.col[:9]) # print the first 9 cols
['1_12' '1_34' '1_10' '1_35' '1_28' '1_25' '1_36' '1_39' '1_4']
>>> data_subset = data1[:,::2].read(view_ok=True,order='A') # create a copy or view with every other col
>>> print2(data_subset.col[:9]) # print the first 9 cols in the subset
['1_12' '1_10' '1_28' '1_36' '1_4' '1_11' '1_32' '1_9' '1_17']

You can apply indexing on top of indexing to specify subsets of subsets of data to read. In this example, only the column values for every 16th col is actually read from the disk.

>>> # These are just PstReaders, nothing is read from disk yet
>>> on_disk = PstNpz(pstnpz_file) # Specify some data on disk in PstNpz format
>>> half_reader = on_disk[:,::2] # a reader for half the cols
>>> quarter_reader = half_reader[:,::2] # a reader for half of half the cols
>>> sixteenth_reader = quarter_reader[:,::2][:,::2] # a reader for half of half of half of half the cols
>>> print(sixteenth_reader) #Print the specification of this reader
PstNpz(...tests/datasets/all_chr.maf0.001.N300.pst.npz')[:,::2][:,::2][:,::2][:,::2]
>>> # Now we read from disk. Only values for one col in every 16 will be read.
>>> data_sixteenth = sixteenth_reader.read()
>>> print(data_sixteenth.val[0,3])
2.0

The read() Method

By default the read() returns a ndarray of numpy.float64 laid out in memory in F-contiguous order (row-index varies the fastest). You may, instead, ask for numpy.float32 or for C-contiguous order or any order. See read() for details.

Details of Methods & Properties:

property col

A ndarray of the cols id. Each id can be anything, for example, a string, an array of two strings, a number, etc.

Return type:

ndarray (length col_count)

This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.

Example:

>>> import numpy as np
>>> from pysnptools.pstreader import PstData
>>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C'])
>>> print(data1.col[-1]) # print the last column id.
z
>>> data2 = PstData(row=['a','b','c'],col=[[1,0],[2,0]],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C'])
>>> print(data2.col[-1]) # print the last column id.
[2 0]
property col_count

number of cols

Return type:

integer

This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.

property col_property

A ndarray of the additional information for each col. Each element is a ndarray.

Return type:

1- or 2-dimensional ndarray (length col_count)

This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.

Example:

>>> from pysnptools.pstreader import PstNpz
>>> from pysnptools.util import example_file # Download and return local file name
>>> pstnpz_file = example_file('tests/datasets/all_chr.maf0.001.N300.pst.npz')
>>> on_disk = PstNpz(pstnpz_file)
>>> print(on_disk.col_property[:3]) # print column information for the first three cols: #The '...' is an optional space
[[1.         0.00800801 0.        ]
 [1.         0.023023   1.        ]
 [1.         0.0700701  4.        ]]
col_to_index(list)

Takes a list of column ds and returns a list of index numbers

Parameters:

list (list) – list of cols

Return type:

ndarray of int

This method (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.

Example:

>>> from pysnptools.pstreader import PstNpz
>>> from pysnptools.util import example_file # Download and return local file name
>>> pstnpz_file = example_file('tests/datasets/all_chr.maf0.001.N300.pst.npz')
>>> on_disk = PstNpz(pstnpz_file) # Specify matrix data on disk
>>> print(on_disk.col_to_index([b'1_10',b'1_13'])) #Find the indexes for two cols.
[2 9]
read(order='F', dtype=<class 'numpy.float64'>, force_python_only=False, view_ok=False, num_threads=None)

Reads the matrix values and returns a PstData (with PstData.val property containing a new ndarray of the matrix values).

Parameters:
  • order (string or None) – {‘F’ (default), ‘C’, ‘A’}, optional – Specify the order of the ndarray. If order is ‘F’ (default), then the array will be in F-contiguous order (row-index varies the fastest). If order is ‘C’, then the returned array will be in C-contiguous order (col-index varies the fastest). If order is ‘A’, then the PstData.val ndarray may be in any order (either C-, Fortran-contiguous).

  • dtype (data-type) – {numpy.float64 (default), numpy.float32}, optional – The data-type for the PstData.val ndarray.

  • force_python_only (bool) – optional – If False (default), may use outside library code. If True, requests that the read be done without outside library code.

  • view_ok (bool) – optional – If False (default), allocates new memory for the PstData.val’s ndarray. If True, if practical and reading from a PstData, will return a new PstData with a ndarray shares memory with the original PstData. Typically, you’ll also wish to use “order=’A’” to increase the chance that sharing will be possible. Use these parameters with care because any change to either ndarray will effect the others. Also keep in mind that read() relies on ndarray’s mechanisms to decide whether to actually share memory and so it may ignore your suggestion and allocate a new ndarray anyway.

  • num_threads (None or int) – optional – The number of threads with which to read data. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

Return type:

PstData

Calling the method again causes the matrix values to be re-read and creates a new in-memory PstData with a new ndarray of matrix values.

If you request the values for only a subset of the sids or iids, (to the degree practical) only that subset will be read from disk.

Example:

>>> from pysnptools.pstreader import PstHdf5
>>> from pysnptools.util import example_file # Download and return local file name
>>> hdf5_file = example_file('pysnptools/examples/toydata.iidmajor.snp.hdf5')
>>> on_disk = PstHdf5(hdf5_file) # Specify matrix data on disk
>>> pstdata1 = on_disk.read() # Read all the matrix data returning a PstData instance
>>> print(type(pstdata1.val).__name__) # The PstData instance contains a ndarray of the data.
ndarray
>>> subset_pstdata = on_disk[:,::2].read() # From the disk, read matrix values for every other sid
>>> print(subset_pstdata.val[0,0]) # Print the first matrix value in the subset
1.0
>>> subsub_pstdata = subset_pstdata[:10,:].read(order='A',view_ok=True) # Create an in-memory subset of the subset with matrix values for the first ten iids. Share memory if practical.
>>> import numpy as np
>>> # print(np.may_share_memory(subset_snpdata.val, subsub_snpdata.val)) # Do the two ndarray's share memory? They could. Currently they won't.
property row

A ndarray of the row ids. Each id can be anything, for example, a string, an array of two strings, a number, etc.

Return type:

ndarray (length row_count)

This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.

Example:

>>> import numpy as np
>>> from pysnptools.pstreader import PstData
>>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C'])
>>> print(data1.row[:2]) # print the first two row ids
['a' 'b']
>>> data2 = PstData(row=[[1,0],[2,0],[2,1]],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C'])
>>> print(data2.row[:2]) # print the first two row ids
[[1 0]
 [2 0]]
property row_count

number of rows

Return type:

integer

This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.

property row_property

A ndarray of the additional information for each row. Each element is a ndarray.

Return type:

1- or 2-dimensional ndarray (length row_count)

This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.

Example:

>>> import numpy as np
>>> from pysnptools.pstreader import PstData
>>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C'])
>>> print(data1.row_property[:2]) # print the first two row property values
['A' 'B']
row_to_index(list)

Takes a list of row ids and returns a list of index numbers

Parameters:

list – list of rows

Return type:

ndarray of int

This method (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.

Example:

>>> from pysnptools.pstreader import PstNpz
>>> from pysnptools.util import example_file # Download and return local file name
>>> pstnpz_file = example_file('tests/datasets/all_chr.maf0.001.N300.pst.npz')
>>> on_disk = PstNpz(pstnpz_file) # Specify matrix data on disk
>>> print(on_disk.row_to_index([[b'POP1',b'44'],[b'POP1',b'12']])) #Find the indexes for two rows.
[2 1]
property shape

number of rows and number of cols

Return type:

tuple of two integers

This property (to the degree practical) reads only row and col data from the disk, not matrix value data. Moreover, the row and col data is read from file only once.

pstreader.PstData

class pysnptools.pstreader.PstData(row, col, val, row_property=None, col_property=None, name=None, parent_string=None, copyinputs_function=None)

A PstReader for holding values in-memory, along with related row and col information. It is usually created by calling the PstReader.read() method on another PstReader, for example, PstNpz. It can also be constructed directly.

See PstReader for details and examples.

Constructor:
Parameters:
  • row (an array of anything) – The PstReader.row information

  • col (an array of anything) – The PstReader.col information

  • val (a 2-D array of floats or an array of floats) – The values

  • row_property (optional, an array of anything) – Additional information associated with each row.

  • col_property (optional, an array of anything) – Additional information associated with each col.

  • name (optional, string) – Information to be display about the origin of this data

  • copyinputs_function (optional, function) – Used internally by optional clustering code

Example:

>>> from pysnptools.pstreader import PstData
>>> pstdata = PstData(row=[['fam0','iid0'],['fam0','iid1']], col=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]])
>>> print(pstdata.val[0,1], pstdata.row_count, pstdata.col_count)
2.0 2 3

Equality:

Two PstData objects are equal if their five arrays (PstData.val, PstReader.row, PstReader.col, PstReader.row_property, and PstReader.col_property) arrays are equal. (Their ‘name’ does not need to be the same). If either PstData.val contains NaN, the objects will not be equal. However, PstData.allclose() can be used to treat NaN as regular values. Any NaN’s in the other four arrays are treated as regular values.

Example:

>>> import numpy as np
>>> from pysnptools.pstreader import PstData
>>> pstdata1 = PstData(row=[['fam0','iid0'],['fam0','iid1']], col=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]])
>>> pstdata2 = PstData(row=[['fam0','iid0'],['fam0','iid1']], col=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]])
>>> print(pstdata1 == pstdata2) #True, because all the arrays have the same values.
True
>>> print(pstdata1.val is pstdata2.val) #False, because the two arrays have different memory.
False
>>> pstdata3 = PstData(row=['a','b'], col=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]])
>>> pstdata4 = PstData(row=[['fam0','iid0'],['fam0','iid1']], col=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,2.]])
>>> print(pstdata3 == pstdata4) #False, because the rows have different ids.
False
>>> pstdata5 = PstData(row=[['fam0','iid0'],['fam0','iid1']], col=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]])
>>> pstdata6 = PstData(row=[['fam0','iid0'],['fam0','iid1']], col=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]])
>>> print(pstdata5 == pstdata6) #False, because the val's contain NaN
False
>>> print(pstdata5.allclose(pstdata6)) #True, if we consider the NaN as regular values, all the arrays have the same values.
True

Methods beyond PstReader

allclose(value, equal_nan=True)
Parameters:
  • value (PstData) – Other object with which to compare.

  • equal_nan (bool) – (Default: True) Tells if NaN in PstData.val should be treated as regular values when testing equality.

>>> import numpy as np
>>> pstdata5 = PstData(row=[['fam0','iid0'],['fam0','iid1']], col=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]])
>>> pstdata6 = PstData(row=[['fam0','iid0'],['fam0','iid1']], col=['snp334','snp349','snp921'], val=[[0.,2.,0.],[0.,1.,np.nan]])
>>> print(pstdata5.allclose(pstdata6)) #True, if we consider the NaN as regular values, all the arrays have the same values.
True
>>> print(pstdata5.allclose(pstdata6,equal_nan=False)) #False, if we consider the NaN as special values, all the arrays are not equal.
False
property val

The 2D NumPy array of floats (or array of floats) that represents the values. You can get or set this property.

>>> from pysnptools.pstreader import PstNpz
>>> from pysnptools.util import example_file
>>>
>>> pstnpz_file = example_file('tests/datasets/little.pst.npz')
>>> pstdata = PstNpz(pstnpz_file)[:5,:].read() #read data for first 5 rows
>>> print(pstdata.val[4,5]) #print one of the values
2.0
property val_shape

Tells the shape of value for a given individual and SNP. If None, means a single scalar value.

pstreader.PstHdf5

class pysnptools.pstreader.PstHdf5(filename, block_size=5000)

A PstReader for reading *.pst.hdf5 files from disk.

See PstReader for general examples of using PstReaders.

The general HDF5 format is described in here. The PstHdf5 format stores val, row, col, row_property, and col_property information in Hdf5 format.

Constructor:
Parameters:
  • filename (string) – The PstHdf5 file to read.

Example:

>>> from pysnptools.pstreader import PstHdf5
>>> from pysnptools.util import example_file # Download and return local file name
>>>
>>> psthdf5_file = example_file('pysnptools/examples/toydata.iidmajor.snp.hdf5')
>>> on_disk = PstHdf5(psthdf5_file) # PstHdf5 can load .pst.hdf5, .snp.hdf5, and kernel.hdf5
>>> print(on_disk.row_count)
500

Methods beyond PstReader

flush()

Flush the connection to the HDF5 file. (If values or properties are accessed again, the file will be reopened.)

>>> import pysnptools.util as pstutil
>>> from pysnptools.pstreader import PstHdf5
>>> from pysnptools.util import example_file # Download and return local file name
>>> hdf5_file = example_file("pysnptools/examples/toydata.kernel.hdf5")
>>> reader = PstHdf5(hdf5_file)
>>> val1 = reader[0,0].read()
>>> reader.flush()
static write(filename, pstdata, hdf5_dtype=None, col_major=True)

Writes a PstData to PstHdf5 format and returns the PstHdf5.

Parameters:
  • filename (string) – the name of the file to create

  • pstdata (PstData) – The in-memory data that should be written to disk.

  • hdf5_dtype (string) – None (use the .val’s dtype) or a Hdf5 dtype, e.g. ‘f8’,’f4’,etc.

  • col_major (bool) – Tells if vals should be stored on disk in col_major (default) or row_major format.

Return type:

PstHdf5

>>> import numpy as np
>>> from pysnptools.pstreader import PstData, PstHdf5
>>> import pysnptools.util as pstutil
>>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C'])
>>> pstutil.create_directory_if_necessary("tempdir/tiny.pst.hdf5")
>>> PstHdf5.write("tempdir/tiny.pst.hdf5",data1)          # Write data in PstHdf5 format
PstHdf5('tempdir/tiny.pst.hdf5')

pstreader.PstNpz

class pysnptools.pstreader.PstNpz(filename)

A PstReader for reading *.pst.npz files from disk.

See PstReader for general examples of using PstReaders.

The general NPZ format is described here. The PstNpz format stores val, row, col, row_property, and col_property information in NPZ format.

Constructor:
Parameters:
  • filename (string) – The PstNpz file to read.

Example:

>>> from pysnptools.pstreader import PstNpz
>>> from pysnptools.util import example_file # Download and return local file name
>>>
>>> pstnpz_file = example_file('tests/datasets/little.pst.npz')
>>> data_on_disk = PstNpz(pstnpz_file)
>>> print(data_on_disk.row_count)
300

Methods beyond NpzReader

static write(filename, pstdata)

Writes a PstData to PstNpz format and returns the PstNpz.

Parameters:
  • filename (string) – the name of the file to create

  • pstdata (PstData) – The in-memory data that should be written to disk.

Return type:

PstNpz

>>> from pysnptools.pstreader import PstData, PstNpz
>>> import pysnptools.util as pstutil
>>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C'])
>>> pstutil.create_directory_if_necessary("tempdir/tiny.pst.npz")
>>> PstNpz.write("tempdir/tiny.pst.npz",data1)          # Write data in PstNz format
PstNpz('tempdir/tiny.pst.npz')

pstreader.PstMemMap

class pysnptools.pstreader.PstMemMap(filename)

A PstData that keeps its data in a memory-mapped file. This allows data large than fits in main memory.

See PstData for general examples of using PstData.

Constructor:
Parameters:

filename (string) – The *.pst.memmap file to read.

Also see PstMemMap.empty() and PstMemMap.write().

Example:

>>> from pysnptools.pstreader import PstMemMap
>>> from pysnptools.util import example_file # Download and return local file name
>>>
>>> pst_mem_map_file = example_file('pysnptools/examples/tiny.pst.memmap')
>>> pst_mem_map = PstMemMap(pst_mem_map_file)
>>> print(pst_mem_map.val[0,1], pst_mem_map.row_count, pst_mem_map.col_count)
2.0 3 2

Methods beyond PstReader

static empty(row, col, filename, row_property=None, col_property=None, order='F', dtype=<class 'numpy.float64'>, val_shape=None)

Create an empty PstMemMap on disk.

Parameters:
  • row (an array of anything) – The PstReader.row information

  • col (an array of anything) – The PstReader.col information

  • filename (string) – name of memory-mapped file to create

  • row_property (an array of anything) – optional – The additional PstReader.row_property information associated with each row. Default: None

  • col_property (an array of anything) – optional – The additional PstReader.col_property information associated with each col. Default: None

  • order (string or None) – {‘F’ (default), ‘C’}, optional – Specify the order of the ndarray.

  • dtype (data-type) – {numpy.float64 (default), numpy.float32}, optional – The data-type for the PstMemMap.val ndarray.

  • val_shape (None or a number) – (Default: None), optional – The shape of the last dimension of PstMemMap.val. None means each value is a scalar.

Return type:

PstMemMap

>>> import pysnptools.util as pstutil
>>> from pysnptools.pstreader import PstMemMap
>>> filename = "tempdir/tiny.pst.memmap"
>>> pstutil.create_directory_if_necessary(filename)
>>> pst_mem_map = PstMemMap.empty(row=['a','b','c'],col=['y','z'],filename=filename,row_property=['A','B','C'],order="F",dtype=np.float64)
>>> pst_mem_map.val[:,:] = [[1,2],[3,4],[np.nan,6]]
>>> pst_mem_map.flush()
property filename

The name of the memory-mapped file

flush()

Flush PstMemMap.val to disk and close the file. (If values or properties are accessed again, the file will be reopened.)

>>> import pysnptools.util as pstutil
>>> from pysnptools.pstreader import PstMemMap
>>> filename = "tempdir/tiny.pst.memmap"
>>> pstutil.create_directory_if_necessary(filename)
>>> pst_mem_map = PstMemMap.empty(row=['a','b','c'],col=['y','z'],filename=filename,row_property=['A','B','C'],order="F",dtype=np.float64)
>>> pst_mem_map.val[:,:] = [[1,2],[3,4],[np.nan,6]]
>>> pst_mem_map.flush()
property offset

The byte position in the file where the memory-mapped values start.

property val

The NumPy memmap array of floats that represents the values. You can get this property, but cannot set it (except with itself)

>>> from pysnptools.pstreader import PstMemMap
>>> from pysnptools.util import example_file # Download and return local file name
>>>
>>> pst_mem_map_file = example_file('pysnptools/examples/tiny.pst.memmap')
>>> pst_mem_map = PstMemMap(pst_mem_map_file)
>>> print(pst_mem_map.val[0,1])
2.0
static write(filename, pstdata)

Writes a PstData to PstMemMap format and returns the PstMemMap.

Parameters:
  • filename (string) – the name of the file to create

  • pstdata (PstData) – The in-memory data that should be written to disk.

Return type:

PstMemMap

>>> import pysnptools.util as pstutil
>>> from pysnptools.pstreader import PstData, PstMemMap
>>> data1 = PstData(row=['a','b','c'],col=['y','z'],val=[[1,2],[3,4],[np.nan,6]],row_property=['A','B','C'])
>>> pstutil.create_directory_if_necessary("tempdir/tiny.pst.memmap")
>>> PstMemMap.write("tempdir/tiny.pst.memmap",data1)      # Write data1 in PstMemMap format
PstMemMap('tempdir/tiny.pst.memmap')

util Module

util

pysnptools.util.array_module(xp=None)

Find the array module to use, for example numpy or cupy.

Parameters:

xp (optional, string or Python module) – The array module to use, for example, ‘numpy’ (normal CPU-based module) or ‘cupy’ (GPU-based module). If not given, will try to read from the ARRAY_MODULE environment variable. If not given and ARRAY_MODULE is not set, will use numpy. If ‘cupy’ is requested, will try to ‘import cupy’. If that import fails, will revert to numpy.

Return type:

Python module

>>> from pysnptools.util import array_module
>>> xp = array_module() # will look at environment variable
>>> print(xp.zeros((3)))
[0. 0. 0.]
>>> xp = array_module('cupy') # will try to import 'cupy'
>>> print(xp.zeros((3)))
[0. 0. 0.]
pysnptools.util.asnumpy(a)

Given an array created with any array module, return the equivalent numpy array. (Returns a numpy array unchanged.)

>>> from pysnptools.util import asnumpy, array_module
>>> xp = array_module('cupy')
>>> zeros_xp = xp.zeros((3)) # will be cupy if available
>>> zeros_np = asnumpy(zeros_xp) # will be numpy
>>> zeros_np
array([0., 0., 0.])
pysnptools.util.create_directory_if_necessary(name, isfile=True, robust=False)

Create a directory for a file if the directory doesn’t already exist.

Parameters:
  • name (string) – file or directory name

  • isfile (bool) – If True (default), the name is a file, otherwise it is a directory.

  • robust (bool) – If False (default), will try once to create the directory. If True, will try 25 times.

pysnptools.util.format_delta(delta_seconds)

Format a time delta nicely.

Parameters:

delta_seconds (number) – The number of seconds

Return type:

string

>>> from pysnptools.util import format_delta
>>> print(format_delta(86403.5))
1 day, 0:00:03.500000
pysnptools.util.get_array_module(a)

Given an array, returns the array’s module, for example, numpy or cupy. Works for numpy even when cupy is not available.

>>> import numpy as np
>>> zeros_np = np.zeros((3))
>>> xp = get_array_module(zeros_np)
>>> xp.ones((3))
array([1., 1., 1.])
pysnptools.util.intersect_apply(data_list, sort_by_dataset=True, intersect_before_standardize=True, is_test=False)

Intersects and sorts the iids from a list of datasets, returning new version of the datasets with all the same iids in the same order.

Parameters:
  • data_list (list) – list of datasets

  • sort_by_dataset (bool) – optional, If True (default), the iids are ordered according to the first non-None dataset. If False, the order is arbitrary, but consistent.

  • intersect_before_standardize (bool) – optional. Special code for SnpKernel, the class that postpones computing a kernel from SNP data. If True (default), SnpKernel will remove any iids before SNP standardization before computing the kernel. If False, SNPs will be standardized with all iids, then the kernel will be computed, then any iids will be removed.

Return type:

list of datasets

Here are the dataset formats understood and what is returned for each.

Dataset Format

What is Returned

None

None

A SnpReader

A new subsetting SnpReader with adjusted iid

A KernelReader

A new subsetting KernelReader with adjusted iid

A dictionary with [‘iid’] and [‘vals’] keys

The same dictionary but with the iid and vals values adjusted

Tuple of the form (val ndarray, iid list)

A new tuple with the val ndarray and iid list adjusted

If the iids in all the datasets are already the same and in the same order, then the datasets are returned without change.

Notice that only dictionaries are processed in-place. Inputting a SnpReader and KernelReader returns a new class of the same type (unless its iids are already ok). Inputting a tuple returns a new tuple (unless its iids are already ok).

Example:

>>> from pysnptools.snpreader import Bed, Pheno
>>> from pysnptools.kernelreader import SnpKernel
>>> from pysnptools.standardizer import Unit
>>> #Create five datasets in different formats
>>> ignore_in = None
>>> from pysnptools.util import example_file # Download and return local file name
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> phenofile = example_file("tests/datasets/phenSynthFrom22.23.N300.randcidorder.txt")
>>> covfile = example_file("tests/datasets/all_chr.maf0.001.covariates.N300.txt")
>>> kernel_in = SnpKernel(Bed(bedfile,count_A1=False),Unit()) # Create a kernel from a Bed file
>>> pheno_in = Pheno(phenofile,missing="")
>>> cov = Pheno(covfile,missing="").read()
>>> cov_as_tuple_in = (cov.val,cov.iid) #We could do cov directly, but as an example we make it a tuple.
>>>
>>> # Create five new datasets with consistent iids
>>> ignore_out, kernel_out, pheno_out, cov_as_tuple_out = intersect_apply([ignore_in, kernel_in, pheno_in, cov_as_tuple_in])
>>> # Print the first five iids from each dataset
>>> print(ignore_out, kernel_out.iid[:5], pheno_out.iid[:5], cov_as_tuple_out[1][:5])
None [['POP1' '0']
 ['POP1' '12']
 ['POP1' '44']
 ['POP1' '58']
 ['POP1' '65']] [['POP1' '0']
 ['POP1' '12']
 ['POP1' '44']
 ['POP1' '58']
 ['POP1' '65']] [['POP1' '0']
 ['POP1' '12']
 ['POP1' '44']
 ['POP1' '58']
 ['POP1' '65']]
pysnptools.util.intersect_ids(idslist)

Deprecated since version Use: intersect_apply() instead.

Takes a list of 2d string arrays of family and individual ids. These are intersected.

Return type:

indarr, an array of size N x L, where N is the number of individuals in the intersection, and L is the number of lists in idslist, and which contains the index to use (in order) such that all people will be identical and in order across all data sets.

If one of the lists=None, it is ignored (but still has values reported in indarr, all equal to -1),

pysnptools.util.log_in_place(name, level, time_lambda=<built-in function time>, show_log_diffs=False)

Create an one-argument function to write messages to. If the logging level is met, messages will appear. All messages will be on the same line and will include time.

Example:

from pysnptools.util import log_in_place
import logging
import time
logging.basicConfig(level=logging.INFO)

with log_in_place("counting", logging.INFO) as updater:
for i in range(100):
        updater(i)
        time.sleep(.1) #typically, some work -- not a delay -- goes here

Outputs 100 messages on the same line, ending with something like “counting – time=0:00:09.99, 99”

pysnptools.util.print2(arg)

Make printing under Python3 look the same as under Python2.

pysnptools.util.sub_matrix(val, row_index_list, col_index_list, order='A', dtype=<class 'numpy.float64'>, num_threads=None)

Efficiently creates a sub-matrix from a 2-D or 3-D ndarray.

Parameters:
  • val (ndarray) – The ndarray from which to copy.

  • row_index_list (list of integers) – Tells which rows (and in which order) to copy into the sub-matrix

  • col_index_list (list of integers) – Tells which columns (and in which order) to copy into the sub-matrix

  • order (string or None) – {‘A’ (default), ‘C’, ‘F’}, optional – Specify the order of the sub-matrix to create. If order is ‘C’, then the returned sub-matrix will be in C-contiguous order (first index varies the fastest). If order is ‘F’, then the array will be in F-contiguous order (second index varies the fastest). If order is ‘A’, then sub-matrix may be in any order F or C.

  • dtype (data-type) – {numpy.float64 (default), numpy.float32}, optional – The data-type for sub-matrix created.

  • num_threads (None or int) – optional – The number of threads with which to work. Defaults to all available processors. Can also be set with these environment variables (listed in priority order): ‘PST_NUM_THREADS’, ‘NUM_THREADS’, ‘MKL_NUM_THREADS’.

Return type:

ndarray

>>> import numpy as np
>>> import pysnptools.util as pstutil
>>> np.random.seed(0) # set seed so that results are deterministic
>>> matrix = np.random.rand(12,7) # create a 12 x 7 ndarray
>>> submatrix = pstutil.sub_matrix(matrix,[0,2,11],[6,5,4,3,2,1,0])
>>> print(int(submatrix.shape[0]),int(submatrix.shape[1]))
3 7
>>> print(matrix[2,0] == submatrix[1,6]) #The row # 2 is now #1, the column #0 is now #6.
True

Note: Behind the scenes, for performance, this function calls a Rust helper function.

pysnptools.util.weighted_mean(ys, weights)
Parameters:
  • ys (ndarray) – The ndarray of values

  • weights (ndarray) – the weight of each value (unnormalized is fine)

Return type:

the weight mean

>>> ys = np.array([103.664086,89.80645161,83.86888046,90.54141176])
>>> weights = np.array([2.340862423,4.982888433,0.17522245,0.098562628])
>>> round(weighted_mean(ys, weights),5)
93.9487
pysnptools.util.weighted_simple_linear_regression(xs, ys, weights)
Parameters:
  • xs (ndarray) – The ndarray of independent values

  • ys (ndarray) – The ndarray of dependent values

  • weights (ndarray) – the weight of each case (unnormalized is fine)

Return type:

slope, intercept, xmean, ymean

>>> xs = np.array([53.8329911,57.49486653,60.07392197,60.21081451])
>>> ys = np.array([103.664086,89.80645161,83.86888046,90.54141176])
>>> weights = np.array([2.340862423,4.982888433,0.17522245,0.098562628])
>>> slope, intercept, xmean, ymean = weighted_simple_linear_regression(xs, ys, weights)
>>> print(round(slope,5), round(intercept,5), round(xmean,5), round(ymean,5))
-3.52643 293.05586 56.46133 93.9487
pysnptools.util.example_file(pattern, endswith=None)

Returns the local location of a PySnpTools example file, downloading it if needed.

Parameters:
  • pattern – The name of the example file of interest. A file name pattern may be given. All matching files will be downloaded (if needed) and the name of one will be returned.

  • endswith – The pattern of the file name to return. By default, when no endswith is given, the name of the first matching file will be returned.

Return type:

string

By default, the local location will be under the system temp directory (typically controlled with the TEMP environment variable). Alternatively, the directory can be set with the PYSNPTOOLS_CACHE_HOME environment variable.

This function knows the MD5 hash of all PySnpTools example files and uses that content-based hash to decide if a file needs to be downloaded.

>>> from pysnptools.util import example_file # Download and return local file name
>>> # Download the phenotype file if necessary. Return its local location.
>>> pheno_fn = example_file("pysnptools/examples/toydata.phe")
>>> print('The local file name is ', pheno_fn)
The local file name is ...pysnptools/examples/toydata.phe
>>>
>>> # Download the bed,bim,&fam files if necessary. Return the location of bed file.
>>> bedfile = example_file("tests/datasets/all_chr.maf0.001.N300.*","*.bed")
>>> print('The local file name is ', bedfile)
The local file name is ...tests/datasets/all_chr.maf0.001.N300.bed
pysnptools.util.snp_gen(fst, dfr, iid_count, sid_count, maf_low=0.05, maf_high=0.5, seed=0, sibs_per_family=10, freq_pop_0=0.5, chr_count=None, label_with_pop=False)

Generates a random SnpData

Parameters:
  • fst (float) – Degree of Population Structure, e.g. 0 (a special case), 0.005, 0.01, 0.05, 0.1

  • dfr (float) – Degree of Family Relatedness, the fraction of individuals belonging to a family, ie. fracSibs, e.g. 0.0, 0.5, 0.6, 0.7, 0.8, 0.9

  • iid_count (int) – The number of individuals to generate. Because of rounding the actual number may be less.

  • sid_count (int) – The number of snps to generate.

  • maf_low (float) – (default .05) lower bound of uniformly-generated Minor allele frequency

  • maf_high (float) – (default .5) upper bound of uniformly-generated Minor allele frequency

  • seed (int) – (default 0) Random seed

  • sibs_per_family (int) – (default 10) number of siblings in each family

  • freq_pop_0 (float) – (default .5) Fraction of individuals in population 0 (the rest will be in population 1)

  • chr_count (int) – (default one chromosome per SNP) Number of chromosomes to which SNPs should be assigned. The SNPs will be assigned as evenly as possible. Chromosome names are integers starting with 1. SNP positions within a chromosome are sequential integers starting with 1.

Return type:

SnpData

Example:

>>> snpdata = snp_gen(fst=.1,dfr=.5,iid_count=200,sid_count=20,maf_low=.05,seed=6)
>>> print(int(snpdata.iid_count), int(snpdata.sid_count)) #because of rounding got 190 individuals
190 20
Also See:

snpreader.SnpGen

util.IntRangeSet

class pysnptools.util.IntRangeSet(*ranges_inputs)

Bases: object

A class for efficiently manipulating ranges of integers (including negatives and longs) using set operations such as union(), intersection(), and difference.

The class differs from the built-in set class (and from Boolean numpy arrays) because it does not need to store every element in the set, only for every contiguous range of elements. It differs from other Python interval libraries (that we know of) by being specialized and optimized for integer elements.

Example:

Here we take the union (operator “|”) of two IntRangeSets:

_images/example1.png
>>> from pysnptools.util import IntRangeSet
>>> a = IntRangeSet("100:500,501:1000") # a is the set of integers from 100 to 500 (exclusive) and 501 to 1000 (exclusive)
>>> b = IntRangeSet("-20,400:600")      # b is the set of integers -20 and the range 400 to 600 (exclusive)
>>> c = a | b                           # c is the union of a and b, namely -20 and 100 to 1000 (exclusive)
>>> print(c)
IntRangeSet('-20,100:1000')
Example:

Suppose we want to find the intron regions of a gene but we are given only the transcription region and the exon regions.

_images/example2.png
>>> from pysnptools.util import IntRangeSet
>>> line = "chr15   29370   37380   29370,32358,36715   30817,32561,37380"
>>> chr,trans_start,trans_last,exon_starts,exon_lasts = line.split() # split the line on white space
>>> trans_start = int(trans_start)
>>> trans_stop = int(trans_last) + 1 # add one to convert the inclusive "last" value into a Pythonesque exclusive "stop" value
>>> int_range_set = IntRangeSet((trans_start,trans_stop)) # creates a IntRangeSet from 29370 (inclusive) to 37381 (exclusive)
>>> print(int_range_set) # print at any time to see the current value
IntRangeSet('29370:37381')

Parse the exon start and last lists from strings to lists of integers (converting ‘last’ to ‘stop’)

>>> exon_starts = [int(start) for start in exon_starts.strip(",").split(',')]
>>> exon_stops = [int(last)+1 for last in exon_lasts.strip(",").split(',')]
>>> assert len(exon_starts) == len(exon_stops)

Zip together the two lists to create an iterable of exon_start,exon_stop tuples. Then ‘set subtract’ all these ranges from int_range_set

>>> int_range_set -= zip(exon_starts,exon_stops)
>>> print(int_range_set) # See what it looks like
IntRangeSet('30818:32358,32562:36715')

Create the desired output by iterating through each contiguous range of integers

>>> for start, stop in int_range_set.ranges():
...    print("{0}   {1}     {2}".format(chr, start, stop-1))
chr15   30818     32357
chr15   32562     36714

Ranges Input

The input to the IntRangeSet constructor and many of its methods is a ranges input.

A ranges input is one of the following:

  • A comma-separated string of integers or integer ranges, e.g., '100:500,500:1000,2000'

  • An integer or integer expression, e.g., 3

  • A tuple of two integers, start and stop, e.g., (2,8). stop is exclusive.

  • A slice with non-negative values, e.g. slice(2,8)

  • A IntRangeSet (or any class with a ranges() method), e.g., IntRangeSet(3)

  • A list or iterable (but not tuple) of ranges inputs, e.g., [1,6,7,(100,200)]

Example:

Strings:

>>> a = IntRangeSet("100:500,500:1000,2000")
>>> a = IntRangeSet('-10:-8,1,2,3:11')
>>> a = IntRangeSet('')

The ranges in an ranges input can overlap and can be in any order.

>>> assert IntRangeSet("2000,100:1500,500:1000,2000") == IntRangeSet('100:1500,2000')

Integers and Integer Expressions:

>>> a = IntRangeSet(7)
>>> a = IntRangeSet(151000000000) # longs are OK
>>> a = IntRangeSet(2*3+1) # integer expressions are OK

A tuple must have exactly two integers. They represent the start (inclusive) and stop (exclusive) integers in a range.

>>> assert IntRangeSet((2,8)) == IntRangeSet('2:8') # check 'set equality'
>>> assert 7 in IntRangeSet((2,8)) # The integer 7 is an element of the set
>>> assert IntRangeSet((2,3)) == IntRangeSet(2)

Lists and iterables of ranges inputs:

>>> assert IntRangeSet([1,6,7,(100,200)]) == IntRangeSet('1,6:8,100:200')
>>> assert IntRangeSet(range(0,100)) == IntRangeSet('0:100')
>>> assert IntRangeSet([range(100,200),3,'1000:2000',[4,6],(20,30)]) == IntRangeSet('3:5,6,20:30,100:200,1000:2000')

Some methods accept zero or more ranges input as their input. This is called a ranges_inputs. For example:

>>> a = IntRangeSet() # zero ranges inputs
>>> a = IntRangeSet(3) # one ranges input
>>> a = IntRangeSet('3:11,5',14,100) # Three ranges inputs, a string and two integers.

Most Important Methods and Operators

Description

Method or Operator

For Details

is a ‘set equal’ to b?

a == b

__eq__()

is a is superset of b?

b in a, a.issuperset(b), a >= b

__contains__()

is a is subset of b?

a <= b, a.issubset(b)

__le__()

remove i th smallest element(s)

del a[i]

__delitem__()

get i th smallest element(s)

a[i]

__getitem__()

is a not ‘set equal’ to b?

a != b

__ne__()

is a is proper superset of b?

a > b

__gt__()

is a is proper subset of b?

a < b

__lt__()

union b into a

a |= b, a += b, a.add(b,c), a.update(b,c)

add()

union a and b

a | b, a b, a.union(b,c)

__or__()

intersect b into a

a &= b, a.intersection_update(b,c),

__iand__()

intersect a and b

a & b, a.intersection(b,c)

intersection()

subtract b from in a

a -= b, a.difference_update(b), a.discard(b),

__isub__()

remove b from a, error if missing

a.remove(b)

remove()

a ‘set difference’ b

a - b, a.difference(b)

__sub__()

iterate integers in a, from low

for element in a:

__iter__()

iterate integers in a, from high

for element in reverse(a):

__reversed__()

symmetric difference into a

a ^= b

__ixor__()

a ‘symmetric difference’ b

a ^ b

__xor__()

count of integer elements in a

len(a)

__len__()

iterate ranges in a, from low

for start,stop in a.ranges():

ranges()

count of ranges in a

a.ranges_len

ranges_len()

index of range containing b

a.ranges_index(b)

ranges_index()

get i th smallest range

a.ranges_getitem(i)

ranges_getitem()

a as a string

str(a)

__str__()

remove all elements from a

a.clear()

clear()

copy a

a.copy()

copy()

find index of b in a

a.index(b)

index()

are a and b disjoint?

a.isdisjoint(b)

isdisjoint()

is a empty?

a.isempty(b)

isempty()

smallest element of a

a.min()

min()

largest element of a

a.max()

max()

sum of elements in a

a.sum()

sum()

remove and return an element

a.pop()

pop()

Examples, Tips, and Warnings

The argument to a method and the right-hand side of an operator can be a ranges input rather than IntRangeSet. For example,

>>> big = IntRangeSet('-1000:2000')
>>> print(big - '10:20,500') # We can subtract this string because it is a legal ranges input.
IntRangeSet('-1000:10,20:500,501:2000')

This even works for equality testing:

>>> assert big - '10:21,500' == '-1000:10,21:500,501:2000'

The Python in operator is backwards from other operators, so the left-hand side can be any ranges input.

>>> assert 501 in big

Like most other Python libraries you specify a range with an inclusive start integer and an exclusive stop.

>>> print(7 in IntRangeSet('4:8')) # includes 7
True
>>> print(8 in IntRangeSet('4:8')) # excludes 8
False
>>> print(8 in IntRangeSet(range(4,7))) # also excludes 8
False
Be careful with the ranges inputs specified with tuples. Suppose we want a IntRangeSet containing 4,5,6,7.
>>> assert IntRangeSet((4,8)) == '4:8' #OK
>>> assert IntRangeSet([(4,8)]) == '4:8' # List of a tuple is OK
>>> assert IntRangeSet(4,8) == '4,8' #No. This is not a tuple. It is two integer inputs, so we get only 4 and 8
>>> assert IntRangeSet([4,8]) == '4,8' #No. List of two integers, so get only 4 and 8
>>> #Illegal: IntRangeSet((4,8,10)) # tuples must be pairs

Methods and Operators

__contains__(*ranges_inputs)

True exactly when all the ranges input is a subset of the IntRangeSet.

These are the same:

  • b in a

  • a.issuperset(b)

  • a >= b

Example:

>>> print(3 in IntRangeSet('0:5,6:11'))
True
>>> print(IntRangeSet('4:7') in IntRangeSet('0:5,6:11'))
False
>>> '6:9' in IntRangeSet('0:5,6:11') # The left-hand of 'in' can be any ranges input
True
>>> print(IntRangeSet('0:5,6:11') >= '6:9') # The right-hand of can be any ranges input
True

The ‘issuperset’ method also supports unioning multiple ranges inputs.

Example:

>>> print(IntRangeSet('0:5,6:11').issuperset(4,7,8))
True
>>> print(IntRangeSet('0:5,6:11').issuperset(4,7,8,100))
False

Note: By definition, any set is a superset of itself.

__delitem__(key)

Remove elements from the IntRangeSet by position index. Position index can be specified by an integer with negative integers counting from the end. Position indexes can also be specified with slices and a ranges input.

Example:

Removing with an integer position index:

>>> a = IntRangeSet('100:200,1000')
>>> del a[2]
>>> print(a)
IntRangeSet('100:102,103:200,1000')
>>> del a[-1]
>>> print(a)
IntRangeSet('100:102,103:200')
Example:

Removing with a slice:

>>> a = IntRangeSet('100:200,1000')
>>> del a[2:11]
>>> print(a)
IntRangeSet('100:102,111:200,1000')
Example:

Removing with a ranges input:

>>> a = IntRangeSet('100:200,1000')
>>> del a['2:11']
>>> print(a)
IntRangeSet('100:102,111:200,1000')
__eq__(other)

True exactly when the IntRangeSet on the left is set equivalent to the ranges input on the right.

  • a == b

>>> print(IntRangeSet('0:10,12') == IntRangeSet('0:10,12'))
True
>>> print(IntRangeSet('0:10,12') == IntRangeSet('0:10'))
False
>>> print(IntRangeSet('0:10,12') == IntRangeSet('12,0:5,5:10'))
True
>>> print(IntRangeSet('0:10,12') == '0:10,12') # The right-hand can be any ranges input
True
__ge__(other)

See IntRangeSet.__contains__()

__getitem__(key)

a[i] returns the ith integer in sorted order (origin 0) from a, an IntRangeSet

>>> print(IntRangeSet('100:200,1000')[0])
100
>>> print(IntRangeSet('100:200,1000')[10])
110

If i is negative, the indexing goes from the end

>>> print(IntRangeSet('100:200,1000')[-1])
1000

Python’s standard slice notation may be used and returns IntRangeSets. (Remember that the Stop number in slice notation is exclusive.)

>>> print(IntRangeSet('100:200,1000')[0:10]) # Integers 0 (inclusive) to 10 (exclusive)
IntRangeSet('100:110')
>>> print(IntRangeSet('100:200,1000')[0:10:2]) # Integers 0 (inclusive) to 10 (exclusive) with step 2
IntRangeSet('100,102,104,106,108')
>>> print(IntRangeSet('100:200,1000')[-3:]) # The last three integers in the IntRangeSet.
IntRangeSet('198:200,1000')

An IntRangeSet can also be accessed with any ranges input.

>>> IntRangeSet('100:200,1000')['0:10,20']
IntRangeSet('100:110,120')
__gt__(other)

True exactly when the IntRangeSet is a proper superset of the ranges.

  • a > b

Example:

>>> print(IntRangeSet('0:5,6:11') > '7:10') # The right-hand can be any ranges input
True

Note: By definition, no set is a proper superset of itself.

__hash__ = None
__iadd__(*ranges_inputs)

See IntRangeSet.add()

__iand__()

Set the IntRangeSet to itself intersected with a input range

These are the same:

  • a &= b

  • a.intersection_update(b)

Example:

>>> a = IntRangeSet('0:5,6:11')
>>> a &= '3:8'
>>> print(a)
IntRangeSet('3:5,6:8')
__imul__(n)

Put produces n copies of a unioned together in a, where a is an IntRangeSet. Because a is a set, the result will either be an empty IntRangeSet (n is 0 or less) or the original a.

  • a *= n

Example:

>>> a = IntRangeSet('0:5,6:11')
>>> a *= 5
>>> print(a) # should be unchanged
IntRangeSet('0:5,6:11')
>>> a *= 0
>>> print(a) # should be empty
IntRangeSet('')
__init__(*ranges_inputs)

Create a IntRangeSet.

__ior__(*ranges_inputs)

See IntRangeSet.add()

__isub__(*ranges_inputs)

Remove the elements of the range inputs from the IntRangeSet

These are the same:

  • a -= b

  • a.difference_update(b)

  • a.discard(b)

remove is almost the same except that it raises a KeyError if any element of b is not in a.

  • a.remove(b)

Example:

>>> a = IntRangeSet('0:5,6:11')
>>> a -= '3:7'
>>> print(a)
IntRangeSet('0:3,7:11')

The ‘difference_update’, ‘discard’ and ‘remove’ methods also support subtracting multiple ranges inputs.

Example:

>>> a = IntRangeSet('0:5,6:11')
>>> a.difference_update('3:7','8:100')
>>> print(a)
IntRangeSet('0:3,7')
__iter__()

Iterate, in order from smallest to largest, the integer elements of the IntRangeSet

Example:

>>> for i in IntRangeSet('1:4,10'):
...    print(i)
1
2
3
10
__ixor__(ranges)

Set the IntRangeSet to contains exactly those elements that appear in either itself or the input ranges but not both

These are the same:

  • a ^= b

  • a.symmetric_difference_update(b)

Example:

>>> a = IntRangeSet('0:5,6:11')
>>> a ^= '3:8'
>>> print(a)
IntRangeSet('0:3,5,8:11')
__le__(ranges)

True exactly when the IntRangeSet is a subset of the ranges.

These are the same:

  • a <= b

  • a.issubset(b)

Example:

>>> print(IntRangeSet('0:5,6:11') <= '-1:101') # The right-hand can be any ranges input
True

Note: By definition, any set is a subset of itself.

__len__()

The number of integer elements in the IntRangeSet

>>> print(len(IntRangeSet('0:10,12')))
11

Note: This is computed in time linear in the number of ranges, rather than integer elements.

__lt__(ranges)

True exactly when the IntRangeSet is a proper subset of the ranges.

  • a < b

Example:

>>> print(IntRangeSet('0:5,6:11') < '-1:101') # The right-hand can be any ranges input
True

Note: By definition, no set is a proper subset of itself.

__mul__(n)

a * n, produces n shallow copies of a unioned, where a is an IntRangeSet. Because a is a set, the result will either be an empty IntRangeSet (n is 0 or less) or a copy of the original IntRangeSet.

  • a * n

__ne__(other)

False exactly when the IntRangeSet on the left is set equivalent to the ranges input on the right.

These are the same.

  • a != b

  • not a == b

__or__()

Return the union of a IntRangeSet with zero or more ranges inputs. The original IntRangeSet is not changed.

These are the same:

  • a | b

  • a + b

  • a.union(b)

Example:

>>> print(IntRangeSet('0:5,6:10') | 5)
IntRangeSet('0:10')

The ‘union’ method also support unioning multiple ranges inputs,

Example:

>>> print(IntRangeSet('0:5,6:10').union(5,'100:200'))
IntRangeSet('0:10,100:200')
__repr__()

Use the standard repr(a) function to create a string representation of a, an IntRangeSet.

>>> print("Hello " + repr(IntRangeSet(2,3,4,10)))
Hello IntRangeSet('2:5,10')
__reversed__()

reversed(a) is a generator that produces the integer elements of a in order from largest to smallest.

Example:

>>> for i in reversed(IntRangeSet('1:4,10')):
...     print(i)
10
3
2
1
__str__()

Use the standard str(a) function to create a string representation of a, an IntRangeSet.

>>> print("Hello " + str(IntRangeSet(2,3,4,10)))
Hello IntRangeSet('2:5,10')
__sub__(*ranges_inputs)

Return the set difference of a IntRangeSet with zero or more ranges inputs. The original IntRangeSet is not changed.

These are the same:

  • a - b

  • a.difference(b)

Example:

>>> print(IntRangeSet('0:5,6:11') - 1)
IntRangeSet('0,2:5,6:11')
>>> print(IntRangeSet('0:5,6:11') - '3:100')
IntRangeSet('0:3')

The ‘difference’ method also supports subtracting multiple input ranges

Example:

>>> print(IntRangeSet('0:5,6:11').difference('3:100',1))
IntRangeSet('0,2')
__xor__(ranges)

Returns a new IntRangeSet set with elements in either the input IntRangeSet or the input range but not both.

These are the same:

  • a ^ b

  • a.symmetric_difference(b)

Example:

>>> print(IntRangeSet('0:5,6:11') ^ '3:9')
IntRangeSet('0:3,5,9:11')
add(*ranges_inputs)

Union zero or more ranges inputs into the current IntRangeSet.

These are the same:

  • a |= b

  • a += b

  • a.add(b)

  • a.update(b)

Example:

>>> a = IntRangeSet('0:5,6:10')
>>> a |= 5
>>> print(a)
IntRangeSet('0:10')

The ‘add’ and ‘update’ methods also support unioning multiple ranges inputs,

Example:

>>> a = IntRangeSet('0:5,6:10')
>>> a.add('5','100:200')
>>> print(a)
IntRangeSet('0:10,100:200')
clear()

Remove all ranges from this IntRangeSet.

>>> a = IntRangeSet('0:10,12')
>>> a.clear()
>>> print(a)
IntRangeSet('')
copy()

Create a deep copy of a IntRangeSet.

count(ranges)

The number of times that the elements of ranges appears in the IntRangeSet. Because IntRangeSet is a set, the number will be either 0 or 1.

>>> print(IntRangeSet('100:110,1000').count('105:107,1000'))
1
difference(*ranges_inputs)

Return the set difference of a IntRangeSet with zero or more ranges inputs. The original IntRangeSet is not changed.

These are the same:

  • a - b

  • a.difference(b)

Example:

>>> print(IntRangeSet('0:5,6:11') - 1)
IntRangeSet('0,2:5,6:11')
>>> print(IntRangeSet('0:5,6:11') - '3:100')
IntRangeSet('0:3')

The ‘difference’ method also supports subtracting multiple input ranges

Example:

>>> print(IntRangeSet('0:5,6:11').difference('3:100',1))
IntRangeSet('0,2')
difference_update(*ranges_inputs)

See IntRangeSet.__isub__()

discard(*ranges_inputs)

See IntRangeSet.__isub__()

index(other)

If x is an integer, returns the index of x in a, where a is an IntRangeSet. If x is an ranges input, returns an IntRangeSet of index of every integer in x. Raises an IndexError is x not in a.

* a.index(x)

>>> print(IntRangeSet('100:110,1000').index(109))
9
>>> print(IntRangeSet('100:110,1000').index('109,100:104'))
IntRangeSet('0:4,9')
intersection()

Return the intersection of a IntRangeSet and zero or more ranges inputs. The original IntRangeSet is not changed.

These are the same:

  • a & b

  • a.intersection(b)

Example:

>>> print(IntRangeSet('0:5,6:11') & '3:8')
IntRangeSet('3:5,6:8')

The ‘intersection’ method also support intersecting multiple ranges inputs,

Example:

>>> print(IntRangeSet('0:5,6:11').intersection('3:8','4:7'))
IntRangeSet('4,6')
intersection_update()

See IntRangeSet.__iand__()

isdisjoint(ranges)

True exactly when the two sets have no integer elements in common.

Example:

>>> print(IntRangeSet('100:110,1000').isdisjoint('900:2000'))
False
>>> print(IntRangeSet('100:110,1000').isdisjoint('1900:2000'))
True
property isempty

True exactly when the IntRangeSet is empty.

>>> print(IntRangeSet().isempty)
True
>>> print(IntRangeSet(4).isempty)
False
issubset(ranges)

True exactly when the IntRangeSet is a subset of the ranges.

These are the same:

  • a <= b

  • a.issubset(b)

Example:

>>> print(IntRangeSet('0:5,6:11') <= '-1:101') # The right-hand can be any ranges input
True

Note: By definition, any set is a subset of itself.

issuperset(*ranges_inputs)

True exactly when all the ranges input is a subset of the IntRangeSet.

These are the same:

  • b in a

  • a.issuperset(b)

  • a >= b

Example:

>>> print(3 in IntRangeSet('0:5,6:11'))
True
>>> print(IntRangeSet('4:7') in IntRangeSet('0:5,6:11'))
False
>>> '6:9' in IntRangeSet('0:5,6:11') # The left-hand of 'in' can be any ranges input
True
>>> print(IntRangeSet('0:5,6:11') >= '6:9') # The right-hand of can be any ranges input
True

The ‘issuperset’ method also supports unioning multiple ranges inputs.

Example:

>>> print(IntRangeSet('0:5,6:11').issuperset(4,7,8))
True
>>> print(IntRangeSet('0:5,6:11').issuperset(4,7,8,100))
False

Note: By definition, any set is a superset of itself.

max()

The largest integer element in the IntRangeSet

>>> print(IntRangeSet('0:10,12').max())
12

Note: This is more efficient than max(IntRangeSet(‘0:10,12’)) because is computed in constant time rather than in time linear to the number of integer elements.

min()

The smallest integer element in the IntRangeSet

Example:

>>> print(IntRangeSet('0:10,12').min())
0

Note: This is more efficient than min(IntRangeSet('0:10,12')) because is computed in constant time rather than in time linear to the number of integer elements.

pop()

Remove and return the largest integer element from the IntRangeSet. Raises KeyError if the IntRangeSet is empty.

Example:

>>> a = IntRangeSet('0:5,6:11')
>>> print(a.pop())
10
>>> print(a)
IntRangeSet('0:5,6:10')
ranges()

Iterate, in order, the ranges of a IntRangeSet as (start,stop) tuples.

Example:

>>> for start,stop in IntRangeSet('0:10,100:200').ranges():
...       print("start is {0}, stop is {1}".format(start,stop))
start is 0, stop is 10
start is 100, stop is 200
ranges_getitem(index)

Returns the index-th range in the collection of ranges

>>> print(IntRangeSet('-30:-20,0:10,12').ranges_getitem(1))
(0, 10)
ranges_index(element)

Returns the ranges index of range containing the element

>>> int_range_set = IntRangeSet('-30:-20,0:10,12')
>>> index = int_range_set.ranges_index(5)
>>> print(index)
1
>>> int_range_set.ranges_getitem(index)
(0, 10)
property ranges_len

The number of contiguous ranges in the IntRangeSet

>>> print(IntRangeSet('0:10,12').ranges_len)
2
remove(*ranges_inputs)

See IntRangeSet__isub__()

sum()

The sum of the integer elements in the IntRangeSet

>>> print(IntRangeSet('0:10,12').sum())
57

Note: This is more efficient than sum(IntRangeSet('0:10,12')) because is computed in time linear in the number of ranges, rather than integer elements.

symmetric_difference(ranges)

Returns a new IntRangeSet set with elements in either the input IntRangeSet or the input range but not both.

These are the same:

  • a ^ b

  • a.symmetric_difference(b)

Example:

>>> print(IntRangeSet('0:5,6:11') ^ '3:9')
IntRangeSet('0:3,5,9:11')
symmetric_difference_update(ranges)

See IntRangeSet.__ixor__()

union()

Return the union of a IntRangeSet with zero or more ranges inputs. The original IntRangeSet is not changed.

These are the same:

  • a | b

  • a + b

  • a.union(b)

Example:

>>> print(IntRangeSet('0:5,6:10') | 5)
IntRangeSet('0:10')

The ‘union’ method also support unioning multiple ranges inputs,

Example:

>>> print(IntRangeSet('0:5,6:10').union(5,'100:200'))
IntRangeSet('0:10,100:200')
update(*ranges_inputs)

Union zero or more ranges inputs into the current IntRangeSet.

These are the same:

  • a |= b

  • a += b

  • a.add(b)

  • a.update(b)

Example:

>>> a = IntRangeSet('0:5,6:10')
>>> a |= 5
>>> print(a)
IntRangeSet('0:10')

The ‘add’ and ‘update’ methods also support unioning multiple ranges inputs,

Example:

>>> a = IntRangeSet('0:5,6:10')
>>> a.add('5','100:200')
>>> print(a)
IntRangeSet('0:10,100:200')

util.pheno

Deprecated since version Use: Pheno instead.

pysnptools.util.pheno.loadOnePhen(filename, i_pheno=0, missing='-9', vectorize=False)

Deprecated since version Use: Pheno instead.

Load one column of a phenotype file. Remove any rows with missing data

Parameters:
  • filename (string) – name of the file

  • i_pheno (int) – column to return (default ‘0’, the first column)

  • missing (string) – value to threat as missing

  • vectorize (bool) – if true, return a 1-D vector rather than a 2-D array

Return type:

An output dictionary

The output dictionary looks like:

  • ‘header’ : [1] array phenotype namesv (only if header line is specified in file),

  • ‘vals’ : [N*1] array of phenotype-data,

  • ‘iid’ : [N*2] array of family IDs and individual IDs

pysnptools.util.pheno.loadPhen(filename, missing='-9', famid='FID', sampid='ID')

Deprecated since version Use: Pheno instead.

Load a phenotype or covariate file. Covariates have the same file format.

Parameters:
  • filename (string) – name of the file

  • missing (string) – value to threat as missing

  • vectorize (bool) – if true, return a 1-D vector rather than a 2-D array

Return type:

An output dictionary

The output dictionary looks like:

  • ‘header’ : [1] array phenotype names (only if header line is specified in file),

  • ‘vals’ : [N*1] array of phenotype-data,

  • ‘iid’ : [N*2] array of family IDs and individual IDs

util.mapreduce1 Module

util.mapreduce1

pysnptools.util.mapreduce1.map_reduce(input_seq, mapper=<function _identity>, reducer=<class 'list'>, input_files=None, output_files=None, name=None, runner=None, nested=None)

Runs a function on sequence of inputs and runs a second function on the results. Can be nested and clusterized.

Parameters:
  • input_seq (a sequence) – a sequence of inputs. The sequence must support the len function and be indexable. e.g. a list, xrange(100)

  • mapper (a function) – A function to apply to each set of inputs (optional). Defaults to the identity function.

  • reducer (a function that takes a sequence) – A function to turn the results from the mapper to a single value (optional). Defaults to creating a list of the results.

  • input_files (a list) – An optional list that tells what input files are needed. The list can contain the names of files (strings), None (ignored), or objects such as SnpReader’s that can self-report their input files.

  • output_files (a list) – An optional list that tells what output files will be produced. The list can contain the names of files (strings), None (ignored), or objects such as SnpReader’s that can self-report their output files.

  • name (a string) – A name to be displayed if this work is done on a cluster.

  • runner (Runner) – a Runner, optional: Tells how to run locally, multi-processor, or on a cluster. If not given, the function is run locally.

  • nested (a function) – a mapper function that is itself a map_reduce. Some runners can efficiently clusterize such nested mappers.

Return type:

The results from the reducer.

Example:

Square the numbers 0 to 99 and report their sum, locally:

>>> from pysnptools.util.mapreduce1 import map_reduce
>>> map_reduce(range(100), 
...        mapper=lambda x: x*x,
...        reducer=sum)
328350

Compute it again, this time run on four processors:

>>> from pysnptools.util.mapreduce1.runner import LocalMultiProc
>>> map_reduce(range(100),
...        mapper=lambda x: x*x,
...        reducer=sum,
...        runner=LocalMultiProc(4))
328350

Compute it using named functions, again using four processors:

>>> def holder1(n,runner):
...     def mapper1(x):
...         return x*x
...     def reducer1(sequence):
...        return sum(sequence)
...     return map_reduce(range(n),mapper=mapper1,reducer=reducer1,runner=runner)
>>> holder1(100,LocalMultiProc(4))
328350

util.mapreduce1.runner.Runner

class pysnptools.util.mapreduce1.runner.Runner

A Runner is an object that tells a map_reduce() how to run on a local machine or cluster. Examples include Local and LocalMultiProc.

util.mapreduce1.runner.Local

class pysnptools.util.mapreduce1.runner.Local(mkl_num_threads=None, logging_handler=<StreamHandler <stdout> (NOTSET)>)

A Runner that runs a map_reduce() locally. To save memory, it will feed the results of the mapper to the reducer as those results are computed.

Constructor:
Parameters:
  • mkl_num_threads (number) – (default None) Limit on the number threads used by the NumPy MKL library.

Parameters:
  • logging_handler (stream) – (default stdout) Where to output logging messages.

Example:

>>> from pysnptools.util.mapreduce1 import map_reduce
>>> from pysnptools.util.mapreduce1.runner import Local
>>> def holder1(n,runner):
...     def mapper1(x):
...         return x*x
...     def reducer1(sequence):
...        return sum(sequence)
...     return map_reduce(range(n),mapper=mapper1,reducer=reducer1,runner=runner)
>>> holder1(100,Local())
328350

util.mapreduce1.runner.LocalMultiProc

class pysnptools.util.mapreduce1.runner.LocalMultiProc(taskcount, mkl_num_threads=None, weights=None, taskindex_to_environ=None, just_one_process=False, logging_handler=<StreamHandler <stdout> (NOTSET)>)

A Runner that runs a map_reduce() as multiple processes on a single machine.

Constructor:
Parameters:
  • taskcount (number) – The number of processes to run on.

  • mkl_num_threads (number) – (default None) Limit on the number threads used by the NumPy MKL library.

  • weights (array of integers) – (default None) If given, tells the relative amount of work assigned to

    each task. The length of the array must be taskcount. If not given, all tasks are assigned the same amount of work.

  • taskindex_to_environ (function from integers to dictionaries) – (default None). If given, this

    should be function from taskindex to a dictionary. The dictionary tells how to temporarily set environment variables while the task is running. The dictionary is a mapping of variables and values (both strings).

  • just_one_process (bool) – (default False) Divide the work for multiple processes, but runs sequentially on one process. Can be useful for debugging.

  • logging_handler (stream) – (default stdout) Where to output logging messages.

Example:

>>> from pysnptools.util.mapreduce1 import map_reduce
>>> from pysnptools.util.mapreduce1.runner import LocalMultiProc
>>> def holder1(n,runner):
...     def mapper1(x):
...         return x*x
...     def reducer1(sequence):
...        return sum(sequence)
...     return map_reduce(range(n),mapper=mapper1,reducer=reducer1,runner=runner)
>>> holder1(100,LocalMultiProc(4))
328350

util.mapreduce1.runner.LocalMultiThread

class pysnptools.util.mapreduce1.runner.LocalMultiThread(taskcount, mkl_num_threads=None, just_one_process=False)

A Runner that runs a map_reduce() as multiple threads on a single machine.

Note that Python has problems running some programs efficiently on multiple threads. (Search ‘python global interpreter lock’ for details.) If this Runner doesn’t work, consider LocalMultiProc instead.

Constructor:
Parameters:
  • taskcount (number) – The number of threads to run on.

Parameters:
  • mkl_num_threads (number) – (default None) Limit on the number threads used by the NumPy MKL library.

Parameters:
  • just_one_process (bool) – (default False) Divide the work for multiple threads, but sequentially on one thread. Can be useful for debugging.

Example:

>>> from pysnptools.util.mapreduce1 import map_reduce
>>> from pysnptools.util.mapreduce1.runner import LocalMultiThread
>>> def holder1(n,runner):
...     def mapper1(x):
...         return x*x
...     def reducer1(sequence):
...        return sum(sequence)
...     return map_reduce(range(n),mapper=mapper1,reducer=reducer1,runner=runner)
>>> holder1(100,LocalMultiThread(4))
328350

util.mapreduce1.runner.LocalInParts

class pysnptools.util.mapreduce1.runner.LocalInParts(taskindex, taskcount, mkl_num_threads=None, weights=None, environ=None, result_file=None, run_dir='.', temp_dir=None, logging_handler=<StreamHandler <stdout> (NOTSET)>)

A Runner that runs one piece of a map_reduce() job locally. Partial results are saved to disk. Clustering runners and LocalMultiProc use this runner internally.

Constructor:
Parameters:
  • taskindex (number) – Which piece of work to run. When 0 to taskcount-1, does map work. When taskcount, does the reduce work.

  • taskcount (number) – The number of pieces into which to divide the work.

  • mkl_num_threads (number) – (default None) Limit on the number threads used by the NumPy MKL library.

  • weights (array of integers) – (default None) If given, tells the relative amount of work assigned to

    each task. The length of the array must be taskcount. If not given, all tasks are assigned the same amount of work.

  • environ (dictionary of variables and values) – (default None). Temporarily assigns environment variables.

    both variables and values should be strings.

  • result_file (string) – (default None) Where to pickle the final results. If no file is given, the final results are returned, but not saved to a file.

  • result_dir (string) – (default None) The directory for any result_file. Defaults to the current working directory.

  • temp_dir (string) – (default None) The directory for partial results. Defaults to the result_dir/.working_directory.{map_reduce’s Name}.

  • logging_handler (stream) – (default stdout) Where to output logging messages.

Example:

>>> from pysnptools.util.mapreduce1 import map_reduce
>>> from pysnptools.util.mapreduce1.runner import LocalInParts
>>> def holder1(n,runner):
...     def mapper1(x):
...         return x*x
...     def reducer1(sequence):
...        return sum(sequence)
...     return map_reduce(range(n),mapper=mapper1,reducer=reducer1,runner=runner)
>>> holder1(100,LocalInParts(0,4)) #Run part 0 of 4 and save partial results to disk as '0.4.p'.
>>> holder1(100,LocalInParts(1,4)) #Run part 1 of 4 and save partial results to disk as '1.4.p'.
>>> holder1(100,LocalInParts(2,4)) #Run part 2 of 4 and save partial results to disk as '2.4.p'.
>>> holder1(100,LocalInParts(3,4)) #Run part 3 of 4 and save partial results to disk as '3.4.p'.
>>> holder1(100,LocalInParts(4,4)) #Read the all partial results and then apply the reduce function and return the result.
328350

util.filecache Module

Tools for reading and writing files, locally or across clusters.

pysnptools.util.filecache.ip_address()
pysnptools.util.filecache.ip_address_pid()

Return the ip address of this computer and process id of the executing process.

util.filecache.FileCache

class pysnptools.util.filecache.FileCache
A FileCache is class such as LocalCache or PeerToPeer such that
  • for reading, copies a (possibly remote) file to a local disk. (If the local file already exists and is up-to-date, retrieval is typically skipped.)

  • for writing, copies a newly-written local file to (possibly remote) storage.

  • provides an assortment of file- and directory-related commands.

Example:

Suppose all machines in a cluster can read & write the storage at ‘peertopeer1/common’. Also, suppose that ‘peertopeer1/192.168.1.105’ is stored locally on machine 192.168.1.105 but readable other machines in the cluster (and so on for all the machines and their IP addresses).

>>> from pysnptools.util.filecache import PeerToPeer, ip_address
>>> def id_and_path_function():
...     ip = ip_address()
...     return ip, 'peertopeer1/{0}'.format(ip)
>>> file_cache = PeerToPeer(common_directory='peertopeer1/common',id_and_path_function=id_and_path_function)
>>> file_cache
PeerToPeer('peertopeer1/common',id_and_path_function=...')

Remove anything already in remote storage

>>> file_cache.rmtree()

Create a random SNP file and store it remotely.

>>> from pysnptools.snpreader import SnpGen, Dense
>>> snp_gen = SnpGen(seed=123,iid_count=100,sid_count=500)
>>> with file_cache.open_write('r123.100x500.dense.txt') as local_filename:
...    dense1 = Dense.write(local_filename,snp_gen.read())
>>> list(file_cache.walk())
['r123.100x500.dense.txt']

Copy back from remote storage (if needed) and then read SNP data from local file.

>>> with file_cache.open_read('r123.100x500.dense.txt') as local_filename:
...     dense2 = Dense(local_filename)
...     print(dense2[:3,:3].read().val) #Read 1st 3 individuals and SNPs
[[ 0. nan nan]
 [ 0.  0. nan]
 [ 0.  0.  0.]]

Given an appropriate module, such as LocalCache or PeerToPeer, the FileCache library provides a unified way to work with any remote storage scheme. It differs from virtual file systems, such as CIFS VFS, because:

  • FileCache works with all operating systems and requires no operating system changes.

  • FileCache can take advantage of the highest performance file retrieval methods (e.g. kernel-space file systems, peer-to-peer transfers, tree copies, etc).

  • FileCache can take advantage of the highest performance local read and write storage (e.g. SSDs)

  • FileCache can work on top of CIFS VFS and any other remote storage system with an appropriate module.

The downside of FileCache is that:

Methods & Properties:

Every FileCache, such as LocalCache and PeerToPeer, has this property: name, and these methods: file_exists(), getmtime(), join(), load(), open_read(), open_write(), remove(), rmtree(), save(), and walk(). See below for details.

Details of Methods & Properties:

file_exists(file_name)

Tells if there a file with this name. (A directory with the name doesn’t count.)

Parameters:

file_name (string) – The name of the file to look for

Return type:

bool

Example:

>>> from pysnptools.util.filecache import LocalCache
>>> file_cache = LocalCache('localcache1')
>>> file_cache.rmtree()
>>> file_cache.file_exists('localcache1/file1.txt')
False
>>> file_cache.save('localcache1/file1.txt','Hello')
>>> file_cache.file_exists('localcache1/file1.txt')
True
getmtime(file_name)

Return the time that the file was last modified.

Parameters:

file_name (string) – The name of the (possibly remote) file of interest

Return type:

number, the number of seconds since the epoch

join(path)

The FileCache created by appending a path to the current FileCache.

Parameters:

path (string) – path to join to current FileCache

Return type:

FileCache

>>> from pysnptools.util.filecache import LocalCache
>>> file_cache = LocalCache('localcache1')
>>> file_cache
LocalCache('localcache1')
>>> sub = file_cache.join('sub1')
>>> sub
LocalCache('localcache1/sub1')
load(file_name, updater=None)

Returns the contents of a file in storage as a string.

Parameters:
  • file_name (string) – The name of the (possibly remote) file to read

  • updater (function or lambda) – (Default, None). Optional function to which status messages may be written. For example, the function created by log_in_place().

Return type:

string - what was written in the file.

>>> from pysnptools.util.filecache import LocalCache
>>> file_cache = LocalCache('localcache1')
>>> file_cache.rmtree()
>>> file_cache.save('file1.txt','Hello')
>>> file_cache.load('file1.txt')
'Hello'
property name

A path-like name for this FileCache.

Return type:

string

>>> from pysnptools.util.filecache import LocalCache
>>> file_cache = LocalCache('localcache1')
>>> file_cache.name
'localcache1'
open_read(file_name, updater=None)

Used with a ‘with’ statement to produce a local copy of a (possibly remote) file.

Parameters:
  • file_name (string) – The name of the (possibly remote) file to read

  • updater (function or lambda) – (Default, None). Optional function to which status messages may be written. For example, the function created by log_in_place().

Return type:

a local filename to read.

Example:

>>> from pysnptools.util.filecache import LocalCache
>>> file_cache = LocalCache('localcache1')
>>> file_cache.rmtree()
>>> file_cache.save('file1.txt','Hello')
>>> with file_cache.open_read('file1.txt') as local_filename:
...     with open(local_filename,'r') as fp:
...         line = fp.readline()
>>> line
'Hello'
open_write(file_name, size=0, updater=None)

Used with a ‘with’ statement to produce a local file name that will be copied to remote storage when the ‘with’ statement is exited.

Parameters:
  • file_name (string) – The name of the (possibly remote) file to which to write.

  • size (number) – (default 0) if given, an error will be thrown immediately if local storage doesn’t have room for that many bytes.

  • updater (function or lambda) – (Default, None). Optional function to which status messages may be written. For example, the function created by log_in_place().

Return type:

a local filename to read.

Example:

>>> from pysnptools.util.filecache import LocalCache
>>> file_cache = LocalCache('localcache1')
>>> file_cache.rmtree()
>>> with file_cache.open_write('file1.txt') as local_filename:
...     with open(local_filename,'w') as fp:
...         _= fp.write('Hello')
>>> file_cache.load('file1.txt')
'Hello'
remove(file_name, updater=None)

Remove a file from storage. It is an error to remove a directory this way.

Parameters:
  • file_name (string) – The name of the (possibly remote) file to remove.

  • updater (function or lambda) – (Default, None). Optional function to which status messages may be written. For example, the function created by log_in_place().

rmtree(path=None, updater=None)

Delete all files in this FileCache. It is OK if there are no files.

Parameters:
  • path (string) – (Default, None, the current FileCache). Optional a path (subdirectory, not file) to start in.

  • updater (function or lambda) – (Default, None). Optional function to which status messages may be written. For example, the function created by log_in_place().

save(file_name, contents, size=0, updater=None)

Write a string to a file in storage.

Parameters:
  • file_name (string) – The name of the (possibly remote) file to which to write.

  • contents (string) – What to write to the file.

  • size (number) – (default 0) if given, an error will be thrown immediately if local storage doesn’t have room for that many bytes.

  • updater (function or lambda) – (Default, None). Optional function to which status messages may be written. For example, the function created by log_in_place().

>>> from pysnptools.util.filecache import LocalCache
>>> file_cache = LocalCache('localcache1')
>>> file_cache.rmtree()
>>> file_cache.save('file1.txt','Hello')
>>> file_cache.load('file1.txt')
'Hello'
walk(path=None)

Generates the relative paths of the files in the FileCache. It is OK if there are no files.

Parameters:

path (string) – (Default, None, the current FileCache). Optional path (subdirectory, not file) to start in.

Return type:

a generator of strings

Example:

>>> from pysnptools.util.filecache import LocalCache
>>> file_cache = LocalCache('localcache1')
>>> file_cache.rmtree()
>>> list(file_cache.walk())
[]
>>> file_cache.save('file1.txt','Hello')
>>> list(file_cache.walk())
['file1.txt']
>>> list(file_cache.walk('no_such_sub_path'))
[]

util.filecache.LocalCache

class pysnptools.util.filecache.LocalCache(directory)

A FileCache for working with files locally.

See FileCache for general examples of using FileCache.

This is the simplest FileCache in that it stores everything on a local disk rather than storing things remotely.

Constructor:
Parameters:
  • directory (string) – The directory under which files should be written and read.

Example:

>>> from pysnptools.util.filecache import LocalCache
>>> file_cache = LocalCache('localcache1')
>>> file_cache.rmtree()
>>> file_cache.save('sub1/file1.txt','Hello')
>>> file_cache.file_exists('sub1/file1.txt')
True
property name

A path-like name for this LocalCache.

Return type:

string

util.filecache.PeerToPeer

class pysnptools.util.filecache.PeerToPeer(common_directory, id_and_path_function, leave_space=0)

A FileCache that allows multiple machines (or processes on one machine) to share files in a peer-to-peer fashion.

See FileCache for general examples of using FileCache.

This is useful for a cluster with multiple machines with persistent storage and a fast network. Every time a machine writes a file, the file stays on that machine, but information about the file’s name and location is saved to a common directory. When a new machine needs the file, it looks to the common directory for a location and then copies from that location. That second location is also saved to the common directory. Later machines that need the file will randomly pick among the available locations. When a file is removed, it is only removed from the common directory; clean up of local copies happens later when space is needed for other reads.

Constructor:
Parameters:
  • common_directory (FileCache or string) – The path or FileCache where each file’s name and locations (but not contents) are stored. This path must be read/writable by every machine in the cluster.

  • id_and_path_function (function) – A function (or lambda) that deterministically returns two values: 1. an id unique to each machine (or process) and 2. a file path to this machine’s local storage that any machine in the cluster can read.

  • leave_space (number) – (Default “0”) How much free space must be left on local drive after downloads and writes.

Example:

>>> #Suppose that every machine can access a shared 'peertopeer1' and that 'peertopeer1/IPADDRESS' is local to each machine.
>>> from pysnptools.util.filecache import PeerToPeer, ip_address
>>> def id_and_path_function():
...     ip = ip_address()
...     return ip, 'peertopeer1/{0}'.format(ip)
>>> file_cache = PeerToPeer(common_directory='peertopeer1/common',id_and_path_function=id_and_path_function)
>>> file_cache.rmtree()
>>> file_cache.save('sub1/file1.txt','Hello')
>>> file_cache.file_exists('sub1/file1.txt')
True
property name

A path-like name for this PeerToPeer.

Return type:

string

util.filecache.Hashdown

class pysnptools.util.filecache.Hashdown(url, file_to_hash={}, directory=None, allow_unknown_files=False, trust_local_files=False, _relative_directory=None)

A FileCache for downloading files with known MD5 hashes from a URL. Unloading is not supported.

See FileCache for general examples of using FileCache.

Constructor:
Parameters:
  • url (string) – The URL from which to download any needed files.

  • file_to_hash (dictionary) – A dictionary from file names to MD5 hashes.

  • directory (optional, string) – Local location for files. If not given will be under the system temp directory

    (typically controlled with the TEMP environment variable).

  • allow_unknown_files (optional, bool) – By default, all requested files must be in the dictionary. If True,

    other files can be requested. If found under the URL, they will be downloaded and an entry will be added to the dictionary.

  • trust_local_files (optional, bool) – By default, when allow_unknown_files is True, unknown files

    will be download. If trust_local_files is also True, then any local files in directory will be assumed to have the correct hash.

  • _relative_directory (string) – Used internally

Example:

>>> from pysnptools.util.filecache import Hashdown
>>> file_to_hash= {'pysnptools/examples/toydata.5chrom.bed': '766f55aa716bc7bc97cad4de41a50ec3',
...                'pysnptools/examples/toydata.5chrom.bim': '6a07f96e521f9a86df7bfd7814eebcd6',
...                'pysnptools/examples/toydata.5chrom.fam': 'f4eb01f67e0738d4865fad2014af8537'}
>>> hashdown = Hashdown('https://github.com/fastlmm/PySnpTools/raw/cf248cbf762516540470d693532590a77c76fba2',
...                      file_to_hash=file_to_hash)
>>> hashdown.file_exists('pysnptools/examples/toydata.5chrom.bed')
True
>>> hashdown.load('pysnptools/examples/toydata.5chrom.fam').split('\n')[0]
'per0 per0 0 0 2 0.408848'
static load_hashdown(filename, directory=None, allow_unknown_files=False, trust_local_files=False)

Load a Hashdown object from a json file.

Parameters:
  • filename – name of file to load from.

  • directory – Local location for files. If not given will be under the system temp directory (typically controlled with the TEMP environment variable).

  • allow_unknown_files – By default, all requested files must be in the dictionary. If True, other files can be requested. If found under the URL, they will be downloaded and an entry will be added to the dictionary.

  • trust_local_files – By default, when allow_unknown_files is True, unknown files will be download. If trust_local_files is also True, then any local files in directory will be assumed to have the correct hash.

Return type:

Hashdown

>>> from pysnptools.util.filecache import Hashdown
>>> file_to_hash= {'pysnptools/examples/toydata.5chrom.bed': '766f55aa716bc7bc97cad4de41a50ec3',
...                'pysnptools/examples/toydata.5chrom.bim': '6a07f96e521f9a86df7bfd7814eebcd6',
...                'pysnptools/examples/toydata.5chrom.fam': 'f4eb01f67e0738d4865fad2014af8537'}
>>> hashdown1 = Hashdown('https://github.com/fastlmm/PySnpTools/raw/cf248cbf762516540470d693532590a77c76fba2',
...                      file_to_hash=file_to_hash)
>>> hashdown1.save_hashdown('tempdir/demo.hashdown.json')
>>> hashdown2 = Hashdown.load_hashdown('tempdir/demo.hashdown.json')
>>> hashdown2.file_exists('pysnptools/examples/toydata.5chrom.bed')
True
>>> hashdown2.load('pysnptools/examples/toydata.5chrom.fam').split('\n')[0]
'per0 per0 0 0 2 0.408848'
property name

A path-like name for this Hashdown.

Return type:

string

>>> from pysnptools.util.filecache import Hashdown
>>> file_to_hash= {'pysnptools/examples/toydata.5chrom.bed': '766f55aa716bc7bc97cad4de41a50ec3',
...                'pysnptools/examples/toydata.5chrom.bim': '6a07f96e521f9a86df7bfd7814eebcd6',
...                'pysnptools/examples/toydata.5chrom.fam': 'f4eb01f67e0738d4865fad2014af8537'}
>>> hashdown = Hashdown('https://github.com/fastlmm/PySnpTools/raw/cf248cbf762516540470d693532590a77c76fba2',
...                      file_to_hash=file_to_hash)
>>> hashdown.name
'hashdown/9ac30da2bf589db947e91744dff0ec24'
>>> hashdown.join('pysnptools').name
'hashdown/9ac30da2bf589db947e91744dff0ec24/pysnptools'
save_hashdown(filename)

Save a Hashdown object to a json file.

Parameters:

filename – name of file to save to.

>>> from pysnptools.util.filecache import Hashdown
>>> file_to_hash= {'pysnptools/examples/toydata.5chrom.bed': '766f55aa716bc7bc97cad4de41a50ec3',
...                'pysnptools/examples/toydata.5chrom.bim': '6a07f96e521f9a86df7bfd7814eebcd6',
...                'pysnptools/examples/toydata.5chrom.fam': 'f4eb01f67e0738d4865fad2014af8537'}
>>> hashdown1 = Hashdown('https://github.com/fastlmm/PySnpTools/raw/cf248cbf762516540470d693532590a77c76fba2',
...                      file_to_hash=file_to_hash)
>>> hashdown1.save_hashdown('tempdir/demo.hashdown.json')
>>> hashdown2 = Hashdown.load_hashdown('tempdir/demo.hashdown.json')
>>> hashdown2.file_exists('pysnptools/examples/toydata.5chrom.bed')
True
>>> hashdown2.load('pysnptools/examples/toydata.5chrom.fam').split('\n')[0]
'per0 per0 0 0 2 0.408848'
static scan_local(local_directory, url=None, logging_level=30)

Bootstrap a Hashdown by recursively walking a local directory and finding the local MD5 hashes. (A local hash might be wrong if the files are out of date or have OS-dependent line endings.) Typically, you’ll then want to save the result to a JSON file and then edit that JSON file manually to remove uninteresting files.

Parameters:
  • local_directory – Local directory to recursively walk

  • url – URL to give to the Hashdown. (It will not be checked.)

  • logging_level – Logging level for printing progress of the walk. Default is logging.WARNING)

Return type:

Hashdown

Indices and Tables