In-memory data structures¶

This module defines NumPy array classes for variant call data.

Please note, functions and command line utilities for converting variant call data from the VCF file format into NumPy arrays and HDF5 files are available from the vcfnp package.

GenotypeArray¶

class allel.model.ndarray.GenotypeArray[source]¶

Array of discrete genotype calls.

Parameters:

data : array_like, int, shape (n_variants, n_samples, ploidy)

Genotype data.

**kwargs : keyword arguments

All keyword arguments are passed through to numpy.array().

Notes

This class represents data on discrete genotype calls as a 3-dimensional numpy array of integers. By convention the first dimension corresponds to the variants genotyped, the second dimension corresponds to the samples genotyped, and the third dimension corresponds to the ploidy of the samples.

Each integer within the array corresponds to an allele index, where 0 is the reference allele, 1 is the first alternate allele, 2 is the second alternate allele, ... and -1 (or any other negative integer) is a missing allele call. A single byte integer dtype (int8) can represent up to 127 distinct alleles, which is usually sufficient. The actual alleles (i.e., the alternate nucleotide sequences) and the physical positions of the variants within the genome of an organism are stored in separate arrays, discussed elsewhere.

Arrays of this class can store either phased or unphased genotype calls. If the genotypes are phased (i.e., haplotypes have been resolved) then individual haplotypes can be extracted by converting to a HaplotypeArray then indexing the second dimension. If the genotype calls are unphased then the ordering of alleles along the third (ploidy) dimension is arbitrary. N.B., this means that an unphased diploid heterozygous call could be stored as (0, 1) or equivalently as (1, 0).

A genotype array can store genotype calls with any ploidy > 1. For haploid calls, use a HaplotypeArray. Note that genotype arrays are not capable of storing calls for samples with differing or variable ploidy.

With genotype data on large numbers of variants and/or samples, storing the genotype calls in memory as an uncompressed numpy array if integers may be impractical. For working with large arrays of genotype data, see the allel.model.chunked.GenotypeChunkedArray class, which provides an alternative implementation of this interface using chunked compressed arrays.

Examples

Instantiate a genotype array:

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]], dtype='i1')
>>> g.dtype
dtype('int8')
>>> g.ndim
3
>>> g.shape
(3, 2, 2)
>>> g.n_variants
3
>>> g.n_samples
2
>>> g.ploidy
2

Genotype calls for a single variant at all samples can be obtained by indexing the first dimension, e.g.:

>>> g[1]
array([[0, 1],
       [1, 1]], dtype=int8)

Genotype calls for a single sample at all variants can be obtained by indexing the second dimension, e.g.:

>>> g[:, 1]
array([[ 0,  1],
       [ 1,  1],
       [-1, -1]], dtype=int8)

A genotype call for a single sample at a single variant can be obtained by indexing the first and second dimensions, e.g.:

>>> g[1, 0]
array([0, 1], dtype=int8)

A genotype array can store polyploid calls, e.g.:

>>> g = allel.GenotypeArray([[[0, 0, 0], [0, 0, 1]],
...                          [[0, 1, 1], [1, 1, 1]],
...                          [[0, 1, 2], [-1, -1, -1]]],
...                         dtype='i1')
>>> g.ploidy
3

n_variants¶: Number of variants (length of first array dimension).

n_samples¶: Number of samples (length of second array dimension).

ploidy¶: Sample ploidy (length of third array dimension).

mask¶

A boolean mask associated with this genotype array, indicating genotype calls that should be filtered (i.e., excluded) from genotype and allele counting operations.

Notes

This is a lightweight genotype call mask and not a mask in the sense of a numpy masked array. This means that the mask will only be taken into account by the genotype and allele counting methods of this class, and is ignored by any of the generic methods on the ndarray class or by any numpy ufuncs.

Note also that the mask may not survive any slicing, indexing or other subsetting procedures (e.g., call to numpy.compress() or numpy.take()). I.e., the mask will have to be similarly indexed then reapplied. The only exceptions are simple slicing operations that preserve the dimensionality and ploidy of the array, and the subset() method, both of which will preserve the mask if present.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]], dtype='i1')
>>> g.count_called()
5
>>> g.count_alleles()
AlleleCountsArray((3, 3), dtype=int32)
[[3 1 0]
 [1 3 0]
 [1 0 1]]
>>> mask = [[True, False], [False, True], [False, False]]
>>> g.mask = mask
>>> g.count_called()
3
>>> g.count_alleles()
AlleleCountsArray((3, 3), dtype=int32)
[[1 1 0]
 [1 1 0]
 [1 0 1]]

fill_masked(value=-1, mask=None, copy=True)[source]¶

Fill masked genotype calls with a given value.

Parameters:

value : int, optional

The fill value.

mask : array_like, bool, shape (n_variants, n_samples), optional

A boolean array where True elements indicate genotype calls to be filled. If not provided, value of the mask property will be used.

copy : bool, optional

If False, modify the array in place.

Returns:

g : GenotypeArray

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]], dtype='i1')
>>> mask = [[True, False], [False, True], [False, False]]
>>> g.mask = mask
>>> g.fill_masked()
GenotypeArray((3, 2, 2), dtype=int8)
[[[-1 -1]
  [ 0  1]]
 [[ 0  1]
  [-1 -1]]
 [[ 0  2]
  [-1 -1]]]

subset(sel0=None, sel1=None)[source]¶

Make a sub-selection of variants and samples.

Parameters:

sel0 : array_like

Boolean array or list of indices selecting variants.

sel0 : array_like

Boolean array or list of indices selecting samples.

Returns:

out : GenotypeArray

See also

numpy.take, numpy.compress

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1], [1, 1]],
...                          [[0, 1], [1, 1], [1, 2]],
...                          [[0, 2], [-1, -1], [-1, -1]]])
>>> g.subset([0, 1], [0, 2])
GenotypeArray((2, 2, 2), dtype=int64)
[[[0 0]
  [1 1]]
 [[0 1]
  [1 2]]]

is_called()[source]¶

Find non-missing genotype calls.

Returns:

out : ndarray, bool, shape (n_variants, n_samples)

Array where elements are True if the genotype call matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]])
>>> g.is_called()
array([[ True,  True],
       [ True,  True],
       [ True, False]], dtype=bool)

is_missing()[source]¶

Find missing genotype calls.

Returns:

out : ndarray, bool, shape (n_variants, n_samples)

Array where elements are True if the genotype call matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]])
>>> g.is_missing()
array([[False, False],
       [False, False],
       [False,  True]], dtype=bool)

is_hom(allele=None)[source]¶

Find genotype calls that are homozygous.

Parameters:

allele : int, optional

Allele index.

Returns:

out : ndarray, bool, shape (n_variants, n_samples)

Array where elements are True if the genotype call matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> g.is_hom()
array([[ True, False],
       [False,  True],
       [ True, False]], dtype=bool)
>>> g.is_hom(allele=1)
array([[False, False],
       [False,  True],
       [False, False]], dtype=bool)

is_hom_ref()[source]¶

Find genotype calls that are homozygous for the reference allele.

Returns:

out : ndarray, bool, shape (n_variants, n_samples)

Array where elements are True if the genotype call matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]])
>>> g.is_hom_ref()
array([[ True, False],
       [False, False],
       [False, False]], dtype=bool)

is_hom_alt()[source]¶

Find genotype calls that are homozygous for any alternate (i.e., non-reference) allele.

Returns:

out : ndarray, bool, shape (n_variants, n_samples)

Array where elements are True if the genotype call matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> g.is_hom_alt()
array([[False, False],
       [False,  True],
       [ True, False]], dtype=bool)

is_het(allele=None)[source]¶

Find genotype calls that are heterozygous.

Returns:

out : ndarray, bool, shape (n_variants, n_samples)

Array where elements are True if the genotype call matches the condition.

allele : int, optional

Heterozygous allele.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]])
>>> g.is_het()
array([[False,  True],
       [ True, False],
       [ True, False]], dtype=bool)
>>> g.is_het(2)
array([[False, False],
       [False, False],
       [ True, False]], dtype=bool)

is_call(call)[source]¶

Find genotypes with a given call.

Parameters:

call : array_like, int, shape (ploidy,)

The genotype call to find.

Returns:

out : ndarray, bool, shape (n_variants, n_samples)

Array where elements are True if the genotype is call.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]])
>>> g.is_call((0, 2))
array([[False, False],
       [False, False],
       [ True, False]], dtype=bool)

count_called(axis=None)[source]¶

count_missing(axis=None)[source]¶

count_hom(allele=None, axis=None)[source]¶

count_hom_ref(axis=None)[source]¶

count_hom_alt(axis=None)[source]¶

count_het(allele=None, axis=None)[source]¶

count_call(call, axis=None)[source]¶

count_alleles(max_allele=None, subpop=None)[source]¶

Count the number of calls of each allele per variant.

Parameters:

max_allele : int, optional

The highest allele index to count. Alleles above this will be ignored.

subpop : sequence of ints, optional

Indices of samples to include in count.

Returns:

ac : AlleleCountsArray

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> g.count_alleles()
AlleleCountsArray((3, 3), dtype=int32)
[[3 1 0]
 [1 2 1]
 [0 0 2]]
>>> g.count_alleles(max_allele=1)
AlleleCountsArray((3, 2), dtype=int32)
[[3 1]
 [1 2]
 [0 0]]

count_alleles_subpops(subpops, max_allele=None)[source]¶

Count alleles for multiple subpopulations simultaneously.

Parameters:

subpops : dict (string -> sequence of ints)

Mapping of subpopulation names to sample indices.

max_allele : int, optional

The highest allele index to count. Alleles above this will be ignored.

Returns:

out : dict (string -> AlleleCountsArray)

A mapping of subpopulation names to allele counts arrays.

map_alleles(mapping, copy=True)[source]¶

Transform alleles via a mapping.

Parameters:

mapping : ndarray, int8, shape (n_variants, max_allele)

An array defining the allele mapping for each variant.

copy : bool, optional

If True, return a new array; if False, apply mapping in place (only applies for arrays with dtype int8; all other dtypes require a copy).

Returns:

gm : GenotypeArray

See also

create_allele_mapping

Notes

For arrays with dtype int8 an optimised implementation is used which is faster and uses far less memory. It is recommended to convert arrays to dtype int8 where possible before calling this method.

Examples

>>> import allel
>>> import numpy as np
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[1, 2], [2, 1]],
...                          [[2, 2], [-1, -1]]], dtype='i1')
>>> mapping = np.array([[1, 2, 0],
...                     [2, 0, 1],
...                     [2, 1, 0],
...                     [0, 2, 1]], dtype='i1')
>>> g.map_alleles(mapping)
GenotypeArray((4, 2, 2), dtype=int8)
[[[ 1  1]
  [ 1  2]]
 [[ 2  1]
  [ 0  0]]
 [[ 1  0]
  [ 0  1]]
 [[ 1  1]
  [-1 -1]]]

to_haplotypes(copy=False)[source]¶

Reshape a genotype array to view it as haplotypes by dropping the ploidy dimension.

Returns:

h : HaplotypeArray, shape (n_variants, n_samples * ploidy)

Haplotype array.

copy : bool, optional

If True, make a copy of the data.

Notes

If genotype calls are unphased, the haplotypes returned by this function will bear no resemblance to the true haplotypes.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]])
>>> g.to_haplotypes()
HaplotypeArray((3, 4), dtype=int64)
[[ 0  0  0  1]
 [ 0  1  1  1]
 [ 0  2 -1 -1]]

to_n_ref(fill=0, dtype='i1')[source]¶

Transform each genotype call into the number of reference alleles.

Parameters:

fill : int, optional

Use this value to represent missing calls.

Returns:

out : ndarray, int, shape (n_variants, n_samples)

Array of ref alleles per genotype call.

Notes

By default this function returns 0 for missing genotype calls and for homozygous non-reference genotype calls. Use the fill argument to change how missing calls are represented.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> g.to_n_ref()
array([[2, 1],
       [1, 0],
       [0, 0]], dtype=int8)
>>> g.to_n_ref(fill=-1)
array([[ 2,  1],
       [ 1,  0],
       [ 0, -1]], dtype=int8)

to_n_alt(fill=0, dtype='i1')[source]¶

Transform each genotype call into the number of non-reference alleles.

Parameters:

fill : int, optional

Use this value to represent missing calls.

Returns:

out : ndarray, int, shape (n_variants, n_samples)

Array of non-ref alleles per genotype call.

Notes

This function simply counts the number of non-reference alleles, it makes no distinction between different alternate alleles.

By default this function returns 0 for missing genotype calls and for homozygous reference genotype calls. Use the fill argument to change how missing calls are represented.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> g.to_n_alt()
array([[0, 1],
       [1, 2],
       [2, 0]], dtype=int8)
>>> g.to_n_alt(fill=-1)
array([[ 0,  1],
       [ 1,  2],
       [ 2, -1]], dtype=int8)

to_allele_counts(alleles=None)[source]¶

Transform genotype calls into allele counts per call.

Parameters:

alleles : sequence of ints, optional

If not None, count only the given alleles. (By default, count all alleles.)

Returns:

out : ndarray, uint8, shape (n_variants, n_samples, len(alleles))

Array of allele counts per call.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> g.to_allele_counts()
array([[[2, 0, 0],
        [1, 1, 0]],
       [[1, 0, 1],
        [0, 2, 0]],
       [[0, 0, 2],
        [0, 0, 0]]], dtype=uint8)
>>> g.to_allele_counts(alleles=(0, 1))
array([[[2, 0],
        [1, 1]],
       [[1, 0],
        [0, 2]],
       [[0, 0],
        [0, 0]]], dtype=uint8)

to_packed(boundscheck=True)[source]¶

Pack diploid genotypes into a single byte for each genotype, using the left-most 4 bits for the first allele and the right-most 4 bits for the second allele. Allows single byte encoding of diploid genotypes for variants with up to 15 alleles.

Parameters:

boundscheck : bool, optional

If False, do not check that minimum and maximum alleles are compatible with bit-packing.

Returns:

packed : ndarray, uint8, shape (n_variants, n_samples)

Bit-packed genotype array.

Notes

If a mask has been set, it is ignored by this function.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]], dtype='i1')
>>> g.to_packed()
array([[  0,   1],
       [  2,  17],
       [ 34, 239]], dtype=uint8)

static from_packed(packed)[source]¶

Unpack diploid genotypes that have been bit-packed into single bytes.

Parameters:

packed : ndarray, uint8, shape (n_variants, n_samples)

Bit-packed diploid genotype array.

Returns:

g : GenotypeArray, shape (n_variants, n_samples, 2)

Genotype array.

Examples

>>> import allel
>>> import numpy as np
>>> packed = np.array([[0, 1],
...                    [2, 17],
...                    [34, 239]], dtype='u1')
>>> allel.GenotypeArray.from_packed(packed)
GenotypeArray((3, 2, 2), dtype=int8)
[[[ 0  0]
  [ 0  1]]
 [[ 0  2]
  [ 1  1]]
 [[ 2  2]
  [-1 -1]]]

to_sparse(format='csr', **kwargs)[source]¶

Convert into a sparse matrix.

Parameters:

format : {‘coo’, ‘csc’, ‘csr’, ‘dia’, ‘dok’, ‘lil’}

Sparse matrix format.

kwargs : keyword arguments

Passed through to sparse matrix constructor.

Returns:

m : scipy.sparse.spmatrix

Sparse matrix

Notes

If a mask has been set, it is ignored by this function.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 1], [0, 1]],
...                          [[1, 1], [0, 0]],
...                          [[0, 0], [-1, -1]]], dtype='i1')
>>> m = g.to_sparse(format='csr')
>>> m
<4x4 sparse matrix of type '<class 'numpy.int8'>'
    with 6 stored elements in Compressed Sparse Row format>
>>> m.data
array([ 1,  1,  1,  1, -1, -1], dtype=int8)
>>> m.indices
array([1, 3, 0, 1, 2, 3], dtype=int32)
>>> m.indptr
array([0, 0, 2, 4, 6], dtype=int32)

static from_sparse(m, ploidy, order=None, out=None)[source]¶

Construct a genotype array from a sparse matrix.

Parameters:

m : scipy.sparse.spmatrix

Sparse matrix

ploidy : int

The sample ploidy.

order : {‘C’, ‘F’}, optional

Whether to store data in C (row-major) or Fortran (column-major) order in memory.

out : ndarray, shape (n_variants, n_samples), optional

Use this array as the output buffer.

Returns:

g : GenotypeArray, shape (n_variants, n_samples, ploidy)

Genotype array.

Examples

>>> import allel
>>> import numpy as np
>>> import scipy.sparse
>>> data = np.array([ 1,  1,  1,  1, -1, -1], dtype=np.int8)
>>> indices = np.array([1, 3, 0, 1, 2, 3], dtype=np.int32)
>>> indptr = np.array([0, 0, 2, 4, 6], dtype=np.int32)
>>> m = scipy.sparse.csr_matrix((data, indices, indptr))
>>> g = allel.GenotypeArray.from_sparse(m, ploidy=2)
>>> g
GenotypeArray((4, 2, 2), dtype=int8)
[[[ 0  0]
  [ 0  0]]
 [[ 0  1]
  [ 0  1]]
 [[ 1  1]
  [ 0  0]]
 [[ 0  0]
  [-1 -1]]]

to_gt(phased=False, max_allele=None)[source]¶

Convert genotype calls to VCF-style string representation.

Parameters:

phased : bool, optional

Determines separator.

max_allele : int, optional

Manually specify max allele index.

Returns:

gt : ndarray, string, shape (n_variants, n_samples)

Notes

If a mask has been set, it is ignored by this function.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[1, 2], [2, 1]],
...                          [[2, 2], [-1, -1]]])
>>> g.to_gt()
chararray([[b'0/0', b'0/1'],
       [b'0/2', b'1/1'],
       [b'1/2', b'2/1'],
       [b'2/2', b'./.']],
      dtype='|S3')
>>> g.to_gt(phased=True)
chararray([[b'0|0', b'0|1'],
       [b'0|2', b'1|1'],
       [b'1|2', b'2|1'],
       [b'2|2', b'.|.']],
      dtype='|S3')

haploidify_samples()[source]¶

Construct a pseudo-haplotype for each sample by randomly selecting an allele from each genotype call.

Returns:	h : HaplotypeArray

Notes

If a mask has been set, it is ignored by this function.

Examples

>>> import allel
>>> import numpy as np
>>> np.random.seed(42)
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[1, 2], [2, 1]],
...                          [[2, 2], [-1, -1]]])
>>> g.haploidify_samples()
HaplotypeArray((4, 2), dtype=int64)
[[ 0  1]
 [ 0  1]
 [ 1  1]
 [ 2 -1]]
>>> g = allel.GenotypeArray([[[0, 0, 0], [0, 0, 1]],
...                          [[0, 1, 1], [1, 1, 1]],
...                          [[0, 1, 2], [-1, -1, -1]]])
>>> g.haploidify_samples()
HaplotypeArray((3, 2), dtype=int64)
[[ 0  0]
 [ 1  1]
 [ 2 -1]]

vstack(*others)¶: Stack arrays in sequence vertically (row wise).

hstack(*others)¶: Stack arrays in sequence horizontally (column wise).

HaplotypeArray¶

class allel.model.ndarray.HaplotypeArray[source]¶

Array of haplotypes.

Parameters:

data : array_like, int, shape (n_variants, n_haplotypes)

Haplotype data.

**kwargs : keyword arguments

All keyword arguments are passed through to numpy.array().

Notes

This class represents haplotype data as a 2-dimensional numpy array of integers. By convention the first dimension corresponds to the variants genotyped, the second dimension corresponds to the haplotypes.

Each integer within the array corresponds to an allele index, where 0 is the reference allele, 1 is the first alternate allele, 2 is the second alternate allele, ... and -1 (or any other negative integer) is a missing allele call.

If adjacent haplotypes originate from the same sample, then a haplotype array can also be viewed as a genotype array. However, this is not a requirement.

With data on large numbers of variants and/or haplotypes, storing the data in memory as an uncompressed numpy array if integers may be impractical. For working with large arrays of haplotype data, see the allel.model.chunked.HaplotypeChunkedArray class, which provides an alternative implementation of this interface using chunked compressed arrays.

Examples

Instantiate a haplotype array:

>>> import allel
>>> h = allel.HaplotypeArray([[0, 0, 0, 1],
...                           [0, 1, 1, 1],
...                           [0, 2, -1, -1]], dtype='i1')
>>> h.dtype
dtype('int8')
>>> h.ndim
2
>>> h.shape
(3, 4)
>>> h.n_variants
3
>>> h.n_haplotypes
4

Allele calls for a single variant at all haplotypes can be obtained by indexing the first dimension, e.g.:

>>> h[1]
array([0, 1, 1, 1], dtype=int8)

A single haplotype can be obtained by indexing the second dimension, e.g.:

>>> h[:, 1]
array([0, 1, 2], dtype=int8)

An allele call for a single haplotype at a single variant can be obtained by indexing the first and second dimensions, e.g.:

>>> h[1, 0]
0

View haplotypes as diploid genotypes:

>>> h.to_genotypes(ploidy=2)
GenotypeArray((3, 2, 2), dtype=int8)
[[[ 0  0]
  [ 0  1]]
 [[ 0  1]
  [ 1  1]]
 [[ 0  2]
  [-1 -1]]]

n_variants¶: Number of variants (length of first dimension).

n_haplotypes¶: Number of haplotypes (length of second dimension).

subset(sel0=None, sel1=None)[source]¶

Make a sub-selection of variants and haplotypes.

Parameters:

sel0 : array_like

Boolean array or list of indices selecting variants.

sel1 : array_like

Boolean array or list of indices selecting haplotypes.

Returns:

out : HaplotypeArray

See also

numpy.take, numpy.compress

is_called()[source]¶

is_missing()[source]¶

is_ref()[source]¶

is_alt(allele=None)[source]¶

is_call(allele)[source]¶

count_called(axis=None)[source]¶

count_missing(axis=None)[source]¶

count_ref(axis=None)[source]¶

count_alt(axis=None)[source]¶

count_call(allele, axis=None)[source]¶

count_alleles(max_allele=None, subpop=None)[source]¶

Count the number of calls of each allele per variant.

Parameters:

max_allele : int, optional

The highest allele index to count. Alleles greater than this index will be ignored.

subpop : array_like, int, optional

Indices of haplotypes to include.

Returns:

ac : AlleleCountsArray, int, shape (n_variants, n_alleles)

Examples

>>> import allel
>>> h = allel.HaplotypeArray([[0, 0, 0, 1],
...                           [0, 1, 1, 1],
...                           [0, 2, -1, -1]], dtype='i1')
>>> ac = h.count_alleles()
>>> ac
AlleleCountsArray((3, 3), dtype=int32)
[[3 1 0]
 [1 3 0]
 [1 0 1]]

count_alleles_subpops(subpops, max_allele=None)[source]¶

Count alleles for multiple subpopulations simultaneously.

Parameters:

subpops : dict (string -> sequence of ints)

Mapping of subpopulation names to sample indices.

max_allele : int, optional

The highest allele index to count. Alleles above this will be ignored.

Returns:

out : dict (string -> AlleleCountsArray)

A mapping of subpopulation names to allele counts arrays.

map_alleles(mapping, copy=True)[source]¶

Transform alleles via a mapping.

Parameters:

mapping : ndarray, int8, shape (n_variants, max_allele)

An array defining the allele mapping for each variant.

copy : bool, optional

If True, return a new array; if False, apply mapping in place (only applies for arrays with dtype int8; all other dtypes require a copy).

Returns:

hm : HaplotypeArray

See also

create_allele_mapping

Notes

For arrays with dtype int8 an optimised implementation is used which is faster and uses far less memory. It is recommended to convert arrays to dtype int8 where possible before calling this method.

Examples

>>> import allel
>>> import numpy as np
>>> h = allel.HaplotypeArray([[0, 0, 0, 1],
...                           [0, 1, 1, 1],
...                           [0, 2, -1, -1]], dtype='i1')
>>> mapping = np.array([[1, 2, 0],
...                     [2, 0, 1],
...                     [2, 1, 0]], dtype='i1')
>>> h.map_alleles(mapping)
HaplotypeArray((3, 4), dtype=int8)
[[ 1  1  1  2]
 [ 2  0  0  0]
 [ 2  0 -1 -1]]

to_genotypes(ploidy, copy=False)[source]¶

Reshape a haplotype array to view it as genotypes by restoring the ploidy dimension.

Parameters:

ploidy : int

The sample ploidy.

Returns:

g : ndarray, int, shape (n_variants, n_samples, ploidy)

Genotype array (sharing same underlying buffer).

copy : bool, optional

If True, copy the data.

Examples

>>> import allel
>>> h = allel.HaplotypeArray([[0, 0, 0, 1],
...                           [0, 1, 1, 1],
...                           [0, 2, -1, -1]], dtype='i1')
>>> h.to_genotypes(ploidy=2)
GenotypeArray((3, 2, 2), dtype=int8)
[[[ 0  0]
  [ 0  1]]
 [[ 0  1]
  [ 1  1]]
 [[ 0  2]
  [-1 -1]]]

to_sparse(format='csr', **kwargs)[source]¶

Convert into a sparse matrix.

Parameters:

format : {‘coo’, ‘csc’, ‘csr’, ‘dia’, ‘dok’, ‘lil’}

Sparse matrix format.

kwargs : keyword arguments

Passed through to sparse matrix constructor.

Returns:

m : scipy.sparse.spmatrix

Sparse matrix

Examples

>>> import allel
>>> h = allel.HaplotypeArray([[0, 0, 0, 0],
...                           [0, 1, 0, 1],
...                           [1, 1, 0, 0],
...                           [0, 0, -1, -1]], dtype='i1')
>>> m = h.to_sparse(format='csr')
>>> m
<4x4 sparse matrix of type '<class 'numpy.int8'>'
    with 6 stored elements in Compressed Sparse Row format>
>>> m.data
array([ 1,  1,  1,  1, -1, -1], dtype=int8)
>>> m.indices
array([1, 3, 0, 1, 2, 3], dtype=int32)
>>> m.indptr
array([0, 0, 2, 4, 6], dtype=int32)

static from_sparse(m, order=None, out=None)[source]¶

Construct a haplotype array from a sparse matrix.

Parameters:

m : scipy.sparse.spmatrix

Sparse matrix

order : {‘C’, ‘F’}, optional

Whether to store data in C (row-major) or Fortran (column-major) order in memory.

out : ndarray, shape (n_variants, n_samples), optional

Use this array as the output buffer.

Returns:

h : HaplotypeArray, shape (n_variants, n_haplotypes)

Haplotype array.

Examples

>>> import allel
>>> import numpy as np
>>> import scipy.sparse
>>> data = np.array([ 1,  1,  1,  1, -1, -1], dtype=np.int8)
>>> indices = np.array([1, 3, 0, 1, 2, 3], dtype=np.int32)
>>> indptr = np.array([0, 0, 2, 4, 6], dtype=np.int32)
>>> m = scipy.sparse.csr_matrix((data, indices, indptr))
>>> h = allel.HaplotypeArray.from_sparse(m)
>>> h
HaplotypeArray((4, 4), dtype=int8)
[[ 0  0  0  0]
 [ 0  1  0  1]
 [ 1  1  0  0]
 [ 0  0 -1 -1]]

prefix_argsort()[source]¶: Return indices that would sort the haplotypes by prefix.

distinct()[source]¶: Return sets of indices for each distinct haplotype.

distinct_counts()[source]¶: Return counts for each distinct haplotype.

distinct_frequencies()[source]¶: Return frequencies for each distinct haplotype.

vstack(*others)¶: Stack arrays in sequence vertically (row wise).

hstack(*others)¶: Stack arrays in sequence horizontally (column wise).

AlleleCountsArray¶

class allel.model.ndarray.AlleleCountsArray[source]¶

Array of allele counts.

Parameters:

data : array_like, int, shape (n_variants, n_alleles)

Allele counts data.

**kwargs : keyword arguments

All keyword arguments are passed through to numpy.array().

Notes

This class represents allele counts as a 2-dimensional numpy array of integers. By convention the first dimension corresponds to the variants genotyped, the second dimension corresponds to the alleles counted.

Examples

Obtain allele counts from a genotype array:

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]], dtype='i1')
>>> ac = g.count_alleles()
>>> ac
AlleleCountsArray((3, 3), dtype=int32)
[[3 1 0]
 [1 3 0]
 [1 0 1]]
>>> ac.dtype
dtype('int32')
>>> ac.shape
(3, 3)
>>> ac.n_variants
3
>>> ac.n_alleles
3

Allele counts for a single variant can be obtained by indexing the first dimension, e.g.:

>>> ac[1]
array([1, 3, 0], dtype=int32)

Allele counts for a specific allele can be obtained by indexing the second dimension, e.g., reference allele counts:

>>> ac[:, 0]
array([3, 1, 1], dtype=int32)

Calculate the total number of alleles called for each variant:

>>> import numpy as np
>>> n = np.sum(ac, axis=1)
>>> n
array([4, 4, 2])

n_variants¶: Number of variants (length of first array dimension).

n_alleles¶: Number of alleles (length of second array dimension).

max_allele()[source]¶

Return the highest allele index for each variant.

Returns:

n : ndarray, int, shape (n_variants,)

Allele index array.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac.max_allele()
array([1, 2, 2], dtype=int8)

allelism()[source]¶

Determine the number of distinct alleles observed for each variant.

Returns:

n : ndarray, int, shape (n_variants,)

Allelism array.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac.allelism()
array([2, 3, 1])

is_variant()[source]¶

Find variants with at least one non-reference allele call.

Returns:

out : ndarray, bool, shape (n_variants,)

Boolean array where elements are True if variant matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac.is_variant()
array([False,  True,  True,  True], dtype=bool)

is_non_variant()[source]¶

Find variants with no non-reference allele calls.

Returns:

out : ndarray, bool, shape (n_variants,)

Boolean array where elements are True if variant matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac.is_non_variant()
array([ True, False, False, False], dtype=bool)

is_segregating()[source]¶

Find segregating variants (where more than one allele is observed).

Returns:

out : ndarray, bool, shape (n_variants,)

Boolean array where elements are True if variant matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac.is_segregating()
array([False,  True,  True, False], dtype=bool)

is_non_segregating(allele=None)[source]¶

Find non-segregating variants (where at most one allele is observed).

Parameters:

allele : int, optional

Allele index.

Returns:

out : ndarray, bool, shape (n_variants,)

Boolean array where elements are True if variant matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac.is_non_segregating()
array([ True, False, False,  True], dtype=bool)
>>> ac.is_non_segregating(allele=2)
array([False, False, False,  True], dtype=bool)

is_singleton(allele)[source]¶

Find variants with a single call for the given allele.

Parameters:

allele : int, optional

Allele index.

Returns:

out : ndarray, bool, shape (n_variants,)

Boolean array where elements are True if variant matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [0, 1]],
...                          [[1, 1], [1, 2]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac.is_singleton(allele=1)
array([False,  True, False, False], dtype=bool)
>>> ac.is_singleton(allele=2)
array([False, False,  True, False], dtype=bool)

is_doubleton(allele)[source]¶

Find variants with exactly two calls for the given allele.

Parameters:

allele : int, optional

Allele index.

Returns:

out : ndarray, bool, shape (n_variants,)

Boolean array where elements are True if variant matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [1, 1]],
...                          [[1, 1], [1, 2]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac.is_doubleton(allele=1)
array([False,  True, False, False], dtype=bool)
>>> ac.is_doubleton(allele=2)
array([False, False, False,  True], dtype=bool)

is_biallelic()[source]¶

Find biallelic variants.

Returns:

out : ndarray, bool, shape (n_variants,)

Boolean array where elements are True if variant matches the condition.

is_biallelic_01(min_mac=None)[source]¶

Find variants biallelic for the reference (0) and first alternate (1) allele.

Parameters:

min_mac : int, optional

Minimum minor allele count.

Returns:

out : ndarray, bool, shape (n_variants,)

Boolean array where elements are True if variant matches the condition.

count_variant()[source]¶

count_non_variant()[source]¶

count_segregating()[source]¶

count_non_segregating(allele=None)[source]¶

count_singleton(allele=1)[source]¶

count_doubleton(allele=1)[source]¶

to_frequencies(fill=nan)[source]¶

Compute allele frequencies.

Parameters:

fill : float, optional

Value to use when number of allele calls is 0.

Returns:

af : ndarray, float, shape (n_variants, n_alleles)

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac.to_frequencies()
array([[ 0.75,  0.25,  0.  ],
       [ 0.25,  0.5 ,  0.25],
       [ 0.  ,  0.  ,  1.  ]])

map_alleles(mapping)[source]¶

Transform alleles via a mapping.

Parameters:

mapping : ndarray, int8, shape (n_variants, max_allele)

An array defining the allele mapping for each variant.

Returns:

ac : AlleleCountsArray

See also

create_allele_mapping

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac
AlleleCountsArray((4, 3), dtype=int32)
[[4 0 0]
 [3 1 0]
 [1 2 1]
 [0 0 2]]
>>> mapping = [[1, 0, 2],
...            [1, 0, 2],
...            [2, 1, 0],
...            [1, 2, 0]]
>>> ac.map_alleles(mapping)
AlleleCountsArray((4, 3), dtype=int64)
[[0 4 0]
 [1 3 0]
 [1 2 1]
 [2 0 0]]

vstack(*others)¶: Stack arrays in sequence vertically (row wise).

hstack(*others)¶: Stack arrays in sequence horizontally (column wise).

VariantTable¶

class allel.model.ndarray.VariantTable[source]¶

Table (catalogue) of variants.

Parameters:

data : array_like, structured, shape (n_variants,)

Variant records.

index : string or pair of strings, optional

Names of columns to use for positional index, e.g., ‘POS’ if table contains a ‘POS’ column and records from a single chromosome/contig, or (‘CHROM’, ‘POS’) if table contains records from multiple chromosomes/contigs.

**kwargs : keyword arguments, optional

Further keyword arguments are passed through to numpy.rec.array().

Examples

Instantiate a table from existing data:

>>> import allel
>>> records = [[b'chr1', 2, 35, 4.5, (1, 2)],
...            [b'chr1', 7, 12, 6.7, (3, 4)],
...            [b'chr2', 3, 78, 1.2, (5, 6)],
...            [b'chr2', 9, 22, 4.4, (7, 8)],
...            [b'chr3', 6, 99, 2.8, (9, 10)]]
>>> dtype = [('CHROM', 'S4'),
...          ('POS', 'u4'),
...          ('DP', int),
...          ('QD', float),
...          ('AC', (int, 2))]
>>> vt = allel.VariantTable(records, dtype=dtype,
...                         index=('CHROM', 'POS'))
>>> vt.names
('CHROM', 'POS', 'DP', 'QD', 'AC')
>>> vt.n_variants
5

Access a column:

>>> vt['DP']
array([35, 12, 78, 22, 99])

Access multiple columns:

>>> vt[['DP', 'QD']]  
VariantTable((5,), dtype=(numpy.record, [('DP', '<i8'), ('QD', '<f8...
[(35, 4.5) (12, 6.7) (78, 1.2) (22, 4.4) (99, 2.8)]

Access a row:

>>> vt[2]
(b'chr2', 3, 78, 1.2, array([5, 6]))

Access multiple rows:

>>> vt[2:4]  
VariantTable((2,), dtype=(numpy.record, [('CHROM', 'S4'), ('POS', '...
[(b'chr2', 3, 78, 1.2, array([5, 6])) (b'chr2', 9, 22, 4.4, array([...

Evaluate expressions against the table:

>>> vt.eval('DP > 30')
array([ True, False,  True, False,  True], dtype=bool)
>>> vt.eval('(DP > 30) & (QD > 4)')
array([ True, False, False, False, False], dtype=bool)
>>> vt.eval('DP * 2')
array([ 70,  24, 156,  44, 198])

Query the table:

>>> vt.query('DP > 30')  
VariantTable((3,), dtype=(numpy.record, [('CHROM', 'S4'), ('POS', '...
[(b'chr1', 2, 35, 4.5, array([1, 2])) (b'chr2', 3, 78, 1.2, array([...
 (b'chr3', 6, 99, 2.8, array([ 9, 10]))]
>>> vt.query('(DP > 30) & (QD > 4)')  
VariantTable((1,), dtype=(numpy.record, [('CHROM', 'S4'), ('POS', '...
[(b'chr1', 2, 35, 4.5, array([1, 2]))]

Use the index to query variants:

>>> vt.query_region(b'chr2', 1, 10)  
VariantTable((2,), dtype=(numpy.record, [('CHROM', 'S4'), ('POS', '...
[(b'chr2', 3, 78, 1.2, array([5, 6])) (b'chr2', 9, 22, 4.4, array([...

n_variants¶: Number of variants (length of first dimension).

names¶: Column names.

eval(expression, vm='python')¶

Evaluate an expression against the table columns.

Parameters:

expression : string

Expression to evaluate.

vm : {‘numexpr’, ‘python’}

Virtual machine to use.

Returns:

result : ndarray

query(expression, vm='python')¶

Evaluate expression and then use it to extract rows from the table.

Parameters:

expression : string

Expression to evaluate.

vm : {‘numexpr’, ‘python’}

Virtual machine to use.

Returns:

result : structured array

query_position(chrom=None, position=None)[source]¶

Query the table, returning row or rows matching the given genomic position.

Parameters:

chrom : string, optional

Chromosome/contig.

position : int, optional

Position (1-based).

Returns:

result : row or VariantTable

query_region(chrom=None, start=None, stop=None)[source]¶

Query the table, returning row or rows within the given genomic region.

Parameters:

chrom : string, optional

Chromosome/contig.

start : int, optional

Region start position (1-based).

stop : int, optional

Region stop position (1-based).

Returns:

result : VariantTable

to_vcf(path, rename=None, number=None, description=None, fill=None, write_header=True)[source]¶

Write to a variant call format (VCF) file.

Parameters:

path : string

File path.

rename : dict, optional

Rename these columns in the VCF.

number : dict, optional

Override the number specified in INFO headers.

description : dict, optional

Descriptions for the INFO and FILTER headers.

fill : dict, optional

Fill values used for missing data in the table.

Examples

Setup a variant table to write out:

>>> import allel
>>> chrom = [b'chr1', b'chr1', b'chr2', b'chr2', b'chr3']
>>> pos = [2, 6, 3, 8, 1]
>>> id = ['a', 'b', 'c', 'd', 'e']
>>> ref = [b'A', b'C', b'T', b'G', b'N']
>>> alt = [(b'T', b'.'),
...        (b'G', b'.'),
...        (b'A', b'C'),
...        (b'C', b'A'),
...        (b'X', b'.')]
>>> qual = [1.2, 2.3, 3.4, 4.5, 5.6]
>>> filter_qd = [True, True, True, False, False]
>>> filter_dp = [True, False, True, False, False]
>>> dp = [12, 23, 34, 45, 56]
>>> qd = [12.3, 23.4, 34.5, 45.6, 56.7]
>>> flg = [True, False, True, False, True]
>>> ac = [(1, -1), (3, -1), (5, 6), (7, 8), (9, -1)]
>>> xx = [(1.2, 2.3), (3.4, 4.5), (5.6, 6.7), (7.8, 8.9),
...       (9.0, 9.9)]
>>> columns = [chrom, pos, id, ref, alt, qual, filter_dp,
...            filter_qd, dp, qd, flg, ac, xx]
>>> records = list(zip(*columns))
>>> dtype = [('chrom', 'S4'),
...          ('pos', 'u4'),
...          ('ID', 'S1'),
...          ('ref', 'S1'),
...          ('alt', ('S1', 2)),
...          ('qual', 'f4'),
...          ('filter_dp', bool),
...          ('filter_qd', bool),
...          ('dp', int),
...          ('qd', float),
...          ('flg', bool),
...          ('ac', (int, 2)),
...          ('xx', (float, 2))]
>>> vt = allel.VariantTable(records, dtype=dtype)

Now write out to VCF and inspect the result:

>>> rename = {'dp': 'DP', 'qd': 'QD', 'filter_qd': 'QD'}
>>> fill = {'ALT': b'.', 'ac': -1}
>>> number = {'ac': 'A'}
>>> description = {'ac': 'Allele counts', 'filter_dp': 'Low depth'}
>>> vt.to_vcf('example.vcf', rename=rename, fill=fill,
...           number=number, description=description)
>>> print(open('example.vcf').read())  
##fileformat=VCFv4.1
##fileDate=...
##source=...
##INFO=<ID=DP,Number=1,Type=Integer,Description="">
##INFO=<ID=QD,Number=1,Type=Float,Description="">
##INFO=<ID=ac,Number=A,Type=Integer,Description="Allele counts">
##INFO=<ID=flg,Number=0,Type=Flag,Description="">
##INFO=<ID=xx,Number=2,Type=Float,Description="">
##FILTER=<ID=QD,Description="">
##FILTER=<ID=dp,Description="Low depth">
#CHROM      POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1        2       a       A       T       1.2     QD;dp   DP=12;QD=12.3;ac=1;flg;xx=...
chr1        6       b       C       G       2.3     QD      DP=23;QD=23.4;ac=3;xx=3.4,4.5
chr2        3       c       T       A,C     3.4     QD;dp   DP=34;QD=34.5;ac=5,6;flg;x...
chr2        8       d       G       C,A     4.5     PASS    DP=45;QD=45.6;ac=7,8;xx=7...
chr3        1       e       N       X       5.6     PASS    DP=56;QD=56.7;ac=9;flg;xx=...

FeatureTable¶

class allel.model.ndarray.FeatureTable[source]¶

Table of genomic features (e.g., genes, exons, etc.).

Parameters:

data : array_like, structured, shape (n_variants,)

Variant records.

index : pair or triplet of strings, optional

Names of columns to use for positional index, e.g., (‘start’, ‘stop’) if table contains ‘start’ and ‘stop’ columns and records from a single chromosome/contig, or (‘seqid’, ‘start’, ‘end’) if table contains records from multiple chromosomes/contigs.

**kwargs : keyword arguments, optional

Further keyword arguments are passed through to numpy.rec.array().

n_features¶: Number of features (length of first dimension).

names¶: Column names.

eval(expression, vm='python')¶

Evaluate an expression against the table columns.

Parameters:

expression : string

Expression to evaluate.

vm : {‘numexpr’, ‘python’}

Virtual machine to use.

Returns:

result : ndarray

query(expression, vm='python')¶

Evaluate expression and then use it to extract rows from the table.

Parameters:

expression : string

Expression to evaluate.

vm : {‘numexpr’, ‘python’}

Virtual machine to use.

Returns:

result : structured array

static from_gff3(path, attributes=None, region=None, score_fill=-1, phase_fill=-1, attributes_fill='.', dtype=None)[source]¶

Read a feature table from a GFF3 format file.

Parameters:

path : string

File path.

attributes : list of strings, optional

List of columns to extract from the “attributes” field.

region : string, optional

Genome region to extract. If given, file must be position sorted, bgzipped and tabix indexed. Tabix must also be installed and on the system path.

score_fill : object, optional

Value to use where score field has a missing value.

phase_fill : object, optional

Value to use where phase field has a missing value.

attributes_fill : object or list of objects, optional

Value(s) to use where attribute field(s) have a missing value.

dtype : numpy dtype, optional

Manually specify a dtype.

Returns:

ft : FeatureTable

to_mask(size, start_name='start', stop_name='end')[source]¶

Construct a mask array where elements are True if the fall within features in the table.

Parameters:

size : int

Size of chromosome/contig.

start_name : string, optional

Name of column with start coordinates.

stop_name : string, optional

Name of column with stop coordinates.

Returns:

mask : ndarray, bool

SortedIndex¶

class allel.model.ndarray.SortedIndex[source]¶

Index of sorted values, e.g., positions from a single chromosome or contig.

Parameters:

data : array_like

Values in ascending order.

**kwargs : keyword arguments

All keyword arguments are passed through to numpy.array().

Notes

Values must be given in ascending order, although duplicate values may be present (i.e., values must be monotonically increasing).

Examples

>>> import allel
>>> idx = allel.SortedIndex([2, 5, 14, 15, 42, 42, 77], dtype='i4')
>>> idx.dtype
dtype('int32')
>>> idx.ndim
1
>>> idx.shape
(7,)
>>> idx.is_unique
False

is_unique¶: True if no duplicate entries.

locate_key(key)[source]¶

Get index location for the requested key.

Parameters:

key : int

Value to locate.

Returns:

loc : int or slice

Location of key (will be slice if there are duplicate entries).

Examples

>>> import allel
>>> idx = allel.SortedIndex([3, 6, 6, 11])
>>> idx.locate_key(3)
0
>>> idx.locate_key(11)
3
>>> idx.locate_key(6)
slice(1, 3, None)
>>> try:
...     idx.locate_key(2)
... except KeyError as e:
...     print(e)
...
2

locate_keys(keys, strict=True)[source]¶

Get index locations for the requested keys.

Parameters:

keys : array_like, int

Array of keys to locate.

strict : bool, optional

If True, raise KeyError if any keys are not found in the index.

Returns:

loc : ndarray, bool

Boolean array with location of values.

Examples

>>> import allel
>>> idx1 = allel.SortedIndex([3, 6, 11, 20, 35])
>>> idx2 = allel.SortedIndex([4, 6, 20, 39])
>>> loc = idx1.locate_keys(idx2, strict=False)
>>> loc
array([False,  True, False,  True, False], dtype=bool)
>>> idx1[loc]
SortedIndex((2,), dtype=int64)
[ 6 20]

locate_intersection(other)[source]¶

Locate the intersection with another array.

Parameters:

other : array_like, int

Array of values to intersect.

Returns:

loc : ndarray, bool

Boolean array with location of intersection.

loc_other : ndarray, bool

Boolean array with location in other of intersection.

Examples

>>> import allel
>>> idx1 = allel.SortedIndex([3, 6, 11, 20, 35])
>>> idx2 = allel.SortedIndex([4, 6, 20, 39])
>>> loc1, loc2 = idx1.locate_intersection(idx2)
>>> loc1
array([False,  True, False,  True, False], dtype=bool)
>>> loc2
array([False,  True,  True, False], dtype=bool)
>>> idx1[loc1]
SortedIndex((2,), dtype=int64)
[ 6 20]
>>> idx2[loc2]
SortedIndex((2,), dtype=int64)
[ 6 20]

intersect(other)[source]¶

Intersect with other sorted index.

Parameters:

other : array_like, int

Array of values to intersect with.

Returns:

out : SortedIndex

Values in common.

Examples

>>> import allel
>>> idx1 = allel.SortedIndex([3, 6, 11, 20, 35])
>>> idx2 = allel.SortedIndex([4, 6, 20, 39])
>>> idx1.intersect(idx2)
SortedIndex((2,), dtype=int64)
[ 6 20]

locate_range(start=None, stop=None)[source]¶

Locate slice of index containing all entries within start and stop values inclusive.

Parameters:

start : int, optional

Start value.

stop : int, optional

Stop value.

Returns:

loc : slice

Slice object.

Examples

>>> import allel
>>> idx = allel.SortedIndex([3, 6, 11, 20, 35])
>>> loc = idx.locate_range(4, 32)
>>> loc
slice(1, 4, None)
>>> idx[loc]
SortedIndex((3,), dtype=int64)
[ 6 11 20]

intersect_range(start=None, stop=None)[source]¶

Intersect with range defined by start and stop values inclusive.

Parameters:

start : int, optional

Start value.

stop : int, optional

Stop value.

Returns:

idx : SortedIndex

Examples

>>> import allel
>>> idx = allel.SortedIndex([3, 6, 11, 20, 35])
>>> idx.intersect_range(4, 32)
SortedIndex((3,), dtype=int64)
[ 6 11 20]

locate_ranges(starts, stops, strict=True)[source]¶

Locate items within the given ranges.

Parameters:

starts : array_like, int

Range start values.

stops : array_like, int

Range stop values.

strict : bool, optional

If True, raise KeyError if any ranges contain no entries.

Returns:

loc : ndarray, bool

Boolean array with location of entries found.

Examples

>>> import allel
>>> import numpy as np
>>> idx = allel.SortedIndex([3, 6, 11, 20, 35])
>>> ranges = np.array([[0, 2], [6, 17], [12, 15], [31, 35],
...                    [100, 120]])
>>> starts = ranges[:, 0]
>>> stops = ranges[:, 1]
>>> loc = idx.locate_ranges(starts, stops, strict=False)
>>> loc
array([False,  True,  True, False,  True], dtype=bool)
>>> idx[loc]
SortedIndex((3,), dtype=int64)
[ 6 11 35]

locate_intersection_ranges(starts, stops)[source]¶

Locate the intersection with a set of ranges.

Parameters:

starts : array_like, int

Range start values.

stops : array_like, int

Range stop values.

Returns:

loc : ndarray, bool

Boolean array with location of entries found.

loc_ranges : ndarray, bool

Boolean array with location of ranges containing one or more entries.

Examples

>>> import allel
>>> import numpy as np
>>> idx = allel.SortedIndex([3, 6, 11, 20, 35])
>>> ranges = np.array([[0, 2], [6, 17], [12, 15], [31, 35],
...                    [100, 120]])
>>> starts = ranges[:, 0]
>>> stops = ranges[:, 1]
>>> loc, loc_ranges = idx.locate_intersection_ranges(starts, stops)
>>> loc
array([False,  True,  True, False,  True], dtype=bool)
>>> loc_ranges
array([False,  True, False,  True, False], dtype=bool)
>>> idx[loc]
SortedIndex((3,), dtype=int64)
[ 6 11 35]
>>> ranges[loc_ranges]
array([[ 6, 17],
       [31, 35]])

intersect_ranges(starts, stops)[source]¶

Intersect with a set of ranges.

Parameters:

starts : array_like, int

Range start values.

stops : array_like, int

Range stop values.

Returns:

idx : SortedIndex

Examples

>>> import allel
>>> import numpy as np
>>> idx = allel.SortedIndex([3, 6, 11, 20, 35])
>>> ranges = np.array([[0, 2], [6, 17], [12, 15], [31, 35],
...                    [100, 120]])
>>> starts = ranges[:, 0]
>>> stops = ranges[:, 1]
>>> idx.intersect_ranges(starts, stops)
SortedIndex((3,), dtype=int64)
[ 6 11 35]

UniqueIndex¶

class allel.model.ndarray.UniqueIndex[source]¶

Array of unique values (e.g., variant or sample identifiers).

Parameters:

data : array_like

Values.

**kwargs : keyword arguments

All keyword arguments are passed through to numpy.array().

Notes

This class represents an arbitrary set of unique values, e.g., sample or variant identifiers.

There is no need for values to be sorted. However, all values must be unique within the array, and must be hashable objects.

Examples

>>> import allel
>>> idx = allel.UniqueIndex(['A', 'C', 'B', 'F'])
>>> idx.dtype
dtype('<U1')
>>> idx.ndim
1
>>> idx.shape
(4,)

locate_key(key)[source]¶

Get index location for the requested key.

Parameters:

key : object

Key to locate.

Returns:

loc : int

Location of key.

Examples

>>> import allel
>>> idx = allel.UniqueIndex(['A', 'C', 'B', 'F'])
>>> idx.locate_key('A')
0
>>> idx.locate_key('B')
2
>>> try:
...     idx.locate_key('X')
... except KeyError as e:
...     print(e)
...
'X'

locate_keys(keys, strict=True)[source]¶

Get index locations for the requested keys.

Parameters:

keys : array_like

Array of keys to locate.

strict : bool, optional

If True, raise KeyError if any keys are not found in the index.

Returns:

loc : ndarray, bool

Boolean array with location of keys.

Examples

>>> import allel
>>> idx = allel.UniqueIndex(['A', 'C', 'B', 'F'])
>>> idx.locate_keys(['F', 'C'])
array([False,  True, False,  True], dtype=bool)
>>> idx.locate_keys(['X', 'F', 'G', 'C', 'Z'], strict=False)
array([False,  True, False,  True], dtype=bool)

locate_intersection(other)[source]¶

Locate the intersection with another array.

Parameters:

other : array_like

Array to intersect.

Returns:

loc : ndarray, bool

Boolean array with location of intersection.

loc_other : ndarray, bool

Boolean array with location in other of intersection.

Examples

>>> import allel
>>> idx1 = allel.UniqueIndex(['A', 'C', 'B', 'F'])
>>> idx2 = allel.UniqueIndex(['X', 'F', 'G', 'C', 'Z'])
>>> loc1, loc2 = idx1.locate_intersection(idx2)
>>> loc1
array([False,  True, False,  True], dtype=bool)
>>> loc2
array([False,  True, False,  True, False], dtype=bool)
>>> idx1[loc1]
UniqueIndex((2,), dtype=<U1)
['C' 'F']
>>> idx2[loc2]
UniqueIndex((2,), dtype=<U1)
['F' 'C']

intersect(other)[source]¶

Intersect with other.

Parameters:

other : array_like

Array to intersect.

Returns:

out : UniqueIndex

Examples

>>> import allel
>>> idx1 = allel.UniqueIndex(['A', 'C', 'B', 'F'])
>>> idx2 = allel.UniqueIndex(['X', 'F', 'G', 'C', 'Z'])
>>> idx1.intersect(idx2)
UniqueIndex((2,), dtype=<U1)
['C' 'F']
>>> idx2.intersect(idx1)
UniqueIndex((2,), dtype=<U1)
['F' 'C']

SortedMultiIndex¶

class allel.model.ndarray.SortedMultiIndex(l1, l2, copy=False)[source]¶

Two-level index of sorted values, e.g., variant positions from two or more chromosomes/contigs.

Parameters:

l1 : array_like

First level values in ascending order.

l2 : array_like

Second level values, in ascending order within each sub-level.

copy : bool, optional

If True, inputs will be copied into new arrays.

Examples

>>> import allel
>>> chrom = ['chr1', 'chr1', 'chr2', 'chr2', 'chr2', 'chr3']
>>> pos = [1, 4, 2, 5, 5, 3]
>>> idx = allel.SortedMultiIndex(chrom, pos)
>>> len(idx)
6

locate_key(k1, k2=None)[source]¶

Get index location for the requested key.

Parameters:

k1 : object

Level 1 key.

k2 : object, optional

Level 2 key.

Returns:

loc : int or slice

Location of requested key (will be slice if there are duplicate entries).

Examples

>>> import allel
>>> chrom = ['chr1', 'chr1', 'chr2', 'chr2', 'chr2', 'chr3']
>>> pos = [1, 4, 2, 5, 5, 3]
>>> idx = allel.SortedMultiIndex(chrom, pos)
>>> idx.locate_key('chr1')
slice(0, 2, None)
>>> idx.locate_key('chr1', 4)
1
>>> idx.locate_key('chr2', 5)
slice(3, 5, None)
>>> try:
...     idx.locate_key('chr3', 4)
... except KeyError as e:
...     print(e)
...
('chr3', 4)

locate_range(k1, start=None, stop=None)[source]¶

Locate slice of index containing all entries within the range key:start-stop inclusive.

Parameters:

key : object

Level 1 key value.

start : object, optional

Level 2 start value.

stop : object, optional

Level 2 stop value.

Returns:

loc : slice

Slice object.

Examples

>>> import allel
>>> chrom = ['chr1', 'chr1', 'chr2', 'chr2', 'chr2', 'chr3']
>>> pos = [1, 4, 2, 5, 5, 3]
>>> idx = allel.SortedMultiIndex(chrom, pos)
>>> idx.locate_range('chr1')
slice(0, 2, None)
>>> idx.locate_range('chr1', 1, 4)
slice(0, 2, None)
>>> idx.locate_range('chr2', 3, 7)
slice(3, 5, None)
>>> try:
...     idx.locate_range('chr3', 4, 9)
... except KeyError as e:
...     print(e)
('chr3', 4, 9)

Utility functions¶

allel.model.ndarray.create_allele_mapping(ref, alt, alleles, dtype='i1')[source]¶

Create an array mapping variant alleles into a different allele index system.

Parameters:

ref : array_like, S1, shape (n_variants,)

Reference alleles.

alt : array_like, S1, shape (n_variants, n_alt_alleles)

Alternate alleles.

alleles : array_like, S1, shape (n_variants, n_alleles)

Alleles defining the new allele indexing.

Returns:

mapping : ndarray, int8, shape (n_variants, n_alt_alleles + 1)

Examples

Example with biallelic variants:

>>> import allel
>>> from allel.model.ndarray import create_allele_mapping
>>> ref = [b'A', b'C', b'T', b'G']
>>> alt = [b'T', b'G', b'C', b'A']
>>> alleles = [[b'A', b'T'],  # no transformation
...            [b'G', b'C'],  # swap
...            [b'T', b'A'],  # 1 missing
...            [b'A', b'C']]  # 1 missing
>>> mapping = create_allele_mapping(ref, alt, alleles)
>>> mapping
array([[ 0,  1],
       [ 1,  0],
       [ 0, -1],
       [-1,  0]], dtype=int8)

Example with multiallelic variants:

>>> ref = [b'A', b'C', b'T']
>>> alt = [[b'T', b'G'],
...        [b'A', b'T'],
...        [b'G', b'.']]
>>> alleles = [[b'A', b'T'],
...            [b'C', b'T'],
...            [b'G', b'A']]
>>> mapping = create_allele_mapping(ref, alt, alleles)
>>> mapping
array([[ 0,  1, -1],
       [ 0, -1,  1],
       [-1,  0, -1]], dtype=int8)

allel.model.ndarray.locate_fixed_differences(ac1, ac2)[source]¶

Locate variants with no shared alleles between two populations.

Parameters:

ac1 : array_like, int, shape (n_variants, n_alleles)

Allele counts array from the first population.

ac2 : array_like, int, shape (n_variants, n_alleles)

Allele counts array from the second population.

Returns:

loc : ndarray, bool, shape (n_variants,)

See also

allel.stats.diversity.windowed_df

Examples

>>> import allel
>>> from allel.model.ndarray import locate_fixed_differences
>>> g = allel.GenotypeArray([[[0, 0], [0, 0], [1, 1], [1, 1]],
...                          [[0, 1], [0, 1], [0, 1], [0, 1]],
...                          [[0, 1], [0, 1], [1, 1], [1, 1]],
...                          [[0, 0], [0, 0], [1, 1], [2, 2]],
...                          [[0, 0], [-1, -1], [1, 1], [-1, -1]]])
>>> ac1 = g.count_alleles(subpop=[0, 1])
>>> ac2 = g.count_alleles(subpop=[2, 3])
>>> loc_df = locate_fixed_differences(ac1, ac2)
>>> loc_df
array([ True, False, False,  True,  True], dtype=bool)

allel.model.ndarray.locate_private_alleles(*acs)[source]¶

Locate alleles that are found only in a single population.

Parameters:

*acs : array_like, int, shape (n_variants, n_alleles)

Allele counts arrays from each population.

Returns:

loc : ndarray, bool, shape (n_variants, n_alleles)

Boolean array where elements are True if allele is private to a single population.

Examples

>>> import allel
>>> from allel.model.ndarray import locate_private_alleles
>>> g = allel.GenotypeArray([[[0, 0], [0, 0], [1, 1], [1, 1]],
...                          [[0, 1], [0, 1], [0, 1], [0, 1]],
...                          [[0, 1], [0, 1], [1, 1], [1, 1]],
...                          [[0, 0], [0, 0], [1, 1], [2, 2]],
...                          [[0, 0], [-1, -1], [1, 1], [-1, -1]]])
>>> ac1 = g.count_alleles(subpop=[0, 1])
>>> ac2 = g.count_alleles(subpop=[2])
>>> ac3 = g.count_alleles(subpop=[3])
>>> loc_private_alleles = locate_private_alleles(ac1, ac2, ac3)
>>> loc_private_alleles
array([[ True, False, False],
       [False, False, False],
       [ True, False, False],
       [ True,  True,  True],
       [ True,  True, False]], dtype=bool)
>>> loc_private_variants = np.any(loc_private_alleles, axis=1)
>>> loc_private_variants
array([ True, False,  True,  True,  True], dtype=bool)

allel.model.ndarray.sample_to_haplotype_selection(indices, ploidy)[source]¶