In-memory data structures¶

GenotypeArray¶

class allel.GenotypeArray(data, copy=False, **kwargs)[source]¶

Array of discrete genotype calls for a matrix of variants and samples.

Parameters:	data : array_like, int, shape (n_variants, n_samples, ploidy) Genotype data. copy : bool, optional If True, make a copy of data. **kwargs : keyword arguments All keyword arguments are passed through to `numpy.array()`.

See also

Genotypes, GenotypeVector, HaplotypeArray, GenotypeAlleleCountsArray

Notes

This class represents data on discrete genotype calls as a 3-dimensional numpy array of integers. By convention, the first dimension corresponds to the variants genotyped, the second dimension corresponds to the samples genotyped, and the third dimension corresponds to the ploidy of the samples.

Each integer within the array corresponds to an allele index, where 0 is the reference allele, 1 is the first alternate allele, 2 is the second alternate allele, … and -1 (or any other negative integer) is a missing allele call. A single byte integer dtype (int8) can represent up to 127 distinct alleles, which is usually sufficient. The actual alleles (i.e., the alternate nucleotide sequences) and the physical positions of the variants within the genome of an organism are stored in separate arrays, discussed elsewhere.

Arrays of this class can store either phased or unphased genotype calls. If the genotypes are phased (i.e., haplotypes have been resolved) then individual haplotypes can be extracted by converting to a HaplotypeArray then indexing the second dimension. If the genotype calls are unphased then the ordering of alleles along the third (ploidy) dimension is arbitrary. N.B., this means that an unphased diploid heterozygous call could be stored as (0, 1) or equivalently as (1, 0).

A genotype array can store genotype calls with any ploidy > 1. For haploid calls, use a HaplotypeArray. Note that genotype arrays are not capable of storing calls for samples with differing or variable ploidy, see GenotypeAlleleCountsArray instead.

With genotype data on large numbers of variants and/or samples, storing the genotype calls in memory as an uncompressed numpy array if integers may be impractical. For working with large arrays of genotype data, see the allel.model.chunked and allel.model.dask modules.

Examples

Instantiate a genotype array:

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]], dtype='i1')
>>> g.dtype
dtype('int8')
>>> g.ndim
3
>>> g.shape
(3, 2, 2)
>>> g.n_variants
3
>>> g.n_samples
2
>>> g.ploidy
2
>>> g
<GenotypeArray shape=(3, 2, 2) dtype=int8>
0/0 0/1
0/1 1/1
0/2 ./.

Genotype calls for a single variant at all samples can be obtained by indexing the first dimension, e.g.:

>>> g[1]
<GenotypeVector shape=(2, 2) dtype=int8>
0/1 1/1

Genotype calls for a single sample at all variants can be obtained by indexing the second dimension, e.g.:

>>> g[:, 1]
<GenotypeVector shape=(3, 2) dtype=int8>
0/1 1/1 ./.

A genotype call for a single sample at a single variant can be obtained by indexing the first and second dimensions, e.g.:

>>> g[1, 0]
array([0, 1], dtype=int8)

A genotype array can store polyploid calls, e.g.:

>>> g = allel.GenotypeArray([[[0, 0, 0], [0, 0, 1]],
...                          [[0, 1, 1], [1, 1, 1]],
...                          [[0, 1, 2], [-1, -1, -1]]],
...                         dtype='i1')
>>> g.ploidy
3
>>> g
<GenotypeArray shape=(3, 2, 3) dtype=int8>
0/0/0 0/0/1
0/1/1 1/1/1
0/1/2 ././.

n_variants¶: Number of variants.

n_samples¶: Number of samples.

ploidy¶: Sample ploidy.

n_calls¶: Total number of genotype calls.

n_allele_calls¶: Total number of allele calls.

count_alleles(self, max_allele=None, subpop=None)[source]¶

Count the number of calls of each allele per variant.

Parameters:	max_allele : int, optional The highest allele index to count. Alleles above this will be ignored. subpop : sequence of ints, optional Indices of samples to include in count.
Returns:	ac : AlleleCountsArray

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> g.count_alleles()
<AlleleCountsArray shape=(3, 3) dtype=int32>
3 1 0
1 2 1
0 0 2
>>> g.count_alleles(max_allele=1)
<AlleleCountsArray shape=(3, 2) dtype=int32>
3 1
1 2
0 0

count_alleles_subpops(self, subpops, max_allele=None)[source]¶

Count alleles for multiple subpopulations simultaneously.

Parameters:	subpops : dict (string -> sequence of ints) Mapping of subpopulation names to sample indices. max_allele : int, optional The highest allele index to count. Alleles above this will be ignored.
Returns:	out : dict (string -> AlleleCountsArray) A mapping of subpopulation names to allele counts arrays.

to_packed(self, boundscheck=True)[source]¶

Pack diploid genotypes into a single byte for each genotype, using the left-most 4 bits for the first allele and the right-most 4 bits for the second allele. Allows single byte encoding of diploid genotypes for variants with up to 15 alleles.

Parameters:	boundscheck : bool, optional If False, do not check that minimum and maximum alleles are compatible with bit-packing.
Returns:	packed : ndarray, uint8, shape (n_variants, n_samples) Bit-packed genotype array.

Notes

If a mask has been set, it is ignored by this function.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]], dtype='i1')
>>> g.to_packed()
array([[  0,   1],
       [  2,  17],
       [ 34, 239]], dtype=uint8)

classmethod from_packed(cls, packed)[source]¶

Unpack diploid genotypes that have been bit-packed into single bytes.

Parameters:	packed : ndarray, uint8, shape (n_variants, n_samples) Bit-packed diploid genotype array.
Returns:	g : GenotypeArray, shape (n_variants, n_samples, 2) Genotype array.

Examples

>>> import allel
>>> import numpy as np
>>> packed = np.array([[0, 1],
...                    [2, 17],
...                    [34, 239]], dtype='u1')
>>> allel.GenotypeArray.from_packed(packed)
<GenotypeArray shape=(3, 2, 2) dtype=int8>
0/0 0/1
0/2 1/1
2/2 ./.

to_sparse(self, format='csr', **kwargs)[source]¶

Convert into a sparse matrix.

Parameters:	format : {‘coo’, ‘csc’, ‘csr’, ‘dia’, ‘dok’, ‘lil’} Sparse matrix format. kwargs : keyword arguments Passed through to sparse matrix constructor.
Returns:	m : scipy.sparse.spmatrix Sparse matrix

Notes

If a mask has been set, it is ignored by this function.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 1], [0, 1]],
...                          [[1, 1], [0, 0]],
...                          [[0, 0], [-1, -1]]], dtype='i1')
>>> m = g.to_sparse(format='csr')
>>> m
<4x4 sparse matrix of type '<class 'numpy.int8'>'
    with 6 stored elements in Compressed Sparse Row format>
>>> m.data
array([ 1,  1,  1,  1, -1, -1], dtype=int8)
>>> m.indices
array([1, 3, 0, 1, 2, 3], dtype=int32)
>>> m.indptr
array([0, 0, 2, 4, 6], dtype=int32)

static from_sparse(m, ploidy, order=None, out=None)[source]¶

Construct a genotype array from a sparse matrix.

Parameters:	m : scipy.sparse.spmatrix Sparse matrix ploidy : int The sample ploidy. order : {‘C’, ‘F’}, optional Whether to store data in C (row-major) or Fortran (column-major) order in memory. out : ndarray, shape (n_variants, n_samples), optional Use this array as the output buffer.
Returns:	g : GenotypeArray, shape (n_variants, n_samples, ploidy) Genotype array.

Examples

>>> import allel
>>> import numpy as np
>>> import scipy.sparse
>>> data = np.array([ 1,  1,  1,  1, -1, -1], dtype=np.int8)
>>> indices = np.array([1, 3, 0, 1, 2, 3], dtype=np.int32)
>>> indptr = np.array([0, 0, 2, 4, 6], dtype=np.int32)
>>> m = scipy.sparse.csr_matrix((data, indices, indptr))
>>> g = allel.GenotypeArray.from_sparse(m, ploidy=2)
>>> g
<GenotypeArray shape=(4, 2, 2) dtype=int8>
0/0 0/0
0/1 0/1
1/1 0/0
0/0 ./.

haploidify_samples(self)[source]¶

Construct a pseudo-haplotype for each sample by randomly selecting an allele from each genotype call.

Returns:	h : HaplotypeArray

Notes

If a mask has been set, it is ignored by this function.

Examples

>>> import allel
>>> import numpy as np
>>> np.random.seed(42)
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[1, 2], [2, 1]],
...                          [[2, 2], [-1, -1]]])
>>> g.haploidify_samples()
<HaplotypeArray shape=(4, 2) dtype=int64>
0 1
0 1
1 1
2 .
>>> g = allel.GenotypeArray([[[0, 0, 0], [0, 0, 1]],
...                          [[0, 1, 1], [1, 1, 1]],
...                          [[0, 1, 2], [-1, -1, -1]]])
>>> g.haploidify_samples()
<HaplotypeArray shape=(3, 2) dtype=int64>
0 0
1 1
2 .

subset(self, sel0=None, sel1=None)[source]¶

Make a sub-selection of variants and samples.

Parameters:	sel0 : array_like Boolean array or array of indices selecting variants. sel1 : array_like Boolean array or array of indices selecting samples.
Returns:	out : GenotypeArray

See also

Genotypes.take, Genotypes.compress

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1], [1, 1]],
...                          [[0, 1], [1, 1], [1, 2]],
...                          [[0, 2], [-1, -1], [-1, -1]]])
>>> g.subset([0, 1], [0, 2])
<GenotypeArray shape=(2, 2, 2) dtype=int64>
0/0 1/1
0/1 1/2

GenotypeVector¶

class allel.GenotypeVector(data, copy=False, **kwargs)[source]¶

Array of genotype calls for a sequence of variants or samples.

Parameters:	data : array_like, int, shape (n_variants, n_samples, ploidy) Genotype data. copy : bool, optional If True, make a copy of data. **kwargs : keyword arguments All keyword arguments are passed through to `numpy.array()`.

See also

Genotypes, GenotypeArray, HaplotypeArray, GenotypeAlleleCountsArray

n_calls¶: Total number of genotype calls.

ploidy¶: Sample ploidy.

n_allele_calls¶: Total number of allele calls.

Genotypes¶

Methods available on both GenotypeArray and GenotypeVector classes:

class allel.Genotypes(data, copy=False, **kwargs)[source]¶

Base class for wrapping a NumPy array of genotype calls.

See also

GenotypeArray, GenotypeVector

mask¶

A boolean mask, indicating genotype calls that should be filtered (i.e., excluded) from genotype and allele counting operations.

Notes

This is a lightweight genotype call mask and not a mask in the sense of a numpy masked array. This means that the mask will only be taken into account by the genotype and allele counting methods of this class, and is ignored by any of the generic methods on the ndarray class or by any numpy ufuncs.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]], dtype='i1')
>>> g
<GenotypeArray shape=(3, 2, 2) dtype=int8>
0/0 0/1
0/1 1/1
0/2 ./.
>>> g.count_called()
5
>>> g.count_alleles()
<AlleleCountsArray shape=(3, 3) dtype=int32>
3 1 0
1 3 0
1 0 1
>>> v = g[:, 1]
>>> v
<GenotypeVector shape=(3, 2) dtype=int8>
0/1 1/1 ./.
>>> v.is_called()
array([ True,  True, False])
>>> mask = [[True, False], [False, True], [False, False]]
>>> g.mask = mask
>>> g
<GenotypeArray shape=(3, 2, 2) dtype=int8>
./. 0/1
0/1 ./.
0/2 ./.
>>> g.count_called()
3
>>> g.count_alleles()
<AlleleCountsArray shape=(3, 3) dtype=int32>
1 1 0
1 1 0
1 0 1
>>> v = g[:, 1]
>>> v
<GenotypeVector shape=(3, 2) dtype=int8>
0/1 ./. ./.
>>> v.is_called()
array([ True, False, False])

is_phased¶

A Boolean array indicating which genotype calls are phased and which are not.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]], dtype='i1')
>>> g
<GenotypeArray shape=(3, 2, 2) dtype=int8>
0/0 0/1
0/1 1/1
0/2 ./.
>>> g.is_phased = [[True, True], [False, True], [False, False]]
>>> g
<GenotypeArray shape=(3, 2, 2) dtype=int8>
0|0 0|1
0/1 1|1
0/2 ./.

fill_masked(self, value=-1, copy=True)[source]¶

Fill masked genotype calls with a given value.

Parameters:	value : int, optional The fill value. copy : bool, optional If False, modify the array in place.
Returns:	g : GenotypeArray

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]], dtype='i1')
>>> mask = [[True, False], [False, True], [False, False]]
>>> g.mask = mask
>>> g.fill_masked().values
array([[[-1, -1],
        [ 0,  1]],
       [[ 0,  1],
        [-1, -1]],
       [[ 0,  2],
        [-1, -1]]], dtype=int8)

is_called(self)[source]¶

Find non-missing genotype calls.

Returns:	out : ndarray, bool, shape (n_variants, n_samples) Array where elements are True if the genotype call matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]])
>>> g.is_called()
array([[ True,  True],
       [ True,  True],
       [ True, False]])
>>> v = g[:, 1]
>>> v
<GenotypeVector shape=(3, 2) dtype=int64>
0/1 1/1 ./.
>>> v.is_called()
array([ True,  True, False])

is_missing(self)[source]¶

Find missing genotype calls.

Returns:	out : ndarray, bool, shape (n_variants, n_samples) Array where elements are True if the genotype call matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]])
>>> g.is_missing()
array([[False, False],
       [False, False],
       [False,  True]])
>>> v = g[:, 1]
>>> v
<GenotypeVector shape=(3, 2) dtype=int64>
0/1 1/1 ./.
>>> v.is_missing()
array([False, False,  True])

is_hom(self, allele=None)[source]¶

Find genotype calls that are homozygous.

Parameters:	allele : int, optional Allele index.
Returns:	out : ndarray, bool, shape (n_variants, n_samples) Array where elements are True if the genotype call matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> g.is_hom()
array([[ True, False],
       [False,  True],
       [ True, False]])
>>> g.is_hom(allele=1)
array([[False, False],
       [False,  True],
       [False, False]])
>>> v = g[:, 0]
>>> v
<GenotypeVector shape=(3, 2) dtype=int64>
0/0 0/1 2/2
>>> v.is_hom()
array([ True, False,  True])

is_hom_ref(self)[source]¶

Find genotype calls that are homozygous for the reference allele.

Returns:	out : ndarray, bool, shape (n_variants, n_samples) Array where elements are True if the genotype call matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]])
>>> g.is_hom_ref()
array([[ True, False],
       [False, False],
       [False, False]])
>>> v = g[:, 0]
>>> v
<GenotypeVector shape=(3, 2) dtype=int64>
0/0 0/1 0/2
>>> v.is_hom_ref()
array([ True, False, False])

is_hom_alt(self)[source]¶

Find genotype calls that are homozygous for any alternate (i.e., non-reference) allele.

Returns:	out : ndarray, bool, shape (n_variants, n_samples) Array where elements are True if the genotype call matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> g.is_hom_alt()
array([[False, False],
       [False,  True],
       [ True, False]])
>>> v = g[:, 1]
>>> v
<GenotypeVector shape=(3, 2) dtype=int64>
0/1 1/1 ./.
>>> v.is_hom_alt()
array([False,  True, False])

is_het(self, allele=None)[source]¶

Find genotype calls that are heterozygous.

Returns:	out : ndarray, bool, shape (n_variants, n_samples) Array where elements are True if the genotype call matches the condition. allele : int, optional Heterozygous allele.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]])
>>> g.is_het()
array([[False,  True],
       [ True, False],
       [ True, False]])
>>> g.is_het(2)
array([[False, False],
       [False, False],
       [ True, False]])
>>> v = g[:, 0]
>>> v
<GenotypeVector shape=(3, 2) dtype=int64>
0/0 0/1 0/2
>>> v.is_het()
array([False,  True,  True])

is_call(self, call)[source]¶

Locate genotypes with a given call.

Parameters:	call : array_like, int, shape (ploidy,) The genotype call to find.
Returns:	out : ndarray, bool, shape (n_variants, n_samples) Array where elements are True if the genotype is call.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]])
>>> g.is_call((0, 2))
array([[False, False],
       [False, False],
       [ True, False]])
>>> v = g[:, 0]
>>> v
<GenotypeVector shape=(3, 2) dtype=int64>
0/0 0/1 0/2
>>> v.is_call((0, 2))
array([False, False,  True])

count_called(self, axis=None)[source]¶

Count called genotypes.

Parameters:	axis : int, optional Axis over which to count, or None to perform overall count.

count_missing(self, axis=None)[source]¶

Count missing genotypes.

Parameters:	axis : int, optional Axis over which to count, or None to perform overall count.

count_hom(self, allele=None, axis=None)[source]¶

Count homozygous genotypes.

Parameters:	allele : int, optional Allele index. axis : int, optional Axis over which to count, or None to perform overall count.

count_hom_ref(self, axis=None)[source]¶

Count homozygous reference genotypes.

Parameters:	axis : int, optional Axis over which to count, or None to perform overall count.

count_hom_alt(self, axis=None)[source]¶

Count homozygous alternate genotypes.

Parameters:	axis : int, optional Axis over which to count, or None to perform overall count.

count_het(self, allele=None, axis=None)[source]¶

Count heterozygous genotypes.

Parameters:	allele : int, optional Allele index. axis : int, optional Axis over which to count, or None to perform overall count.

count_call(self, call, axis=None)[source]¶

Count genotypes with a given call.

Parameters:	call : array_like, int, shape (ploidy,) The genotype call to find. axis : int, optional Axis over which to count, or None to perform overall count.

to_n_ref(self, fill=0, dtype='i1')[source]¶

Transform each genotype call into the number of reference alleles.

Parameters:	fill : int, optional Use this value to represent missing calls. dtype : dtype, optional Output dtype.
Returns:	out : ndarray, int8, shape (n_variants, n_samples) Array of ref alleles per genotype call.

Notes

By default this function returns 0 for missing genotype calls and for homozygous non-reference genotype calls. Use the fill argument to change how missing calls are represented.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> g.to_n_ref()
array([[2, 1],
       [1, 0],
       [0, 0]], dtype=int8)
>>> g.to_n_ref(fill=-1)
array([[ 2,  1],
       [ 1,  0],
       [ 0, -1]], dtype=int8)
>>> v = g[:, 0]
>>> v
<GenotypeVector shape=(3, 2) dtype=int64>
0/0 0/2 2/2
>>> v.to_n_ref()
array([2, 1, 0], dtype=int8)

to_n_alt(self, fill=0, dtype='i1')[source]¶

Transform each genotype call into the number of non-reference alleles.

Parameters:	fill : int, optional Use this value to represent missing calls. dtype : dtype, optional Output dtype.
Returns:	out : ndarray, int, shape (n_variants, n_samples) Array of non-ref alleles per genotype call.

Notes

This function simply counts the number of non-reference alleles, it makes no distinction between different alternate alleles.

By default this function returns 0 for missing genotype calls and for homozygous reference genotype calls. Use the fill argument to change how missing calls are represented.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> g.to_n_alt()
array([[0, 1],
       [1, 2],
       [2, 0]], dtype=int8)
>>> g.to_n_alt(fill=-1)
array([[ 0,  1],
       [ 1,  2],
       [ 2, -1]], dtype=int8)
>>> v = g[:, 0]
>>> v
<GenotypeVector shape=(3, 2) dtype=int64>
0/0 0/2 2/2
>>> v.to_n_alt()
array([0, 1, 2], dtype=int8)

to_allele_counts(self, max_allele=None, dtype='u1')[source]¶

Transform genotype calls into allele counts per call.

Parameters:	max_allele : int, optional Highest allele index. Provide this value to speed up computation. dtype : dtype, optional Output dtype.
Returns:	out : ndarray, uint8, shape (n_variants, n_samples, len(alleles)) Array of allele counts per call.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> g.to_allele_counts()
<GenotypeAlleleCountsArray shape=(3, 2, 3) dtype=uint8>
2:0:0 1:1:0
1:0:1 0:2:0
0:0:2 0:0:0
>>> v = g[:, 0]
>>> v
<GenotypeVector shape=(3, 2) dtype=int64>
0/0 0/2 2/2
>>> v.to_allele_counts()
<GenotypeAlleleCountsVector shape=(3, 3) dtype=uint8>
2:0:0 1:0:1 0:0:2

to_gt(self, max_allele=None)[source]¶

Convert genotype calls to VCF-style string representation.

Returns:	gt : ndarray, string, shape (n_variants, n_samples)

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[1, 2], [2, 1]],
...                          [[2, 2], [-1, -1]]])
>>> g.to_gt()
chararray([[b'0/0', b'0/1'],
       [b'0/2', b'1/1'],
       [b'1/2', b'2/1'],
       [b'2/2', b'./.']],
      dtype='|S3')
>>> v = g[:, 0]
>>> v
<GenotypeVector shape=(4, 2) dtype=int64>
0/0 0/2 1/2 2/2
>>> v.to_gt()
chararray([b'0/0', b'0/2', b'1/2', b'2/2'],
      dtype='|S3')
>>> g.is_phased = np.ones(g.shape[:-1])
>>> g.to_gt()
chararray([[b'0|0', b'0|1'],
       [b'0|2', b'1|1'],
       [b'1|2', b'2|1'],
       [b'2|2', b'.|.']],
      dtype='|S3')
>>> v = g[:, 0]
>>> v
<GenotypeVector shape=(4, 2) dtype=int64>
0|0 0|2 1|2 2|2
>>> v.to_gt()
chararray([b'0|0', b'0|2', b'1|2', b'2|2'],
      dtype='|S3')

map_alleles(self, mapping, copy=True)[source]¶

Transform alleles via a mapping.

Parameters:	mapping : ndarray, int8, shape (n_variants, max_allele) An array defining the allele mapping for each variant. copy : bool, optional If True, return a new array; if False, apply mapping in place (only applies for arrays with dtype int8; all other dtypes require a copy).
Returns:	gm : GenotypeArray

See also

create_allele_mapping

Notes

If a mask has been set, it is ignored by this function.

For arrays with dtype int8 an optimised implementation is used which is faster and uses far less memory. It is recommended to convert arrays to dtype int8 where possible before calling this method.

Examples

>>> import allel
>>> import numpy as np
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[1, 2], [2, 1]],
...                          [[2, 2], [-1, -1]]], dtype='i1')
>>> mapping = np.array([[1, 2, 0],
...                     [2, 0, 1],
...                     [2, 1, 0],
...                     [0, 2, 1]], dtype='i1')
>>> g.map_alleles(mapping)
<GenotypeArray shape=(4, 2, 2) dtype=int8>
1/1 1/2
2/1 0/0
1/0 0/1
1/1 ./.
>>> v = g[:, 0]
>>> v
<GenotypeVector shape=(4, 2) dtype=int8>
0/0 0/2 1/2 2/2
>>> v.map_alleles(mapping)
<GenotypeVector shape=(4, 2) dtype=int8>
1/1 2/1 1/0 1/1

compress(self, condition, axis=0, out=None)[source]¶

Return selected slices of an array along given axis.

Parameters:

condition : array_like, bool: Array that selects which entries to return. N.B., if len(condition) is less than the size of the given axis, then output is truncated to the length of the condition array.
axis : int, optional: Axis along which to take slices. If None, work on the flattened array.
out : ndarray, optional: Output array. Its type is preserved and it must be of the right shape to hold the output.

Returns:

out : Genotypes: A copy of the array without the slices along axis for which condition is false.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]])
>>> g.compress([True, False, True], axis=0)
<GenotypeArray shape=(2, 2, 2) dtype=int64>
0/0 0/1
0/2 ./.
>>> v = g[:, 0]
>>> v
<GenotypeVector shape=(3, 2) dtype=int64>
0/0 0/1 0/2
>>> v.compress([True, False, True], axis=0)
<GenotypeVector shape=(2, 2) dtype=int64>
0/0 0/2

take(self, indices, axis=0, out=None, mode='raise')[source]¶

Take elements from an array along an axis.

This function does the same thing as “fancy” indexing (indexing arrays using arrays); however, it can be easier to use if you need elements along a given axis.

Parameters:

indices : array_like

The indices of the values to extract.

axis : int, optional

The axis over which to select values.

out : ndarray, optional

If provided, the result will be placed in this array. It should be of the appropriate shape and dtype.

mode : {‘raise’, ‘wrap’, ‘clip’}, optional

Specifies how out-of-bounds indices will behave.

‘raise’ – raise an error (default)
‘wrap’ – wrap around
‘clip’ – clip to the range

‘clip’ mode means that all indices that are too large are replaced by the index that addresses the last element along that axis. Note that this disables indexing with negative numbers.

Returns:

subarray : ndarray

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]])
>>> g.take([0, 2], axis=0)
<GenotypeArray shape=(2, 2, 2) dtype=int64>
0/0 0/1
0/2 ./.
>>> v = g[:, 0]
>>> v
<GenotypeVector shape=(3, 2) dtype=int64>
0/0 0/1 0/2
>>> v.take([0, 2], axis=0)
<GenotypeVector shape=(2, 2) dtype=int64>
0/0 0/2

concatenate(self, others, axis=0)[source]¶

Join a sequence of arrays along an existing axis.

Parameters:	others : sequence of array_like The arrays must have the same shape, except in the dimension corresponding to axis (the first, by default). axis : int, optional The axis along which the arrays will be joined. Default is 0.
Returns:	res : ndarray The concatenated array.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]])
>>> g.concatenate([g], axis=0)
<GenotypeArray shape=(6, 2, 2) dtype=int64>
0/0 0/1
0/1 1/1
0/2 ./.
0/0 0/1
0/1 1/1
0/2 ./.
>>> g.concatenate([g], axis=1)
<GenotypeArray shape=(3, 4, 2) dtype=int64>
0/0 0/1 0/0 0/1
0/1 1/1 0/1 1/1
0/2 ./. 0/2 ./.
>>> v1 = g[:, 0]
>>> v2 = g[:, 1]
>>> v1.concatenate([v2], axis=0)
<GenotypeVector shape=(6, 2) dtype=int64>
0/0 0/1 0/2 0/1 1/1 ./.

HaplotypeArray¶

class allel.HaplotypeArray(data, copy=False, **kwargs)[source]¶

Array of haplotypes.

Parameters:	data : array_like, int, shape (n_variants, n_haplotypes) Haplotype data. **kwargs : keyword arguments All keyword arguments are passed through to `numpy.array()`.

See also

Genotypes, GenotypeArray, GenotypeVector

Notes

This class represents haplotype data as a 2-dimensional numpy array of integers. By convention the first dimension corresponds to the variants genotyped, the second dimension corresponds to the haplotypes.

Each integer within the array corresponds to an allele index, where 0 is the reference allele, 1 is the first alternate allele, 2 is the second alternate allele, … and -1 (or any other negative integer) is a missing allele call.

If adjacent haplotypes originate from the same sample, then a haplotype array can also be viewed as a genotype array. However, this is not a requirement.

With data on large numbers of variants and/or haplotypes, storing the data in memory as an uncompressed numpy array if integers may be impractical. For working with large arrays of haplotype data, see the allel.model.chunked and allel.model.dask modules.

Examples

Instantiate a haplotype array:

>>> import allel
>>> h = allel.HaplotypeArray([[0, 0, 0, 1],
...                           [0, 1, 1, 1],
...                           [0, 2, -1, -1]], dtype='i1')
>>> h.dtype
dtype('int8')
>>> h.ndim
2
>>> h.shape
(3, 4)
>>> h.n_variants
3
>>> h.n_haplotypes
4
>>> h
<HaplotypeArray shape=(3, 4) dtype=int8>
0 0 0 1
0 1 1 1
0 2 . .

Allele calls for a single variant at all haplotypes can be obtained by indexing the first dimension, e.g.:

>>> h[1]
array([0, 1, 1, 1], dtype=int8)

A single haplotype can be obtained by indexing the second dimension, e.g.:

>>> h[:, 1]
array([0, 1, 2], dtype=int8)

An allele call for a single haplotype at a single variant can be obtained by indexing the first and second dimensions, e.g.:

>>> h[1, 0]
0

View haplotypes as diploid genotypes:

>>> h.to_genotypes(ploidy=2)
<GenotypeArray shape=(3, 2, 2) dtype=int8>
0/0 0/1
0/1 1/1
0/2 ./.

n_variants¶: Number of variants.

n_haplotypes¶: Number of haplotypes.

subset(self, sel0=None, sel1=None)[source]¶

Make a sub-selection of variants and haplotypes.

Parameters:	sel0 : array_like Boolean array or array of indices selecting variants. sel1 : array_like Boolean array or array of indices selecting haplotypes.
Returns:	out : HaplotypeArray

See also

HaplotypeArray.take, HaplotypeArray.compress

is_called(self)[source]¶

is_missing(self)[source]¶

is_ref(self)[source]¶

is_alt(self, allele=None)[source]¶

is_call(self, allele)[source]¶

count_called(self, axis=None)[source]¶

count_missing(self, axis=None)[source]¶

count_ref(self, axis=None)[source]¶

count_alt(self, axis=None)[source]¶

count_call(self, allele, axis=None)[source]¶

count_alleles(self, max_allele=None, subpop=None)[source]¶

Count the number of calls of each allele per variant.

Parameters:	max_allele : int, optional The highest allele index to count. Alleles greater than this index will be ignored. subpop : array_like, int, optional Indices of haplotypes to include.
Returns:	ac : AlleleCountsArray, int, shape (n_variants, n_alleles)

Examples

>>> import allel
>>> h = allel.HaplotypeArray([[0, 0, 0, 1],
...                           [0, 1, 1, 1],
...                           [0, 2, -1, -1]], dtype='i1')
>>> ac = h.count_alleles()
>>> ac
<AlleleCountsArray shape=(3, 3) dtype=int32>
3 1 0
1 3 0
1 0 1

count_alleles_subpops(self, subpops, max_allele=None)[source]¶

Count alleles for multiple subpopulations simultaneously.

Parameters:	subpops : dict (string -> sequence of ints) Mapping of subpopulation names to sample indices. max_allele : int, optional The highest allele index to count. Alleles above this will be ignored.
Returns:	out : dict (string -> AlleleCountsArray) A mapping of subpopulation names to allele counts arrays.

map_alleles(self, mapping, copy=True)[source]¶

Transform alleles via a mapping.

Parameters:	mapping : ndarray, int8, shape (n_variants, max_allele) An array defining the allele mapping for each variant. copy : bool, optional If True, return a new array; if False, apply mapping in place (only applies for arrays with dtype int8; all other dtypes require a copy).
Returns:	hm : HaplotypeArray

See also

allel.model.util.create_allele_mapping

Notes

For arrays with dtype int8 an optimised implementation is used which is faster and uses far less memory. It is recommended to convert arrays to dtype int8 where possible before calling this method.

Examples

>>> import allel
>>> import numpy as np
>>> h = allel.HaplotypeArray([[0, 0, 0, 1],
...                           [0, 1, 1, 1],
...                           [0, 2, -1, -1]], dtype='i1')
>>> mapping = np.array([[1, 2, 0],
...                     [2, 0, 1],
...                     [2, 1, 0]], dtype='i1')
>>> h.map_alleles(mapping)
<HaplotypeArray shape=(3, 4) dtype=int8>
1 1 1 2
2 0 0 0
2 0 . .

to_genotypes(self, ploidy, copy=False)[source]¶

Reshape a haplotype array to view it as genotypes by restoring the ploidy dimension.

Parameters:	ploidy : int The sample ploidy. copy : bool, optional If True, make a copy of data.
Returns:	g : ndarray, int, shape (n_variants, n_samples, ploidy) Genotype array (sharing same underlying buffer). copy : bool, optional If True, copy the data.

Examples

>>> import allel
>>> h = allel.HaplotypeArray([[0, 0, 0, 1],
...                           [0, 1, 1, 1],
...                           [0, 2, -1, -1]], dtype='i1')
>>> h.to_genotypes(ploidy=2)
<GenotypeArray shape=(3, 2, 2) dtype=int8>
0/0 0/1
0/1 1/1
0/2 ./.

to_sparse(self, format='csr', **kwargs)[source]¶

Convert into a sparse matrix.

Parameters:	format : {‘coo’, ‘csc’, ‘csr’, ‘dia’, ‘dok’, ‘lil’} Sparse matrix format. kwargs : keyword arguments Passed through to sparse matrix constructor.
Returns:	m : scipy.sparse.spmatrix Sparse matrix

Examples

>>> import allel
>>> h = allel.HaplotypeArray([[0, 0, 0, 0],
...                           [0, 1, 0, 1],
...                           [1, 1, 0, 0],
...                           [0, 0, -1, -1]], dtype='i1')
>>> m = h.to_sparse(format='csr')
>>> m
<4x4 sparse matrix of type '<class 'numpy.int8'>'
    with 6 stored elements in Compressed Sparse Row format>
>>> m.data
array([ 1,  1,  1,  1, -1, -1], dtype=int8)
>>> m.indices
array([1, 3, 0, 1, 2, 3], dtype=int32)
>>> m.indptr
array([0, 0, 2, 4, 6], dtype=int32)

static from_sparse(m, order=None, out=None)[source]¶

Construct a haplotype array from a sparse matrix.

Parameters:	m : scipy.sparse.spmatrix Sparse matrix order : {‘C’, ‘F’}, optional Whether to store data in C (row-major) or Fortran (column-major) order in memory. out : ndarray, shape (n_variants, n_samples), optional Use this array as the output buffer.
Returns:	h : HaplotypeArray, shape (n_variants, n_haplotypes) Haplotype array.

Examples

>>> import allel
>>> import numpy as np
>>> import scipy.sparse
>>> data = np.array([ 1,  1,  1,  1, -1, -1], dtype=np.int8)
>>> indices = np.array([1, 3, 0, 1, 2, 3], dtype=np.int32)
>>> indptr = np.array([0, 0, 2, 4, 6], dtype=np.int32)
>>> m = scipy.sparse.csr_matrix((data, indices, indptr))
>>> h = allel.HaplotypeArray.from_sparse(m)
>>> h
<HaplotypeArray shape=(4, 4) dtype=int8>
0 0 0 0
0 1 0 1
1 1 0 0
0 0 . .

prefix_argsort(self)[source]¶: Return indices that would sort the haplotypes by prefix.

distinct(self)[source]¶: Return sets of indices for each distinct haplotype.

distinct_counts(self)[source]¶: Return counts for each distinct haplotype.

distinct_frequencies(self)[source]¶: Return frequencies for each distinct haplotype.

compress(self, condition, axis=0, out=None)[source]¶

Return selected slices of an array along given axis.

Parameters:

condition : array_like, bool: Array that selects which entries to return. N.B., if len(condition) is less than the size of the given axis, then output is truncated to the length of the condition array.
axis : int, optional: Axis along which to take slices. If None, work on the flattened array.
out : ndarray, optional: Output array. Its type is preserved and it must be of the right shape to hold the output.

Returns:

out : HaplotypeArray: A copy of the array without the slices along axis for which condition is false.

Examples

>>> import allel
>>> h = allel.HaplotypeArray([[0, 0, 0, 1],
...                           [0, 1, 1, 1],
...                           [0, 2, -1, -1]], dtype='i1')
>>> h.compress([True, False, True], axis=0)
<HaplotypeArray shape=(2, 4) dtype=int8>
0 0 0 1
0 2 . .
>>> h.compress([True, False, True, False], axis=1)
<HaplotypeArray shape=(3, 2) dtype=int8>
0 0
0 1
0 .

take(self, indices, axis=0, out=None, mode='raise')[source]¶

Take elements from an array along an axis.

This function does the same thing as “fancy” indexing (indexing arrays using arrays); however, it can be easier to use if you need elements along a given axis.

Parameters:

indices : array_like

The indices of the values to extract.

axis : int, optional

The axis over which to select values.

out : ndarray, optional

If provided, the result will be placed in this array. It should be of the appropriate shape and dtype.

mode : {‘raise’, ‘wrap’, ‘clip’}, optional

Specifies how out-of-bounds indices will behave.

‘raise’ – raise an error (default)
‘wrap’ – wrap around
‘clip’ – clip to the range

‘clip’ mode means that all indices that are too large are replaced by the index that addresses the last element along that axis. Note that this disables indexing with negative numbers.

Returns:

subarray : ndarray

Examples

>>> import allel
>>> h = allel.HaplotypeArray([[0, 0, 0, 1],
...                           [0, 1, 1, 1],
...                           [0, 2, -1, -1]], dtype='i1')
>>> h.take([0, 2], axis=0)
<HaplotypeArray shape=(2, 4) dtype=int8>
0 0 0 1
0 2 . .
>>> h.take([0, 2], axis=1)
<HaplotypeArray shape=(3, 2) dtype=int8>
0 0
0 1
0 .

subset(self, sel0=None, sel1=None)[source]

Make a sub-selection of variants and haplotypes.

Parameters:	sel0 : array_like Boolean array or array of indices selecting variants. sel1 : array_like Boolean array or array of indices selecting haplotypes.
Returns:	out : HaplotypeArray

See also

HaplotypeArray.take, HaplotypeArray.compress

concatenate(self, others, axis=0)[source]¶

Join a sequence of arrays along an existing axis.

Parameters:	others : sequence of array_like The arrays must have the same shape, except in the dimension corresponding to axis (the first, by default). axis : int, optional The axis along which the arrays will be joined. Default is 0.
Returns:	res : ndarray The concatenated array.

Examples

>>> import allel
>>> h = allel.HaplotypeArray([[0, 0, 0, 1],
...                           [0, 1, 1, 1],
...                           [0, 2, -1, -1]], dtype='i1')
>>> h.concatenate([h], axis=0)
<HaplotypeArray shape=(6, 4) dtype=int8>
0 0 0 1
0 1 1 1
0 2 . .
0 0 0 1
0 1 1 1
0 2 . .
>>> h.concatenate([h], axis=1)
<HaplotypeArray shape=(3, 8) dtype=int8>
0 0 0 1 0 0 0 1
0 1 1 1 0 1 1 1
0 2 . . 0 2 . .

AlleleCountsArray¶

class allel.AlleleCountsArray(data, copy=False, **kwargs)[source]¶

Array of allele counts.

Parameters:	data : array_like, int, shape (n_variants, n_alleles) Allele counts data. copy : bool, optional If True, make a copy of data. **kwargs : keyword arguments All keyword arguments are passed through to `numpy.array()`.

Notes

This class represents allele counts as a 2-dimensional numpy array of integers. By convention the first dimension corresponds to the variants genotyped, the second dimension corresponds to the alleles counted. The integer recorded is the total count of the given allele at the given variant, across all individuals.

Examples

Obtain allele counts from a genotype array:

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 1], [1, 1]],
...                          [[0, 2], [-1, -1]]], dtype='i1')
>>> ac = g.count_alleles()
>>> ac
<AlleleCountsArray shape=(3, 3) dtype=int32>
3 1 0
1 3 0
1 0 1
>>> ac.dtype
dtype('int32')
>>> ac.shape
(3, 3)
>>> ac.n_variants
3
>>> ac.n_alleles
3

Allele counts for a single variant can be obtained by indexing the first dimension, e.g.:

>>> ac[1]
array([1, 3, 0], dtype=int32)

Allele counts for a specific allele can be obtained by indexing the second dimension, e.g., reference allele counts:

>>> ac[:, 0]
array([3, 1, 1], dtype=int32)

Calculate the total number of alleles called for each variant:

>>> import numpy as np
>>> an = np.sum(ac, axis=1)
>>> an
array([4, 4, 2])

Add allele counts from two populations:

>>> ac + ac
<AlleleCountsArray shape=(3, 3) dtype=int32>
6 2 0
2 6 0
2 0 2

n_variants¶: Number of variants.

n_alleles¶: Number of alleles.

max_allele(self)[source]¶

Return the highest allele index for each variant.

Returns:	n : ndarray, int, shape (n_variants,) Allele index array.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac.max_allele()
array([1, 2, 2], dtype=int8)

allelism(self)[source]¶

Determine the number of distinct alleles observed for each variant.

Returns:	n : ndarray, int, shape (n_variants,) Allelism array.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac.allelism()
array([2, 3, 1])

is_variant(self)[source]¶

Find variants with at least one non-reference allele call.

Returns:	out : ndarray, bool, shape (n_variants,) Boolean array where elements are True if variant matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac.is_variant()
array([False,  True,  True,  True])

is_non_variant(self)[source]¶

Find variants with no non-reference allele calls.

Returns:	out : ndarray, bool, shape (n_variants,) Boolean array where elements are True if variant matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac.is_non_variant()
array([ True, False, False, False])

is_segregating(self)[source]¶

Find segregating variants (where more than one allele is observed).

Returns:	out : ndarray, bool, shape (n_variants,) Boolean array where elements are True if variant matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac.is_segregating()
array([False,  True,  True, False])

is_non_segregating(self, allele=None)[source]¶

Find non-segregating variants (where at most one allele is observed).

Parameters:	allele : int, optional Allele index.
Returns:	out : ndarray, bool, shape (n_variants,) Boolean array where elements are True if variant matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac.is_non_segregating()
array([ True, False, False,  True])
>>> ac.is_non_segregating(allele=2)
array([False, False, False,  True])

is_singleton(self, allele)[source]¶

Find variants with a single call for the given allele.

Parameters:	allele : int, optional Allele index.
Returns:	out : ndarray, bool, shape (n_variants,) Boolean array where elements are True if variant matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [0, 1]],
...                          [[1, 1], [1, 2]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac.is_singleton(allele=1)
array([False,  True, False, False])
>>> ac.is_singleton(allele=2)
array([False, False,  True, False])

is_doubleton(self, allele)[source]¶

Find variants with exactly two calls for the given allele.

Parameters:	allele : int, optional Allele index.
Returns:	out : ndarray, bool, shape (n_variants,) Boolean array where elements are True if variant matches the condition.

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [1, 1]],
...                          [[1, 1], [1, 2]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac.is_doubleton(allele=1)
array([False,  True, False, False])
>>> ac.is_doubleton(allele=2)
array([False, False, False,  True])

is_biallelic(self)[source]¶

Find biallelic variants.

Returns:	out : ndarray, bool, shape (n_variants,) Boolean array where elements are True if variant matches the condition.

is_biallelic_01(self, min_mac=None)[source]¶

Find variants biallelic for the reference (0) and first alternate (1) allele.

Parameters:	min_mac : int, optional Minimum minor allele count.
Returns:	out : ndarray, bool, shape (n_variants,) Boolean array where elements are True if variant matches the condition.

count_variant(self)[source]¶

count_non_variant(self)[source]¶

count_segregating(self)[source]¶

count_non_segregating(self, allele=None)[source]¶

count_singleton(self, allele=1)[source]¶

count_doubleton(self, allele=1)[source]¶

to_frequencies(self, fill=nan)[source]¶

Compute allele frequencies.

Parameters:	fill : float, optional Value to use when number of allele calls is 0.
Returns:	af : ndarray, float, shape (n_variants, n_alleles)

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac.to_frequencies()
array([[0.75, 0.25, 0.  ],
       [0.25, 0.5 , 0.25],
       [0.  , 0.  , 1.  ]])

map_alleles(self, mapping, max_allele=None)[source]¶

Transform alleles via a mapping.

Parameters:	mapping : ndarray, int8, shape (n_variants, max_allele) An array defining the allele mapping for each variant. max_allele : int, optional Highest allele index expected in the output. If not provided will be determined from maximum value in mapping.
Returns:	ac : AlleleCountsArray

See also

create_allele_mapping

Examples

>>> import allel
>>> g = allel.GenotypeArray([[[0, 0], [0, 0]],
...                          [[0, 0], [0, 1]],
...                          [[0, 2], [1, 1]],
...                          [[2, 2], [-1, -1]]])
>>> ac = g.count_alleles()
>>> ac
<AlleleCountsArray shape=(4, 3) dtype=int32>
4 0 0
3 1 0
1 2 1
0 0 2
>>> mapping = [[1, 0, 2],
...            [1, 0, 2],
...            [2, 1, 0],
...            [1, 2, 0]]
>>> ac.map_alleles(mapping)
<AlleleCountsArray shape=(4, 3) dtype=int32>
0 4 0
1 3 0
1 2 1
2 0 0

compress(self, condition, axis=0, out=None)[source]¶

take(self, indices, axis=0, out=None, mode='raise')[source]¶

concatenate(self, others, axis=0)[source]¶

GenotypeAlleleCountsArray¶

class allel.GenotypeAlleleCountsArray(data, copy=False, **kwargs)[source]¶

Array of genotype calls for a matrix of variants and samples, stored as allele counts per call.

Parameters:	data : array_like, int, shape (n_variants, n_samples, ploidy) Genotype data. copy : bool, optional If True, make a copy of data. **kwargs : keyword arguments All keyword arguments are passed through to `numpy.array()`.

See also

GenotypeAlleleCounts, GenotypeAlleleCountsVector, GenotypeArray

Notes

This class provides an alternative representation of genotype calls, allowing for variable copy number (effective ploidy) between chromosomes and/or genome regions. Genotype calls are represented as a 3-dimensional array of integers. By convention, the first dimension corresponds to the variants genotyped, the second dimension corresponds to the samples genotyped, and the third dimension corresponds to the alleles genotyped in index order. Each integer in the array records the count for the given allele at the given variant in the given sample.

Examples

Instantiate an array:

>>> import allel
>>> g = allel.GenotypeAlleleCountsArray([[[2, 0, 0], [0, 2, 0]],
...                                      [[1, 1, 0], [0, 1, 1]],
...                                      [[0, 0, 4], [0, 0, 0]]],
...                                     dtype='u1')
>>> g.dtype
dtype('uint8')
>>> g.ndim
3
>>> g.shape
(3, 2, 3)
>>> g.n_variants
3
>>> g.n_samples
2
>>> g.n_alleles
3
>>> g
<GenotypeAlleleCountsArray shape=(3, 2, 3) dtype=uint8>
2:0:0 0:2:0
1:1:0 0:1:1
0:0:4 0:0:0

Genotype calls for a single variant at all samples can be obtained by indexing the first dimension, e.g.:

>>> g[1]
<GenotypeAlleleCountsVector shape=(2, 3) dtype=uint8>
1:1:0 0:1:1

Genotype calls for a single sample at all variants can be obtained by indexing the second dimension, e.g.:

>>> g[:, 1]
<GenotypeAlleleCountsVector shape=(3, 3) dtype=uint8>
0:2:0 0:1:1 0:0:0

A genotype call for a single sample at a single variant can be obtained by indexing the first and second dimensions, e.g.:

>>> g[1, 0]
array([1, 1, 0], dtype=uint8)

Copy number (effective ploidy) may vary between calls:

>>> cn = g.sum(axis=2)
>>> cn
array([[2, 2],
       [2, 2],
       [4, 0]], dtype=uint64)

n_variants¶: Number of variants.

n_samples¶: Number of samples.

n_alleles¶: Number of alleles.

count_alleles(self, subpop=None)[source]¶

subset(self, sel0=None, sel1=None)[source]¶

GenotypeAlleleCountsVector¶

class allel.GenotypeAlleleCountsVector(data, copy=False, **kwargs)[source]¶

Array of genotype calls for a sequence of variants or samples, stored as allele counts per call.

Parameters:	data : array_like, int, shape (n_variants, n_samples, ploidy) Genotype data. copy : bool, optional If True, make a copy of data. **kwargs : keyword arguments All keyword arguments are passed through to `numpy.array()`.

See also

GenotypeAlleleCounts, GenotypeAlleleCountsArray, GenotypeVector

n_calls¶: Number of variants.

n_alleles¶: Number of alleles.

GenotypeAlleleCounts¶

Methods available on both GenotypeAlleleCountsArray and GenotypeAlleleCountsVector classes:

class allel.GenotypeAlleleCounts(data, copy=False, **kwargs)[source]¶

Base class for wrapping a NumPy array of genotype calls, stored as allele counts per call.

See also

GenotypeAlleleCountsArray, GenotypeAlleleCountsVector

is_called(self)[source]¶

is_missing(self)[source]¶

is_hom(self, allele=None)[source]¶

is_hom_ref(self)[source]¶

is_hom_alt(self)[source]¶

is_het(self, allele=None)[source]¶

compress(self, condition, axis=0, out=None)[source]¶

take(self, indices, axis=0, out=None, mode='raise')[source]¶

concatenate(self, others, axis=0)[source]¶

VariantTable¶

class allel.VariantTable(data, index=None, copy=False, **kwargs)[source]¶

Table (catalogue) of variants.

Parameters:

data : array_like, structured, shape (n_variants,): Variant records.
index : string or pair of strings, optional: Names of columns to use for positional index, e.g., ‘POS’ if table contains a ‘POS’ column and records from a single chromosome/contig, or (‘CHROM’, ‘POS’) if table contains records from multiple chromosomes/contigs.
**kwargs : keyword arguments, optional: Further keyword arguments are passed through to numpy.rec.array().

Examples

Instantiate a table from existing data:

>>> import allel
>>> records = [[b'chr1', 2, 35, 4.5, (1, 2)],
...            [b'chr1', 7, 12, 6.7, (3, 4)],
...            [b'chr2', 3, 78, 1.2, (5, 6)],
...            [b'chr2', 9, 22, 4.4, (7, 8)],
...            [b'chr3', 6, 99, 2.8, (9, 10)]]
>>> dtype = [('CHROM', 'S4'),
...          ('POS', 'u4'),
...          ('DP', int),
...          ('QD', float),
...          ('AC', (int, 2))]
>>> vt = allel.VariantTable(records, dtype=dtype,
...                         index=('CHROM', 'POS'))
>>> vt.names
('CHROM', 'POS', 'DP', 'QD', 'AC')
>>> vt.n_variants
5

Access a column:

>>> vt['DP']
array([35, 12, 78, 22, 99])

Access multiple columns:

>>> vt[['DP', 'QD']]
<VariantTable shape=(5,) dtype=(numpy.record, [('DP', '<i8'), ('QD', '<f8')])>
[(35, 4.5) (12, 6.7) (78, 1.2) (22, 4.4) (99, 2.8)]

Access a row:

>>> vt[2]
(b'chr2', 3, 78, 1.2, [5, 6])

Access multiple rows:

>>> vt[2:4]
<VariantTable shape=(2,) dtype=(numpy.record, [('CHROM', 'S4'), ('POS', '<u4'), ...
[(b'chr2', 3, 78, 1.2, [5, 6]) (b'chr2', 9, 22, 4.4, ...

Evaluate expressions against the table:

>>> vt.eval('DP > 30')
array([ True, False,  True, False,  True])
>>> vt.eval('(DP > 30) & (QD > 4)')
array([ True, False, False, False, False])
>>> vt.eval('DP * 2')
array([ 70,  24, 156,  44, 198])

Query the table:

>>> vt.query('DP > 30')
<VariantTable shape=(3,) dtype=(numpy.record, [('CHROM', 'S4'), ('POS', '<u4'), ...
[(b'chr1', 2, 35,  4.5, [ 1,  2]) (b'chr2', 3, 78,  1.2, ...
 (b'chr3', 6, 99,  2.8, [ 9, 10])]
>>> vt.query('(DP > 30) & (QD > 4)')
<VariantTable shape=(1,) dtype=(numpy.record, [('CHROM', 'S4'), ('POS', '<u4'), ...
[(b'chr1', 2, 35, 4.5, [1, 2])]

Use the index to query variants:

>>> vt.query_region(b'chr2', 1, 10)
<VariantTable shape=(2,) dtype=(numpy.record, [('CHROM', 'S4'), ('POS', '<u4'), ...
[(b'chr2', 3, 78,  1.2, [5, 6]) (b'chr2', 9, 22,  4.4, ...

n_variants¶: Number of variants (length of first dimension).

names¶

eval(self, expression, vm='python')¶

Evaluate an expression against the table columns.

Parameters:	expression : string Expression to evaluate. vm : {‘numexpr’, ‘python’} Virtual machine to use.
Returns:	result : ndarray

query(self, expression, vm='python')¶

Evaluate expression and then use it to extract rows from the table.

Parameters:	expression : string Expression to evaluate. vm : {‘numexpr’, ‘python’} Virtual machine to use.
Returns:	result : structured array

query_position(self, chrom=None, position=None)[source]¶

Query the table, returning row or rows matching the given genomic position.

Parameters:	chrom : string, optional Chromosome/contig. position : int, optional Position (1-based).
Returns:	result : row or VariantTable

query_region(self, chrom=None, start=None, stop=None)[source]¶

Query the table, returning row or rows within the given genomic region.

Parameters:	chrom : string, optional Chromosome/contig. start : int, optional Region start position (1-based). stop : int, optional Region stop position (1-based).
Returns:	result : VariantTable

to_vcf(self, path, rename=None, number=None, description=None, fill=None, write_header=True)[source]¶

Write to a variant call format (VCF) file.

Parameters:	path : string File path. rename : dict, optional Rename these columns in the VCF. number : dict, optional Override the number specified in INFO headers. description : dict, optional Descriptions for the INFO and FILTER headers. fill : dict, optional Fill values used for missing data in the table. write_header : bool, optional If True write VCF header.

Examples

Setup a variant table to write out:

>>> import allel
>>> chrom = [b'chr1', b'chr1', b'chr2', b'chr2', b'chr3']
>>> pos = [2, 6, 3, 8, 1]
>>> ids = ['a', 'b', 'c', 'd', 'e']
>>> ref = [b'A', b'C', b'T', b'G', b'N']
>>> alt = [(b'T', b'.'),
...        (b'G', b'.'),
...        (b'A', b'C'),
...        (b'C', b'A'),
...        (b'X', b'.')]
>>> qual = [1.2, 2.3, 3.4, 4.5, 5.6]
>>> filter_qd = [True, True, True, False, False]
>>> filter_dp = [True, False, True, False, False]
>>> dp = [12, 23, 34, 45, 56]
>>> qd = [12.3, 23.4, 34.5, 45.6, 56.7]
>>> flg = [True, False, True, False, True]
>>> ac = [(1, -1), (3, -1), (5, 6), (7, 8), (9, -1)]
>>> xx = [(1.2, 2.3), (3.4, 4.5), (5.6, 6.7), (7.8, 8.9),
...       (9.0, 9.9)]
>>> columns = [chrom, pos, ids, ref, alt, qual, filter_dp,
...            filter_qd, dp, qd, flg, ac, xx]
>>> records = list(zip(*columns))
>>> dtype = [('CHROM', 'S4'),
...          ('POS', 'u4'),
...          ('ID', 'S1'),
...          ('REF', 'S1'),
...          ('ALT', ('S1', 2)),
...          ('qual', 'f4'),
...          ('filter_dp', bool),
...          ('filter_qd', bool),
...          ('dp', int),
...          ('qd', float),
...          ('flg', bool),
...          ('ac', (int, 2)),
...          ('xx', (float, 2))]
>>> vt = allel.VariantTable(records, dtype=dtype)

Now write out to VCF and inspect the result:

>>> rename = {'dp': 'DP', 'qd': 'QD', 'filter_qd': 'QD'}
>>> fill = {'ALT': b'.', 'ac': -1}
>>> number = {'ac': 'A'}
>>> description = {'ac': 'Allele counts', 'filter_dp': 'Low depth'}
>>> vt.to_vcf('example.vcf', rename=rename, fill=fill,
...           number=number, description=description)
>>> print(open('example.vcf').read())
##fileformat=VCFv4.1
##fileDate=...
##source=...
##INFO=<ID=DP,Number=1,Type=Integer,Description="">
##INFO=<ID=QD,Number=1,Type=Float,Description="">
##INFO=<ID=ac,Number=A,Type=Integer,Description="Allele counts">
##INFO=<ID=flg,Number=0,Type=Flag,Description="">
##INFO=<ID=xx,Number=2,Type=Float,Description="">
##FILTER=<ID=QD,Description="">
##FILTER=<ID=dp,Description="Low depth">
#CHROM      POS     ID      REF     ALT     QUAL    FILTER  INFO
chr1        2       a       A       T       1.2     QD;dp   DP=12;QD=12.3;ac=1;flg;xx=...
chr1        6       b       C       G       2.3     QD      DP=23;QD=23.4;ac=3;xx=3.4,4.5
chr2        3       c       T       A,C     3.4     QD;dp   DP=34;QD=34.5;ac=5,6;flg;x...
chr2        8       d       G       C,A     4.5     PASS    DP=45;QD=45.6;ac=7,8;xx=7...
chr3        1       e       N       X       5.6     PASS    DP=56;QD=56.7;ac=9;flg;xx=...

FeatureTable¶

class allel.FeatureTable(data, copy=False, **kwargs)[source]¶

Table of genomic features (e.g., genes, exons, etc.).

Parameters:	data : array_like, structured, shape (n_variants,) Variant records. copy : bool, optional If True, make a copy of data. **kwargs : keyword arguments, optional Further keyword arguments are passed through to `numpy.rec.array()`.

n_features¶: Number of features (length of first dimension).

names¶

eval(self, expression, vm='python')¶

Evaluate an expression against the table columns.

Parameters:	expression : string Expression to evaluate. vm : {‘numexpr’, ‘python’} Virtual machine to use.
Returns:	result : ndarray

query(self, expression, vm='python')¶

Evaluate expression and then use it to extract rows from the table.

Parameters:	expression : string Expression to evaluate. vm : {‘numexpr’, ‘python’} Virtual machine to use.
Returns:	result : structured array

static from_gff3(path, attributes=None, region=None, score_fill=-1, phase_fill=-1, attributes_fill='.', dtype=None)[source]¶

Read a feature table from a GFF3 format file.

Parameters:

path : string: File path.
attributes : list of strings, optional: List of columns to extract from the “attributes” field.
region : string, optional: Genome region to extract. If given, file must be position sorted, bgzipped and tabix indexed. Tabix must also be installed and on the system path.
score_fill : int, optional: Value to use where score field has a missing value.
phase_fill : int, optional: Value to use where phase field has a missing value.
attributes_fill : object or list of objects, optional: Value(s) to use where attribute field(s) have a missing value.
dtype : numpy dtype, optional: Manually specify a dtype.

Returns:

ft : FeatureTable

to_mask(self, size, start_name='start', stop_name='end')[source]¶

Construct a mask array where elements are True if the fall within features in the table.

Parameters:	size : int Size of chromosome/contig. start_name : string, optional Name of column with start coordinates. stop_name : string, optional Name of column with stop coordinates.
Returns:	mask : ndarray, bool

SortedIndex¶

class allel.SortedIndex(data, copy=False, **kwargs)[source]¶

Index of sorted values, e.g., positions from a single chromosome or contig.

Parameters:	data : array_like Values in ascending order. **kwargs : keyword arguments All keyword arguments are passed through to `numpy.array()`.

See also

SortedMultiIndex, UniqueIndex

Notes

Values must be given in ascending order, although duplicate values may be present (i.e., values must be monotonically increasing).

Examples

>>> import allel
>>> idx = allel.SortedIndex([2, 5, 8, 14, 15, 23, 42, 42, 61, 77, 103], dtype='i4')
>>> idx
<SortedIndex shape=(11,) dtype=int32>
[2, 5, 8, 14, 15, ..., 42, 42, 61, 77, 103]
>>> idx.dtype
dtype('int32')
>>> idx.ndim
1
>>> idx.shape
(11,)
>>> idx.is_unique
False

is_unique¶: True if no duplicate entries.

locate_key(self, key)[source]¶

Get index location for the requested key.

Parameters:	key : object Value to locate.
Returns:	loc : int or slice Location of key (will be slice if there are duplicate entries).

Examples

>>> import allel
>>> idx = allel.SortedIndex([3, 6, 6, 11])
>>> idx.locate_key(3)
0
>>> idx.locate_key(11)
3
>>> idx.locate_key(6)
slice(1, 3, None)
>>> try:
...     idx.locate_key(2)
... except KeyError as e:
...     print(e)
...
2

locate_keys(self, keys, strict=True)[source]¶

Get index locations for the requested keys.

Parameters:	keys : array_like Array of keys to locate. strict : bool, optional If True, raise KeyError if any keys are not found in the index.
Returns:	loc : ndarray, bool Boolean array with location of values.

Examples

>>> import allel
>>> idx1 = allel.SortedIndex([3, 6, 11, 20, 35])
>>> idx2 = allel.SortedIndex([4, 6, 20, 39])
>>> loc = idx1.locate_keys(idx2, strict=False)
>>> loc
array([False,  True, False,  True, False])
>>> idx1[loc]
<SortedIndex shape=(2,) dtype=int64>
[6, 20]

locate_intersection(self, other)[source]¶

Locate the intersection with another array.

Parameters:	other : array_like, int Array of values to intersect.
Returns:	loc : ndarray, bool Boolean array with location of intersection. loc_other : ndarray, bool Boolean array with location in other of intersection.

Examples

>>> import allel
>>> idx1 = allel.SortedIndex([3, 6, 11, 20, 35])
>>> idx2 = allel.SortedIndex([4, 6, 20, 39])
>>> loc1, loc2 = idx1.locate_intersection(idx2)
>>> loc1
array([False,  True, False,  True, False])
>>> loc2
array([False,  True,  True, False])
>>> idx1[loc1]
<SortedIndex shape=(2,) dtype=int64>
[6, 20]
>>> idx2[loc2]
<SortedIndex shape=(2,) dtype=int64>
[6, 20]

intersect(self, other)[source]¶

Intersect with other sorted index.

Parameters:	other : array_like, int Array of values to intersect with.
Returns:	out : SortedIndex Values in common.

Examples

>>> import allel
>>> idx1 = allel.SortedIndex([3, 6, 11, 20, 35])
>>> idx2 = allel.SortedIndex([4, 6, 20, 39])
>>> idx1.intersect(idx2)
<SortedIndex shape=(2,) dtype=int64>
[6, 20]

locate_range(self, start=None, stop=None)[source]¶

Locate slice of index containing all entries within start and stop values inclusive.

Parameters:	start : int, optional Start value. stop : int, optional Stop value.
Returns:	loc : slice Slice object.

Examples

>>> import allel
>>> idx = allel.SortedIndex([3, 6, 11, 20, 35])
>>> loc = idx.locate_range(4, 32)
>>> loc
slice(1, 4, None)
>>> idx[loc]
<SortedIndex shape=(3,) dtype=int64>
[6, 11, 20]

intersect_range(self, start=None, stop=None)[source]¶

Intersect with range defined by start and stop values inclusive.

Parameters:	start : int, optional Start value. stop : int, optional Stop value.
Returns:	idx : SortedIndex

Examples

>>> import allel
>>> idx = allel.SortedIndex([3, 6, 11, 20, 35])
>>> idx.intersect_range(4, 32)
<SortedIndex shape=(3,) dtype=int64>
[6, 11, 20]

locate_ranges(self, starts, stops, strict=True)[source]¶

Locate items within the given ranges.

Parameters:	starts : array_like, int Range start values. stops : array_like, int Range stop values. strict : bool, optional If True, raise KeyError if any ranges contain no entries.
Returns:	loc : ndarray, bool Boolean array with location of entries found.

Examples

>>> import allel
>>> import numpy as np
>>> idx = allel.SortedIndex([3, 6, 11, 20, 35])
>>> ranges = np.array([[0, 2], [6, 17], [12, 15], [31, 35],
...                    [100, 120]])
>>> starts = ranges[:, 0]
>>> stops = ranges[:, 1]
>>> loc = idx.locate_ranges(starts, stops, strict=False)
>>> loc
array([False,  True,  True, False,  True])
>>> idx[loc]
<SortedIndex shape=(3,) dtype=int64>
[6, 11, 35]

locate_intersection_ranges(self, starts, stops)[source]¶

Locate the intersection with a set of ranges.

Parameters:	starts : array_like, int Range start values. stops : array_like, int Range stop values.
Returns:	loc : ndarray, bool Boolean array with location of entries found. loc_ranges : ndarray, bool Boolean array with location of ranges containing one or more entries.

Examples

>>> import allel
>>> import numpy as np
>>> idx = allel.SortedIndex([3, 6, 11, 20, 35])
>>> ranges = np.array([[0, 2], [6, 17], [12, 15], [31, 35],
...                    [100, 120]])
>>> starts = ranges[:, 0]
>>> stops = ranges[:, 1]
>>> loc, loc_ranges = idx.locate_intersection_ranges(starts, stops)
>>> loc
array([False,  True,  True, False,  True])
>>> loc_ranges
array([False,  True, False,  True, False])
>>> idx[loc]
<SortedIndex shape=(3,) dtype=int64>
[6, 11, 35]
>>> ranges[loc_ranges]
array([[ 6, 17],
       [31, 35]])

intersect_ranges(self, starts, stops)[source]¶

Intersect with a set of ranges.

Parameters:	starts : array_like, int Range start values. stops : array_like, int Range stop values.
Returns:	idx : SortedIndex

Examples

>>> import allel
>>> import numpy as np
>>> idx = allel.SortedIndex([3, 6, 11, 20, 35])
>>> ranges = np.array([[0, 2], [6, 17], [12, 15], [31, 35],
...                    [100, 120]])
>>> starts = ranges[:, 0]
>>> stops = ranges[:, 1]
>>> idx.intersect_ranges(starts, stops)
<SortedIndex shape=(3,) dtype=int64>
[6, 11, 35]

ChromPosIndex¶

class allel.ChromPosIndex(chrom, pos, copy=False)[source]¶

Two-level index of variant positions from two or more chromosomes/contigs.

Parameters:	chrom : array_like Chromosome values. pos : array_like Position values, in ascending order within each chromosome. copy : bool, optional If True, inputs will be copied into new arrays.

See also

SortedIndex, UniqueIndex

Notes

Chromosomes do not need to be sorted, but all values for a given chromosome must be grouped together, and all positions for a given chromosome must be sorted in ascending order.

Examples

>>> import allel
>>> chrom = ['chr2', 'chr2', 'chr1', 'chr1', 'chr1', 'chr3']
>>> pos = [1, 4, 2, 5, 5, 3]
>>> idx = allel.ChromPosIndex(chrom, pos)
>>> idx
<ChromPosIndex shape=(6,), dtype=<U4/int64>
chr2:1 chr2:4 chr1:2 chr1:5 chr1:5 chr3:3
>>> len(idx)
6

locate_key(self, chrom, pos=None)[source]¶

Get index location for the requested key.

Parameters:	chrom : object Chromosome or contig. pos : int, optional Position within chromosome or contig.
Returns:	loc : int or slice Location of requested key (will be slice if there are duplicate entries).

Examples

>>> import allel
>>> chrom = ['chr2', 'chr2', 'chr1', 'chr1', 'chr1', 'chr3']
>>> pos = [1, 4, 2, 5, 5, 3]
>>> idx = allel.ChromPosIndex(chrom, pos)
>>> idx.locate_key('chr1')
slice(2, 5, None)
>>> idx.locate_key('chr2', 4)
1
>>> idx.locate_key('chr1', 5)
slice(3, 5, None)
>>> try:
...     idx.locate_key('chr3', 4)
... except KeyError as e:
...     print(e)
...
('chr3', 4)

locate_range(self, chrom, start=None, stop=None)[source]¶

Locate slice of index containing all entries within the range key:start-stop inclusive.

Parameters:	chrom : object Chromosome or contig. start : int, optional Position start value. stop : int, optional Position stop value.
Returns:	loc : slice Slice object.

Examples

>>> import allel
>>> chrom = ['chr2', 'chr2', 'chr1', 'chr1', 'chr1', 'chr3']
>>> pos = [1, 4, 2, 5, 5, 3]
>>> idx = allel.ChromPosIndex(chrom, pos)
>>> idx.locate_range('chr1')
slice(2, 5, None)
>>> idx.locate_range('chr2', 1, 4)
slice(0, 2, None)
>>> idx.locate_range('chr1', 3, 7)
slice(3, 5, None)
>>> try:
...     idx.locate_range('chr3', 4, 9)
... except KeyError as e:
...     print(e)
('chr3', 4, 9)

SortedMultiIndex¶

class allel.SortedMultiIndex(l1, l2, copy=False)[source]¶

Two-level index of sorted values, e.g., variant positions from two or more chromosomes/contigs.

Parameters:	l1 : array_like First level values in ascending order. l2 : array_like Second level values, in ascending order within each sub-level. copy : bool, optional If True, inputs will be copied into new arrays.

See also

SortedIndex, UniqueIndex

Examples

>>> import allel
>>> chrom = ['chr1', 'chr1', 'chr2', 'chr2', 'chr2', 'chr3']
>>> pos = [1, 4, 2, 5, 5, 3]
>>> idx = allel.SortedMultiIndex(chrom, pos)
>>> idx
<SortedMultiIndex shape=(6,), dtype=<U4/int64>
chr1:1 chr1:4 chr2:2 chr2:5 chr2:5 chr3:3
>>> len(idx)
6

locate_key(self, k1, k2=None)[source]¶

Get index location for the requested key.

Parameters:	k1 : object Level 1 key. k2 : object, optional Level 2 key.
Returns:	loc : int or slice Location of requested key (will be slice if there are duplicate entries).

Examples

>>> import allel
>>> chrom = ['chr1', 'chr1', 'chr2', 'chr2', 'chr2', 'chr3']
>>> pos = [1, 4, 2, 5, 5, 3]
>>> idx = allel.SortedMultiIndex(chrom, pos)
>>> idx.locate_key('chr1')
slice(0, 2, None)
>>> idx.locate_key('chr1', 4)
1
>>> idx.locate_key('chr2', 5)
slice(3, 5, None)
>>> try:
...     idx.locate_key('chr3', 4)
... except KeyError as e:
...     print(e)
...
('chr3', 4)

locate_range(self, key, start=None, stop=None)[source]¶

Locate slice of index containing all entries within the range key:start-stop inclusive.

Parameters:	key : object Level 1 key value. start : object, optional Level 2 start value. stop : object, optional Level 2 stop value.
Returns:	loc : slice Slice object.

Examples

>>> import allel
>>> chrom = ['chr1', 'chr1', 'chr2', 'chr2', 'chr2', 'chr3']
>>> pos = [1, 4, 2, 5, 5, 3]
>>> idx = allel.SortedMultiIndex(chrom, pos)
>>> idx.locate_range('chr1')
slice(0, 2, None)
>>> idx.locate_range('chr1', 1, 4)
slice(0, 2, None)
>>> idx.locate_range('chr2', 3, 7)
slice(3, 5, None)
>>> try:
...     idx.locate_range('chr3', 4, 9)
... except KeyError as e:
...     print(e)
('chr3', 4, 9)

UniqueIndex¶

class allel.UniqueIndex(data, copy=False, dtype=<type 'object'>, **kwargs)[source]¶

Array of unique values (e.g., variant or sample identifiers).

Parameters:	data : array_like Values. **kwargs : keyword arguments All keyword arguments are passed through to `numpy.array()`.

See also

SortedIndex, SortedMultiIndex

Notes

This class represents an arbitrary set of unique values, e.g., sample or variant identifiers.

There is no need for values to be sorted. However, all values must be unique within the array, and must be hashable objects.

Examples

>>> import allel
>>> idx = allel.UniqueIndex(['A', 'C', 'B', 'F'])
>>> idx
<UniqueIndex shape=(4,) dtype=object>
['A', 'C', 'B', 'F']
>>> idx.dtype
dtype('O')
>>> idx.ndim
1
>>> idx.shape
(4,)

locate_key(self, key)[source]¶

Get index location for the requested key.

Parameters:	key : object Key to locate.
Returns:	loc : int Location of key.

Examples

>>> import allel
>>> idx = allel.UniqueIndex(['A', 'C', 'B', 'F'])
>>> idx.locate_key('A')
0
>>> idx.locate_key('B')
2
>>> try:
...     idx.locate_key('X')
... except KeyError as e:
...     print(e)
...
'X'

locate_keys(self, keys, strict=True)[source]¶

Get index locations for the requested keys.

Parameters:	keys : array_like Array of keys to locate. strict : bool, optional If True, raise KeyError if any keys are not found in the index.
Returns:	loc : ndarray, bool Boolean array with location of keys.

Examples

>>> import allel
>>> idx = allel.UniqueIndex(['A', 'C', 'B', 'F'])
>>> idx.locate_keys(['F', 'C'])
array([False,  True, False,  True])
>>> idx.locate_keys(['X', 'F', 'G', 'C', 'Z'], strict=False)
array([False,  True, False,  True])

locate_intersection(self, other)[source]¶

Locate the intersection with another array.

Parameters:	other : array_like Array to intersect.
Returns:	loc : ndarray, bool Boolean array with location of intersection. loc_other : ndarray, bool Boolean array with location in other of intersection.

Examples

>>> import allel
>>> idx1 = allel.UniqueIndex(['A', 'C', 'B', 'F'], dtype=object)
>>> idx2 = allel.UniqueIndex(['X', 'F', 'G', 'C', 'Z'], dtype=object)
>>> loc1, loc2 = idx1.locate_intersection(idx2)
>>> loc1
array([False,  True, False,  True])
>>> loc2
array([False,  True, False,  True, False])
>>> idx1[loc1]
<UniqueIndex shape=(2,) dtype=object>
['C', 'F']
>>> idx2[loc2]
<UniqueIndex shape=(2,) dtype=object>
['F', 'C']

intersect(self, other)[source]¶

Intersect with other.

Parameters:	other : array_like Array to intersect.
Returns:	out : UniqueIndex

Examples

>>> import allel
>>> idx1 = allel.UniqueIndex(['A', 'C', 'B', 'F'], dtype=object)
>>> idx2 = allel.UniqueIndex(['X', 'F', 'G', 'C', 'Z'], dtype=object)
>>> idx1.intersect(idx2)
<UniqueIndex shape=(2,) dtype=object>
['C', 'F']
>>> idx2.intersect(idx1)
<UniqueIndex shape=(2,) dtype=object>
['F', 'C']