Statistics¶
This module defines statistical functions for use with variant call data.
Windowed statistics¶
- allel.stats.windowed_statistic(pos, values, window, statistic, start=None, stop=None)[source]¶
Compute a statistic in non-overlapping windows over a single chromosome/contig.
Parameters: pos : array_like, int, shape (n_variants,)
Positions array (1-based).
values : array_like, shape (n_variants,)
Values to be summarised within each window.
window : int
Window size.
statistic : string or function
Statistic to compute.
start : int, optional
Start position of first window (1-based).
stop : int, optional
Stop position of last window (1-based).
Returns: s : ndarray, shape (n_windows,)
Computed statistic.
edges : ndarray, shape (n_windows + 1,)
Window edge positions.
- allel.stats.windowed_nnz(pos, b, window, start=None, stop=None)[source]¶
Count nonzero (i.e., True) elements in non-overlapping windows over a single chromosome/contig.
Parameters: pos : array_like, int, shape (n_variants,)
Positions array.
b : array_like, bool, shape (n_variants,) or (n_variants, n_samples)
Boolean array.
window : int
Window size.
start : int, optional
Start position of first window (1-based).
stop : int, optional
Stop position of last window (1-based).
Returns: counts : ndarray, int, shape (n_windows,) or (n_windows, n_samples)
Counts array.
edges : ndarray, shape (n_windows + 1,)
Window edge positions.
Examples
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 1], [0, 1]], ... [[1, 1], [1, 2]], ... [[2, 2], [-1, -1]]], dtype='i1') >>> pos = allel.PositionIndex([2, 14, 15, 27]) >>> b = g.is_variant() >>> counts, edges = allel.windowed_nnz(pos, b, window=10) >>> edges array([ 2, 12, 22, 32]) >>> counts array([0, 2, 1]) >>> counts, edges = allel.windowed_nnz(pos, b, window=10, start=1, ... stop=27) >>> edges array([ 1, 11, 21, 27]) >>> counts array([0, 2, 1]) >>> b = g.is_het() >>> counts, edges = allel.windowed_nnz(pos, b, window=10) >>> edges array([ 2, 12, 22, 32]) >>> counts array([[0, 0], [1, 2], [0, 0]])
- allel.stats.windowed_mean_per_base(pos, values, window, start=None, stop=None, is_accessible=None, fill=nan)[source]¶
Calculate the mean per genome position of the given values in non-overlapping windows over a single chromosome/contig.
Parameters: pos : array_like, int, shape (n_variants,)
Positions array.
values : array_like, shape (n_variants,)
Values to be summarised within each window.
window : int
Window size.
start : int, optional
Start position of first window (1-based).
stop : int, optional
Stop position of last window (1-based).
is_accessible : array_like, bool, shape (len(contig),), optional
Accessibility mask. If provided, the size of each window will be calculated as the number of accessible positions, rather than the absolute window size.
fill : optional
Fill value to use if window size is zero.
Returns: m : ndarray, shape (n_windows,)
Mean per base of values in each window.
edges : ndarray, shape (n_windows + 1,)
Window edge positions.
widths : array_like, int, shape (n_windows,)
Number of accessible positions in each window.
- allel.stats.windowed_nnz_per_base(pos, b, window, start=None, stop=None, is_accessible=None, fill=nan)[source]¶
Calculate the per-base-pair density of nonzero (i.e., True) elements in non-overlapping windows over a single chromosome/contig.
Parameters: pos : array_like, int, shape (n_variants,)
Positions array.
b : array_like, bool, shape (n_variants,) or (n_variants, n_samples)
Boolean array.
window : int
Window size.
start : int, optional
Start position of first window (1-based).
stop : int, optional
Stop position of last window (1-based).
is_accessible : array_like, bool, shape (len(contig),), optional
Accessibility mask. If provided, the size of each window will be calculated as the number of accessible positions, rather than the absolute window size.
fill : optional
Fill value to use if window size is zero.
Returns: densities : ndarray, shape (n_windows,) or (n_windows, n_samples)
Per base density of True values in each window.
edges : ndarray, shape (n_windows + 1,)
Window edge positions.
counts : ndarray, int, shape (n_windows,) or (n_windows, n_samples)
Counts array.
widths : array_like, int, shape (n_windows,)
Number of accessible positions in each window.
Examples
Assuming all positions are accessible:
>>> import allel >>> g = allel.GenotypeArray([[[0, 0], [0, 0]], ... [[0, 1], [0, 1]], ... [[1, 1], [1, 2]], ... [[2, 2], [-1, -1]]], dtype='i1') >>> pos = allel.PositionIndex([2, 14, 15, 27]) >>> b = g.is_variant() >>> densities, edges, counts, widths = allel.windowed_nnz_per_base( ... pos, b, window=10 ... ) >>> edges array([ 2, 12, 22, 32]) >>> widths array([10, 10, 11]) >>> counts array([0, 2, 1]) >>> densities array([ 0. , 0.2 , 0.09090909])
Density calculations can take into account the number of accessible positions within each window, e.g.:
>>> is_accessible = np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ... 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, ... 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], ... dtype=bool) >>> densities, edges, counts, widths = allel.windowed_nnz_per_base( ... pos, b, start=1, stop=31, window=10, ... is_accessible=is_accessible, fill=np.nan ... ) >>> edges array([ 1, 11, 21, 31]) >>> widths array([ 0, 5, 11]) >>> counts array([0, 2, 1]) >>> densities array([ nan, 0.4 , 0.09090909])
- allel.stats.windowed_nucleotide_diversity(g, pos, window, start=None, stop=None, is_accessible=None, fill=nan)[source]¶
Calculate nucleotide diversity in non-overlapping windows over a single chromosome/contig.
Parameters: g : array_like, int, shape (n_variants, n_samples, ploidy)
Genotype array.
pos : array_like, int, shape (n_variants,)
Positions array.
window : int
Window size.
start : int, optional
Start position of first window (1-based).
stop : int, optional
Stop position of last window (1-based).
is_accessible : array_like, bool, shape (len(contig),), optional
Accessibility mask. If provided, the size of each window will be calculated as the number of accessible positions, rather than the absolute window size.
fill : optional
Fill value to use if window size is zero.
Returns: pi : ndarray, shape (n_windows,)
Nucleotide diversity in each window.
edges : ndarray, shape (n_windows + 1,)
Window edge positions.
widths : array_like, int, shape (n_windows,)
Number of accessible positions in each window.