Chunked arrays¶
This module provides alternative implementations of array and table
classes defined in the allel.model.ndarray
module, using
chunked arrays for data storage. Chunked arrays can be compressed and
optionally stored on disk, providing a means for working with data too
large to fit uncompressed in main memory.
Either HDF5 (via h5py) or bcolz can be used as the underlying storage
layer. Choice of storage layer can be made via the storage keyword
argument which all class methods accept. This argument can either be
a string identifying one of the predefined storage layer
configurations, or an object implementing the chunked storage API. For more
information about controlling storage see the allel.chunked
module.
GenotypeChunkedArray¶
-
class
allel.model.chunked.
GenotypeChunkedArray
(data)[source]¶ Alternative implementation of the
allel.model.ndarray.GenotypeArray
class, wrapping a chunked array as the backing store.Parameters: data : array_like, int, shape (n_variants, n_samples, ploidy)
Genotype data to be wrapped. May be a bcolz carray, h5py dataset, or anything providing a similar interface.
Examples
Wrap an HDF5 dataset:
>>> import h5py >>> with h5py.File('callset.h5', mode='w') as h5f: ... h5g = h5f.create_group('/3L/calldata') ... h5g.create_dataset('genotype', ... data=[[[0, 0], [0, 1]], ... [[0, 1], [1, 1]], ... [[0, 2], [-1, -1]]], ... dtype='i1', ... chunks=(2, 2, 2)) ... <HDF5 dataset "genotype": shape (3, 2, 2), type "|i1"> >>> import allel >>> callset = h5py.File('callset.h5', mode='r') >>> g = allel.GenotypeChunkedArray(callset['/3L/calldata/genotype']) >>> g GenotypeChunkedArray((3, 2, 2), int8, nbytes=12, cbytes=16, cratio=0.8, shuffle=False, chunks=(2, 2, 2), data=h5py._hl.dataset.Dataset) >>> g.data <HDF5 dataset "genotype": shape (3, 2, 2), type "|i1">
Obtain a numpy array by slicing, e.g.:
>>> g[:] GenotypeArray((3, 2, 2), dtype=int8) [[[ 0 0] [ 0 1]] [[ 0 1] [ 1 1]] [[ 0 2] [-1 -1]]]
Note that most methods will return a chunked array, using whatever chunked storage is set as default (bcolz carray) or specified directly via the storage keyword argument. E.g.:
>>> g.copy() GenotypeChunkedArray((3, 2, 2), int8, nbytes=12, cbytes=16.0K, cratio=0.0, cname=blosclz, clevel=5, shuffle=True, chunks=(4096, 2, 2), data=bcolz.carray_ext.carray) >>> g.copy(storage='hdf5mem_zlib1') GenotypeChunkedArray((3, 2, 2), int8, nbytes=12, cbytes=4.5K, cratio=0.0, cname=gzip, clevel=1, shuffle=False, chunks=(262144, 2, 2), data=h5py._hl.dataset.Dataset)
HaplotypeChunkedArray¶
-
class
allel.model.chunked.
HaplotypeChunkedArray
(data)[source]¶ Alternative implementation of the
allel.model.ndarray.HaplotypeArray
class, using a chunked array as the backing store.Parameters: data : array_like, int, shape (n_variants, n_haplotypes)
Haplotype data to be wrapped. May be a bcolz carray, h5py dataset, or anything providing a similar interface.
AlleleCountsChunkedArray¶
-
class
allel.model.chunked.
AlleleCountsChunkedArray
(data)[source]¶ Alternative implementation of the
allel.model.ndarray.AlleleCountsArray
class, using a chunked array as the backing store.Parameters: data : array_like, int, shape (n_variants, n_alleles)
Allele counts data to be wrapped. May be a bcolz carray, h5py dataset, or anything providing a similar interface.
VariantChunkedTable¶
-
class
allel.model.chunked.
VariantChunkedTable
(data, names=None, index=None)[source]¶ Alternative implementation of the
allel.model.ndarray.VariantTable
class, using a chunked table as the backing store.Parameters: data: table_like
Data to be wrapped. May be a tuple or list of columns (array-like), a dict mapping names to columns, a bcolz ctable, h5py group, numpy recarray, or anything providing a similar interface.
names : sequence of strings
Column names.
Examples
Wrap columns stored as datasets within an HDF5 group:
>>> import h5py >>> chrom = [b'chr1', b'chr1', b'chr2', b'chr2', b'chr3'] >>> pos = [2, 7, 3, 9, 6] >>> dp = [35, 12, 78, 22, 99] >>> qd = [4.5, 6.7, 1.2, 4.4, 2.8] >>> ac = [(1, 2), (3, 4), (5, 6), (7, 8), (9, 10)] >>> with h5py.File('callset.h5', mode='w') as h5f: ... h5g = h5f.create_group('/3L/variants') ... h5g.create_dataset('CHROM', data=chrom, chunks=True) ... h5g.create_dataset('POS', data=pos, chunks=True) ... h5g.create_dataset('DP', data=dp, chunks=True) ... h5g.create_dataset('QD', data=qd, chunks=True) ... h5g.create_dataset('AC', data=ac, chunks=True) ... <HDF5 dataset "CHROM": shape (5,), type "|S4"> <HDF5 dataset "POS": shape (5,), type "<i8"> <HDF5 dataset "DP": shape (5,), type "<i8"> <HDF5 dataset "QD": shape (5,), type "<f8"> <HDF5 dataset "AC": shape (5, 2), type "<i8"> >>> import allel >>> callset = h5py.File('callset.h5', mode='r') >>> vt = allel.VariantChunkedTable(callset['/3L/variants'], ... names=['CHROM', 'POS', 'AC', 'QD', 'DP']) >>> vt VariantChunkedTable(5, nbytes=220, cbytes=220, cratio=1.0, data=h5py._hl.group.Group)
Obtain a single row:
>>> vt[0] row(CHROM=b'chr1', POS=2, AC=array([1, 2]), QD=4.5, DP=35)
Obtain a numpy array by slicing:
>>> vt[:] VariantTable((5,), dtype=[('CHROM', 'S4'), ('POS', '<i8'), ('AC', ... [(b'chr1', 2, [1, 2], 4.5, 35) (b'chr1', 7, [3, 4], 6.7, 12) (b'chr2', 3, [5, 6], 1.2, 78) (b'chr2', 9, [7, 8], 4.4, 22) (b'chr3', 6, [9, 10], 2.8, 99)]
Access a subset of columns:
>>> vt[['CHROM', 'POS']] VariantChunkedTable(5, nbytes=60, cbytes=60, cratio=1.0, data=builtins.list)
Note that most methods will return a chunked table, using whatever chunked storage is set as default (bcolz ctable) or specified directly via the storage keyword argument. E.g.:
>>> vt.copy() VariantChunkedTable(5, nbytes=220, cbytes=80.0K, cratio=0.0, data=bcolz.ctable.ctable) >>> vt.copy(storage='hdf5mem_zlib1') VariantChunkedTable(5, nbytes=220, cbytes=22.5K, cratio=0.0, data=h5py._hl.files.File)
FeatureChunkedTable¶
-
class
allel.model.chunked.
FeatureChunkedTable
(data, names=None)[source]¶ Alternative implementation of the
allel.model.ndarray.FeatureTable
class, using a chunked table as the backing store.Parameters: data: table_like
Data to be wrapped. May be a tuple or list of columns (array-like), a dict mapping names to columns, a bcolz ctable, h5py group, numpy recarray, or anything providing a similar interface.
names : sequence of strings
Column names.
AlleleCountsChunkedTable¶
-
class
allel.model.chunked.
FeatureChunkedTable
(data, names=None)[source] Alternative implementation of the
allel.model.ndarray.FeatureTable
class, using a chunked table as the backing store.Parameters: data: table_like
Data to be wrapped. May be a tuple or list of columns (array-like), a dict mapping names to columns, a bcolz ctable, h5py group, numpy recarray, or anything providing a similar interface.
names : sequence of strings
Column names.