Chunked arrays¶
This module is maintained for backwards-compatibility, however it is recommended
to migrate to use the equivalent classes from the allel.model.dask
module
where possible.
This module provides alternative implementations of array and table
classes defined in the allel.model.ndarray
module, using
chunked arrays for data storage. Chunked arrays can be compressed and
optionally stored on disk, providing a means for working with data too
large to fit uncompressed in main memory.
Either Zarr, HDF5 (via h5py)
or bcolz can be used as the underlying storage
layer. Choice of storage layer can be made via the storage keyword
argument which all class methods accept. This argument can either be
a string identifying one of the predefined storage layer
configurations, or an object implementing the chunked storage API. For more
information about controlling storage see the allel.chunked
module.
GenotypeChunkedArray¶
-
class
allel.
GenotypeChunkedArray
(data)[source]¶ Alternative implementation of the
allel.model.ndarray.GenotypeArray
class, wrapping a chunked array as the backing store.Parameters: - data : array_like, int, shape (n_variants, n_samples, ploidy)
Genotype data to be wrapped. May be a bcolz carray, h5py dataset, or anything providing a similar interface.
Examples
Wrap an HDF5 dataset:
>>> import h5py >>> with h5py.File('callset.h5', mode='w') as h5f: ... h5g = h5f.create_group('/3L/calldata') ... h5g.create_dataset('genotype', ... data=[[[0, 0], [0, 1]], ... [[0, 1], [1, 1]], ... [[0, 2], [-1, -1]]], ... dtype='i1', chunks=(2, 2, 2), ... compression='gzip', compression_opts=1) ... <HDF5 dataset "genotype": shape (3, 2, 2), type "|i1"> >>> import allel >>> callset = h5py.File('callset.h5', mode='r') >>> g = allel.GenotypeChunkedArray(callset['/3L/calldata/genotype']) >>> g <GenotypeChunkedArray shape=(3, 2, 2) dtype=int8 chunks=(2, 2, 2) nbytes=12 cbytes=30 cratio=0.4 compression=gzip compression_opts=1 values=h5py._hl.dataset.Dataset> >>> g.values <HDF5 dataset "genotype": shape (3, 2, 2), type "|i1">
Obtain a numpy array by slicing, e.g.:
>>> g[:] <GenotypeArray shape=(3, 2, 2) dtype=int8> 0/0 0/1 0/1 1/1 0/2 ./.
Note that most methods will return a chunked array, using whatever chunked storage is set as default (bcolz carray) or specified directly via the storage keyword argument. E.g.:
>>> g.copy() <GenotypeChunkedArray shape=(3, 2, 2) dtype=int8 chunks=(3, 2, 2) ... >>> g.copy(storage='bcolzmem') <GenotypeChunkedArray shape=(3, 2, 2) dtype=int8 chunks=(4096, 2, 2) ... >>> g.copy(storage='hdf5mem_zlib1') <GenotypeChunkedArray shape=(3, 2, 2) dtype=int8 chunks=(3, 2, 2) ...
HaplotypeChunkedArray¶
-
class
allel.
HaplotypeChunkedArray
(data)[source]¶ Alternative implementation of the
allel.model.ndarray.HaplotypeArray
class, using a chunked array as the backing store.Parameters: - data : array_like, int, shape (n_variants, n_haplotypes)
Haplotype data to be wrapped. May be a bcolz carray, h5py dataset, or anything providing a similar interface.
AlleleCountsChunkedArray¶
-
class
allel.
AlleleCountsChunkedArray
(data)[source]¶ Alternative implementation of the
allel.model.ndarray.AlleleCountsArray
class, using a chunked array as the backing store.Parameters: - data : array_like, int, shape (n_variants, n_alleles)
Allele counts data to be wrapped. May be a bcolz carray, h5py dataset, or anything providing a similar interface.
VariantChunkedTable¶
-
class
allel.
VariantChunkedTable
(data, names=None, index=None)[source]¶ Alternative implementation of the
allel.model.ndarray.VariantTable
class, using a chunked table as the backing store.Parameters: - data: table_like
Data to be wrapped. May be a tuple or list of columns (array-like), a dict mapping names to columns, a bcolz ctable, h5py group, numpy recarray, or anything providing a similar interface.
- names : sequence of strings
Column names.
Examples
Wrap columns stored as datasets within an HDF5 group:
>>> import h5py >>> chrom = [b'chr1', b'chr1', b'chr2', b'chr2', b'chr3'] >>> pos = [2, 7, 3, 9, 6] >>> dp = [35, 12, 78, 22, 99] >>> qd = [4.5, 6.7, 1.2, 4.4, 2.8] >>> ac = [(1, 2), (3, 4), (5, 6), (7, 8), (9, 10)] >>> with h5py.File('callset.h5', mode='w') as h5f: ... h5g = h5f.create_group('/3L/variants') ... h5g.create_dataset('CHROM', data=chrom, chunks=True) ... h5g.create_dataset('POS', data=pos, chunks=True) ... h5g.create_dataset('DP', data=dp, chunks=True) ... h5g.create_dataset('QD', data=qd, chunks=True) ... h5g.create_dataset('AC', data=ac, chunks=True) ... <HDF5 dataset "CHROM": shape (5,), type "|S4"> <HDF5 dataset "POS": shape (5,), type "<i8"> <HDF5 dataset "DP": shape (5,), type "<i8"> <HDF5 dataset "QD": shape (5,), type "<f8"> <HDF5 dataset "AC": shape (5, 2), type "<i8"> >>> import allel >>> callset = h5py.File('callset.h5', mode='r') >>> vt = allel.VariantChunkedTable(callset['/3L/variants'], ... names=['CHROM', 'POS', 'AC', 'QD', 'DP']) >>> vt <VariantChunkedTable shape=(5,) dtype=[('CHROM', 'S4'), ('POS', '<i8'), ('AC', ... ...
Obtain a single row:
>>> vt[0] row(CHROM=b'chr1', POS=2, AC=array([1, 2]), QD=4.5, DP=35)
Obtain a numpy array by slicing:
>>> vt[:] <VariantTable shape=(5,) dtype=(numpy.record, [('CHROM', 'S4'), ('POS', '<i8'), ... [(b'chr1', 2, [ 1, 2], 4.5, 35) (b'chr1', 7, [ 3, 4], 6.7, 12) (b'chr2', 3, [ 5, 6], 1.2, 78) (b'chr2', 9, [ 7, 8], 4.4, 22) (b'chr3', 6, [ 9, 10], 2.8, 99)]
Access a subset of columns:
>>> vt[['CHROM', 'POS']] <VariantChunkedTable shape=(5,) dtype=[('CHROM', 'S4'), ('POS', '<i8')] nbytes=60 cbytes=60 cratio=1.0 values=builtins.list>
Note that most methods will return a chunked table, using whatever chunked storage is set as default (bcolz ctable) or specified directly via the storage keyword argument. E.g.:
>>> vt.copy() <VariantChunkedTable shape=(5,) dtype=[('CHROM', 'S4'), ('POS', '<i8'), ('AC', ... nbytes=220 cbytes=... cratio=... values=allel.chunked.storage_zarr.ZarrTable> >>> vt.copy(storage='bcolzmem') <VariantChunkedTable shape=(5,) dtype=[('CHROM', 'S4'), ('POS', '<i8'), ('AC', ... nbytes=220 cbytes=... cratio=... values=bcolz.ctable.ctable> >>> vt.copy(storage='hdf5mem_zlib1') <VariantChunkedTable shape=(5,) dtype=[('CHROM', 'S4'), ('POS', '<i8'), ('AC', ... nbytes=220 cbytes=... cratio=... values=h5py._hl.files.File>