| Name | Last modified | Size | Description | |
|---|---|---|---|---|
| Parent Directory | - | |||
| README.html | 2020-02-14 11:14 | 297K | ||
| README.ipynb | 2020-02-14 11:20 | 30K | ||
| bigwig/ | 2020-02-18 14:33 | - | ||
| dataset/ | 2020-02-18 14:33 | - | ||
| metadata/ | 2020-02-18 14:33 | - | ||
| study/ | 2020-02-18 14:34 | - | ||
(See bellow for more details)
import pathlib
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
mc_df = pd.read_msgpack('study/mC/MOp_clustering/ALL/CellMetadata.AfterQC.msg')
mc_df['Region'].value_counts()
mc_df.shape[0]
atac_df = pd.read_msgpack('study/ATAC/MOp.snATAC-seq.AnalysisResult.msg')
atac_df['region'].value_counts()
atac_df.shape[0]
methylation count matrix is saved in MCDS file format, which is a structured HDF5 format defined by xarray package (using netCDF). The key reason to use xarray is its ability to handle N-D labeled arrays. So we can add new dimentions for count type (mc count, cov count) or methylation type (mCH, mCG) instead of saving bunch of files separately.
import xarray as xr
mcds = xr.open_dataset('./dataset/mC/3C-171206.mcds')
mcds
# get a 2D matrix and save to other formats
count_table = mcds['gene_da'].sel(mc_type='CHN', count_type='cov').squeeze().to_pandas()
count_table.head()
# for arbitary cell combination and parallel computation
# see here for xarry parallel computation using dask package http://xarray.pydata.org/en/stable/dask.html
select_cells = mc_df.sample(1000).index
mcds_select = xr.open_mfdataset('./dataset/mC/*.mcds',
combine='nested',
concat_dim='cell')\
.sel(cell=select_cells).chunk(dict(cell=1000))
mcds_select
mcds_list = list(pathlib.Path('./dataset/mC/').glob('*mcds'))
for mcds_path in mcds_list:
for feature_type in ['gene', 'chrom100k']:
for mc_type in ['CHN', 'CGN']:
for count_type in ['cov', 'mc']:
mcds_name = mcds_path.name.split('.')[0]
path = f'./dataset/mC/2-D_csv_matrix/{mcds_name}_{feature_type}_{mc_type}_{count_type}.csv.gz'
mcds[f'{feature_type}_da'].sel(mc_type=mc_type, count_type=count_type)\
.squeeze()\
.to_pandas()\
.to_csv(path)
print(path, 'saved')
ATAC count matrix is saved in SNAP file format, which is a structured HDF5 format defined by Snaptools developed by Rongxing Fang from Ren Lab. See Snaptools and SnapATAC github pages for more details:
!h5ls ./dataset/ATAC/CEMBA171206_3C.snap
# 5KB fragment count for each cell, stored in COO sparse matrix format.
# Note that the index is using R convension, start from 1
!h5ls ./dataset/ATAC/CEMBA171206_3C.snap/GM
# 5KB fragment count for each cell, stored in COO sparse matrix format.
# Note that the index is using R convension, start from 1
!h5ls ./dataset/ATAC/CEMBA171206_3C.snap/FM
# gene count for each cell, stored in COO sparse matrix format.
# Note that the index is using R convension, start from 1
!h5ls ./dataset/ATAC/CEMBA171206_3C.snap/GM
Analysis code about single modality clustering and further analysis for reproducibility
%%javascript
IPython.notebook.save_notebook()
!jupyter nbconvert --to HTML README.ipynb