MOp Epigenomic Data¶

File location summary¶

(See bellow for more details)

Dropbox URL¶

https://www.dropbox.com/home/BICCN%20MiniBrain%20JointDataAnalysis/EckerRen_Mouse_MOp_methylation%2BATAC

mC¶

Dataset (including all kinds of raw count matrix, gene and chromosome 100kb bins):
- ./dataset/mC/*mcds # store in hdf5 based MCDS format, see bellow for details
- ./dataset/mC/2-D_csv_matrix # cell-by-feature matrix extracted from *.mcds, same info but just 2-D matrix csv file.
Metadata:
- ./metadata/mc/MOp_Metadata.tsv.gz
Analysis Result:
- methylation cluster assignment is finalized
- ./study/mC/MOp_clustering/MOp.snmC-seq.AnalysisResult.csv.gz

ATAC¶

Dataset (including all kinds of raw count matrix, gene and chromosome 100kb bins)
- ./dataset/ATAC/*snap # store in hdf5 based SNAP format, see bellow for details
Metadata:
- ./metadata/atac/CEMBA_MOp.barcode.txt
Analysis Result:
- ATAC cluster assignment is finalized
- ./study/ATAC/MOp.snATAC-seq.AnalysisResult.csv.gz

Data Included¶

snmC-seq2 Data¶

Brain Region: 2C (2214), 3C (2473), 4B (2881), 5D (2373)
Total Cell Pass QC: 9941

import pathlib
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

mc_df = pd.read_msgpack('study/mC/MOp_clustering/ALL/CellMetadata.AfterQC.msg')

mc_df['Region'].value_counts()

4B    2881
3C    2473
5D    2373
2C    2214
Name: Region, dtype: int64

mc_df.shape[0]

9941

snATAC-seq Data¶

Brain Region: 2C (19965), 3C (17662), 4B (23266), 5D (20303)
Total Cell Pass QC: 81196

atac_df = pd.read_msgpack('study/ATAC/MOp.snATAC-seq.AnalysisResult.msg')

atac_df['region'].value_counts()

4B    23266
5D    20303
2C    19965
3C    17662
Name: region, dtype: int64

atac_df.shape[0]

81196

Directory contents¶

./Dataset¶

methylation count matrix is saved in MCDS file format, which is a structured HDF5 format defined by xarray package (using netCDF). The key reason to use xarray is its ability to handle N-D labeled arrays. So we can add new dimentions for count type (mc count, cov count) or methylation type (mCH, mCG) instead of saving bunch of files separately.

xarray http://xarray.pydata.org/en/stable/

Contact¶

Hanqing Liu hanliu@salk.edu

import xarray as xr

mcds = xr.open_dataset('./dataset/mC/3C-171206.mcds')
mcds

<xarray.Dataset>
Dimensions:              (cell: 1202, chrom100k: 27269, count_type: 2, gene: 55487, mc_type: 2, strand_type: 1)
Coordinates:
  * cell                 (cell) object '3C_M_1015' '3C_M_0' ... '3C_M_997'
  * gene                 (gene) object 'ENSMUSG00000102693.1' ... 'ENSMUSG00000064372.1'
  * count_type           (count_type) object 'mc' 'cov'
  * strand_type          (strand_type) object 'both'
  * mc_type              (mc_type) object 'CGN' 'CHN'
  * chrom100k            (chrom100k) int64 0 1 2 3 4 ... 27265 27266 27267 27268
    chrom100k_chrom      (chrom100k) object ...
    chrom100k_bin_start  (chrom100k) int64 ...
    chrom100k_bin_end    (chrom100k) int64 ...
    gene_chrom           (gene) object ...
    gene_start           (gene) int64 ...
    gene_end             (gene) int64 ...
Data variables:
    gene_da              (cell, gene, mc_type, strand_type, count_type) uint16 ...
    chrom100k_da         (cell, chrom100k, mc_type, strand_type, count_type) uint16 ...

# get a 2D matrix and save to other formats
count_table = mcds['gene_da'].sel(mc_type='CHN', count_type='cov').squeeze().to_pandas()
count_table.head()

# for arbitary cell combination and parallel computation
# see here for xarry parallel computation using dask package http://xarray.pydata.org/en/stable/dask.html

select_cells = mc_df.sample(1000).index

mcds_select = xr.open_mfdataset('./dataset/mC/*.mcds', 
                                combine='nested', 
                                concat_dim='cell')\
                .sel(cell=select_cells).chunk(dict(cell=1000))
mcds_select

[1, 1, 1, 1, 1, 1, 1, 1, 1]

<xarray.Dataset>
Dimensions:              (cell: 1000, chrom100k: 27269, count_type: 2, gene: 55487, mc_type: 2, strand_type: 1)
Coordinates:
  * mc_type              (mc_type) object 'CGN' 'CHN'
  * gene                 (gene) object 'ENSMUSG00000102693.1' ... 'ENSMUSG00000064372.1'
  * count_type           (count_type) object 'mc' 'cov'
  * strand_type          (strand_type) object 'both'
  * chrom100k            (chrom100k) int64 0 1 2 3 4 ... 27265 27266 27267 27268
    chrom100k_chrom      (chrom100k) object dask.array<shape=(27269,), chunksize=(27269,)>
    chrom100k_bin_start  (chrom100k) int64 dask.array<shape=(27269,), chunksize=(27269,)>
    chrom100k_bin_end    (chrom100k) int64 dask.array<shape=(27269,), chunksize=(27269,)>
    gene_chrom           (gene) object dask.array<shape=(55487,), chunksize=(55487,)>
    gene_start           (gene) int64 dask.array<shape=(55487,), chunksize=(55487,)>
    gene_end             (gene) int64 dask.array<shape=(55487,), chunksize=(55487,)>
  * cell                 (cell) object '5D_M_2339' '2C_M_1313' ... '4B_M_694'
Data variables:
    gene_da              (cell, gene, mc_type, strand_type, count_type) uint16 dask.array<shape=(1000, 55487, 2, 1, 2), chunksize=(1000, 55487, 2, 1, 2)>
    chrom100k_da         (cell, chrom100k, mc_type, strand_type, count_type) uint16 dask.array<shape=(1000, 27269, 2, 1, 2), chunksize=(1000, 27269, 2, 1, 2)>

Save 2-D csv files¶

mcds_list = list(pathlib.Path('./dataset/mC/').glob('*mcds'))

for mcds_path in mcds_list:
    for feature_type in ['gene', 'chrom100k']:
        for mc_type in ['CHN', 'CGN']:
            for count_type in ['cov', 'mc']:
                mcds_name = mcds_path.name.split('.')[0]
                path = f'./dataset/mC/2-D_csv_matrix/{mcds_name}_{feature_type}_{mc_type}_{count_type}.csv.gz'
                mcds[f'{feature_type}_da'].sel(mc_type=mc_type, count_type=count_type)\
                                          .squeeze()\
                                          .to_pandas()\
                                          .to_csv(path)
                print(path, 'saved')

ATAC¶

ATAC count matrix is saved in SNAP file format, which is a structured HDF5 format defined by Snaptools developed by Rongxing Fang from Ren Lab. See Snaptools and SnapATAC github pages for more details:

SnapATAC: https://github.com/r3fang/SnapATAC
Snaptools: https://github.com/r3fang/SnapTools

Contact¶

Yang Li yal054@UCSD.EDU
Rongxing Fang r3fang@eng.ucsd.edu

!h5ls ./dataset/ATAC/CEMBA171206_3C.snap

AM                       Group
BD                       Group
FM                       Group
GM                       Group
HD                       Group
PM                       Group

# 5KB fragment count for each cell, stored in COO sparse matrix format.
# Note that the index is using R convension, start from 1
!h5ls ./dataset/ATAC/CEMBA171206_3C.snap/GM

count                    Dataset {29154507}
idx                      Dataset {29154507}
idy                      Dataset {29154507}
name                     Dataset {53278}

# 5KB fragment count for each cell, stored in COO sparse matrix format.
# Note that the index is using R convension, start from 1
!h5ls ./dataset/ATAC/CEMBA171206_3C.snap/FM

barcodeLen               Dataset {15731}
barcodePos               Dataset {15731}
fragChrom                Dataset {71428971/Inf}
fragLen                  Dataset {71428971/Inf}
fragStart                Dataset {71428971/Inf}

# gene count for each cell, stored in COO sparse matrix format.
# Note that the index is using R convension, start from 1
!h5ls ./dataset/ATAC/CEMBA171206_3C.snap/GM

count                    Dataset {29154507}
idx                      Dataset {29154507}
idy                      Dataset {29154507}
name                     Dataset {53278}

./metadata¶

mC¶

Cell Metadata: MOp_Metadata.tsv.gz

ATAC¶

Cell Metadata: CEMBA_MOp.barcode.txt

./study¶

Analysis code about single modality clustering and further analysis for reproducibility

%%javascript
IPython.notebook.save_notebook()

!jupyter nbconvert --to HTML README.ipynb

[NbConvertApp] Converting notebook README.ipynb to HTML
[NbConvertApp] Writing 304431 bytes to README.html

Name	Last modified	Size

Parent Directory		-
README.html	2020-02-14 11:14	297K
README.ipynb	2020-02-14 11:20	30K
bigwig/	2020-02-18 14:33	-
dataset/	2020-02-18 14:33	-
metadata/	2020-02-18 14:33	-
study/	2020-02-18 14:34	-

Index of /biccn/grant/u19_cemba/cemba/epigenome/sncell/ATACseq/mouse/processed/analysis/EckerRen_Mouse_MOp_methylation_ATAC

Table of Contents

MOp Epigenomic Data¶

File location summary¶

Dropbox URL¶

mC¶

ATAC¶

Data Included¶

snmC-seq2 Data¶

snATAC-seq Data¶

Directory contents¶

./Dataset¶

mC¶

Contact¶

Save 2-D csv files¶

ATAC¶

Contact¶

./metadata¶

mC¶

ATAC¶

./study¶

gene	ENSMUSG00000102693.1	ENSMUSG00000064842.1	ENSMUSG00000051951.5	ENSMUSG00000102851.1	ENSMUSG00000103377.1	ENSMUSG00000104017.1	ENSMUSG00000103025.1	ENSMUSG00000089699.1	ENSMUSG00000103201.1	ENSMUSG00000103147.1	...	ENSMUSG00000064363.1	ENSMUSG00000064364.1	ENSMUSG00000064365.1	ENSMUSG00000064366.1	ENSMUSG00000064367.1	ENSMUSG00000064368.1	ENSMUSG00000064369.1	ENSMUSG00000064370.1	ENSMUSG00000064371.1	ENSMUSG00000064372.1
cell
3C_M_1015	0	0	6712	0	83	0	17	635	45	0	...	50	6	0	0	0	29	0	84	0	0
3C_M_0	0	0	4981	32	44	0	30	401	0	28	...	0	0	0	0	36	0	0	142	0	0
3C_M_1005	0	0	3481	0	37	0	29	483	22	0	...	0	0	0	0	122	8	0	53	17	18
3C_M_1	0	0	7149	32	28	85	0	464	0	45	...	5	13	18	14	206	21	16	251	19	14
3C_M_1004	35	0	7919	16	35	94	34	645	110	57	...	21	9	0	0	30	0	0	70	26	30