Index of /biccn/grant/u19_cemba/cemba/epigenome/sncell/ATACseq/mouse/processed/analysis/EckerRen_Mouse_MOp_methylation_ATAC

[ICO]NameLast modifiedSizeDescription

[PARENTDIR]Parent Directory  -  
[TXT]README.html2020-02-14 11:14 297K 
[   ]README.ipynb2020-02-14 11:20 30K 
[DIR]bigwig/2020-02-18 14:33 -  
[DIR]dataset/2020-02-18 14:33 -  
[DIR]metadata/2020-02-18 14:33 -  
[DIR]study/2020-02-18 14:34 -  

README

MOp Epigenomic Data

File location summary

(See bellow for more details)

Dropbox URL

https://www.dropbox.com/home/BICCN%20MiniBrain%20JointDataAnalysis/EckerRen_Mouse_MOp_methylation%2BATAC

mC

  • Dataset (including all kinds of raw count matrix, gene and chromosome 100kb bins):
    • ./dataset/mC/*mcds # store in hdf5 based MCDS format, see bellow for details
    • ./dataset/mC/2-D_csv_matrix # cell-by-feature matrix extracted from *.mcds, same info but just 2-D matrix csv file.
  • Metadata:
    • ./metadata/mc/MOp_Metadata.tsv.gz
  • Analysis Result:
    • methylation cluster assignment is finalized
    • ./study/mC/MOp_clustering/MOp.snmC-seq.AnalysisResult.csv.gz

ATAC

  • Dataset (including all kinds of raw count matrix, gene and chromosome 100kb bins)
    • ./dataset/ATAC/*snap # store in hdf5 based SNAP format, see bellow for details
  • Metadata:
    • ./metadata/atac/CEMBA_MOp.barcode.txt
  • Analysis Result:
    • ATAC cluster assignment is finalized
    • ./study/ATAC/MOp.snATAC-seq.AnalysisResult.csv.gz

Data Included

snmC-seq2 Data

  • Brain Region: 2C (2214), 3C (2473), 4B (2881), 5D (2373)
  • Total Cell Pass QC: 9941
In [1]:
import pathlib
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
In [2]:
mc_df = pd.read_msgpack('study/mC/MOp_clustering/ALL/CellMetadata.AfterQC.msg')
In [3]:
mc_df['Region'].value_counts()
Out[3]:
4B    2881
3C    2473
5D    2373
2C    2214
Name: Region, dtype: int64
In [4]:
mc_df.shape[0]
Out[4]:
9941

snATAC-seq Data

  • Brain Region: 2C (19965), 3C (17662), 4B (23266), 5D (20303)
  • Total Cell Pass QC: 81196
In [5]:
atac_df = pd.read_msgpack('study/ATAC/MOp.snATAC-seq.AnalysisResult.msg')
In [6]:
atac_df['region'].value_counts()
Out[6]:
4B    23266
5D    20303
2C    19965
3C    17662
Name: region, dtype: int64
In [7]:
atac_df.shape[0]
Out[7]:
81196

Directory contents

./Dataset

mC

methylation count matrix is saved in MCDS file format, which is a structured HDF5 format defined by xarray package (using netCDF). The key reason to use xarray is its ability to handle N-D labeled arrays. So we can add new dimentions for count type (mc count, cov count) or methylation type (mCH, mCG) instead of saving bunch of files separately.

Contact
  • Hanqing Liu hanliu@salk.edu
In [8]:
import xarray as xr
In [9]:
mcds = xr.open_dataset('./dataset/mC/3C-171206.mcds')
mcds
Out[9]:
<xarray.Dataset>
Dimensions:              (cell: 1202, chrom100k: 27269, count_type: 2, gene: 55487, mc_type: 2, strand_type: 1)
Coordinates:
  * cell                 (cell) object '3C_M_1015' '3C_M_0' ... '3C_M_997'
  * gene                 (gene) object 'ENSMUSG00000102693.1' ... 'ENSMUSG00000064372.1'
  * count_type           (count_type) object 'mc' 'cov'
  * strand_type          (strand_type) object 'both'
  * mc_type              (mc_type) object 'CGN' 'CHN'
  * chrom100k            (chrom100k) int64 0 1 2 3 4 ... 27265 27266 27267 27268
    chrom100k_chrom      (chrom100k) object ...
    chrom100k_bin_start  (chrom100k) int64 ...
    chrom100k_bin_end    (chrom100k) int64 ...
    gene_chrom           (gene) object ...
    gene_start           (gene) int64 ...
    gene_end             (gene) int64 ...
Data variables:
    gene_da              (cell, gene, mc_type, strand_type, count_type) uint16 ...
    chrom100k_da         (cell, chrom100k, mc_type, strand_type, count_type) uint16 ...
In [10]:
# get a 2D matrix and save to other formats
count_table = mcds['gene_da'].sel(mc_type='CHN', count_type='cov').squeeze().to_pandas()
count_table.head()
Out[10]:
gene ENSMUSG00000102693.1 ENSMUSG00000064842.1 ENSMUSG00000051951.5 ENSMUSG00000102851.1 ENSMUSG00000103377.1 ENSMUSG00000104017.1 ENSMUSG00000103025.1 ENSMUSG00000089699.1 ENSMUSG00000103201.1 ENSMUSG00000103147.1 ... ENSMUSG00000064363.1 ENSMUSG00000064364.1 ENSMUSG00000064365.1 ENSMUSG00000064366.1 ENSMUSG00000064367.1 ENSMUSG00000064368.1 ENSMUSG00000064369.1 ENSMUSG00000064370.1 ENSMUSG00000064371.1 ENSMUSG00000064372.1
cell
3C_M_1015 0 0 6712 0 83 0 17 635 45 0 ... 50 6 0 0 0 29 0 84 0 0
3C_M_0 0 0 4981 32 44 0 30 401 0 28 ... 0 0 0 0 36 0 0 142 0 0
3C_M_1005 0 0 3481 0 37 0 29 483 22 0 ... 0 0 0 0 122 8 0 53 17 18
3C_M_1 0 0 7149 32 28 85 0 464 0 45 ... 5 13 18 14 206 21 16 251 19 14
3C_M_1004 35 0 7919 16 35 94 34 645 110 57 ... 21 9 0 0 30 0 0 70 26 30

5 rows × 55487 columns

In [11]:
# for arbitary cell combination and parallel computation
# see here for xarry parallel computation using dask package http://xarray.pydata.org/en/stable/dask.html

select_cells = mc_df.sample(1000).index

mcds_select = xr.open_mfdataset('./dataset/mC/*.mcds', 
                                combine='nested', 
                                concat_dim='cell')\
                .sel(cell=select_cells).chunk(dict(cell=1000))
mcds_select
[1, 1, 1, 1, 1, 1, 1, 1, 1]
Out[11]:
<xarray.Dataset>
Dimensions:              (cell: 1000, chrom100k: 27269, count_type: 2, gene: 55487, mc_type: 2, strand_type: 1)
Coordinates:
  * mc_type              (mc_type) object 'CGN' 'CHN'
  * gene                 (gene) object 'ENSMUSG00000102693.1' ... 'ENSMUSG00000064372.1'
  * count_type           (count_type) object 'mc' 'cov'
  * strand_type          (strand_type) object 'both'
  * chrom100k            (chrom100k) int64 0 1 2 3 4 ... 27265 27266 27267 27268
    chrom100k_chrom      (chrom100k) object dask.array<shape=(27269,), chunksize=(27269,)>
    chrom100k_bin_start  (chrom100k) int64 dask.array<shape=(27269,), chunksize=(27269,)>
    chrom100k_bin_end    (chrom100k) int64 dask.array<shape=(27269,), chunksize=(27269,)>
    gene_chrom           (gene) object dask.array<shape=(55487,), chunksize=(55487,)>
    gene_start           (gene) int64 dask.array<shape=(55487,), chunksize=(55487,)>
    gene_end             (gene) int64 dask.array<shape=(55487,), chunksize=(55487,)>
  * cell                 (cell) object '5D_M_2339' '2C_M_1313' ... '4B_M_694'
Data variables:
    gene_da              (cell, gene, mc_type, strand_type, count_type) uint16 dask.array<shape=(1000, 55487, 2, 1, 2), chunksize=(1000, 55487, 2, 1, 2)>
    chrom100k_da         (cell, chrom100k, mc_type, strand_type, count_type) uint16 dask.array<shape=(1000, 27269, 2, 1, 2), chunksize=(1000, 27269, 2, 1, 2)>
Save 2-D csv files
In [12]:
mcds_list = list(pathlib.Path('./dataset/mC/').glob('*mcds'))
In [ ]:
for mcds_path in mcds_list:
    for feature_type in ['gene', 'chrom100k']:
        for mc_type in ['CHN', 'CGN']:
            for count_type in ['cov', 'mc']:
                mcds_name = mcds_path.name.split('.')[0]
                path = f'./dataset/mC/2-D_csv_matrix/{mcds_name}_{feature_type}_{mc_type}_{count_type}.csv.gz'
                mcds[f'{feature_type}_da'].sel(mc_type=mc_type, count_type=count_type)\
                                          .squeeze()\
                                          .to_pandas()\
                                          .to_csv(path)
                print(path, 'saved')

ATAC

ATAC count matrix is saved in SNAP file format, which is a structured HDF5 format defined by Snaptools developed by Rongxing Fang from Ren Lab. See Snaptools and SnapATAC github pages for more details:

Contact
  • Yang Li yal054@UCSD.EDU
  • Rongxing Fang r3fang@eng.ucsd.edu
In [13]:
!h5ls ./dataset/ATAC/CEMBA171206_3C.snap
AM                       Group
BD                       Group
FM                       Group
GM                       Group
HD                       Group
PM                       Group
In [14]:
# 5KB fragment count for each cell, stored in COO sparse matrix format.
# Note that the index is using R convension, start from 1
!h5ls ./dataset/ATAC/CEMBA171206_3C.snap/GM
count                    Dataset {29154507}
idx                      Dataset {29154507}
idy                      Dataset {29154507}
name                     Dataset {53278}
In [15]:
# 5KB fragment count for each cell, stored in COO sparse matrix format.
# Note that the index is using R convension, start from 1
!h5ls ./dataset/ATAC/CEMBA171206_3C.snap/FM
barcodeLen               Dataset {15731}
barcodePos               Dataset {15731}
fragChrom                Dataset {71428971/Inf}
fragLen                  Dataset {71428971/Inf}
fragStart                Dataset {71428971/Inf}
In [16]:
# gene count for each cell, stored in COO sparse matrix format.
# Note that the index is using R convension, start from 1
!h5ls ./dataset/ATAC/CEMBA171206_3C.snap/GM
count                    Dataset {29154507}
idx                      Dataset {29154507}
idy                      Dataset {29154507}
name                     Dataset {53278}

./metadata

mC

  • Cell Metadata: MOp_Metadata.tsv.gz

ATAC

  • Cell Metadata: CEMBA_MOp.barcode.txt

./study

Analysis code about single modality clustering and further analysis for reproducibility

In [ ]:
%%javascript
IPython.notebook.save_notebook()
In [18]:
!jupyter nbconvert --to HTML README.ipynb
[NbConvertApp] Converting notebook README.ipynb to HTML
[NbConvertApp] Writing 304431 bytes to README.html
In [ ]:
 
In [ ]: