Reference maping using scGen

In this tutorial, we are going to build a reference atlas using scGen and also map two new query datasets on the top of the reference atlas.

Note: scGen requires cell-type labels for data integration. The method outputs both corrected gene expression and also latent space.

[1]:
import os
import sys
sys.path.insert(0, "../")

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)
warnings.simplefilter(action='ignore', category=UserWarning)
[2]:
import scanpy as sc
import scarches as sca
from scarches.dataset.trvae.data_handling import remove_sparsity
import matplotlib.pyplot as plt
import numpy as np
import gdown
[3]:
sc.settings.set_figure_params(dpi=200, frameon=False)
sc.set_figure_params(dpi=200)
sc.set_figure_params(figsize=(4, 4))

set relevant anndata.obs labels and training hyperparameters

[4]:
condition_key = 'study'
cell_type_key = 'cell_type'
target_conditions = ['Pancreas CelSeq2', 'Pancreas SS2']


epoch = 50

early_stopping_kwargs = {
    "early_stopping_metric": "val_loss",
    "patience": 20,
    "threshold": 0,
    "reduce_lr": True,
    "lr_patience": 13,
    "lr_factor": 0.1,
}

Download Dataset and split into reference dataset and query dataset

[5]:
url = 'https://drive.google.com/uc?id=1ehxgfHTsMZXy6YzlFKGJOsBKQ5rrvMnd'
output = 'pancreas.h5ad'
gdown.download(url, output, quiet=False)
Downloading...
From: https://drive.google.com/uc?id=1ehxgfHTsMZXy6YzlFKGJOsBKQ5rrvMnd
To: /home/mo/projects/scarches/notebooks/pancreas.h5ad
126MB [00:01, 102MB/s]
[5]:
'pancreas.h5ad'

Imortant note : scGen requires normalized and log-transformed data in ``adata.X``

[6]:
adata = sc.read('pancreas.h5ad')

original uncorrected data UMAP

[7]:
sc.pp.neighbors(adata)
WARNING: You’re trying to run this on 1000 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.
[8]:
sc.tl.umap(adata)
[9]:
sc.pl.umap(adata, color=['study', 'cell_type'],
           frameon=False, wspace=0.6)
_images/scgen_map_query_14_0.png

As observed above these Pancreas studies are seperated accroding to source of study

Here we use the CelSeq2 and SS2 studies as query data and the other 3 studies to build as reference atlas.

[10]:
adata = remove_sparsity(adata) # remove sparsity
source_adata = adata[~adata.obs[condition_key].isin(target_conditions)].copy()
target_adata = adata[adata.obs[condition_key].isin(target_conditions)].copy()

Create scGen model and train it on reference dataset

Create the scgen model instance

[11]:
network = sca.models.scgen(adata = source_adata,
                           hidden_layer_sizes=[256,128])

INITIALIZING NEW NETWORK..............
Encoder Architecture:
        Input Layer in, out: 1000 256
        Hidden Layer 1 in/out: 256 128
        Mean/Var Layer in/out: 128 10
Decoder Architecture:
        First Layer in, out 10 128
        Hidden Layer 1 in/out: 128 256
        Output Layer in/out:  256 1000

[12]:
network.train(n_epochs=epoch, early_stopping_kwargs = early_stopping_kwargs)
 |████████████████████| 100.0%  - epoch_loss: 1.9351395184 - val_loss: 1.8963928728
Saving best state of network...
Best State was in Epoch 87

Correct batches in reference data

This function returns corrected gene expression in adata.X, raw uncorrected data in adata.obsm["original_data"]. Also it returns uncorrected data in adata.layers["original_data"].

The low-dimensional corrected latent space in adata.obsm["latent_corrected"]

[13]:
corrected_reference_adata = network.batch_removal(source_adata, batch_key="study", cell_label_key="cell_type",return_latent=True)

Corrected gene expression

[14]:
sc.pp.neighbors(corrected_reference_adata)
sc.tl.umap(corrected_reference_adata)
sc.pl.umap(corrected_reference_adata, color=["study", "cell_type"], wspace=.5, frameon=False)
WARNING: You’re trying to run this on 1000 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.
... storing 'cell_type' as categorical
_images/scgen_map_query_26_1.png

We an also use low-dim corrected reference data

[15]:
sc.pp.neighbors(corrected_reference_adata, use_rep="latent_corrected")
sc.tl.umap(corrected_reference_adata)
sc.pl.umap(corrected_reference_adata,
           color=['study', 'cell_type'],
           frameon=False,
           wspace=0.6,
           )
_images/scgen_map_query_28_0.png

After training, the model can be saved for later use and projection of new query studies

[16]:
ref_path = './ref_model/'
network.save(ref_path, overwrite=True)
[17]:
os.getcwd()
[17]:
'/home/mo/projects/scarches/notebooks'

Project query on top of the reference

query data needs to be preprocessed same way as reference data with same genes

This function need pretrained reference model, corrected gene expression from reference data and incorrected query data

[18]:
# here we pass the saved model from a file to the map query
integrated_query = sca.models.scgen.map_query_data(reference_model = network,
                                                   corrected_reference = corrected_reference_adata,
                                                   query = target_adata,
                                                   batch_key = 'study',
                                                   return_latent=True)

INITIALIZING NEW NETWORK..............
Encoder Architecture:
        Input Layer in, out: 1000 256
        Hidden Layer 1 in/out: 256 128
        Mean/Var Layer in/out: 128 10
Decoder Architecture:
        First Layer in, out 10 128
        Hidden Layer 1 in/out: 128 256
        Output Layer in/out:  256 1000

Plot the latent space of integrated query and reference

[19]:
sc.pp.neighbors(integrated_query, use_rep="latent_corrected")
sc.tl.umap(integrated_query)
sc.pl.umap(
    integrated_query,
    color=["study", "cell_type"],
    frameon=False,
    wspace=0.6)
... storing 'batch' as categorical
... storing 'study' as categorical
... storing 'cell_type' as categorical
... storing 'original_batch' as categorical
_images/scgen_map_query_37_1.png

Plot corrected gene expression space of integrated query and reference

[20]:
sc.pp.neighbors(integrated_query)
sc.tl.umap(integrated_query)
sc.pl.umap(
    integrated_query,
    color=["study", "cell_type"],
    frameon=False,
    wspace=0.6)
WARNING: You’re trying to run this on 1000 dimensions of `.X`, if you really want this, set `use_rep='X'`.
         Falling back to preprocessing with `sc.pp.pca` and default params.
_images/scgen_map_query_39_1.png