Single-cell architecture surgery (scArches) is a package for reference-based analysis of single-cell data.

  • (30.11.2022) We have added scPoli to scArches code base. scPoli enables population-level integration and multi-scale analyses of cells and samples.

  • (22.10.2022) We have added mvTCR and SageNet enabling mapping multimodal immune profiling (TCR+scRNAreq) and scRNA-seq to spatial atlases, respectively.

    (7.07.2022) We have added treeArches to scArches code base. treeArches enables building cell-type hierarchies to identify novel states (e.g., disease, subpopulations) in the query data when mapped to the reference. See tutorials here .

    (6.02.2022) We have added expiMap to scArches code base. expiMap allows interpretable reference mapping. Try it in the tutorials section.

What is scArches?

scArches allows analysis of your single-cell query data by integrating it into a reference atlas. To map your data you need an integrated atlas using one of the reference building methods for different applications that are supported by scArches which are, including:

  • scVI (Lopez et al., 2018): Requires access to raw counts values for data integration and assumes count distribution on the data (NB, ZINB, Poisson).

  • trVAE (Lotfollahi et al.,2020): It supports both normalized log-transformed or count data as input and applies additional MMD loss to have better merging in the latent space.

  • scANVI (Xu et al., 2019): It needs cell type labels for reference data. Your query data can be either unlabeled or labeled. In the case of unlabeled query data, you can use this method also to classify your query cells using reference labels.

  • scGen (Lotfollahi et al., 2019): This method requires cell-type labels for both reference building and Mapping. The reference mapping for this method solely relies on the integrated reference and requires no fine-tuning.

  • expiMap (Lotfollahi*, Rybakov* et al., 2023): This method takes prior knowledge from gene sets databases or users allowing to analyze your query data in the context of known gene programs.

  • totalVI (Gayoso al., 2019): This model can be used to build multi-modal CITE-seq reference atalses.

  • treeArches (Michielsen*, Lotfollahi* et al., 2022): This model builds a hierarchical tree for cell-types in the reference atlas and when mapping the query data can annotate and also identify novel cell-states and populations present in the query data.

  • SageNet (Heidari et al., 2022): This model allows constrcution of a spatial atlas by mapping query dissociated single cells/spots (e.g., from scRNAseq or visium datasets) into a common coordinate framework using one or more spatially resolved reference datasets.

  • mvTCR (Drost et al., 2022): Using this model you will be able to integrate T-cell receptor (TCR, treated as a sequence) and scRNA-seq dataset across multiple donors into a joint representation capturing information from both modalities.

  • scPoli (De Donno et al., 2022): This model allows data integration of scRNA-seq dataset, prototype-based label transfer and reference mapping. scPoli learns both sample embeddings and integrated cell embeddings, thus providing the user with a multi-scale view of the data, especially useful in the case of many samples to integrate.

Which model to choose?

  • If your reference data is labeled (cell-type labels) and you have an unlabeled or labeled query, then use scArches scANVI, treeArches or scPoli.

  • If your reference data is labeled (cell-type labels) and you have a labeled query, then use scGen.

  • If your reference and query are unlabeled, our preferred model is scArches scVI and if it did not work for you, try scArches trVAE, which gives you better integration but is a bit slower.

  • If you have CITE-seq data and want to integrate RNA-seq as a query and impute missing proteins in query scRNA-seq data, then use scArches totalVI.

  • If you scRNAseq data and want to analyze your data in the context of gene programs to answer a question such as what pathways have changed after a disease or which genes are causing my new disease state in the query separate from others, then use expiMap.

  • If you want to build a cellular hierarchy and continuously update the hierarchy using new query datasets, see how your query populations compare to the original hierarchy to identify new subpopulations or disease states in your query, then use treeArches.

  • If you have scRNA seq data and want to map it to a reference spatial atlas to infer the spatial location and perform cell-cell interaction analysis then use SageNet.

  • If you have many samples to integrate and are interested in obtaining a sample embedding space in addition to integrated cell embeddings use scPoli.

Where to start?

To get a sense of how the model works please go through this tutorial. To find out how to construct and share or use pre-trained models example sections.


If scArches is useful in your research, please consider citing the preprint.