# Estimating Heterogeneity of scRNA-seq Data With rhype

A practical use of hypergraphs

During my masters, I wrote my thesis on ‘The Applications of Hypergraphs in the Analysis of Biological Systems’. From this, I took my love of R and wrote the rhype package, read my article introducing it. Here I’d like to talk about one application of hypergraphs from the field of genetics (where I have found myself for my PhD) that came from two people I worked with closely during my masters, specifically how it can be done very easily using rhype.

Hypergraphs are mathematical objects used to represent higher order interactions within a system. There are many places where they can be applied across all of science, one being the field of single cell RNA sequencing (scRNA-seq). The details of scRNA-seq are not gone into in this post (not least because I am wholly unqualified to explain it), for a more in depth explanation please read this article. From a hypergraphical perspective, each cell can be viewed as a vertex and each gene as a hyperedge, with each cell being contained in a hyperedge if that cell expressed the given gene. This relationship can then be further abstracted using hypergraphs with real coefficients, where each vertex is connected to a hyperedge (gene) with a weight according to how many times the cell transcribed the gene. This is all starting to get quite complicated, luckily the code for this is very simple, lets take a look at an example.

You can download the data I am using here, it is from Patrick Stumpf’s paper Mapping Biology From Mouse to Man Using Transfer Learning. First of all, load in the data.

library(Matrix)

load("./scRNA-seq_example.RData")

You can see in your environment window three different objects counts, plot_coords and identity that we will discuss during the post. The first and most important of these is the counts matrix counts, this is the table of gene counts. It is stored in a sparse matrix format (hence why we needed to load up the Matrix package) as it is very big, we can see a small section of the matrix using

as.matrix(counts[1:5,3:4])
##                                                  0610007P14Rik 0610009B22Rik
## M1_TBM_PCR2.4_altered_out_gene_exon_AGGCCATTGGGC             2             3
## M1_TBM_PCR2.4_altered_out_gene_exon_GCTGGATCGCTT             0             0
## M1_TBM_PCR2.4_altered_out_gene_exon_CGACTTATCTTC             2             1
## M1_TBM_PCR2.4_altered_out_gene_exon_CTAATCTGTTAG             0             1
## M1_TBM_PCR2.4_altered_out_gene_exon_AGTTATGGTTCG             1             1

Fortunately, this matrix is actually the same as the incidence matrix of the underlying hypergraph, this means to make a hypergraph the only code we need is

library(rhype)

counts_hypergraph <- hype_from_inc_mat(counts, real_coef = TRUE)
counts_hypergraph
## Hypergraph Object:
##     Number of vertices:  5504
##     Number of hyperedges:  16519
##     Oriented:  FALSE    Directed:  FALSE
##     Real Coefficients:  TRUE    Weighted:  FALSE

The question now is what can we do with this hypergraph? How can it actually help us in the real world? This is where a paper by Raffaella Mulas and Michael Casey comes in called Estimating cellular redundancy in networks of genetic expression. This uses spectral hypergraph theory to create an estimate of cellular redundancy in scRNA-seq data. Before we dive into the spectral theory and calculations, in their paper they normalise their scRNA-seq data per cell, so lets do that and then remake our hypergraph

#Normalising counts so the rows sum to 1
norm_counts <- apply(counts, 1, $$x) x/sum(x)) #Reorienting the matrix so cells are rows again norm_counts <- t(norm_counts) #Creating the hypergraph norm_counts_hypergraph <- hype_from_inc_mat(norm_counts, real_coef = TRUE) The essence of spectral hypergraph theory is that hypergraphs can be represented as matrices and square matrices can be represented with eigenvalues. If we choose our matrix representation well, these eigenvalues will contain important information about the hypergraph. The paper describes how the largest eigenvalue of the hypergraph normalised laplacians will encode information about the cellular redundancy of the scRNA-seq data. This largest eigenvalue can be calculated with one command. ev <- spectra( #The hypergraph to find the spectra of hype = norm_counts_hypergraph, #The matrix to use to find the spectra matrix = "vert_norm_lap_mat", #The number of eigenvalues to look for n = 1 ) evvalues ## [1] 2190.23+0i This gives a complex number between \(1$$ and $$N$$, where $$N$$ is the number of vertices in the hypergraph (or the number of cells in the scRNA-seq dataset). In the paper they normalise this to the range $$[\frac{1}{N}, 1]$$ by dividing the eignevalue, $$\lambda$$, by $$N$$, $$\frac{\lambda}{N}$$. For this post we will be using a different normalisation formula to account for samples with different numbers of cells. Using $$\frac{\lambda -1}{N-1}$$ maps the eigenvalue to the range $$[0,1]$$. We can create and compare these two formulae (there is currently no wrapper function to find the number of vertices in a hypergraph, so for now we use the object function <hypergraph>$get_numv()). ev_norm1 <- function(ev) { ev <- Re(ev$values)
N <- norm_counts_hypergraph$get_numv() ev/N } ev_norm2 <- function(ev) { ev <- Re(ev$values)
N <- norm_counts_hypergraph\$get_numv()
(ev-1)/(N-1)
}

ev_norm1(ev)
## [1] 0.3979343
ev_norm2(ev)
## [1] 0.3978249

That is it! As you can see these two normalisation methods gave very similar answers and both can be used as a measure of cellular redundancy, or homogeneity, as described in the paper. But calculating one number isn’t really much of a post, so lets do some comparisons.

Our data set contains cells of multiple different types, these types can be found in the identity object. We can see what these look like using the plotting coordinates in the plot_coords variable.

Instead of making a hypergraph for the whole dataset, we can make a hypergraph just for each cell type and compare homogeneity between cell types.

hypergraphs <- list()

for (i in 1:14) {
cell_type = levels(identity)[i]
new_counts <- norm_counts[identity == cell_type,]
hypergraphs[[i]] <- hype_from_inc_mat(new_counts, real_coef = TRUE)
}

names(hypergraphs) <- levels(identity)

We can then calculate the largest eigenvalue for each of these hypergraphs

evs <- list()

for (i in 1:14) {
evs[[i]] <- spectra(
hype = hypergraphs[[i]],
matrix = "vert_norm_lap_mat",
n = 1
)
}

cell_red_scores <- unlist(lapply(evs, ev_norm2))

Lets compare these to one another

The graph above ranks the cell types from most heterogeneous (Pre-B) to least heterogeneous (Erythroblasts). As scientists we like to put confidence intervals on quantities. This is certainly possible for the cellular redundancy score too, but it is a whole other discussion in its own right, so keep an eye out for another post.

##### Hugh Warden
###### PhD Student - Human Genetics Unit

I use machine learning and cell painting to look at how cancer affects the morphology of cells.