Tutorial 6: Atera: Breast Tumor

Analyze 10x Atera breast tumor spatial transcriptomics data with SpacGPA.

The Atera data folder used in this tutorial contains three required files: cell_feature_matrix.h5, cells.csv.gz, and cell_boundaries.csv.gz.

[1]:

# Import SpacGPA and other required packages.
import SpacGPA as sg
import scanpy as sc
import matplotlib.pyplot as plt
import os
import warnings
warnings.filterwarnings('ignore', message=r'.*Variable names are not unique.*')

[2]:

# Set the working directory to your local path.
workdir = '..'
os.chdir(workdir)

Part 1: Gene program analysis via SpacGPA

[3]:

# Load Atera data.
adata = sg.load_atera('data/Atera/Breast_Tumor')
print(adata)

AnnData object with n_obs × n_vars = 170057 × 18028
    obs: 'x_centroid', 'y_centroid', 'transcript_counts', 'control_probe_counts', 'genomic_control_counts', 'control_codeword_counts', 'unassigned_codeword_counts', 'deprecated_codeword_counts', 'total_counts', 'cell_area', 'nucleus_area', 'nucleus_count', 'segmentation_method', 'polygon'
    var: 'gene_ids', 'feature_types', 'genome'
    obsm: 'spatial'

[4]:

# Preprocessing: library-size normalize and log1p-transform.
sc.pp.filter_cells(adata, min_counts=1)
sc.pp.normalize_total(adata, target_sum = 1e4)
sc.pp.log1p(adata)
sc.pp.filter_cells(adata, min_genes = 200)
sc.pp.filter_genes(adata, min_cells = 10)
print(adata.X.shape)

(165301, 18022)

[5]:

# Construct the co-expression network using SpacGPA (Gaussian graphical model).
ggm = sg.create_ggm(adata, project_name = 'Breast Tumor')

Please Normalize The Expression Matrix Before Running!
Loading Data.

Running all calculations on GPU (if available).
Using single precision (float32) for all calculations.
Using chunk size of 5000 for efficient computation

Computing the number of cells co-expressing each gene pair...
Computing covariance matrix...
Computing Pearson correlation matrix...

Calculating partial correlations in 8120 iterations.
Number of genes randomly selected in each iteration: 2000
Iteration: 8120/8120, Avg loop time: 0.0372 s, Elapsed time: 5.15 min, Estimated time left: 0.00 min.
All iterations completed.

Performing FDR control...
Randomly redistribute the expression distribution of input genes...

Calculate correlation between genes after redistribution...

Computing the number of cells co-expressing each gene pair...
Computing covariance matrix...
Computing Pearson correlation matrix...

Calculating partial correlations in 8120 iterations.
Number of genes randomly selected in each iteration: 2000
Iteration: 8120/8120, Avg loop time: 0.0371 s, Elapsed time: 5.12 min, Estimated time left: 0.00 min.
All iterations completed.

Summarizing the FDR Statistics...
FDR control completed.
Current Pcor threshold: 0.020
Minimum Pcor threshold with FDR <= 0.05: 0.010
FDR at pcor=0.02: 0.00e+00

Task completed. Resources released.

[6]:

# Show statistically significant co-expression gene pairs.
print(ggm.SigEdges.head(5))

    GeneA   GeneB      Pcor  SamplingTime         r  Cell_num_A  Cell_num_B  \
0  ABCA10    AASS  0.021785           110  0.140313       17773       22553
1   ABCA6    AASS  0.026992            94  0.161891       30479       22553
2   ABCA6  ABCA10  0.170752           104  0.504233       30479       17773
3   ABCA8  ABCA10  0.096947           132  0.373468       21301       17773
4   ABCA8   ABCA6  0.097767           109  0.433015       21301       30479

   Cell_num_coexpressed       Project
0                  4225  Breast Tumor
1                  6462  Breast Tumor
2                 10677  Breast Tumor
3                  7385  Breast Tumor
4                 11000  Breast Tumor

[7]:

# Identify gene programs using the MCL-Hub algorithm (set inflation to 2).
ggm.find_modules(method = 'mcl-hub', inflation = 2)


Find modules using MCL-Hub...
Current Pcor: 0.02
Total significantly co-expressed gene pairs: 64309
Iteration: 29, Max change: 0.00000000
Converged at iteration 29.

2 modules were removed due to their linear or radial topology structure out of 108 modules.

[8]:

# Inspect the top 5 identified gene programs.
print(ggm.modules_summary.head(5))

  module_id  size  num_genes_degree_ge_2  \
0        M1   306                    239
1        M2   287                    251
2        M3   241                    207
3        M4   220                     32
4        M5   196                    161

                                           all_genes
0  GABRP, PROM1, CCL28, KRT15, KRT23, KIT, KLK5, ...
1  TRAC, CD3E, ACAP1, TMC8, CD52, TRBC2, ZAP70, S...
2  TNS4, COL17A1, MYLK, TP63, TAGLN, LAMB3, LTBP2...
3  SMG1, PKD1, EME2, SULT1A1, SNX29, MAPK8IP3, C1...
4  BTNL9, KDR, SHANK3, EXOC3L2, FLT1, RGCC, APOLD...

[9]:

# Visualize the subnetwork of program M3.
ggm.module_network_plot(module_id = 'M3', seed = 2, layout_iterations = 60)
# Fix layout randomness for reproducibility via set seed.

../_images/tutorials_Tutorial_6_Atera_Breast_Tumor_11_0.png

[10]:

# Gene Ontology (GO) enrichment analysis with BH FDR control and p-value threshold 0.05.
ggm.go_enrichment_analysis(species = 'human', padjust_method = 'BH', pvalue_cutoff = 0.05)


Reading GO term information for |human|...

Start GO enrichment analysis ...
Found 66 significant enriched GO terms in M106
GO enrichment analysis completed. Found 7609 significant enriched GO terms total.

[11]:

# Visualize top enriched GO terms for top 10 identified programs.
program_list = ggm.modules_summary['module_id'].tolist()
ggm.module_go_enrichment_plot(shown_modules = program_list[:10], go_per_module = 1)

../_images/tutorials_Tutorial_6_Atera_Breast_Tumor_13_0.png

[12]:

# Visualize the M3 network with nodes highlighted by a selected GO term.
M3_GO_Enrich = ggm.go_enrichment[ggm.go_enrichment['module_id'] == 'M3']
print(M3_GO_Enrich.iloc[:3, :6])
ggm.module_network_plot(module_id = 'M3', highlight_anno = 'focal adhesion', seed = 2, layout_iterations = 60)

    module_id  module_size  go_rank       go_id         go_category  \
265        M3          241        1  GO:0005925  cellular_component
266        M3          241        2  GO:0001725  cellular_component
267        M3          241        3  GO:0030018  cellular_component

            go_term
265  focal adhesion
266    stress fiber
267          Z disc

../_images/tutorials_Tutorial_6_Atera_Breast_Tumor_14_1.png

[13]:

# Print a summary of the GGM analysis.
print(ggm)

View of ggm object: Breast Tumor
MetaInfo:
  Gene Number: 18022
  Sample Number: 165301
  Pcor Threshold: 0.02

Results:
  SigEdges: DataFrame with 64309 significant gene pairs
  modules: 106 modules with 6008 genes
  modules_summary: DataFrame with 106 rows
  FDR: Exists

[14]:

# Save the GGM object to HDF5 for later reuse.
sg.save_ggm(ggm, 'data/Atera_Breast_Tumor.ggm.h5')

Part 2: Cell annotation based on program expression

[15]:

# Compute per-cell expression scores of each gene program.
sg.calculate_module_expression(adata, ggm)


Calculating module expression using top 30 genes...

Calculating gene weights based on degree...
Storing module information in adata.uns['module_info']...

Assigning colors to 106 modules...

Total 106 modules' average expression calculated and stored in adata.obs and adata.obsm

[16]:

# Visualize the spatial distribution of the top 20 program-expression scores.
plt.rcParams['figure.figsize'] = (7, 4)
program_list = ggm.modules_summary['module_id'] + '_exp'
sc.pl.spatial(adata, spot_size = 30, color = program_list[:20], cmap = 'Reds', ncols = 5)

../_images/tutorials_Tutorial_6_Atera_Breast_Tumor_19_0.png

[17]:

# Visualize the expression of M1 with cell boundary polygons.
sg.pl.spatial_polygon(adata, color = 'M1_exp', figsize = (40, 20), background = 'black')

../_images/tutorials_Tutorial_6_Atera_Breast_Tumor_20_0.png

[18]:

# Visualize the expression of M6 with cell boundary polygons.
sg.pl.spatial_polygon(adata, color = 'M6_exp', figsize = (40, 20), background = 'black')

../_images/tutorials_Tutorial_6_Atera_Breast_Tumor_21_0.png

[19]:

# Save the annotated AnnData object.
adata.write('data/Atera_Breast_Tumor_ggm_anno.h5ad')