Quality Control on in-silico spatial proteomics with Harpy

Quality Control on in-silico spatial proteomics with Harpy#

Introduction#

This noteboook outlines some quality controls steps for highly-multiplexed spatial proteomics data using the Harpy package. Most steps are similar to this good resource on Imaging Mass Cytometry data analysis in R from the Bodenmiller lab. Here, an artificial in-silico dataset is used to demonstrate the quality control steps.

The levels of quality control are:

  • Image level: to give an overview of the quality of the images in the dataset.

  • Segmentation level: to give an overview of the quality of the segmentation in the dataset.

  • Cell level: to give an overview of the quality of the calculated features per cell in the dataset.

We start by loading in an example dataset and visualising the data:

import harpy
import matplotlib.pyplot as plt
import spatialdata_plot  # noqa
import scanpy as sc
from harpy.datasets import multisample_blobs

plt.viridis()

sdata = multisample_blobs(n_samples=4)
sdata
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload
INFO     no axes information specified in the object, setting `dims` to: ('c', 'y', 'x')                           
INFO     no axes information specified in the object, setting `dims` to: ('y', 'x')
INFO     no axes information specified in the object, setting `dims` to: ('c', 'y', 'x')                           
INFO     no axes information specified in the object, setting `dims` to: ('y', 'x')
INFO     no axes information specified in the object, setting `dims` to: ('c', 'y', 'x')                           
INFO     no axes information specified in the object, setting `dims` to: ('y', 'x')
INFO     no axes information specified in the object, setting `dims` to: ('c', 'y', 'x')                           
INFO     no axes information specified in the object, setting `dims` to: ('y', 'x')
SpatialData object
├── Images
│     ├── 'sample_0_image': DataArray[cyx] (11, 512, 512)
│     ├── 'sample_1_image': DataArray[cyx] (11, 512, 512)
│     ├── 'sample_2_image': DataArray[cyx] (11, 512, 512)
│     └── 'sample_3_image': DataArray[cyx] (11, 512, 512)
├── Labels
│     ├── 'sample_0_labels': DataArray[yx] (512, 512)
│     ├── 'sample_1_labels': DataArray[yx] (512, 512)
│     ├── 'sample_2_labels': DataArray[yx] (512, 512)
│     └── 'sample_3_labels': DataArray[yx] (512, 512)
├── Points
│     ├── 'sample_0_points': DataFrame with shape: (<Delayed>, 2) (2D points)
│     ├── 'sample_1_points': DataFrame with shape: (<Delayed>, 2) (2D points)
│     ├── 'sample_2_points': DataFrame with shape: (<Delayed>, 2) (2D points)
│     └── 'sample_3_points': DataFrame with shape: (<Delayed>, 2) (2D points)
└── Tables
      ├── 'sample_0_table': AnnData (20, 11)
      ├── 'sample_1_table': AnnData (20, 11)
      ├── 'sample_2_table': AnnData (20, 11)
      ├── 'sample_3_table': AnnData (20, 11)
      └── 'table': AnnData (80, 11)
with coordinate systems:
    ▸ 'sample_0', with elements:
        sample_0_image (Images), sample_0_labels (Labels), sample_0_points (Points)
    ▸ 'sample_1', with elements:
        sample_1_image (Images), sample_1_labels (Labels), sample_1_points (Points)
    ▸ 'sample_2', with elements:
        sample_2_image (Images), sample_2_labels (Labels), sample_2_points (Points)
    ▸ 'sample_3', with elements:
        sample_3_image (Images), sample_3_labels (Labels), sample_3_points (Points)
<Figure size 640x480 with 0 Axes>
table = sdata["table"]
table.to_df()
nucleus lineage_0 lineage_1 lineage_2 lineage_3 lineage_4 lineage_5 lineage_6 lineage_7 lineage_8 lineage_9
cells
1 130579.074583 73143.384718 90176.329291 78887.342227 83547.398858 0.000000 85104.954042 89203.618952 71052.452962 166453.784109 79783.723973
2 97747.257070 44127.352051 54403.315415 47592.677537 50404.086393 0.000000 51343.758329 103611.315408 42865.894408 48557.293269 48133.464008
3 129783.693347 72871.920340 89159.713324 77997.994298 82605.515103 0.000000 132335.765591 88197.968976 70251.432797 79581.243727 78884.270554
4 126967.536598 69764.494608 86010.594981 159287.142678 79687.890840 0.000000 81180.370216 85082.819414 67770.154343 76768.146443 76776.118424
5 107214.353348 52460.254002 64676.705321 56579.963133 124573.992918 0.000000 61039.388909 63979.053282 50960.585762 57726.734557 57222.870407
... ... ... ... ... ... ... ... ... ... ... ...
16 161696.495625 84635.637309 175638.724282 105850.500384 102504.128585 88371.831230 106509.552476 97690.881838 97966.012854 96128.415489 0.000000
17 163295.271990 86338.942986 93873.258186 107784.066049 104567.040494 90150.328405 108653.074179 188483.571632 99937.594462 98063.015157 0.000000
18 153201.459416 75869.389265 82490.085245 94714.053483 91887.127930 80816.560758 95477.684746 87572.419579 175560.397181 86171.787748 0.000000
19 154421.629069 153238.714827 83799.306745 96217.284748 93345.492327 80475.900902 96993.035800 88962.304123 89212.852472 88883.535945 0.000000
20 161275.579840 84186.643485 175348.640010 105097.641133 101960.342041 87903.016819 105944.517075 97172.629669 97446.301109 95618.452236 0.000000

80 rows × 11 columns

table.var
cycle
nucleus 0
lineage_0 0
lineage_1 1
lineage_2 1
lineage_3 2
lineage_4 2
lineage_5 3
lineage_6 3
lineage_7 4
lineage_8 4
lineage_9 5
table.obs
instance_id region fov_labels cell_ID phenotype area eccentricity major_axis_length minor_axis_length perimeter centroid-0 centroid-1 convex_area equivalent_diameter _major_minor_axis_ratio _perim_square_over_area _major_axis_equiv_diam_ratio _convex_hull_resid _centroid_dif
cells
1 1 by sample_0_labels 1 8 2393.0 0.000000 55.199096 55.199096 180.852814 482.000000 410.000000 2433.0 55.198390 1.000000 13.668090 1.000013 0.016441 0.000000
2 2 by sample_0_labels 2 6 1489.0 0.840912 60.858245 32.934784 160.083261 161.000000 12.721961 1519.0 43.541402 1.847841 17.210645 1.397710 0.019750 0.003520
3 3 by sample_0_labels 3 5 2332.0 0.296286 55.821777 53.315330 179.338095 369.522298 388.423671 2375.0 54.490317 1.047012 13.791661 1.024435 0.018105 0.001060
4 4 by sample_0_labels 4 2 2352.0 0.249266 55.649195 53.892628 180.852814 64.180697 106.579082 2397.0 54.723481 1.032594 13.906352 1.016916 0.018773 0.001168
5 5 by sample_0_labels 5 3 1706.0 0.765745 59.549068 38.298639 166.083261 216.000000 495.966002 1734.0 46.606294 1.554861 16.168611 1.277704 0.016148 0.002416
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
16 16 by sample_3_labels 16 1 2856.0 0.198505 60.937241 59.724580 198.409163 328.041317 117.297969 2911.0 60.302339 1.020304 13.783682 1.010529 0.018894 0.001298
17 17 by sample_3_labels 17 6 2885.0 0.000000 60.607577 60.607577 197.823376 245.000000 470.000000 2933.0 60.607723 1.000000 13.564675 0.999998 0.016365 0.000000
18 18 by sample_3_labels 18 7 2616.0 0.508391 62.616366 53.920614 194.793939 110.995413 406.372706 2680.0 57.713037 1.161270 14.504847 1.084961 0.023881 0.003410
19 19 by sample_3_labels 19 0 2665.0 0.466229 62.247488 55.068120 193.036580 227.496060 153.503940 2703.0 58.251038 1.130372 13.982409 1.068607 0.014058 0.001451
20 20 by sample_3_labels 20 1 2844.0 0.237913 61.101562 59.347125 196.509668 27.427215 169.000000 2890.0 60.175520 1.029562 13.578077 1.015389 0.015917 0.000612

80 rows × 19 columns

selected_markers = table.var_names[:3].tolist()
selected_markers
['nucleus', 'lineage_0', 'lineage_1']
sdata.pl.render_images(channel=selected_markers).pl.render_labels().pl.show()
../../_images/bd1871a1bbefe5281e8025fbf6de9b66453dc4b5b72975accc9a53e2c7250ba8.png

Image level quality control#

table.var["cycle"]
nucleus      0
lineage_0    0
lineage_1    1
lineage_2    1
lineage_3    2
lineage_4    2
lineage_5    3
lineage_6    3
lineage_7    4
lineage_8    4
lineage_9    5
Name: cycle, dtype: int64
df = harpy.pl.calculate_snr_ratio(sdata, cycles="cycle")
df
image cycle channel snr signal
0 sample_0_image 0 nucleus 3.287837 73.395618
1 sample_0_image 0 lineage_0 3.350054 60.223873
2 sample_0_image 1 lineage_1 3.341242 71.868568
3 sample_0_image 1 lineage_2 3.339835 62.863215
4 sample_0_image 2 lineage_3 3.341236 66.354626
5 sample_0_image 2 lineage_4 NaN NaN
6 sample_0_image 3 lineage_5 3.328766 68.452983
7 sample_0_image 3 lineage_6 3.333190 71.719604
8 sample_0_image 4 lineage_7 3.341090 56.638927
9 sample_0_image 4 lineage_8 3.346352 65.247809
10 sample_0_image 5 lineage_9 3.351038 65.531110
11 sample_1_image 0 nucleus 3.317217 68.429706
12 sample_1_image 0 lineage_0 3.354614 57.989819
13 sample_1_image 1 lineage_1 3.347276 68.978197
14 sample_1_image 1 lineage_2 3.353919 63.870013
15 sample_1_image 2 lineage_3 3.345361 55.801995
16 sample_1_image 2 lineage_4 3.340518 72.187184
17 sample_1_image 3 lineage_5 3.348367 66.388941
18 sample_1_image 3 lineage_6 3.345955 63.479419
19 sample_1_image 4 lineage_7 3.345319 63.756732
20 sample_1_image 4 lineage_8 3.354425 67.277019
21 sample_1_image 5 lineage_9 3.351014 68.710552
22 sample_2_image 0 nucleus 3.285106 72.556726
23 sample_2_image 0 lineage_0 3.346338 66.941166
24 sample_2_image 1 lineage_1 3.348429 58.152628
25 sample_2_image 1 lineage_2 3.354761 64.411135
26 sample_2_image 2 lineage_3 3.342676 62.753927
27 sample_2_image 2 lineage_4 3.343551 62.706414
28 sample_2_image 3 lineage_5 3.346515 65.341556
29 sample_2_image 3 lineage_6 3.346694 68.759239
30 sample_2_image 4 lineage_7 3.340821 65.563372
31 sample_2_image 4 lineage_8 3.342025 63.964066
32 sample_2_image 5 lineage_9 3.342700 61.233602
33 sample_3_image 0 nucleus 3.409306 74.062011
34 sample_3_image 0 lineage_0 3.354025 59.879868
35 sample_3_image 1 lineage_1 3.355139 65.147514
36 sample_3_image 1 lineage_2 3.346811 72.394797
37 sample_3_image 2 lineage_3 3.345158 69.136590
38 sample_3_image 2 lineage_4 3.347905 60.523140
39 sample_3_image 3 lineage_5 3.344262 71.837665
40 sample_3_image 3 lineage_6 3.349851 66.981595
41 sample_3_image 4 lineage_7 3.344799 66.048378
42 sample_3_image 4 lineage_8 3.351530 66.949340
43 sample_3_image 5 lineage_9 NaN NaN
df = df.groupby(["image", "channel"]).mean(numeric_only=True)
harpy.pl.snr_ratio(sdata, color="cycle")
<Axes: xlabel='Signal intensity', ylabel='Signal-to-noise ratio'>
../../_images/ba1ad37cba436b8c0ee982da170a5f816310c43d60cb497fd78234c6081559ad.png
harpy.pl.group_snr_ratio(sdata, groupby=["image", "channel"], color="cycle")
array([[<Axes: title={'center': 'sample_0_image'}, xlabel='Signal intensity', ylabel='Signal-to-noise ratio'>,
        <Axes: title={'center': 'sample_1_image'}, xlabel='Signal intensity', ylabel='Signal-to-noise ratio'>],
       [<Axes: title={'center': 'sample_2_image'}, xlabel='Signal intensity', ylabel='Signal-to-noise ratio'>,
        <Axes: title={'center': 'sample_3_image'}, xlabel='Signal intensity', ylabel='Signal-to-noise ratio'>]],
      dtype=object)
../../_images/d3119b2ce0caf56da15a243d9959cfd2b4d3acf6feaf2c24471672a780e49124.png
harpy.pl.snr_ratio(sdata, signal_threshold=2)
<Axes: xlabel='Signal intensity', ylabel='Signal-to-noise ratio'>
../../_images/7639e6a1ec1e3c38ad6d239249255b473e020899b73379083d01a5bbc6e83743.png

The plots above summarize all samples together. It would be interesting to look without cell segmentation bias across all channels and samples. There are multiple ways to aggregate the channel signal on an image level to a heatmap. One way is to create heatmaps using the SNR and signal values calculated above. This again depends on the unsupervised Otsu thresholding and is not a very good proxy of a good cell segmentation.

When showing the heatmap you could use the z_score or standard_scale options, but it’s also interesting not to transform the data too much in order to still visualize the outliers.

harpy.pl.signal_clustermap(sdata, signal_threshold=2, figsize=(12, 10))
<seaborn.matrix.ClusterGrid at 0x37bb615d0>
../../_images/13009915e354519d451e2c8488fe8e11c14a9664b17bc56f6b509a5a44377558.png
harpy.pl.snr_clustermap(sdata, signal_threshold=2, figsize=(12, 10))
<seaborn.matrix.ClusterGrid at 0x37cd5d600>
../../_images/999abef90c7977952e63c9fa2e2a555906c8600e7a3ffb25c0820ffc7c7a4ae5.png

Another way is to normalize the image with a quartile normalization. The min and max quartile can greatly influence how the ends of the intensity distributions. Here we clip the signal below and above the 5th and 95th percentile. We also apply an arcsinh transformation to each channel against extreme outliers to make the heatmap more interpretable.

df_norm = harpy.pl.calculate_mean_norm(
    sdata, overwrite=True, q_min=5, q_max=95, c_mask=selected_markers[0]
)
df_norm
lineage_0 lineage_1 lineage_2 lineage_3 lineage_4 lineage_5 lineage_6 lineage_7 lineage_8 lineage_9
image_name
sample_0_image 0.331322 0.395357 0.350663 0.368848 0.000000 0.379894 0.395442 0.316976 0.359809 0.359169
sample_1_image 0.325366 0.387694 0.357238 0.318276 0.402458 0.372283 0.359916 0.361338 0.373854 0.383336
sample_2_image 0.376520 0.330558 0.358838 0.357882 0.356515 0.370799 0.385876 0.373850 0.362120 0.349794
sample_3_image 0.323949 0.350711 0.391374 0.378530 0.332451 0.391408 0.365116 0.363468 0.362484 0.000000
# harpy.pl.clustermap(df_norm, row_colors=harpy.pl.make_cols_colors(df_metadata), figsize=(12, 10))
harpy.pl.clustermap(
    df_norm, col_colors=harpy.pl.make_cols_colors(table.var), figsize=(12, 10)
)
<seaborn.matrix.ClusterGrid at 0x37ce4b4f0>
../../_images/6be94b1a8eb4f69fb8ae4c53c9da0dfba81039c27910b132e74286f331d7a8a7.png

Segmentation level quality control#

harpy.pl.segmentation_coverage(sdata)
<Axes: xlabel='Percentile of covered area', ylabel='name'>
../../_images/13f6208cd6898f3c2dcd24ac5b8c55955fd87df8e46263fef917863ed8dd9d3a.png
sdata.pl.render_images(channel=selected_markers).pl.render_labels().pl.show(
    coordinate_systems=["sample_1", "sample_2"]
)
../../_images/a835cb1ee88cf6432171bc3c8d19450e2a499a63ee1bd64a332c0713596ae6fd.png
harpy.pl.segmentation_size_boxplot(sdata, sample_key="fov_labels")
<Axes: title={'center': 'area'}>
../../_images/64af11a67858d98cf3b2f3e970e8f69726c3080e603568576ee9457e2831d601.png
table.obs["area"].describe()
count      80.000000
mean     2349.425000
std       414.222095
min      1489.000000
25%      2069.000000
50%      2437.000000
75%      2593.500000
max      2890.000000
Name: area, dtype: float64
harpy.pl.calculate_segments_per_area(sdata, sample_key="fov_labels")
fov_labels cells_per_mm2
fov_labels
sample_0_labels 20 1000
sample_1_labels 20 1000
sample_2_labels 20 1000
sample_3_labels 20 1000
harpy.pl.segments_per_area(sdata, sample_key="fov_labels")
<Axes: xlabel='fov_labels'>
../../_images/70838ea967b7ed6c3d41a5194a7b7739260e47f11f1f5b806c7e7a220d975af4.png

Cell level quality control#

import numpy as np

table.layers["arcsinh"] = np.arcsinh(table.to_df())
table
AnnData object with n_obs × n_vars = 80 × 11
    obs: 'instance_id', 'region', 'fov_labels', 'cell_ID', 'phenotype', 'area', 'eccentricity', 'major_axis_length', 'minor_axis_length', 'perimeter', 'centroid-0', 'centroid-1', 'convex_area', 'equivalent_diameter', '_major_minor_axis_ratio', '_perim_square_over_area', '_major_axis_equiv_diam_ratio', '_convex_hull_resid', '_centroid_dif'
    var: 'cycle'
    layers: 'arcsinh'
table.to_df()
nucleus lineage_0 lineage_1 lineage_2 lineage_3 lineage_4 lineage_5 lineage_6 lineage_7 lineage_8 lineage_9
cells
1 130579.074583 73143.384718 90176.329291 78887.342227 83547.398858 0.000000 85104.954042 89203.618952 71052.452962 166453.784109 79783.723973
2 97747.257070 44127.352051 54403.315415 47592.677537 50404.086393 0.000000 51343.758329 103611.315408 42865.894408 48557.293269 48133.464008
3 129783.693347 72871.920340 89159.713324 77997.994298 82605.515103 0.000000 132335.765591 88197.968976 70251.432797 79581.243727 78884.270554
4 126967.536598 69764.494608 86010.594981 159287.142678 79687.890840 0.000000 81180.370216 85082.819414 67770.154343 76768.146443 76776.118424
5 107214.353348 52460.254002 64676.705321 56579.963133 124573.992918 0.000000 61039.388909 63979.053282 50960.585762 57726.734557 57222.870407
... ... ... ... ... ... ... ... ... ... ... ...
16 161696.495625 84635.637309 175638.724282 105850.500384 102504.128585 88371.831230 106509.552476 97690.881838 97966.012854 96128.415489 0.000000
17 163295.271990 86338.942986 93873.258186 107784.066049 104567.040494 90150.328405 108653.074179 188483.571632 99937.594462 98063.015157 0.000000
18 153201.459416 75869.389265 82490.085245 94714.053483 91887.127930 80816.560758 95477.684746 87572.419579 175560.397181 86171.787748 0.000000
19 154421.629069 153238.714827 83799.306745 96217.284748 93345.492327 80475.900902 96993.035800 88962.304123 89212.852472 88883.535945 0.000000
20 161275.579840 84186.643485 175348.640010 105097.641133 101960.342041 87903.016819 105944.517075 97172.629669 97446.301109 95618.452236 0.000000

80 rows × 11 columns

sc.tl.dendrogram(table, groupby="fov_labels")
sc.pl.heatmap(
    table,
    groupby="fov_labels",
    # var_names=used_var_names,
    var_names=table.var_names,
    swap_axes=True,
    dendrogram=True,
    figsize=(12, 10),
)
../../_images/934e437df789667864edea7dd9d9dfaabae9e23e6b229a0cf4471330d08164ff.png
harpy.pl.ridgeplot_channel_sample(table, y="fov_labels", value_vars=selected_markers)
../../_images/13a8ec24c72a6eaf585531f0b21b7f0dd401dc6bcd4a46b543606fe4a5bce4f8.png ../../_images/45ba1a33f91078ad9770c7221e431fb369e016b7f77c4b792ebac18e63288c02.png ../../_images/849193b01448b7617ff1585faff554f4ae3b6a290de770e942165a7518b37ba7.png
sc.pp.neighbors(table, n_neighbors=10, random_state=42)
sc.tl.umap(table, random_state=42)
markers = ["fov_labels", "area"]
fig, axes = plt.subplots(len(markers), 1, figsize=(10, 30))
for c, axs in zip(markers, axes, strict=False):
    sc.pl.umap(table, color=c, ax=axs, show=False)
../../_images/56a5cd1756da8133dcc6094bbf658d65267a73f077c69b53e630f4ee4bbd694c.png
markers = selected_markers
fig, axes = plt.subplots(1, len(markers), figsize=(5 * len(markers), 10))
for c, axs in zip(markers, axes, strict=False):
    sc.pl.umap(table, color=c, ax=axs, show=False)
../../_images/f35d288565d51c6f553f416a7b81458b1f266556e58bb89b5c9ca4144bb66d5f.png