harpy.io.read_transcripts

harpy.io.read_transcripts#

harpy.io.read_transcripts(sdata, path_count_matrix, transform_matrix=None, pixel_size=None, output_points_name='transcripts', overwrite=False, column_x=0, column_y=1, column_z=None, column_gene=3, column_midcount=None, delimiter=',', header=None, comment=None, crd=None, to_coordinate_system='global', to_micron_coordinate_system=None, filter_gene_names=None, blocksize='64MB')#

Reads transcript information from a file with each row listing the x and y coordinates, along with the gene name.

If a transform matrix is provided an affine transformation from micron to pixels is applied to the coordinates of the transcripts. The transformation is applied to the dask dataframe before adding it to sdata. The SpatialData object is augmented with a points element named output_points_name that contains the transcripts.

Parameters:
  • sdata (SpatialData) – The SpatialData object to which the transcripts will be added.

  • path_count_matrix (str | Path) – Path to a .parquet file or .csv file containing the transcripts information. Each row should contain an x (column_x), y (column_y) coordinate and a gene name (column_gene). Optional a count column (see column_midcount) is provided.

  • transform_matrix (str | Path | ndarray[tuple[Any, ...], dtype[TypeVar(_ScalarT, bound= generic)]] | None (default: None)) –

    This numpy array should contain a 3x3 transformation matrix for the affine transformation from micron to pixels. The matrix defines the linear transformation to be applied to the x and y coordinates of the transcripts before adding it as a points element to sdata in the coordinate system to_coordinate_system. E.g. to_coordinate_system will be the intrinsic pixel coordinate system.

    Example transform matrix:

    Sx 0 Tx

    0 Sy Ty

    0 0 1

    If no transform matrix is specified, transform_matrix will default to the identity matrix, and we assume transcripts coordinates are already in pixels. If transform_matrix is specified as a path to a file, it will be read via numpy.genfromtxt. If to_micron_coordinate_system is specifed and transform_matrix is not the identity matrix, a micron coordinate system will be added at to_micron_coordinate_system, with the inverse of the transform_matrix defined on it (inverse == transformation from pixels to micron).

  • pixel_size (int | None (default: None)) – Size of the pixels in micron. This can only be specified if ‘transform_matrix’ is equal to the identity matrix or ‘None’ (i.e. transcripts are already in pixels). If ‘to_micron_coordinate_system’ is specified, a micron coordinate system will be added, otherwise this parameter will be ignored.

  • output_points_name (str (default: 'transcripts')) – Name of the points element of the SpatialData object to which the transcripts will be added.

  • overwrite (bool (default: False)) – If True overwrites the output_points_name (a points element) if it already exists.

  • column_x (int (default: 0)) – Column index of the X coordinate in the count matrix.

  • column_y (int (default: 1)) – Column index of the Y coordinate in the count matrix.

  • column_z (int | None (default: None)) – Column index of the Z coordinate in the count matrix.

  • column_gene (int (default: 3)) – Column index of the gene information in the count matrix.

  • column_midcount (int | None (default: None)) – Specifies the column index that contains the count of how many times the gene is detected at that particular location. Ignored when set to None.

  • delimiter (str (default: ',')) – Delimiter used to separate values in the .csv file. Ignored if path_count_matrix is a .parquet file.

  • header (int | None (default: None)) – Row number to use as the header in the .csv file. If None, no header is used. Ignored if path_count_matrix is a .parquet file.

  • comment (str | None (default: None)) – Character indicating that the remainder of line should not be parsed. If found at the beginning of a line, the line will be ignored altogether. This parameter must be a single character. Ignored if path_count_matrix is a .parquet file.

  • crd (tuple[int, int, int, int] | None (default: None)) – The coordinates (in pixels) for the region of interest in the format (xmin, xmax, ymin, ymax). If None, all transcripts are considered.

  • to_coordinate_system (str (default: 'global')) – Intrinsic pixel coordinate system to which output_points_name will be added. This is the pixel coordinate system.

  • to_micron_coordinate_system (str | None (default: None)) – Micron coordinate system to which output_points_name will be added, if ‘transform_matrix’ is not the identity matrix, or ‘pixel_size’ is not ‘None’. This is the micron coordinate system.

  • filter_gene_names (str | list[str] | None (default: None)) – Gene names that need to be filtered out (via str.contains), mostly control genes that were added, and which you don’t want to use. Filtering is case insensitive.

  • blocksize (str (default: '64MB')) – Block size of the partions of the dask dataframe stored as points_name in sdata.

Raises:

ValueError – If pixel_size is not None, and transform_matrix is not equal to the identity matrix.

Return type:

SpatialData

Returns:

: The updated SpatialData object containing the transcripts.