Skip to contents

Differential transcription usage testing for single cell data, using colLabels as cluster labels.

Usage

sc_DTU_analysis(
  sce,
  gene_col = "gene_id",
  min_count = 15,
  threads = 1,
  method = "trascript usage permutation",
  permuations = 1000
)

Arguments

sce

The SingleCellExperiment object, with transcript counts in the counts slot and cluster labels in the colLabels slot.

gene_col

The column name in the rowData slot of sce that contains the gene ID / name. Default is "gene_id".

min_count

The minimum total counts for a transcript to be tested.

threads

Number of threads to use for parallel processing.

method

The method to use for testing, listed in details.

permuations

Number of permutations for permutation methods.

Value

a tibble containing the following columns:

p.value

- the raw p-value

adj.p.value

- multiple testing adjusted p-value

cluster

- the cluster where DTU was observed

transcript

- rowname of sce, the DTU isoform

transcript_usage

- the transcript usage of the isoform in the cluster

Additional columns from method = "trascript usage permutation":

transcript_usage_elsewhere

- transcript usage in other clusters

usage_difference

- the difference between the two transcript usage

permuted_var

- the variance of usage difference in the permuted data

Additional columns from method = "chisq":

X_value

- the test statistic

df

- the degrees of freedom

expected_usage

- the expected usage (mean across all clusters)

usage_difference

- the difference between the observed and expected usage

The table is sorted by P-values.

Details

Genes with more than 2 isoforms expressing more than min_count counts are selected for testing with one of the following methods:

trascript usage permutation

Transcript usage are taken as the test statistic, cluster labels are permuted to generate a null distribution.

chisq

Chi-square test of the transcript count matrix for each gene.

Adjusted P-values were calculated by Benjamini–Hochberg correction.

Examples

outdir <- tempfile()
dir.create(outdir)
bc_allow <- file.path(outdir, "bc_allow.tsv")
genome_fa <- file.path(outdir, "rps24.fa")
R.utils::gunzip(
  filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"),
  destname = bc_allow, remove = FALSE
)
R.utils::gunzip(
  filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"),
  destname = genome_fa, remove = FALSE
)

sce <- FLAMES::sc_long_pipeline(
  genome_fa = genome_fa,
  fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"),
  annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"),
  outdir = outdir,
  barcodes_file = bc_allow,
  config_file = create_config(outdir)
)
#> Writing configuration parameters to:  /tmp/RtmpltU9Yg/file92d2327584ae/config_file_37586.json 
#> 04:17:11 AM Thu Apr 17 2025 Start running
#> 04:17:11 AM Thu Apr 17 2025 Demultiplexing using flexiplex...
#> Matching cell barcodes...
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be  replaced
#> Setting number of threads to 8
#> Search pattern: 
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/RtmpltU9Yg/file92d2327584ae/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /__w/_temp/Library/FLAMES/extdata/fastq/musc_rps24.fastq.gz
#> Searching for barcodes...
#> Number of reads processed: 393
#> Number of reads where at least one barcode was found: 368
#> Number of reads with exactly one barcode match: 364
#> Number of chimera reads: 1
#> All done!
#> 04:17:11 AM Thu Apr 17 2025 Demultiplex done
#> Running FLAMES pipeline...
#> #### Input parameters:
#> {
#>   "comment": ["this is the default config for nanopore single cell long read data using 10x 3'end kit. use splice annotation in alignment. "],
#>   "pipeline_parameters": {
#>     "seed": [2022],
#>     "threads": [8],
#>     "do_barcode_demultiplex": [true],
#>     "do_gene_quantification": [true],
#>     "do_genome_alignment": [true],
#>     "do_isoform_identification": [true],
#>     "bambu_isoform_identification": [false],
#>     "multithread_isoform_identification": [false],
#>     "do_read_realignment": [true],
#>     "do_transcript_quantification": [true],
#>     "oarfish_quantification": [true]
#>   },
#>   "barcode_parameters": {
#>     "max_bc_editdistance": [2],
#>     "max_flank_editdistance": [8],
#>     "pattern": {
#>       "primer": ["CTACACGACGCTCTTCCGATCT"],
#>       "BC": ["NNNNNNNNNNNNNNNN"],
#>       "UMI": ["NNNNNNNNNNNN"],
#>       "polyT": ["TTTTTTTTT"]
#>     },
#>     "strand": ["-"],
#>     "TSO_seq": ["AAGCAGTGGTATCAACGCAGAGTACATGGG"],
#>     "TSO_prime": [5],
#>     "cutadapt_minimum_length": [10],
#>     "full_length_only": [false]
#>   },
#>   "isoform_parameters": {
#>     "generate_raw_isoform": [false],
#>     "max_dist": [10],
#>     "max_ts_dist": [100],
#>     "max_splice_match_dist": [10],
#>     "min_fl_exon_len": [40],
#>     "max_site_per_splice": [3],
#>     "min_sup_cnt": [5],
#>     "min_cnt_pct": [0.001],
#>     "min_sup_pct": [0.2],
#>     "bambu_ndr": [0.5],
#>     "bambu_verbose": [false],
#>     "bambu_trust_reference": [true],
#>     "strand_specific": [0],
#>     "remove_incomp_reads": [4],
#>     "downsample_ratio": [1]
#>   },
#>   "alignment_parameters": {
#>     "use_junctions": [true],
#>     "no_flank": [false]
#>   },
#>   "realign_parameters": {
#>     "use_annotation": [true]
#>   },
#>   "transcript_counting": {
#>     "min_tr_coverage": [0.4],
#>     "min_read_coverage": [0.4]
#>   }
#> } 
#> gene annotation: /__w/_temp/Library/FLAMES/extdata/rps24.gtf.gz 
#> genome fasta: /tmp/RtmpltU9Yg/file92d2327584ae/rps24.fa 
#> input fastq: /tmp/RtmpltU9Yg/file92d2327584ae/matched_reads.fastq 
#> output directory: /tmp/RtmpltU9Yg/file92d2327584ae 
#> minimap2 path: 
#> k8 path: 
#> #### Aligning reads to genome using minimap2
#> 04:17:11 AM Thu Apr 17 2025 minimap2_align
#> Sorting BAM files with 8 threads...
#> Indexing bam files
#> 04:17:11 AM Thu Apr 17 2025 Start gene quantification and UMI deduplication
#> 04:17:11 AM Thu Apr 17 2025 quantify genes 
#> Found genome alignment file(s): 	align2genome.bam
#> 04:17:11 AM Thu Apr 17 2025 Gene quantification and UMI deduplication done!
#> 04:17:11 AM Thu Apr 17 2025 Start isoform identificaiton
#> 04:17:11 AM Thu Apr 17 2025 find_isoform
#> 04:17:11 AM Thu Apr 17 2025 Isoform identificaiton done
#> #### Realigning deduplicated reads to transcript using minimap2
#> Running minimap2 with args: --eqx -N 100 -ax map-ont -y -t 8 
#> 04:17:11 AM Thu Apr 17 2025 minimap2_realign
#> Sorting by CB for oarfish quantifaction 
#> Sorting BAM file with 8 threads...
#> #### Generating transcript count matrix
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: The "phase" metadata column contains non-NA values for features of type
#>   stop_codon. This information was ignored.
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
group_anno <- data.frame(barcode_seq = colnames(sce), groups = SingleCellExperiment::counts(sce)["ENSMUST00000169826.2", ] > 1)
SingleCellExperiment::colLabels(sce) <- group_anno$groups
# DTU with permutation testing:
sc_DTU_analysis(sce, min_count = 1, method = "trascript usage permutation")
#> Filtering for genes with at least 2 isforms expressing more than 1 counts ...
#> 	1 gene(s), 9 transcript(s) left.
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#> 
#> # A tibble: 18 × 9
#>    transcript   cluster transcript_usage transcript_usage_els…¹ usage_difference
#>    <chr>        <fct>              <dbl>                  <dbl>            <dbl>
#>  1 ENSMUST0000… FALSE            0.570                  0.746          -0.176   
#>  2 ENSMUST0000… TRUE             0.746                  0.570           0.176   
#>  3 ENSMUSG0000… FALSE            0.130                  0.0750          0.0555  
#>  4 ENSMUSG0000… FALSE            0.0538                 0.0208          0.0329  
#>  5 ENSMUSG0000… FALSE            0.0860                 0.0388          0.0473  
#>  6 ENSMUST0000… FALSE            0.0215                 0.00758         0.0139  
#>  7 ENSMUST0000… FALSE            0.0201                 0.00599         0.0141  
#>  8 ENSMUSG0000… TRUE             0.0750                 0.130          -0.0555  
#>  9 ENSMUSG0000… TRUE             0.0208                 0.0538         -0.0329  
#> 10 ENSMUSG0000… TRUE             0.0388                 0.0860         -0.0473  
#> 11 ENSMUST0000… TRUE             0.00758                0.0215         -0.0139  
#> 12 ENSMUST0000… TRUE             0.00599                0.0201         -0.0141  
#> 13 ENSMUSG0000… FALSE            0.0538                 0.0455          0.00830 
#> 14 ENSMUSG0000… TRUE             0.0455                 0.0538         -0.00830 
#> 15 ENSMUSG0000… FALSE            0.0108                 0.00758         0.00318 
#> 16 ENSMUSG0000… FALSE            0.0538                 0.0530          0.000732
#> 17 ENSMUSG0000… TRUE             0.00758                0.0108         -0.00318 
#> 18 ENSMUSG0000… TRUE             0.0530                 0.0538         -0.000732
#> # ℹ abbreviated name: ¹​transcript_usage_elsewhere
#> # ℹ 4 more variables: p.value <dbl>, permuted_var <dbl>, adj.p.value <dbl>,
#> #   gene_id <chr>
# now try with chisq:
sc_DTU_analysis(sce, min_count = 1, method = "chisq")
#> Filtering for genes with at least 2 detected isforms ...
#> 	9 isoform(s) left.
#> Aggregating counts by cluster labels ...
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#> 
#> # A tibble: 1 × 10
#>   gene  X_value    df transcript cluster p.value expected_usage transcript_usage
#>   <chr>   <dbl> <int> <chr>      <chr>     <dbl>          <dbl>            <dbl>
#> 1 ENSM…    13.7     8 ENSMUST00… FALSE    0.0888          0.700            0.570
#> # ℹ 2 more variables: usage_difference <dbl>, adj.p.value <dbl>