Differential transcription usage testing for single cell data, using
colLabels
as cluster labels.
Usage
sc_DTU_analysis(
sce,
gene_col = "gene_id",
min_count = 15,
threads = 1,
method = "trascript usage permutation",
permuations = 1000
)
Arguments
- sce
The
SingleCellExperiment
object, with transcript counts in thecounts
slot and cluster labels in thecolLabels
slot.- gene_col
The column name in the
rowData
slot ofsce
that contains the gene ID / name. Default is"gene_id"
.- min_count
The minimum total counts for a transcript to be tested.
- threads
Number of threads to use for parallel processing.
- method
The method to use for testing, listed in
details
.- permuations
Number of permutations for permutation methods.
Value
a tibble
containing the following columns:
- p.value
- the raw p-value
- adj.p.value
- multiple testing adjusted p-value
- cluster
- the cluster where DTU was observed
- transcript
- rowname of
sce
, the DTU isoform- transcript_usage
- the transcript usage of the isoform in the cluster
Additional columns from method = "trascript usage permutation"
:
- transcript_usage_elsewhere
- transcript usage in other clusters
- usage_difference
- the difference between the two transcript usage
- permuted_var
- the variance of usage difference in the permuted data
Additional columns from method = "chisq"
:
- X_value
- the test statistic
- df
- the degrees of freedom
- expected_usage
- the expected usage (mean across all clusters)
- usage_difference
- the difference between the observed and expected usage
The table is sorted by P-values.
Details
Genes with more than 2 isoforms expressing more than min_count
counts
are selected for testing with one of the following methods:
- trascript usage permutation
Transcript usage are taken as the test statistic, cluster labels are permuted to generate a null distribution.
- chisq
Chi-square test of the transcript count matrix for each gene.
Adjusted P-values were calculated by Benjamini–Hochberg correction.
Examples
outdir <- tempfile()
dir.create(outdir)
bc_allow <- file.path(outdir, "bc_allow.tsv")
genome_fa <- file.path(outdir, "rps24.fa")
R.utils::gunzip(
filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"),
destname = bc_allow, remove = FALSE
)
R.utils::gunzip(
filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"),
destname = genome_fa, remove = FALSE
)
sce <- FLAMES::sc_long_pipeline(
genome_fa = genome_fa,
fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"),
annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"),
outdir = outdir,
barcodes_file = bc_allow,
config_file = create_config(outdir)
)
#> Writing configuration parameters to: /tmp/RtmpgpEV0i/file25bf39375faf/config_file_9663.json
#> Configured steps:
#> barcode_demultiplex: TRUE
#> genome_alignment: TRUE
#> gene_quantification: TRUE
#> isoform_identification: TRUE
#> read_realignment: TRUE
#> transcript_quantification: TRUE
#> samtools not found, will use Rsamtools package instead
#> Running step: barcode_demultiplex
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be replaced
#> Setting number of threads to 8
#> Search pattern:
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/RtmpgpEV0i/file25bf39375faf/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /__w/_temp/Library/FLAMES/extdata/fastq/musc_rps24.fastq.gz
#> Searching for barcodes...
#> Number of reads processed: 393
#> Number of reads where at least one barcode was found: 368
#> Number of reads with exactly one barcode match: 364
#> Number of chimera reads: 1
#> All done!
#> Running step: genome_alignment
#> Creating junction bed file from GFF3 annotation.
#> Aligning sample /tmp/RtmpgpEV0i/file25bf39375faf/matched_reads.fastq -> /tmp/RtmpgpEV0i/file25bf39375faf/align2genome.bam
#> Your fastq file appears to have tags, but you did not provide the -y option to minimap2 to include the tags in the output.
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 8 threads...
#> Indexing bam files
#> Running step: gene_quantification
#> 03:23:26 AM Wed May 21 2025 quantify genes
#> Found genome alignment file(s): align2genome.bam
#> Running step: isoform_identification
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Running step: read_realignment
#> Checking for fastq file(s) /__w/_temp/Library/FLAMES/extdata/fastq/musc_rps24.fastq.gz
#> files found
#> Checking for fastq file(s) /tmp/RtmpgpEV0i/file25bf39375faf/matched_reads.fastq
#> files found
#> Checking for fastq file(s) /tmp/RtmpgpEV0i/file25bf39375faf/matched_reads_dedup.fastq
#> files found
#> Realigning sample /tmp/RtmpgpEV0i/file25bf39375faf/matched_reads_dedup.fastq -> /tmp/RtmpgpEV0i/file25bf39375faf/realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by 8 with CB threads...
#> Running step: transcript_quantification
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Pipeline saved to /tmp/RtmpgpEV0i/file25bf39375faf/pipeline.rds
group_anno <- data.frame(barcode_seq = colnames(sce), groups = SingleCellExperiment::counts(sce)["ENSMUST00000169826.2", ] > 1)
SingleCellExperiment::colLabels(sce) <- group_anno$groups
# DTU with permutation testing:
sc_DTU_analysis(sce, min_count = 1, method = "trascript usage permutation")
#> Filtering for genes with at least 2 isforms expressing more than 1 counts ...
#> 1 gene(s), 8 transcript(s) left.
#>
|
| | 0%
|
|======================================================================| 100%
#>
#> # A tibble: 16 × 9
#> transcript cluster transcript_usage transcript_usage_els…¹ usage_difference
#> <chr> <fct> <dbl> <dbl> <dbl>
#> 1 ENSMUSG0000… FALSE 0.0909 0.0261 0.0648
#> 2 ENSMUSG0000… TRUE 0.0261 0.0909 -0.0648
#> 3 ENSMUST0000… FALSE 0.608 0.755 -0.147
#> 4 ENSMUST0000… TRUE 0.755 0.608 0.147
#> 5 ENSMUSG0000… FALSE 0.0568 0.0201 0.0367
#> 6 ENSMUSG0000… TRUE 0.0201 0.0568 -0.0367
#> 7 ENSMUSG0000… FALSE 0.119 0.0908 0.0285
#> 8 ENSMUSG0000… FALSE 0.0341 0.0234 0.0106
#> 9 ENSMUSG0000… FALSE 0.0568 0.0446 0.0122
#> 10 ENSMUSG0000… TRUE 0.0908 0.119 -0.0285
#> 11 ENSMUSG0000… TRUE 0.0234 0.0341 -0.0106
#> 12 ENSMUSG0000… TRUE 0.0446 0.0568 -0.0122
#> 13 ENSMUSG0000… FALSE 0.0227 0.0288 -0.00607
#> 14 ENSMUSG0000… TRUE 0.0288 0.0227 0.00607
#> 15 ENSMUST0000… FALSE 0.0114 0.0111 0.000224
#> 16 ENSMUST0000… TRUE 0.0111 0.0114 -0.000224
#> # ℹ abbreviated name: ¹transcript_usage_elsewhere
#> # ℹ 4 more variables: p.value <dbl>, permuted_var <dbl>, adj.p.value <dbl>,
#> # gene_id <chr>
# now try with chisq:
sc_DTU_analysis(sce, min_count = 1, method = "chisq")
#> Filtering for genes with at least 2 detected isforms ...
#> 8 isoform(s) left.
#> Aggregating counts by cluster labels ...
#>
|
| | 0%
|
|======================================================================| 100%
#>
#> # A tibble: 1 × 10
#> gene X_value df transcript cluster p.value expected_usage transcript_usage
#> <chr> <dbl> <int> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 ENSM… 12.8 7 ENSMUSG00… FALSE 0.0767 0.0421 0.0909
#> # ℹ 2 more variables: usage_difference <dbl>, adj.p.value <dbl>