Differential transcription usage testing for single cell data, using
colLabels
as cluster labels.
Usage
sc_DTU_analysis(
sce,
gene_col = "gene_id",
min_count = 15,
threads = 1,
method = "trascript usage permutation",
permuations = 1000
)
Arguments
- sce
The
SingleCellExperiment
object, with transcript counts in thecounts
slot and cluster labels in thecolLabels
slot.- gene_col
The column name in the
rowData
slot ofsce
that contains the gene ID / name. Default is"gene_id"
.- min_count
The minimum total counts for a transcript to be tested.
- threads
Number of threads to use for parallel processing.
- method
The method to use for testing, listed in
details
.- permuations
Number of permutations for permutation methods.
Value
a tibble
containing the following columns:
- p.value
- the raw p-value
- adj.p.value
- multiple testing adjusted p-value
- cluster
- the cluster where DTU was observed
- transcript
- rowname of
sce
, the DTU isoform- transcript_usage
- the transcript usage of the isoform in the cluster
Additional columns from method = "trascript usage permutation"
:
- transcript_usage_elsewhere
- transcript usage in other clusters
- usage_difference
- the difference between the two transcript usage
- permuted_var
- the variance of usage difference in the permuted data
Additional columns from method = "chisq"
:
- X_value
- the test statistic
- df
- the degrees of freedom
- expected_usage
- the expected usage (mean across all clusters)
- usage_difference
- the difference between the observed and expected usage
The table is sorted by P-values.
Details
Genes with more than 2 isoforms expressing more than min_count
counts
are selected for testing with one of the following methods:
- trascript usage permutation
Transcript usage are taken as the test statistic, cluster labels are permuted to generate a null distribution.
- chisq
Chi-square test of the transcript count matrix for each gene.
Adjusted P-values were calculated by Benjamini–Hochberg correction.
Examples
outdir <- tempfile()
dir.create(outdir)
bc_allow <- file.path(outdir, "bc_allow.tsv")
genome_fa <- file.path(outdir, "rps24.fa")
R.utils::gunzip(
filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"),
destname = bc_allow, remove = FALSE
)
R.utils::gunzip(
filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"),
destname = genome_fa, remove = FALSE
)
sce <- FLAMES::sc_long_pipeline(
genome_fa = genome_fa,
fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"),
annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"),
outdir = outdir,
barcodes_file = bc_allow,
config_file = create_config(outdir)
)
#> Writing configuration parameters to: /tmp/RtmpltU9Yg/file92d2327584ae/config_file_37586.json
#> 04:17:11 AM Thu Apr 17 2025 Start running
#> 04:17:11 AM Thu Apr 17 2025 Demultiplexing using flexiplex...
#> Matching cell barcodes...
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be replaced
#> Setting number of threads to 8
#> Search pattern:
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/RtmpltU9Yg/file92d2327584ae/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /__w/_temp/Library/FLAMES/extdata/fastq/musc_rps24.fastq.gz
#> Searching for barcodes...
#> Number of reads processed: 393
#> Number of reads where at least one barcode was found: 368
#> Number of reads with exactly one barcode match: 364
#> Number of chimera reads: 1
#> All done!
#> 04:17:11 AM Thu Apr 17 2025 Demultiplex done
#> Running FLAMES pipeline...
#> #### Input parameters:
#> {
#> "comment": ["this is the default config for nanopore single cell long read data using 10x 3'end kit. use splice annotation in alignment. "],
#> "pipeline_parameters": {
#> "seed": [2022],
#> "threads": [8],
#> "do_barcode_demultiplex": [true],
#> "do_gene_quantification": [true],
#> "do_genome_alignment": [true],
#> "do_isoform_identification": [true],
#> "bambu_isoform_identification": [false],
#> "multithread_isoform_identification": [false],
#> "do_read_realignment": [true],
#> "do_transcript_quantification": [true],
#> "oarfish_quantification": [true]
#> },
#> "barcode_parameters": {
#> "max_bc_editdistance": [2],
#> "max_flank_editdistance": [8],
#> "pattern": {
#> "primer": ["CTACACGACGCTCTTCCGATCT"],
#> "BC": ["NNNNNNNNNNNNNNNN"],
#> "UMI": ["NNNNNNNNNNNN"],
#> "polyT": ["TTTTTTTTT"]
#> },
#> "strand": ["-"],
#> "TSO_seq": ["AAGCAGTGGTATCAACGCAGAGTACATGGG"],
#> "TSO_prime": [5],
#> "cutadapt_minimum_length": [10],
#> "full_length_only": [false]
#> },
#> "isoform_parameters": {
#> "generate_raw_isoform": [false],
#> "max_dist": [10],
#> "max_ts_dist": [100],
#> "max_splice_match_dist": [10],
#> "min_fl_exon_len": [40],
#> "max_site_per_splice": [3],
#> "min_sup_cnt": [5],
#> "min_cnt_pct": [0.001],
#> "min_sup_pct": [0.2],
#> "bambu_ndr": [0.5],
#> "bambu_verbose": [false],
#> "bambu_trust_reference": [true],
#> "strand_specific": [0],
#> "remove_incomp_reads": [4],
#> "downsample_ratio": [1]
#> },
#> "alignment_parameters": {
#> "use_junctions": [true],
#> "no_flank": [false]
#> },
#> "realign_parameters": {
#> "use_annotation": [true]
#> },
#> "transcript_counting": {
#> "min_tr_coverage": [0.4],
#> "min_read_coverage": [0.4]
#> }
#> }
#> gene annotation: /__w/_temp/Library/FLAMES/extdata/rps24.gtf.gz
#> genome fasta: /tmp/RtmpltU9Yg/file92d2327584ae/rps24.fa
#> input fastq: /tmp/RtmpltU9Yg/file92d2327584ae/matched_reads.fastq
#> output directory: /tmp/RtmpltU9Yg/file92d2327584ae
#> minimap2 path:
#> k8 path:
#> #### Aligning reads to genome using minimap2
#> 04:17:11 AM Thu Apr 17 2025 minimap2_align
#> Sorting BAM files with 8 threads...
#> Indexing bam files
#> 04:17:11 AM Thu Apr 17 2025 Start gene quantification and UMI deduplication
#> 04:17:11 AM Thu Apr 17 2025 quantify genes
#> Found genome alignment file(s): align2genome.bam
#> 04:17:11 AM Thu Apr 17 2025 Gene quantification and UMI deduplication done!
#> 04:17:11 AM Thu Apr 17 2025 Start isoform identificaiton
#> 04:17:11 AM Thu Apr 17 2025 find_isoform
#> 04:17:11 AM Thu Apr 17 2025 Isoform identificaiton done
#> #### Realigning deduplicated reads to transcript using minimap2
#> Running minimap2 with args: --eqx -N 100 -ax map-ont -y -t 8
#> 04:17:11 AM Thu Apr 17 2025 minimap2_realign
#> Sorting by CB for oarfish quantifaction
#> Sorting BAM file with 8 threads...
#> #### Generating transcript count matrix
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: The "phase" metadata column contains non-NA values for features of type
#> stop_codon. This information was ignored.
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: genome version information is not available for this TxDb object
#> OK
group_anno <- data.frame(barcode_seq = colnames(sce), groups = SingleCellExperiment::counts(sce)["ENSMUST00000169826.2", ] > 1)
SingleCellExperiment::colLabels(sce) <- group_anno$groups
# DTU with permutation testing:
sc_DTU_analysis(sce, min_count = 1, method = "trascript usage permutation")
#> Filtering for genes with at least 2 isforms expressing more than 1 counts ...
#> 1 gene(s), 9 transcript(s) left.
#>
|
| | 0%
|
|======================================================================| 100%
#>
#> # A tibble: 18 × 9
#> transcript cluster transcript_usage transcript_usage_els…¹ usage_difference
#> <chr> <fct> <dbl> <dbl> <dbl>
#> 1 ENSMUST0000… FALSE 0.570 0.746 -0.176
#> 2 ENSMUST0000… TRUE 0.746 0.570 0.176
#> 3 ENSMUSG0000… FALSE 0.130 0.0750 0.0555
#> 4 ENSMUSG0000… FALSE 0.0538 0.0208 0.0329
#> 5 ENSMUSG0000… FALSE 0.0860 0.0388 0.0473
#> 6 ENSMUST0000… FALSE 0.0215 0.00758 0.0139
#> 7 ENSMUST0000… FALSE 0.0201 0.00599 0.0141
#> 8 ENSMUSG0000… TRUE 0.0750 0.130 -0.0555
#> 9 ENSMUSG0000… TRUE 0.0208 0.0538 -0.0329
#> 10 ENSMUSG0000… TRUE 0.0388 0.0860 -0.0473
#> 11 ENSMUST0000… TRUE 0.00758 0.0215 -0.0139
#> 12 ENSMUST0000… TRUE 0.00599 0.0201 -0.0141
#> 13 ENSMUSG0000… FALSE 0.0538 0.0455 0.00830
#> 14 ENSMUSG0000… TRUE 0.0455 0.0538 -0.00830
#> 15 ENSMUSG0000… FALSE 0.0108 0.00758 0.00318
#> 16 ENSMUSG0000… FALSE 0.0538 0.0530 0.000732
#> 17 ENSMUSG0000… TRUE 0.00758 0.0108 -0.00318
#> 18 ENSMUSG0000… TRUE 0.0530 0.0538 -0.000732
#> # ℹ abbreviated name: ¹transcript_usage_elsewhere
#> # ℹ 4 more variables: p.value <dbl>, permuted_var <dbl>, adj.p.value <dbl>,
#> # gene_id <chr>
# now try with chisq:
sc_DTU_analysis(sce, min_count = 1, method = "chisq")
#> Filtering for genes with at least 2 detected isforms ...
#> 9 isoform(s) left.
#> Aggregating counts by cluster labels ...
#>
|
| | 0%
|
|======================================================================| 100%
#>
#> # A tibble: 1 × 10
#> gene X_value df transcript cluster p.value expected_usage transcript_usage
#> <chr> <dbl> <int> <chr> <chr> <dbl> <dbl> <dbl>
#> 1 ENSM… 13.7 8 ENSMUST00… FALSE 0.0888 0.700 0.570
#> # ℹ 2 more variables: usage_difference <dbl>, adj.p.value <dbl>