Pipeline for Single Cell Data

Semi-supervised isofrom detection and annotation for long read data. This variant is meant for single sample scRNA-seq data. Specific parameters can be configured in the config file (see create_config), input files are specified via arguments.

Usage

SingleCellPipeline(
  config_file,
  outdir,
  fastq,
  annotation,
  genome_fa,
  genome_mmi,
  minimap2,
  samtools,
  barcodes_file,
  expect_cell_number,
  controllers
)

Arguments

config_file: Path to the JSON configuration file. See create_config for creating one.
outdir: Path to the output directory. If it does not exist, it will be created.
fastq: Path to the FASTQ file or a directory containing FASTQ files. Each file will be processed as an individual sample.
annotation: The file path to the annotation file in GFF3 / GTF format.
genome_fa: The file path to the reference genome in FASTA format.
genome_mmi: (optional) The file path to minimap2's index reference genome.
minimap2: (optional) The path to the minimap2 binary. If not provided, FLAMES will use a copy from bioconda via basilisk.
samtools: (optional) The path to the samtools binary. If not provided, FLAMES will use a copy from bioconda via basilisk.
barcodes_file: The file with expected cell barcodes, with each barcode on a new line.
expect_cell_number: The expected number of cells in the sample. This is used if barcodes_file is not provided. See BLAZE for more details.
controllers: (optional, experimental) A crew_class_controller object for running certain steps

Value

A FLAMES.SingleCellPipeline object. The pipeline can be run using run_FLAMES(pipeline). The results can be accessed with experiment(pipeline). The pipeline also outputs a number of output files into the given outdir directory. Some of these output files include:

matched_reads.fastq: - fastq file with reads demultiplexed
align2genome.bam: - sorted BAM file with reads aligned to genome
matched_reads_dedup.fastq: - demultiplexed and UMI-deduplicated fastq file
transcript_assembly.fa: - transcript sequence from the isoforms
isoform_annotated.filtered.gff3: - isoforms in gff3 format (also contained in the SingleCellExperiment)
realign2transcript.bam: - sorted realigned BAM file using the transcript_assembly.fa as reference

Details

By default the pipeline starts with demultiplexing the input fastq data. If the cell barcodes are known apriori (e.g. via coupled short-read sequencing), the barcodes_file argument can be used to specify a file containing the cell barcodes, and a modified Rcpp version of flexiplex will be used; otherwise, expect_cell_number need to be provided, and BLAZE will be used to generate the cell barcodes. The pipeline then aligns the reads to the genome using minimap2. The alignment is then used for isoform detection (either using FLAMES or bambu, can be configured). The reads are then realigned to the detected isoforms. Finally, a transcript count matrix is generated (either using FLAMES's simplistic counting or oarfish's Expectation Maximization algorithm, can be configured). The results can be accssed with experiment(pipeline). If the pipeline errored out / new steps were configured, it can be resumed by calling resume_FLAMES(pipeline)

Examples

outdir <- tempfile()
dir.create(outdir)
bc_allow <- file.path(outdir, "bc_allow.tsv")
genome_fa <- file.path(outdir, "rps24.fa")
R.utils::gunzip(
  filename = system.file("extdata", "bc_allow.tsv.gz", package = "FLAMES"),
  destname = bc_allow, remove = FALSE
)
R.utils::gunzip(
  filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"),
  destname = genome_fa, remove = FALSE
)
ppl <- SingleCellPipeline(
  config_file = create_config(outdir, gene_quantification = FALSE),
  outdir = outdir,
  fastq = system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES"),
  annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"),
  genome_fa = genome_fa,
  barcodes_file = bc_allow
)
#> Writing configuration parameters to:  /tmp/Rtmp3Wjsxa/file358a973a394/config_file_13706.json 
#> Configured steps: 
#> 	barcode_demultiplex: TRUE
#> 	genome_alignment: TRUE
#> 	gene_quantification: TRUE
#> 	isoform_identification: TRUE
#> 	read_realignment: TRUE
#> 	transcript_quantification: TRUE
#> samtools not found, will use Rsamtools package instead
ppl <- run_FLAMES(ppl)
#> ── Running step: barcode_demultiplex @ Fri Jul  4 03:18:46 2025 ────────────────
#> FLEXIPLEX 0.96.2
#> Setting max barcode edit distance to 2
#> Setting max flanking sequence edit distance to 8
#> Setting read IDs to be  replaced
#> Setting number of threads to 8
#> Search pattern: 
#> primer: CTACACGACGCTCTTCCGATCT
#> BC: NNNNNNNNNNNNNNNN
#> UMI: NNNNNNNNNNNN
#> polyT: TTTTTTTTT
#> Setting known barcodes from /tmp/Rtmp3Wjsxa/file358a973a394/bc_allow.tsv
#> Number of known barcodes: 143
#> Processing file: /__w/_temp/Library/FLAMES/extdata/fastq/musc_rps24.fastq.gz
#> Searching for barcodes...
#> Number of reads processed: 393
#> Number of reads where at least one barcode was found: 368
#> Number of reads with exactly one barcode match: 364
#> Number of chimera reads: 1
#> All done!
#> ── Running step: genome_alignment @ Fri Jul  4 03:18:46 2025 ───────────────────
#> Creating junction bed file from GFF3 annotation.
#> Aligning sample /tmp/Rtmp3Wjsxa/file358a973a394/matched_reads.fastq.gz -> /tmp/Rtmp3Wjsxa/file358a973a394/align2genome.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 8 threads...
#> Indexing bam files
#> ── Running step: gene_quantification @ Fri Jul  4 03:18:47 2025 ────────────────
#> 03:18:47 AM Fri Jul 04 2025 quantify genes 
#> Using BAM(s): /tmp/Rtmp3Wjsxa/file358a973a394/align2genome.bam
#> ── Running step: isoform_identification @ Fri Jul  4 03:18:47 2025 ─────────────
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> ── Running step: read_realignment @ Fri Jul  4 03:18:48 2025 ───────────────────
#> Checking for fastq file(s) /__w/_temp/Library/FLAMES/extdata/fastq/musc_rps24.fastq.gz
#> 	files found
#> Checking for fastq file(s) /tmp/Rtmp3Wjsxa/file358a973a394/matched_reads.fastq.gz
#> 	files found
#> Checking for fastq file(s) /tmp/Rtmp3Wjsxa/file358a973a394/matched_reads_dedup.fastq.gz
#> 	files found
#> Realigning sample /tmp/Rtmp3Wjsxa/file358a973a394/matched_reads_dedup.fastq.gz -> /tmp/Rtmp3Wjsxa/file358a973a394/realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by 8 with CB threads...
#> ── Running step: transcript_quantification @ Fri Jul  4 03:18:48 2025 ──────────
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Import genomic features from the file as a GRanges object ... 
#> OK
#> Prepare the 'metadata' data frame ... 
#> OK
#> Make the TxDb object ... 
#> Warning: genome version information is not available for this TxDb object
#> OK
experiment(ppl)
#> class: SingleCellExperiment 
#> dim: 10 137 
#> metadata(0):
#> assays(1): counts
#> rownames(10): ENSMUST00000169826.2 ENSMUSG00000025290.17_19_5159_1 ...
#>   ENSMUSG00000025290.17_19_5159_8 ENSMUST00000225023.1
#> rowData names(2): transcript_id gene_id
#> colnames(137): AACTCTTGTCACCTAA AACCATGAGTCGTTTG ... TTGTAGGTCAGTGTTG
#>   TTTATGCAGACTAGAT
#> colData names(0):
#> reducedDimNames(0):
#> mainExpName: NULL
#> altExpNames(0):

Usage

Arguments

Value

Details

See also

Examples