Semi-supervised isofrom detection and annotation for long read data.
This variant is meant for bulk samples. Specific parameters can be configured in
the config file (see create_config
), input files are specified via
arguments.
Arguments
- config_file
Path to the JSON configuration file. See
create_config
for creating one.- outdir
Path to the output directory. If it does not exist, it will be created.
- fastq
Path to the FASTQ file or a directory containing FASTQ files. Each file will be processed as an individual sample.
- annotation
The file path to the annotation file in GFF3 / GTF format.
- genome_fa
The file path to the reference genome in FASTA format.
- minimap2
(optional) The path to the minimap2 binary. If not provided, FLAMES will use a copy from bioconda via
basilisk
.- samtools
(optional) The path to the samtools binary. If not provided, FLAMES will use a copy from bioconda via
basilisk
. provided, FLAMES will use a copy from bioconda viabasilisk
.
Value
A FLAMES.Pipeline
object. The pipeline could be run using run_FLAMES
, and / or resumed using resume_FLAMES
.
Details
By default FLAMES use minimap2 for read alignment. After the genome alignment step (do_genome_align
), FLAMES summarizes the alignment for each read by grouping reads
with similar splice junctions to get a raw isoform annotation (do_isoform_id
). The raw isoform
annotation is compared against the reference annotation to correct potential splice site
and transcript start/end errors. Transcripts that have similar splice junctions
and transcript start/end to the reference transcript are merged with the
reference. This process will also collapse isoforms that are likely to be truncated
transcripts. If isoform_id_bambu
is set to TRUE
, bambu::bambu
will be used to generate the updated annotations.
Next is the read realignment step (do_read_realign
), where the sequence of each transcript from the update annotation is extracted, and
the reads are realigned to this updated transcript_assembly.fa
by minimap2. The
transcripts with only a few full-length aligned reads are discarded.
The reads are assigned to transcripts based on both alignment score, fractions of
reads aligned and transcript coverage. Reads that cannot be uniquely assigned to
transcripts or have low transcript coverage are discarded. The UMI transcript
count matrix is generated by collapsing the reads with the same UMI in a similar
way to what is done for short-read scRNA-seq data, but allowing for an edit distance
of up to 2 by default. Most of the parameters, such as the minimal distance to splice site and minimal percentage of transcript coverage
can be modified by the JSON configuration file (config_file
).
See also
create_config
for creating a configuration file,
SingleCellPipeline
for single cell pipelines,
MultiSampleSCPipeline
for multi sample single cell pipelines.
Examples
outdir <- tempfile()
dir.create(outdir)
# simulate 3 samples via sampling
reads <- ShortRead::readFastq(
system.file("extdata", "fastq", "musc_rps24.fastq.gz", package = "FLAMES")
)
dir.create(file.path(outdir, "fastq"))
ShortRead::writeFastq(reads[1:100],
file.path(outdir, "fastq/sample1.fq.gz"), mode = "w", full = FALSE)
reads <- reads[-(1:100)]
ShortRead::writeFastq(reads[1:100],
file.path(outdir, "fastq/sample2.fq.gz"), mode = "w", full = FALSE)
reads <- reads[-(1:100)]
ShortRead::writeFastq(reads,
file.path(outdir, "fastq/sample3.fq.gz"), mode = "w", full = FALSE)
# prepare the reference genome
genome_fa <- file.path(outdir, "rps24.fa")
R.utils::gunzip(
filename = system.file("extdata", "rps24.fa.gz", package = "FLAMES"),
destname = genome_fa, remove = FALSE
)
ppl <- BulkPipeline(
fastq = c(
"sample1" = file.path(outdir, "fastq", "sample1.fq.gz"),
"sample2" = file.path(outdir, "fastq", "sample2.fq.gz"),
"sample3" = file.path(outdir, "fastq", "sample3.fq.gz")
),
annotation = system.file("extdata", "rps24.gtf.gz", package = "FLAMES"),
genome_fa = genome_fa,
config_file = create_config(outdir, type = "sc_3end", threads = 1, no_flank = TRUE),
outdir = outdir
)
#> Writing configuration parameters to: /tmp/RtmpgpEV0i/file25bf79ffabf3/config_file_9663.json
#> Configured steps:
#> genome_alignment: TRUE
#> isoform_identification: TRUE
#> read_realignment: TRUE
#> transcript_quantification: TRUE
#> samtools not found, will use Rsamtools package instead
ppl <- run_FLAMES(ppl) # run the pipeline
#> Running step: genome_alignment
#> Creating junction bed file from GFF3 annotation.
#> Aligning sample sample1 -> /tmp/RtmpgpEV0i/file25bf79ffabf3/sample1_align2genome.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 1 threads...
#> Indexing bam files
#> Aligning sample sample2 -> /tmp/RtmpgpEV0i/file25bf79ffabf3/sample2_align2genome.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 1 threads...
#> Indexing bam files
#> Aligning sample sample3 -> /tmp/RtmpgpEV0i/file25bf79ffabf3/sample3_align2genome.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Sorting BAM files by genome coordinates with 1 threads...
#> Indexing bam files
#> Running step: isoform_identification
#> Import genomic features from the file as a GRanges object ...
#> OK
#> Prepare the 'metadata' data frame ...
#> OK
#> Make the TxDb object ...
#> Warning: genome version information is not available for this TxDb object
#> OK
#> Running step: read_realignment
#> Realigning sample sample1 -> /tmp/RtmpgpEV0i/file25bf79ffabf3/sample1_realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Skipped sorting BAM files.
#> Realigning sample sample2 -> /tmp/RtmpgpEV0i/file25bf79ffabf3/sample2_realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Skipped sorting BAM files.
#> Realigning sample sample3 -> /tmp/RtmpgpEV0i/file25bf79ffabf3/sample3_realign2transcript.bam
#> Warning: samtools not found, using Rsamtools instead, this could be slower and might fail for large BAM files.
#> Skipped sorting BAM files.
#> Running step: transcript_quantification
experiment(ppl) # get the result as SummarizedExperiment
#> class: SummarizedExperiment
#> dim: 10 3
#> metadata(0):
#> assays(1): counts
#> rownames(10): ENSMUST00000169826.2 ENSMUSG00000025290.17_19_5159_1 ...
#> ENSMUSG00000025290.17_19_5159_8 ENSMUST00000225023.1
#> rowData names(0):
#> colnames(3): sample1 sample2 sample3
#> colData names(0):