RNA-Seq
is a technique that allows transcriptome studies (see also Transcriptomics technologies) based on next-generation sequencing technologies. This technique is largely dependent on bioinformatics tools developed to support the different steps of the process. Here are listed some of the principal tools commonly employed and links to some important web resources.
Design is a fundamental step of a particular RNA-Seq experiment. Some important questions like sequencing depth/coverage or how many biological or technical replicates must be carefully considered. Design review.
Quality assessment of raw data is the first step of the bioinformatics pipeline of RNA-Seq. Often, is necessary to filter data, removing low quality sequences or bases (trimming), adapters, contaminations, overrepresented sequences or correcting errors to assure a coherent final result. Articles about common next-generation sequencing problems.
Improvement of the RNA-Seq quality, correcting the bias is a complex subject. Each RNA-Seq protocol introduces specific type of bias, each step of the process (such as the sequencing technology used) is susceptible to generate some sort of noise or type of error. Furthermore, even the species under investigation and the biological context of the samples are able to influence the results and introduce some kind of bias.
Many sources of bias were already reported – GC content and PCR enrichment, rRNA depletion, errors produced during sequencing, priming of reverse transcription caused by random hexamers.
Different tools were developed to attempt to solve each of the detected errors.
Recent sequencing technologies normally require DNA samples to be amplified via polymerase chain reaction (PCR). Amplification often generates chimeric elements (specially from ribosomal origin) – sequences formed from two or more original sequences joined together.
High-throughput sequencing errors characterization and their eventual correction.
Further tasks performed before alignment, namely paired-read mergers.
After quality control, the first step of RNA-Seq analysis involves alignment (RNA-Seq alignment) of the sequenced reads to a reference genome (if available) or to a transcriptome database. See also and List of sequence alignment software.
Short aligners are able to align continuous reads (not containing gaps result of splicing) to a genome of reference. Basically, there are two types: 1) based on the Burrows-Wheeler transform method such as Bowtie and BWA, and 2) based on Seed-extend methods, Needleman-Wunsch or Smith-Waterman algorithms. The first group (Bowtie and BWA) is many times faster, however some tools of the second group tend to be more sensitive, generating more correctly aligned reads. See a comparative study of short aligners – comparative study.
Many reads span exon-exon junctions and can not be aligned directly by Short aligners, thus specific aligners were necessary – Spliced aligners. Some Spliced aligners employ Short aligners to align firstly unspliced/continuous reads (exon-first approach), and after follow a different strategy to align the rest containing spliced regions – normally the reads are split into smaller segments and mapped independently. See also.
In this case the detection of splice junctions is based on data available in databases about known junctions. This type of tools cannot identify new splice junctions. Some of this data comes from other expression methods like expressed sequence tags (EST).
De novo Splice aligners allow the detection of new Splice junctions without need to previous annotated information (some of these tools present annotation as a suplementar option). See also De novo Splice Aligners.
These tools perform normalization and calculate the abundance of each gene expressed in a sample. RPKM, FPKM and TPMs are some of the units employed to quantification of expression (RPKM-FPKM-TPMs video).
Some software are also designed to study the variability of genetic expression between samples (differential expression). Quantitative and differential studies are largely determined by the quality of reads alignment and accuracy of isoforms reconstruction. Several studies are available comparing differential expression methods.
Genome arrangements result of diseases like cancer can produce aberrant genetic modifications like fusions or translocations. Identification of these modifications play important role in carcinogenesis studies.
Single cell sequencing. The traditional RNA-Seq methodology is commonly known as “bulk RNA-Seq”, in this case RNA is extracted from a group of cells or tissues, not from the individual cell like it happens in single cell methods. Some tools available to bulk RNA-Seq are also applied to single cell analysis, however to face the specificity of this technique new algorithms were developed. Comparative analysis of single-cell RNA-sequencing methods. A list of Single Cell tools can be found at Awesome Single Cell.
These Simulators generate in silico reads and are useful tools to compare and test the efficiency of algorithms developed to handle RNA-Seq data. Moreover, some of them make possible to analyse and model RNA-Seq protocols. See also Genetic Simulation Resources and some discussion about simulation at Biostars.
The transcriptome is the total population of RNAs expressed in one cell or group of cells, including non-coding and protein-coding RNAs.
There are two types of approaches to assemble transcriptomes. Genome-guided methods use a reference genome (if possible a finished and high quality genome) as a template to align and assembling reads into transcripts. Genome-independent methods does not require a reference genome and are normally used when a genome is not available. In this case reads are assembled directly in transcripts. Some important comparative studies were already published.