0_based and 1_based

0_based numbering: the initial element of a sequence is assigned the
index 0;


1_based numbering: the initial element of a sequence is assigned the
index 1;

Extended miRDeep2 tutorial with step by step instructions. It will cover
the mapper.pl for preprocessing and mapping, the miRDeep2.pl for de-novo
prediction and the quantifier.pl for expression profiling.

BAM/BED data are 0-based and SAM data are 1-based.


This tutorial comes with no warranty and demands common sense of the
reader. I am not responsible for any damage that happens to your
computer by using this tutorial. For comments or questions just create
an ‘issue’ here


Apparently you will need a github account for that.

An example shows below:


This is a step by step guide for a full small RNA sequencing data
analysis using the miRDeep2 package and its patched files. The first
part will describe the general workflow to do de-novo miRNA predictions
based on a small RNAseq data seq and the second part will focus on
expression analysis with the quantification module.

The screenshot came from a sam file. The sequence
(10.4|823|29663|59.364374575|+|61.9047619048|73.3|1) was mapped to mm10,
starting from 3259346 in chr10. As sam file is 1_base, which means
first locus 3259346 in chr10 is the first nucleotide “A” of the

Installation instructions

If you haven’t installed them yet you can obtain the main package
the patched files
clicking on ‘Clone or download’ and then on ‘Download Zip’. Extract the
zipped files and then open a command line window. If you have git
installed you can obtain the packages also directly from the command
line by typing

·git clone


git clone

To install the miRDeep2 package enter the directory to which the package
was extracted to. If you extracted the folder on the Desktop then typing

cd ~/Desktop/mirdeep2

will bring you to the mirdeep2 folder. Then typing

perl install.pl

will start the installer and download and install third party software.
In particular. bowtie version 1, RNAfold, randfold and the perl packages
PDF:API and TTF will be installed. Please follow the instructions on the
screen. When mirdeep2 was installed successfully please open a new
terminal window and just type


If you see the miRDeep2 usage instructions on the screen you can
continue to install the patch. Otherwise something went wrong during the
installation. In case everything worked fine you can now enter the
directory of the mirdeep2_patch by typing

cd ~/Desktop/mirdeep2_patch

if you extracted the patched file to the Desktop. Typing

bash patchme.sh

will add the patched files to your mirdeep2 installation. After that we
can start with some miRNA data analysis. Download the miRBase reference
files for version 21 by typing

mirbase.pl 21

This will download the hairpin.fa.gz and mature.fa.gz for version 21 to
directory ~/mirbase/21/

(The ~ sign will be expanded by your computer to your home directory).

If you want the gff files as well then you need to type

mirbase.pl 21 1

The second argument can be anything but 0 which will tell the script to
also get the gff files from mirbase. For miRNA quantification we next
extract the miRNAs for our species of interest. For that you need to
know the 3-letter code of miRBase for your species. For humans this will
be ‘hsa’ and mouse would be mmu. To extract the mature sequences from
the mirbase file we downloaded before you just need to type

extract_miRNAs.pl ~/mirbase/21/mature.fa.gz hsa > mature_ref.fa

and to get the hairpin sequences by typing

extract_miRNAs.pl ~/mirbase/21/hairpin.fa.gz hsa > hairpin_ref.fa

For running miRDeep2 for de-novo miRNA prediction it is beneficial to
supply also mature miRNAs from related species. You could for example
use mouse ‘mmu’ and chimp ‘ptr’ as related species. For extracting those
miRNAs you can type

extract_miRNAs.pl ~/mirbase/21/mature.fa.gz mmu,ptr >

图片 1

Tutorial files

In case you want to follow the tutorial with example data you can get
the files from

unzip drmirdeep.github.io-master.zip

change to the unzipped directory with

cd drmirdeep.github.io-master

and continue with the tutorial.

However, after I converted this to bed file using bamToBed, the iniial
locus of the region converts from 3259346 to 3259345. And as Bed file is
0-based, so the second locus in chr10 is the first nucleotide “A”.

The data and what else you will need

Usually you will have gotten a small RNA sequencing data file from a
collaborator that wants you to analyze the data file. Before you can
start with any kind of analysis you should either already know the small
RNA sequencing adapter that was used for the sequencing of the sample or
ask your collaborator to sent it to you. If you don’t clip the adapter
then the majority of the reads having an adapter are likely not to be
aligned to anywhere.

Once you know the adapter sequence you should do a simple check to see
how many of your sequences contain the adapter. This you can do by

grep -c TGGAATTC example_small_rna_file.fastq

where ‘TGGAATTC’ are the first 8 nucleotides of the adapter that has
been used for this sample. Replace it with your own sequencing adapter.
MicroRNAs have a mean length of 22 nucleotides in animals so if you have
sequenced one of those it will likely have the sequencing adapter
attached to it. If the resulting number of sequences with an adapter is
around 70% of the number of your input sequences the data set can be
considered as reasonably good. Note: In case that only adapters have
been sequenced predominantly you will also get a high number which is
obviously not good. If you only get 10% of sequences with an adpater
then likely something went wrong during the sequnecing library
preparation or your sample doesn’t contain too many small RNAs. For
novel miRNA prediction we need to map the reads against a reference
database which has to be indexed by bowtie 1. For this we take a
reference database file, lets call it refdb.fa (This can be a genome
file or simply a file with scaffolds) and build a bowtie index by typing

bowtie-build refdb.fa refdb.fa

The first argument is here the actual file to index, the second argument
is the prefix for the bowtie index files. You can name it differently
but for ease of use I use the same name as my reference database file.
Depending on the input file size it can take several hours (for the
human genome for example) to be indexed. However, the bowtie website has
already some prebuild index files for download. If you decide to
download index files you will also need to download the fasta file with
which the index was build. Otherwise the results in the miRDeep2
prediction will be not reliable.

图片 2

Data preprocessing for novel miRNA prediction

Since the miRDeep2 package was designed as a complete solution for miRNA
prediction and quantification it also contains data preprocessing
routines that will also clip the sequencing adapter. The main function
of the mapping module is the mapping of the preprocessed reads file to
reference database. The reference database is typically an annotated
genome sequence but can also be simply a scaffold assembly if no genome
is available. The scaffolds itself however should be at least 200
nucleotides long so that a sane miRNA precursor plus some flanking
region fits into it. Apart from clipping adapters the module does sanity
checks on your sequencing reads and also collapse read sequences to
reduce the file size which will save computing time.

mapper.pl example_small_rna_file.fastq -e -h -i -j -k TGGAATTC -l
18 -m -p refdb.fa -s reads_collapsed.fa -t reads_vs_refdb.arf -v -o

What does this command do? The first argument needs to be your
sequencing file. Typically, this will be a fastq file. The format of the
fastq file is designated by specifying option ‘-e’. If your file is in
fasta format already you specify option ‘-c’ instead. If your reads file
is not in fasta format you need to specify option ‘-h’ which advises the
mapper module to parse your file to fasta format. Option -i will convert
RNA to DNA and option ‘-j’ will remove sequences that contain characters
other than ACGTN.

Now comes the actual adapter clipping which is only done if a adapter
sequence is given by option -k. Only the first 6 nucleotides of this
sequence will be used to search for an exact match in the sequencing
reads. Option ‘-m’ will collapse the reads to remove redundancy and
decrease the file size. A sequnecing read seen 10 times in your raw file
will occur only once in the collapsed file and have a _x10 in its

After that the reads will be mapped to the given refence genome which
index file was specified by option ‘-p’. Option -s indicates the
preprocessed read file name which is output by the mapper module and
option -t is the file name of the read mappings to the reference
database (‘refdb.fa’) in miRDeep2’s arf format. A mapping file in arf
format can be easily obtained from a standard bowtie 1 output file (This
is NOT in ‘sam’ format but a proprietary bowtie text file format) by

convert_bowtie_output.pl reads_vs_refdb.bwt >

However, if you used the mapping module then the mapped output file is
already in arf format.

To double check, I extracted the the sequence of this region
(chr10:3259345-3259372) in bed file as shown below. Comparing with the
sequence (AAGGGGCTGGACTTGCATGCCATGGAT) in the sam file, we can know that
the sequence indeed start from the second locus.

Identification of known and novel miRNAs

For predicting novel miRNAs the miRDeep2 module from the package is
called with a collapsed reads file and a reference genome file in fasta
format. For better prediction results reference files of miRNAs and
related miRNAs should be given since miRDeep2 considers predicted miRNAs
with conserved seeds in other species more reliable that miRNAs with
non-conserved seeds.

miRDeep2.pl reads_collapsed.fa refdb.fa reads_vs_refdb.arf
mature_ref.fa mature_other.fa hairpin_ref.fa -t hsa 2>report.log

图片 3

Data preprocessing for miRNA expression profiling