Aligning Data

In this lesson:

A brief background on alignment
Creating an index for alignment
Aligning fastq files
Post-alignment processing

But First…

Since the queues may or may not be congested, we’ll go ahead and get some alignment jobs running, so we can hopefully see the results by the end of the lesson. Later in the lesson, we’ll go over in detail what we’re doing here:

These files will be bigger than the ones we’ve worked with before, so let’s work in our individual scratch spaces

$ cd /work/users/t/r/tristand/

Remember to use your own info instead of ‘t/r/tristand’. We’ll now copy some data:

$ cp -r /proj/seq/data/carpentry/align/ .
$ cd align/
$ ls

set a few variable, and submit two alignment runs.

$ module load bbmap
$ refpath=/proj/seq/data/dm6_UCSC/Sequence/BBMapIndex/
$ base=SRR5457511_1
$ sbatch -J $base slurm_bbmap_basic.sh $base $refpath

By using a variable for the “base”, we can simply change the value of that variable and submit a second alignment with the same command”

$ base=SRR5457513_1
$ sbatch -J $base slurm_bbmap_basic.sh $base $refpath

A lot of these may take some time to queue, especially if we’re all submitting at once. Then a several minutes to run, so we’ll go ahead and submit a few and look at the results once all have finished.

What is Alignment?

The typical Illumina HiSeq type run can produce 100s of millions of reads, usually from 50-150bp long. These reads are from the ends of DNA fragments which are typically 400-600bp long. These fragments either came directly from harvesting genomic DNA, or the conversion of RNA from cells into cDNA.

Alignment is the task of figuring out where these reads, and consequently the fragments they were sequenced from, originated in the original cells’ genome. This is the gateway step for many types of analyses, from ChIP-seq, RNA-seq, variant calling, and more specialized types of -seq.

Complications

The task of locating these reads in the genome faces a number of difficulties, major ones include:

Sequencing errors - a sequencing quality score of 30 corresponds to an estimated error of 1 in 1000, q40 1 in 10000. A typical run has average quality scores in the mid 30s, but over millions of reads statisically they will crop up.
Reference genome - you can only align as well as your reference genome is accurate. More polished references like human or mouse will have less issues than less studied organisms.
Natural variantion - individual organisms will have SNPs and indels, the later of which can be particularly problematic.
Computational power - trying to match millions of reads to a full genome is a very large task.

Indexing

Most alignment algorithms use some form of indexing - essentially a way to quickly ‘bookmark’ sections of a genome.

A commonly used technique is the Burrows-Wheeler algorithm. This algorithm is used in lossless compression. It is particularly good at finding reoccurring substrings in text and reducing the data to simpler representions. It was adapted for use in alignment, but essentially reversing its usual usage. The index created by B-W aligners is a reduced representation of the genome, which the program can quickly check parts of a read against.

Burrows-Wheeler is used in a number of common alignment programs, like BWA and Bowtie (the B and W reference the algorithm name).

However, we’ll be using bbmap, which has a slightly different algorithm which rapidly finds k-mer matches. But it too needs to create an index, which is used in a similar manner.

These days, which aligner you use will probably depend more on external factors than “how good” a particular aligner is. Nearly all the major ones differ in various comparisons by a few percent. And this is a good time to introduce a bioinformatics rule of thumb:

Most algorithms tend to agree on the obvious stuff and tend to only differ in edge cases

Now, those edge cases may be important to your particular experiment, so that is one reason you find certain analyses using one aligner over another. Another major reason is simply a lab standardized on the aligner the first person in the lab used to align sequencing data. You may also want to use the same aligner used in a paper that does the same or similar experiment you are conducting if you wish to compare results.

Data Organization

As we’ve seen, things can get quite messy once you start processing and analyzing a lot of files. Once you get more comfortable with paths you can start to have your commands and programs access data from different directories and write out results to a different set of directories. For now we’ll just continue to use the approach of accessing and writing to the current directory, then sort out files afterwards.

If you recall, the fastq files we produced last lesson were all in the trim directory, and no separation of the raw fastq files from the trimmed ones. How you organize your data is up to you, though remember someone else may need to reconstruct your analysis later - that someone may even be you a few months or years in the future.

Usually it’s a good idea to keep the files from each stage of processing in their own directories, for example you may have several directories like:

fastq/
fastqc/
trimmed/
align/

You could have it all organized from the beginning, using relative paths to have each stage (QC, trimming, aligning, counting) write directly to the relevant directories. Or at each stage you just access files in the current “working” directory and then move all those files to a new directory.

(we’re not actually doing this, it’s example/hypothetical set of commands)

Imagine we have all our raw fastq files in our current directory, and then we trimmed them, and wanted to put the trimmed versions in a new directory and we included the string “trim” in each of the resulting trimmed fastqs.

$ mkdir trimmed
$ mv *trim*fastq.gz trimmed/
$ mv trimmed/ ..
$ cd ../trimmed/

In the last three commands, we move the new trimmed directory up one level and then enter it, and lastly make sure all the files we want are here.

Making a reference with BBmap

Let’s pretend we have loaded so many modules we’re worried there may be a conflict:

$ module load bowtie
$ module list

Now we’ll clear all loaded modules:

$ module purge
$ module list

module purge removes all currently loaded modules. We could also use module unload or module remove to remove a specific module.

Now let’s cleanly load bbmap:

$ module load bbmap
$ module list

Currently Loaded Modules:
  1) samtools/1.9   2) pigz/2.3.4   3) bbmap/38.41

As might have noticed earlier, extra modules appeared in addition to bbmap. The BBmap package makes use of other modules, and ITS has set up the system so these modules are loaded automatically. Here’s another reason for using the modules system, some packages may depend on very specific versions of other packages. Samtools is a module we’ll be using later - it has a lot of utilities for working with alignment files. pigz is module that let’s the BBmap programs access compressed files.

Making a reference is fairly straight forward with bbmap.

First let’s get a genome to use - usually these aren’t just sitting around but we’ll grab one from a set we maintain on longleaf.

$ cp /proj/seq/data/dm6_UCSC/Sequence/WholeGenomeFasta/genome.fa .

genome.fa is a bit generic of a name, but as we’ll see in a bit, the set of genomic data we store in /proj/seq/data follows a structure to make it easy to automate access. We can rename this file to remind us of what it is.

$ mv genome.fa dm6_UCSC.genome.fa

We can look at bbmap’s help to get a basic idea of its use: (here we are piping the output to less to make it easier to look over)

$ bbmap.sh -h | less

From the help, we can see the instructions for creating an index are:
To index: bbmap.sh ref=<reference fasta>

So, we can construction this slurm submission:

$ sbatch -o dm6_indexing.out --wrap="bbmap.sh ref=dm6_UCSC.genome.fa"

As we saw with its bbduk trimming sister program, bbmap provides useful information in its logging:

$ less dm6_indexing.out

BBmap will by default write its indexing files to a directory called ref in the current directory. In fact, when aligning, if you don’t specify a reference, bbmap will look for the ref by default. Because I’ve setup the alignment script to point to a directory containing a ref directory, let’s move the index we just made into a newly made directory. Why we do this will hopefully become clear later.

$ mkdir index_dm6
$ mv ref/ index_dm6/

Premade references

Since indexes for each reference genome only need to be made once, we’ve provided a number of common reference genomes for some of the most commonly used aligners. In fact that’s what we used to get our alignments going.

$ ls /proj/seq/data/

Let’s look in the dm6_UCSC directory, which is a build of the Drosophila melanogaster genome.

$ ls /proj/seq/data/dm6_UCSC/

That Sequence directory looks promising

$ ls /proj/seq/data/dm6_UCSC/Sequence/

Here, we’ve provided the premade indexes for several common alignment programs

$ ls /proj/seq/data/dm6_UCSC/Sequence/BBMapIndex/

Now we see the ref directory that BBmap likes to see! We can give the above path to BBmap and it will automatically find the ref directory there.

Alignment

At the beginning of the lesson, we used a simple alignment script I made using bbmap, let’s take a look at it:

$ less slurm_bbmap_basic.sh

There’s a few things to take note of:

base=$1
refpath=$2

I’ve written the script to take two arguments, once is a ‘basename’ of a file, and the other is the path to the reference index.

#SBATCH -n 4
...
threads=4

I request 4 nodes with the -n option for sbatch, and I’ve told bbmap to use 4 threads - these should be equal. Typically I’d use 8 to 16 nodes (eg ‘-n 12’ and ‘threads=12’) which will take longer to queue, but run faster overall

#SBATCH --mem 8g
...
-Xmx8g

We’re going to request 8 gigs of memory from slurm, and you see below we’re telling bbmap to use the same amount. For mammalian sized genomes, 32g is usually a good choice.

in1=${base}.fastq.gz

in1=<file_path> tells bbmap what fastq file to use. If we had paired end data in 2 separate files (which is how data is released by default), we’d have to specify the ‘R2’ file using in2=

path=$refpath the “path=” parameter points bbmap to an existing bbmap index. If it is not set, bbmap looks for a ‘ref/’ directory in the current directory as mentioned above.

out= specifies the name of the file we want to write the bam (aka alignment file) to.

Wait for at least one run from everyone to finish

The bbmap log files are very useful (not all aligners make such convenient summaries), it reports all sorts of info on how well the alignments we to standard out:
(note the job ID, the number after SRRxxxxxx, will differ for everyone, so try using tab completion)

$ less bbmap.SRR5457511_1.57125268.err

That’s a lot of info. Notice anything in particular?

Now say you setup a lot of fastq files to be aligned overnight, and want a quick check to see how they did. We can use grep to pull out specific lines from all the .err files at once:

$ grep "mapped" *err

When aligning paired end reads, the alignment information section will be repeated, once for Read 1, and Read 2, plus a combined section. When dealing with paired end data aligned with ‘bbmap’, I find the following to be useful, because it shows how many of R1 and R2 mapped within typical fragment size of each other.

$ grep "mated pair" *err

Samtools

Samtools is a package with a number of subcommands, which you use similar to the module command.

$ samtools

Shows us a ton of options, but its mostly important to notice this first part:

Usage:   samtools <command> [options]

Let’s sort one of the bam files. If we type in one of the subcommands, it’ll give us a list of its individual options:

$ samtools sort

Take note of this one:

  -o FILE    Write final output to FILE rather than standard output

I mentioned in a previous class a lot of bioinformatics tools are designed to have their results piped from one to another. So by default samtools sort is not going to save the sorted bam file, but stream it to standard out. Instead, we want it in a file, and the sort subcommand gives us that option with the -o option.

So, to sort a bam you can submit the below to the queue:

$ sbatch -o bam_sort.out --wrap="samtools sort SRR5457511_1.bam -o SRR5457511_1.sorted.bam"

Once the sorting is finished, we have the file SRR5457511_1.sorted.bam - let’s check the file sizes:

$ du -shc *bam

157M	SRR5457511_1.bam
106M	SRR5457511_1.sorted.bam

The sorted files will get a bit smaller - this is just because they make a little more efficient use of the file structure.

A lot of the programs that require sorted bams also need an index file of the bam. Note, this is not the same sort of index as aligning needs - this is a more traditional index.

$ sbatch -o bam_index.out --wrap="samtools index SRR5457511_1.sorted.bam"

Practice

For homework(!) you can try sorting and indexing the other files. Also think about how you might put these in scripts, or perhaps just one script to get all your aligning and sorting done in one submission to the queue.