Genomic Analyses
Basic tools for processing and analyzing genomic sequencing (next-generation sequencing, NGS) data are available preinstalled in a container environment.
The path to the container on Komondor is: /opt/software/packages/containers/datascience_genomics.sif
.
The definition file (datascience_genomics.def) is also available in the same path next to the .sif file.
Available tools
Open-source tools that are currently available in the container:
Quality control of sequencing data: FastQC, MultiQC
Trimming and cleaning sequences: Cutadapt, Trimmomatic, Trim Galore
Sequence read mapping tools: BWA, Bowtie, Bowtie 2, STAR, BBMap, HISAT2, Bismark
Manipulation of sequences and alignment data: Samtools, Picard
Peak calling: MACS2
Motif analysis: HOMER
Genome arithmetics: Bedtools
Gene expression analysis: Salmon, DESeq2
Genetic variant identification and analysis: BCFtools, VCFtools, freebayes
Chromosome conformation capture analysis: HiCUP, HiCExplorer, HiC-Pro
Genome assembly: ABySS, MEGAHIT, Velvet, SPAdes, Canu, Tigmint
Genome analysis toolkit: GATK4
Data analysis and scripting: R, Python, Julia
Important
Note that authors usually request a citation in the publication if their software tool was used in the published work. Please always check the terms of use and the citation request on the tool’s website.
Scheduled analyses (sbatch)
Using the container, bioinformatics pipeline scripts can be run as scheduled slurm jobs.
Example: Bowtie2 alignment
In this example, suppose we have paired-end NGS data from an experiment that we want to align to the human reference genome.
We have the reference genome prepared (indexed) in the reference directory (on the PROJECT filesystem). We have the two NGS fastq (read1 and read2) files in the data directory and we created an output directory for the alignment files (on the faster SCRATCH filesystem).
$ ls -1 /project/my_project/reference
hs_grch38.fa.gz
hs_grch38.fa.gz.1.bt2
hs_grch38.fa.gz.2.bt2
hs_grch38.fa.gz.3.bt2
hs_grch38.fa.gz.4.bt2
hs_grch38.fa.gz.rev.1.bt2
hs_grch38.fa.gz.rev.2.bt2
$ ls -1 /scratch/my_project/fastq
my_sample_R1.fastq.gz
my_sample_R2.fastq.gz
$ ls -1 /scratch/my_project/sam
$
Now we prepare an sbatch script to do the alignment using bowtie2 that is already installed in the genomics container
(/opt/software/packages/containers/datascience_genomics.sif
). We allocate 16 processor cores for the alignment and
set bowtie2 to use 16 threads accordingly. Note, that we mount both the PROJECT and SCRATCH file systems for the container
since we have input data in both locations. As configured in the script, bowtie2’s output log will be
in the “%slurm-%j__bowtie2.out” file (%j is replaced by the Slurm job ID).
$ cat ./alignment.sbatch
#!/usr/bin/env bash
#SBATCH --job-name=alignment
#SBATCH --partition=cpu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=2000
#SBATCH --time=01:00:00
#SBATCH --output=slurm-%j__bowtie2.out
# input data (fastq files)
READ1=/scratch/my_project/fastq/my_sample_R1.fastq.gz
READ2=/scratch/my_project/fastq/my_sample_R2.fastq.gz
# output data (sam file)
SAM_OUT=/scratch/my_project/sam/my_sample.sam
# reference genome index location
REF_IDX=/project/my_project/reference/hs_grch38.fa.gz
# loading Singularity
module load singularity
# Singularity options
# the genomic analysis tools are installed in the datascience_genomics.sif container
# -B option specifies the file systems (with our data) we want to mount inside the container
CONTAINER_CALL="singularity exec -B /project:/project,/scratch:/scratch /opt/software/packages/containers/datascience_genomics.sif"
# running bowtie2 alignment (on 16 processor cores, as set in the SBATCH options above) using the container
${CONTAINER_CALL} bowtie2 --threads ${SLURM_CPUS_PER_TASK} -x ${REF_IDX} -1 ${READ1} -2 ${READ2} -S ${SAM_OUT}
We can submit the sbatch script to Slurm using the following command:
$ sbatch ./alignment.sbatch
Submitted batch job 1234567
If the job completes successfully, the alignment (sam) file have been created and we see a bowtie2 log with the alignment stats:
$ ls -1 /scratch/my_project/sam
my_sample.sam
$ cat ./slurm-1234567__bowtie2.log
50000 reads; of these:
50000 (100.00%) were paired; of these:
10646 (21.29%) aligned concordantly 0 times
33390 (66.78%) aligned concordantly exactly 1 time
5964 (11.93%) aligned concordantly >1 times
----
10646 pairs aligned concordantly 0 times; of these:
5243 (49.25%) aligned discordantly 1 time
----
5403 pairs aligned 0 times concordantly or discordantly; of these:
10806 mates make up the pairs; of these:
6544 (60.56%) aligned 0 times
1890 (17.49%) aligned exactly 1 time
2372 (21.95%) aligned >1 times
93.46% overall alignment rate
Example: Multiple alignments in parallel (job array)
In this example, the task is similar to the previous example, but now we have three samples instead of one that we want to analyse in parallel.
We have the reference (indexed), the input files (three paired-end NGS data files) and the output directory:
$ ls -1 /project/my_project/reference
hs_grch38.fa.gz
hs_grch38.fa.gz.1.bt2
hs_grch38.fa.gz.2.bt2
hs_grch38.fa.gz.3.bt2
hs_grch38.fa.gz.4.bt2
hs_grch38.fa.gz.rev.1.bt2
hs_grch38.fa.gz.rev.2.bt2
$ ls -1 /scratch/my_project/fastq
sample1_R1.fastq.gz
sample1_R2.fastq.gz
sample2_R1.fastq.gz
sample2_R2.fastq.gz
sample3_R1.fastq.gz
sample3_R2.fastq.gz
$ ls -1 /scratch/my_project/sam
$
The simplest method to do the alignments in parallel is using a job array. With the following settings, the sbatch script
will be run in parallel for all the three sample as separate jobs. The variable $SLURM_ARRAY_TASK_ID
can be used
to distinguish the runs and specify the sample name individually for each run. The range of the array task IDs is set
with the --array
sbatch option.
(See also the “Job Array” section
in “Basic Usage / Advanced Job Definitions”.)
$ cat ./alignment_array.sbatch
#!/usr/bin/env bash
#SBATCH --job-name=alignment
#SBATCH --partition=cpu
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem-per-cpu=2000
#SBATCH --array=0-2
#SBATCH --time=01:00:00
#SBATCH --output=slurm-%j_%a__bowtie2.out
# loading Singularity
module load singularity
# Singularity options
# the genomic analysis tools are installed in the datascience_genomics.sif container
# -B option specifies the file systems (with our data) we want to mount inside the container
CONTAINER_CALL="singularity exec -B /project:/project,/scratch:/scratch /opt/software/packages/containers/datascience_genomics.sif"
# we want to run the alignment for three samples in parallel
samples=(sample1 sample2 sample3)
# current run (sample)
# we use the index of the current task within the job array to select the sample name
run=${samples[$SLURM_ARRAY_TASK_ID]}
# input data (fastq files)
READ1=/scratch/my_project/fastq/${run}_R1.fastq.gz
READ2=/scratch/my_project/fastq/${run}_R2.fastq.gz
# output data (sam file)
SAM_OUT=/scratch/my_project/sam/${run}.sam
# reference genome index location
REF_IDX=/project/my_project/reference/hs_grch38.fa.gz
# running bowtie2 alignment (on 16 processor cores, as set in the SBATCH options above) using the container
srun ${CONTAINER_CALL} bowtie2 --threads ${SLURM_CPUS_PER_TASK} -x ${REF_IDX} -1 ${READ1} -2 ${READ2} -S ${SAM_OUT}
Submitting the sbatch script to Slurm:
$ sbatch ./alignment_array.sbatch
Submitted batch job 1234567
On a successful run, we will have the output (sam) file for each sample, and a separate output log for each alignment:
$ ls -1 /scratch/my_project/sam
sample1.sam
sample2.sam
sample3.sam
$ ls -1 ./slurm-*__bowtie2.out
slurm-1234567_2__bowtie2.out
slurm-1234568_0__bowtie2.out
slurm-1234569_1__bowtie2.out
Useful configurations
Links to Slurm configurations that can help organize and efficiently run your bioinfo pipelines that consists of many steps and/or parallel taks:
Job arrays (mentioned above): Submitting and managing collections of similar jobs quickly and easily - https://slurm.schedmd.com/job_array.html
Packed jobs: Mechanism for resource management to the job within its allocation regarding multiple job steps - https://slurm.schedmd.com/srun.html#OPT_exclusive
Job dependencies: Organizing sequential compute jobs that have to be run in order - https://slurm.schedmd.com/sbatch.html#OPT_dependency
Heterogeneous jobs: Submiting and managing jobs with different resources for different pieces of the job - https://slurm.schedmd.com/heterogeneous_jobs.html
Multiple programs configuration: Run a job with different programs and different arguments for each task - https://slurm.schedmd.com/srun.html#OPT_multi-prog
Interactive analyses
The genomics container can also be launched interactively on the command line or on the`JupyterHub <https://jupyter.hpc.kifu.hu>`_.
Analysis setup
Let’s say we want to run just a short test alignment using the same setup (reference, data files, task) described in the “Example: Bowtie2 alignment” section above:
$ REF_IDX=/project/my_project/reference/hs_grch38.fa.gz
$ READ1=/scratch/my_project/fastq/my_sample_R1.fastq.gz
$ READ2=/scratch/my_project/fastq/my_sample_R1.fastq.gz
$ SAM_OUT=/scratch/my_project/sam/my_sample.sam
Execute the analysis interactively
Instead of preparing and submitting an sbatch script, we can also do the alignment interactively on the command line using the container (provided that the task is really short and does not use too many resources):
$ REF_IDX=/project/my_project/reference/hs_grch38.fa.gz
$ READ1=/scratch/my_project/fastq/my_sample_R1.fastq.gz
$ READ2=/scratch/my_project/fastq/my_sample_R1.fastq.gz
$ SAM_OUT=/scratch/my_project/sam/my_sample.sam
$ module load singularity
$ CONTAINER_CALL="singularity exec -B /project:/project,/scratch:/scratch /opt/software/packages/containers/datascience_genomics.sif"
$ srun --ntasks=1 --cpus-per-task=16 --mem-per-cpu=2000 --pty ${CONTAINER_CALL} bowtie2 --threads 16 -x ${REF_IDX} -1 ${READ1} -2 ${READ2} -S ${SAM_OUT}
Execute the analysis in the container shell
We can also load the genomics container for full interactive use and issue commands directly in the container’s shell:
$ module load singularity
$ CONTAINER_SHELL="singularity shell -B /project:/project,/scratch:/scratch /opt/software/packages/containers/datascience_genomics.sif"
$ srun --ntasks=1 --cpus-per-task=16 --mem-per-cpu=2000 --pty $CONTAINER_SHELL
singularity> bowtie2 --version
/opt/conda/bin/bowtie2-align-s version 2.5.4
...
singularity> realpath $READ1
/scratch/my_project/fastq/my_sample_R1.fastq.gz
singularity> bowtie2 --threads 16 -x ${REF_IDX} -1 ${READ1} -2 ${READ2} -S ${SAM_OUT}
50000 reads; of these:
50000 (100.00%) were paired; of these:
...
singularity> ls /scratch/my_project/sam
my_sample.sam
singularity> exit
$
Interactive use on the JupyterHub
Login to the JupyterHub and select datascience-genomics
from the container list
when starting a new server. You can use the bash kernel to create a notebook for the analysis
or use the bash terminal to run the tools directly.