How to Download Genomes and Annotations

Starting a genomics project often means facing a significant initial hurdle: selecting and preparing the correct reference genome and annotation files. The sheer number of genome versions (like GRCm38 vs GRCm39 for mouse) and their corresponding annotation releases (e.g., Ensembl release 102) can feel overwhelming. These differences matter because even slight changes in coordinates can invalidate downstream analyses.

To ensure reproducible and compatible results, the golden rule is to be consistent with specific genome and annotation versions across your entire project. Furthermore, the raw files you download often require downstream processing—such as adding chromosome prefixes or converting file formats—to be properly integrated into various popular bioinformatics tools.

Several reputable institutions provide genome assemblies and annotations such as Ensembl, UCSC genome browser, NCBI etc. The workflow below is designed specifically for Ensembl data and outlines the three major sections to get the genome and annotation data ready for your analysis. By executing a single, version-controlled script, you eliminate manual download errors, inconsistent file naming, and coordinate mismatches that often plague genomics pipelines. This means that anyone, including your future self, can perfectly recreate your starting data files simply by running the same script. This standardization is critical for collaborative projects and for building robust, reliable analytical pipelines that stand the test of time. Here are the three major steps:

1. Download and Prepare the Genome Sequence (FASTA)

This first stage is about securing and cleaning the actual DNA sequence file. This means downloading the large, compressed reference genome FASTA format directly from the Ensembl server. We then standardize chromosome names by optionally adding the "chr" prefix to ensure compatibility with various bioinformatics tools, remembering to be consistent with this format across all files. Finally, we scan the corrected genome file to measure the exact length of every chromosome, creating a simple, two-column tab-separated file that is essential for coordinate validation and defining boundaries in downstream programs.

2. Download GTF files and generate BED from GTF

For the gene annotation, we first download the compressed GTF file for the matching Ensembl release. Like the genome sequence, we optionally adjust the chromosome names in this file by adding the "chr" prefix to maintain internal consistency. The raw GTF is then processed to filter for only the key gene entries and extract essential attributes like ID, name, and type. Crucially, we convert the coordinates from the GTF's 1-based system to the standard BED format's 0-based system. This conversion is vital because the simpler, column-based BED format is the lingua franca for many essential downstream bioinformatics tools, like bedtools, and is much easier to parse for coordinate-based operations than the complex, attribute-heavy GTF format. The resulting clean, simple table is thus optimized for analysis tools.

3. Define Transcription Start Site (TSS) Regions

This final stage prepares specific regions of interest, often used in ChIP-seq, ATAC-seq analysis to study gene regulation. We begin by identifying the gene's Transcription Start Site (TSS), a position dependent on the gene's strand (forward or reverse). Using the TSS as the anchor, we then calculate three different common regulatory windows: an asymmetrical window spanning 1kb upstream and 0.5kb downstream, and two symmetrical windows of 2kb and 5kb, while also ensuring no coordinate falls below zero. Finally, we filter these new$\text{TSS}$region files to include only the genes classified as "protein-coding", focusing the regulatory analysis on the most functionally relevant genes.

Finally, here is the complete bash script designed to automatically download and process the files discussed in this guide. Please feel free to edit the script to download your desired genome and annotation files from Ensembl.

1#!/bin/bash
2
3# Script to download and process Ensembl mouse genome assembly (mm10/GRCm38 release 102)
4# This script downloads the genome FASTA, GTF annotation, and creates various gene-related BED files
5
6set -e  # Exit on any error
7
8# ==============================================================================
9# Download Ensembl mouse genome assembly
10# ==============================================================================
11wget https://ftp.ensembl.org/pub/release-102/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz
12
13# Add chr prefix and rename header
14zcat Mus_musculus.GRCm38.dna.primary_assembly.fa.gz | \
15  awk '/^>/ {
16    print ">chr" substr($1, 2)
17    next
18  }
19  { print }' > mm10.GRCm38.r102.chr.fa
20
21# Generate chromosome sizes file
22awk '/^>/ {
23  if (chr != "") print chr "\t" length(seq)
24  chr = substr($1, 2)
25  seq = ""
26  next
27}
28{ seq = seq $0 }
29END {
30  if (chr != "") print chr "\t" length(seq)
31}' mm10.GRCm38.r102.chr.fa > mm10.GRCm38.r102.chr.sizes
32
33# ==============================================================================
34# Download and process annotation GTF file
35# ==============================================================================
36wget https://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz
37gunzip Mus_musculus.GRCm38.102.gtf.gz
38
39# Add chr prefix to GTF
40less Mus_musculus.GRCm38.102.gtf | \
41  grep -v "#!" | \
42  awk '{print "chr"$0}' > mm10.GRCm38.102.chr.gtf
43
44# Extract gene information (chr, start, end, gene_id, length, strand, gene_name, gene_biotype)
45less mm10.GRCm38.102.chr.gtf | \
46  awk -F'\t' '$3=="gene" {
47    match($9, /gene_id "([^"]+)"/, a)
48    gene_id = a[1]
49    match($9, /gene_name "([^"]+)"/, b)
50    gene_name = b[1]
51    match($9, /gene_biotype "([^"]+)"/, c)
52    gene_biotype = c[1]
53    printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n", \
54      $1, $4-1, $5, gene_id, $5-$4-1, $7, gene_name, gene_biotype
55  }' > all_genes.55487.bed
56
57# ==============================================================================
58# Create TSS (Transcription Start Site) windows
59# ==============================================================================
60
61# TSS ±1kb upstream, ±0.5kb downstream
62awk -F'\t' '{
63  if ($6 == "+") {
64    tss = $2
65    window_start = tss - 1000
66    window_end = tss + 500
67  } else {
68    tss = $3
69    window_start = tss - 500
70    window_end = tss + 1000
71  }
72
73  if (window_start < 0) window_start = 0
74  printf "%s\t%d\t%d\t%s\t%s\t%s\t%s\t%s\n", \
75    $1, window_start, window_end, $4, $5, $6, $7, $8
76}' all_genes.55487.bed > all_genes.55487_tss.1000up_500down.bed
77
78# TSS ±2kb window
79awk -F'\t' '{
80  if ($6 == "+") {
81    tss = $2
82  } else {
83    tss = $3
84  }
85  window_start = tss - 2000
86  window_end = tss + 2000
87  if (window_start < 0) window_start = 0
88  printf "%s\t%d\t%d\t%s\t%s\t%s\t%s\t%s\n", \
89    $1, window_start, window_end, $4, $5, $6, $7, $8
90}' all_genes.55487.bed > all_genes.55487_tss.2000bp.bed
91
92# TSS ±5kb window
93awk -F'\t' '{
94  if ($6 == "+") {
95    tss = $2
96  } else {
97    tss = $3
98  }
99  window_start = tss - 5000
100  window_end = tss + 5000
101  if (window_start < 0) window_start = 0
102  printf "%s\t%d\t%d\t%s\t%s\t%s\t%s\t%s\n", \
103    $1, window_start, window_end, $4, $5, $6, $7, $8
104}' all_genes.55487.bed > all_genes.55487_tss.5000bp.bed
105
106# Extract protein-coding genes only
107for f in *tss*bed; do
108  awk '$8=="protein_coding"{print }' $f > "${f}_pc"
109done