Starting a genomics project often means facing a significant initial hurdle: selecting and preparing the correct reference genome and annotation files. The sheer number of genome versions (like GRCm38 vs GRCm39 for mouse) and their corresponding annotation releases (e.g., Ensembl release 102) can feel overwhelming. These differences matter because even slight changes in coordinates can invalidate downstream analyses.
To ensure reproducible and compatible results, the golden rule is to be consistent with specific genome and annotation versions across your entire project. Furthermore, the raw files you download often require downstream processing—such as adding chromosome prefixes or converting file formats—to be properly integrated into various popular bioinformatics tools.
Several reputable institutions provide genome assemblies and annotations such as Ensembl, UCSC genome browser, NCBI etc. The workflow below is designed specifically for Ensembl data and outlines the three major sections to get the genome and annotation data ready for your analysis. By executing a single, version-controlled script, you eliminate manual download errors, inconsistent file naming, and coordinate mismatches that often plague genomics pipelines. This means that anyone, including your future self, can perfectly recreate your starting data files simply by running the same script. This standardization is critical for collaborative projects and for building robust, reliable analytical pipelines that stand the test of time. Here are the three major steps:
1. Download and Prepare the Genome Sequence (FASTA)
This first stage is about securing and cleaning the actual DNA sequence file. This means downloading the large, compressed reference genome FASTA format directly from the Ensembl server. We then standardize chromosome names by optionally adding the "chr" prefix to ensure compatibility with various bioinformatics tools, remembering to be consistent with this format across all files. Finally, we scan the corrected genome file to measure the exact length of every chromosome, creating a simple, two-column tab-separated file that is essential for coordinate validation and defining boundaries in downstream programs.
2. Download GTF files and generate BED from GTF
For the gene annotation, we first download the compressed GTF file for the matching Ensembl release. Like the genome sequence, we optionally adjust the chromosome names in this file by adding the "chr" prefix to maintain internal consistency. The raw GTF is then processed to filter for only the key gene entries and extract essential attributes like ID, name, and type. Crucially, we convert the coordinates from the GTF's 1-based system to the standard BED format's 0-based system. This conversion is vital because the simpler, column-based BED format is the lingua franca for many essential downstream bioinformatics tools, like bedtools, and is much easier to parse for coordinate-based operations than the complex, attribute-heavy GTF format. The resulting clean, simple table is thus optimized for analysis tools.
3. Define Transcription Start Site (TSS) Regions
This final stage prepares specific regions of interest, often used in ChIP-seq, ATAC-seq analysis to study gene regulation. We begin by identifying the gene's Transcription Start Site (TSS), a position dependent on the gene's strand (forward or reverse). Using the TSS as the anchor, we then calculate three different common regulatory windows: an asymmetrical window spanning 1kb upstream and 0.5kb downstream, and two symmetrical windows of 2kb and 5kb, while also ensuring no coordinate falls below zero. Finally, we filter these new$\text{TSS}$region files to include only the genes classified as "protein-coding", focusing the regulatory analysis on the most functionally relevant genes.
Finally, here is the complete bash script designed to automatically download and process the files discussed in this guide. Please feel free to edit the script to download your desired genome and annotation files from Ensembl.
1#!/bin/bash23# Script to download and process Ensembl mouse genome assembly (mm10/GRCm38 release 102)4# This script downloads the genome FASTA, GTF annotation, and creates various gene-related BED files56set -e # Exit on any error78# ==============================================================================9# Download Ensembl mouse genome assembly10# ==============================================================================11wget https://ftp.ensembl.org/pub/release-102/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.primary_assembly.fa.gz1213# Add chr prefix and rename header14zcat Mus_musculus.GRCm38.dna.primary_assembly.fa.gz | \15 awk '/^>/ {16 print ">chr" substr($1, 2)17 next18 }19 { print }' > mm10.GRCm38.r102.chr.fa2021# Generate chromosome sizes file22awk '/^>/ {23 if (chr != "") print chr "\t" length(seq)24 chr = substr($1, 2)25 seq = ""26 next27}28{ seq = seq $0 }29END {30 if (chr != "") print chr "\t" length(seq)31}' mm10.GRCm38.r102.chr.fa > mm10.GRCm38.r102.chr.sizes3233# ==============================================================================34# Download and process annotation GTF file35# ==============================================================================36wget https://ftp.ensembl.org/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.gtf.gz37gunzip Mus_musculus.GRCm38.102.gtf.gz3839# Add chr prefix to GTF40less Mus_musculus.GRCm38.102.gtf | \41 grep -v "#!" | \42 awk '{print "chr"$0}' > mm10.GRCm38.102.chr.gtf4344# Extract gene information (chr, start, end, gene_id, length, strand, gene_name, gene_biotype)45less mm10.GRCm38.102.chr.gtf | \46 awk -F'\t' '$3=="gene" {47 match($9, /gene_id "([^"]+)"/, a)48 gene_id = a[1]49 match($9, /gene_name "([^"]+)"/, b)50 gene_name = b[1]51 match($9, /gene_biotype "([^"]+)"/, c)52 gene_biotype = c[1]53 printf "%s\t%s\t%s\t%s\t%s\t%s\t%s\t%s\n", \54 $1, $4-1, $5, gene_id, $5-$4-1, $7, gene_name, gene_biotype55 }' > all_genes.55487.bed5657# ==============================================================================58# Create TSS (Transcription Start Site) windows59# ==============================================================================6061# TSS ±1kb upstream, ±0.5kb downstream62awk -F'\t' '{63 if ($6 == "+") {64 tss = $265 window_start = tss - 100066 window_end = tss + 50067 } else {68 tss = $369 window_start = tss - 50070 window_end = tss + 100071 }7273 if (window_start < 0) window_start = 074 printf "%s\t%d\t%d\t%s\t%s\t%s\t%s\t%s\n", \75 $1, window_start, window_end, $4, $5, $6, $7, $876}' all_genes.55487.bed > all_genes.55487_tss.1000up_500down.bed7778# TSS ±2kb window79awk -F'\t' '{80 if ($6 == "+") {81 tss = $282 } else {83 tss = $384 }85 window_start = tss - 200086 window_end = tss + 200087 if (window_start < 0) window_start = 088 printf "%s\t%d\t%d\t%s\t%s\t%s\t%s\t%s\n", \89 $1, window_start, window_end, $4, $5, $6, $7, $890}' all_genes.55487.bed > all_genes.55487_tss.2000bp.bed9192# TSS ±5kb window93awk -F'\t' '{94 if ($6 == "+") {95 tss = $296 } else {97 tss = $398 }99 window_start = tss - 5000100 window_end = tss + 5000101 if (window_start < 0) window_start = 0102 printf "%s\t%d\t%d\t%s\t%s\t%s\t%s\t%s\n", \103 $1, window_start, window_end, $4, $5, $6, $7, $8104}' all_genes.55487.bed > all_genes.55487_tss.5000bp.bed105106# Extract protein-coding genes only107for f in *tss*bed; do108 awk '$8=="protein_coding"{print }' $f > "${f}_pc"109done