Sequencing Reference

Basic Rules for References

All genomes and references should be put into /public directory, which is a SATA SSD with 1T capacity.
If you build an index, please always make sure to put it into the corresponding directory.

Eg. You build up a hisat2 index using the mm10 genome. You should put the index folder under the hisat2 directory, name it mm10 at the same time
Please always attach a file describing the parameters or data source link (shell script file preferred) in the same directory. Also, reflect on the folder name briefly.

Eg. If you are going to build a STAR index for T2T with personalized parameters. The folder name should be like T2T_CHM13v2.0_149
After building the index or downloading the genome, please update it on this website. Asking the Administrator to change the directory owner to root and permission to 755.

Directory Structure

/public
├─ reference/
│  ├─ genomes
│  │  ├─ mm10
│  │  │  ├─ .fa.gz
│  │  │  ├─ .gtf.gz
│  │  │  ├─ RefSeq.bed
│  │  │  └─ RepeatMasker.bed
│  │  ├─ hg38
│  │  │  └─ ...
│  │  └─ ...
│  ├─ STAR
│  ├─ bwa
│  ├─ hisat2
│  ├─ bowtie2
│  ├─ cellranger
│  └─ rsem
└─ ...

Genome	.fa	.gtf	RefSeq	RepeatMasker	STAR
`T2T-CHM13v2.0`
`hg38`
`hg19`
`mm39`
`mm10`	✅	✅	✅	✅	✅
`mm9`
`rn6`

Common Storage Requirement

Fasta+gtf: 4G STAR: 29G bwa: 5G bowtie2: 4G cellranger: 15G rsem: 1G

Download Genomes

ensembl ftp site

download_mm10_ensembl.sh
#!/bin/bash

# mm10, ensmebl release-102
# https://www.ensembl.org/info/data/ftp/rsync.html
# Insert ensembl into the path after the domain name before you paste. Note for downloading from ensemblgenomes sites use ensemblgenomes rather than ensembl

TARGET_DIR= # (1)!

mkdir -p ${TARGET_DIR}

# DNA
rsync -avP rsync://ftp.ensembl.org/ensembl/pub/release-102/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.toplevel.fa.gz ${TARGET_DIR} # (2)!

# GTF
rsync -avP rsync://ftp.ensembl.org/ensembl/pub/release-102/gtf/mus_musculus/Mus_musculus.GRCm38.102.chr_patch_hapl_scaff.gtf.gz ${TARGET_DIR}

Where you wish to store files.
Insert ensembl into the path after the domain name before you paste. Note for downloading from ensemblgenomes sites use ensemblgenomes rather than ensembl.

Generate Reference Files

STAR

STAR_mm10_ensembl_149.sh
#!/bin/bash

# Hongjiang Liu 12/08/23
# mm10, ensmebl release-102
# STAR

STAR_OUTPUT_DIR=/public/reference/STAR/mm10_149
GENOME_FILE=/public/reference/mm10/Mus_musculus.GRCm38.dna.toplevel.fa.gz
GTF_FILE=/public/reference/mm10/Mus_musculus.GRCm38.102.chr_patch_hapl_scaff.gtf.gz
# SJOB_OVERHANG should be `read length - 1`
SJOB_OVERHANG=149
# CPU Threads
THREAD=16

mkdir -p ${STAR_OUTPUT_DIR}

# Unzip files

echo "Unzipping files"
gzip -cd ${GENOME_FILE} > ${STAR_OUTPUT_DIR}/temp.fa
gzip -cd ${GTF_FILE} > ${STAR_OUTPUT_DIR}/temp.gtf

# STAR Mapping

STAR --runThreadN ${THREAD} \
--runMode genomeGenerate \
--genomeDir ${STAR_OUTPUT_DIR} \
--genomeFastaFiles ${STAR_OUTPUT_DIR}/temp.fa \
--sjdbGTFfile ${STAR_OUTPUT_DIR}/temp.gtf \
--sjdbOverhang ${SJOB_OVERHANG} \
--limitGenomeGenerateRAM -1

echo "Deleteing temp files"
rm ${STAR_OUTPUT_DIR}/temp.fa
rm ${STAR_OUTPUT_DIR}/temp.gtf
echo "Finished"