This page contains scripts that were written in the course of the paper above. To cite this work, please use the following reference:

Minot, S., Wu, G., Lewis, J., Bushman, F.D. (2012) Conservation of Gene Cassettes Among Diverse Viruses of the Human Gut. PLoS One


A BASH script that will iteratively assemble a set of paired-end metagenomic reads

Download script

This script requires three command-line options, paired-end read 1, paired-end read 2, and the output folder name. These are specified as follows:

>bash iterative_assembly.sh pe_read1.fastq pe_read2.fastq outputname

This script also requires that the following programs are installed and accessible to the PATH:

It also requires that a set of supporting scripts is in the folder (or linked to the folder) where the assembly is taking place from. In other words, in the same place that iterative_assembly.sh is (or is linked to).

Protein Cassette Discovery

Unless otherwise noted, all of the BASH wrappers used below take a single input: $1

  1. If it does not already exist, make a table with the length of each contig that the ORFs are to be predicted from. The python script fasta_name_len_table.py used above is appropriate for this. The table must be tab-delimited, have the contig name in the first column, and the length in the third column.

  2. Once glimmer is installed and tigr-glimmer is on your PATH, use glimmer-wrapper.sh to run glimmer. It will use the translateFasta.R package, and so you will need R installed, and place translateFasta.R in a folder of your choosing (updating glimmer-wrapper.sh to reflect this location).

    1. This will make a file with the ORF sequences: *.fastp

    2. And a file with the location of those ORFs on each contig: *.predict.formatted

  3. With ORFs in hand, group them into clusters using uclust.R

    1. This will make a file with the cluster into which each ORF has been placed: *.clstr.tsv

  4. The following files will have been generated:

    1. A table with the name and length of each contig, for example test.length.table

    2. A table with the location of each ORF on each contig, for example test.predict.formatted

    3. A table with the cluster that each ORF has been assigned to, for example test.clstr.tsv

  5. In order to cluster the ORFs, execute protein_cassette.R. It will require the files listed above to be specified in the following manner, as well as the name of the output files, for example outputfp.

    1. From within R:

      module.wrapper(len.table= 'test.length.table', cluster.table='test.clstr.tsv', orf.pos='test.predict.formatted', fo.base='outfp')