Unified framework for metadata management and state-of-the-art analyses
Curation (aqua) and analyses (yellow) tasks are intrinsically related (overlap region) in next-generation sequencing studies, because sample handling and sequence production are multistep processes, and careful metadata tracking and management are required for downstream analyses and publication preparation. The OMMS supports user input of project metadata, automated creation of consistently named and enumerated unique identifiers for specimens, samples and sequence production information, and straightforward integration with bioinformatics utilities. Spreadsheets can be generated for structured data extraction and local download. Standard input and output of executables used here are stored in automatically-generated files and directories.
The OMMS is point-and-click intuitive. Once the user is logged in, the OMMS displays the main page with the three portals: MetaData, Analysis and Results. The MetaData portal contains the Specimen Info, Sample Processing and Sequence MetaInfo tables. Each of these tables allows de novo creation and/or update of a metadata entry, as well as viewing information in any of the tables (using the Consult function). Spreadsheets can be generated for all samples and/or individual entries while in the Consult mode.
The Analysis portal has these options: Select Input, BLAST, Bowtie2 and TopHat. These programs can be accessed via the Sequence MetaInfo table: Select Input points to a sequence flat file (fastq, fasta formatted) that will be used as standard input in the automatically-generated analysis directory. After the input file is selected, the user can avail themselves of any of the options in the Analysis portal.
For BLAST and Bowtie 2 analyses, the appropriate databases and indices must be built. The OMMS supports the blastn, blastx and blastp programs, as well as the selection of database, (E) value, and various output formats. The OMMS implementation of Bowtie 2 uses default parameters (in the -n alignment mode) to generate standard output and error (i.e., timing data for a run, error and sam files). In non-default cases, users can select the number of processors allocated for a run. Our use of theTopHat program allows selection of index and subsequent Cufflinks analysis using the accepted_hits.bam file (from TopHat).
The Results portal is used to access standard output from the integrated executables (e.g., BLAST, Bowtie 2). After entering the Results portal, the OMMS displays the Analysis Type and Configuration table, where a Sequence Run ID entry is associated with all analyses performed with that dataset. Output associated with a Sequence Run ID can be downloaded by clicking the desired results set.
Downloading results and metadata files to generate custom tables and spreadsheets. To download results (from integrated BLAST, Bowtie 2, etc. programs), click on the Results portal on the main page, and select the Sequence Run ID and desired analysis number. To download a spreadsheet summary of entries in a table (Specimen Info, Sample Processing, Sequence MetaInfo), click on the MetaData portal on the main page, and select the table of interest (e.g., Specimen Info), and click on Consult. In Consult mode, select Spreadsheet for Foo Table, opening and saving as a tab-delimited file.
To download the full history of an entry, click on the MetaData portal, and choose the table of interest (e.g., Specimen Info), and select Consult. While in Consult mode, select the relevant Specimen UID, and click on Spreadsheet to generate a full history. The downloaded history can be opened, read and saved locally as a tab-delimited file.
To mix and match fields across tables (Specimen Info, Sample Processing, Sequence MetaInfo that are accessed via the MetaData portal), first select the Sequence MetaInfo table, and click Consult. The user will see the Spreadsheet from Specimen Info, Sample Processing and Sequence MetaInfo clickable options. After clicking the desired option, tailor the spreadsheet by choosing the fields of interest, and download locally.
To download metadata-associated notes (in the Specimen Info, Sample Processing, Sequence MetaInfo tables), click on the MetaData portal on the main page, and choose the table of interest (e.g., Specimen Info), and click on Consult. While in Consult mode, select the unique identifier (e.g., Specimen UID), and under Experimentalist Notes, click on the desired file.
Example workflows with integrated bioinformatics utilities
Users can validate OMMS-integrated function with Bowtie2 and BLAST with example input sequence files (derived from larger read sets) from a Human Microbiome study (GenBank accession number: SRX025177). Fastq- and fasta- formatted sequences were sampled from a short read archive file (SRR063480; Illumina reads from a human microbiome stool sample). Sequence files were extracted using the NCBI SRA Toolkit (http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software). The resulting fastq files were further processed using in-house and Fastx tools to: 1) convert sra file (using the fastq dump tool from the SRA Toolkit) to separate mate pairs in fastq-formatted files; 2) quality filter sequences (using Fastx tools, fastq_quality_filter, with a minimum quality score of 20, and at least 65% of bases in a read with this minimum quality score); 3) sub-sample the larger read files. These test datasets (sequence files) are included in the distribution (in the Test_DataSets directory) to facilitate testing of the installation and workflow development (using the Tuxedo Suite and BLAST programs):
- HsStoo_onePerc_Mate1.fq; fastq-formatted file of mate pair 1 (representing 1% of the quality-filtered SRR063480 read archive) to test and run Bowtie2 in single-end mode;
- HsStoo_Mate1.fq, HsStoo_Mate2.fq; fastq-formatted files of corresponding mate pairs (representing the first 1000 mate pair sequences of the SRR063480 read archive) to test and run Bowtie2 in paired-end mode;
- HsStoo_BLAST500.fa; fasta-formatted file of 500 reads randomly sampled from mate pair 1 file (from SRR063480) to test and run BLAST;
To validate the OMMS interoperability with TopHat and Cufflinks programs for analyzing and assembling transcripts and relative expression levels, one percent of fastq-formatted reads from an RNASeq file (SRR023838) from a recent study (SRP001119) were subsampled using in-house tools. The resulting dataset is:
- SRR023838_RNASeq.fq (fastq-formatted RNASeq data);
The human microbiome HsStoo example input sequence files were used to test our implementation of Bowtie 2 with an index built from Staphylococcus aureus supsp. aureus N315 (Ref Seq NC_002745.2) and BLAST with a database built from Clostridium kluyveri NBRC 12016 (Ref Seq NC_011837.1). To verify interoperability of the OMMS with TopHat and Cufflinks (using the SRR023838_RNASeq.fq input), the UCSC distribution of the reference human genome (hg19) was used as an index (ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Homo_sapiens/UCSC/hg19/Homo_sapiens_UCSC_hg19.tar.gz).