Abstract

Next-generation sequencing projects have underappreciated information management tasks requiring detailed attention to specimen curation, nucleic acid sample preparation and sequence production methods required for downstream data processing, comparison, interpretation, sharing and reuse. The few existing metadata management tools for genome-based studies provide loose curatorial frameworks for experimentalists to store and manage idiosyncratic, project-specific information, typically offering no automation supporting unified naming and numbering conventions for large numbers of samples. Moreover, existing tools are not readily interfaced with bioinformatics executables, (e.g., BLAST, Bowtie2, custom pipelines).

Our application, the Omics Metadata Management Software (OMMS), answers both needs, empowering experimentalists to generate intuitive, consistent metadata, and to perform bioinformatics analyses and information management tasks via an intuitive web-based interface. Several use cases with short-read sequence datasets are provided to showcase the full functionality of the OMMS, from metadata curation tasks, to bioinformatics analyses and results management and downloading.

The OMMS can be implemented as a stand alone-package for individual laboratories, or can be configured for web-based deployment supporting geographically-dispersed projects. Our software was developed with open-source bundles, is flexible, extensible and easily installed and run. Keywords: Bioinformatics, relational database management system, omics, next-generation sequencing, biological curation, open-source programs

Omics Metadata Management Software

1 The University of New Mexico, Computer Science Department

2 Sandia National Laboratories, Systems Advanced Concepts Engineering

Martha Perez-Arriaga1, Amy J. Powell2

Last updated by AJP on October 5, 2014

I. Background

Our software supports scalable, semi-automated management of information associated with specimens (e.g., host tissue, microhabitat), sample preparation (e.g., methods for nucleic acid isolation and handling) and sequencing (e.g., sequencing platform and provider, read characteristics, library preparation method, data file names) in next generation sequencing projects, where a host (animal or plant) is central to experimental design. This package was developed in Linux environments [CentOS release 5.8 (Final) and Red Hat Enterprise Linux Workstation release 6.5 (Santiago)] for the RAPid Threat ORganism Recognition (RAPTOR) Grand Challenge at Sandia National Laboratories. The OMMS can be integrated with state-of-the-art bioinformatics utilities, such as BLAST and programs in the Tuxedo Suite.

II. Function

This application enables user-driven management of specimen, sample and sequence-production metainformation. The main functionality of our software resides in three tables: “Specimen Info,” “Sample Processing” and “Sequence Metainfo.” Each of these tables facilitates advanced experiment curation by generating unique identifiers and directory names for information entered by users, providing a web-based platform for entry and storage of project-specific information in consistent, persistent defined data structures. At its core, the OMMS:

  1. Instantiates a Linux-based file system and SQL relational database structures;
  2. Generates spreadsheets within and across each of the tables;
  3. Supports uploading/downloading of metadata, input and output;
  4. Integrates with state-of-the-art bioinformatics tools, allowing defined thresholds, and tailoring output formats of integrated utilities;
  5. Stores results files (i.e., standard output and error from integrated executables);

III. Required Base Software & Integrating Bioinformatics Tools

The LAMP bundle (Linux, Apache, MySQL, PHP) was used to develop and test the OMMS; Red Hat Enterprise Linux Workstation release 6.5 (Santiago) was employed in the final stages, and CentOS in initial phases. We thus expect this distribution will run in diverse Linux environments without extensive cross-platform customization. The yum package management utility was used to install the base software packages (III A-E) in our final Linux Red Hat testing environment. Hardware development specifications were: Intel Core i7, CPU Q840 @ 1.87 Ghz, 8 Gb RAM. These base software packages must be installed and configured BEFORE running the OMMS:

  1. httpd-2.2.15-31 (This is server software from the Apache HTTP Server Project); (http://httpd.apache.org/download.cgi)
  2. firefox-24.8.0 (This is browser software); (http://www.mozilla.org/en-US/firefox/new/)
  3. mysql-5.1.73 (This is relational database server software); (http://dev.mysql.com/downloads/mysql/5.1.html)
  4. php-5.3.3 (cli) (built: July 15, 2014) (This is a server-side scripting language designed for web development); (http://us.php.net/releases/index.php)
  5. phpMyAdmin-4.0.10.3 (This is php-based tool facilitating MySQL administration over the web); (http://www.phpmyadmin.net/home_page/index.php)
For integrated bioinformatics functionality, install and configure the following tools (in accordance with native documentation) AFTER installation of the required base software (detailed above):
  1. ncbi-blast-2.2.27+ (http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE_TYPE=BlastNews; and relevant databases)
  2. bowtie2-2.0.0 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml; and desired indices)
  3. tophat-2.0.5 (http://ccb.jhu.edu/software/tophat/index.shtml )
  4. cufflinks-2.0.2 (http://cufflinks.cbcb.umd.edu/)
  5. samtools-0.1.18 (http://samtools.sourceforge.net/)
  6. boost_1_51_0 C ++ Libraries (http://www.boost.org/users/history/version_1_51_0.html)

IV. OMMS Installation And Configuring

Nota bene: Linux commands given in parentheses were executed in CentOS release 5.9 (Final) and Red Hat Enterprise Linux Workstation release 6.5 (Santiago). The “$” prompt indicates the standard user mode, while the “#” connotes root-user (super-user) privileges. Enter required locations (on the file system) for the OMMS-related bioinformatics databases (detailed below).

  1. After downloading the OMMS to your Desktop, open a terminal window, su to root, and copy the tar archive to /usr/local/bin from the Desktop (#cp ~/Desktop/OMMS_0.9.1_tar.gz /usr/local/bin/). The expected OMMS file size is 16MB (#ls –lah).
  2. Verify that the package is in /usr/local/bin, and extract the OMMS archive (#tar xvzf OMMS_0.9.1_tar.gz).
  3. Change directory to OMMS (#cd OMMS), and get a directory listing (#ls –lah) to confirm the contents of the extracted archive; assuming successful extraction, the following items will be present: DBOMMS, installer.sh script, LICENSE, OMMS, README, TestDataSets and uninstaller.sh script.
  4. Run the installer (# ./installer.sh), and either accept or change the defaults accordingly, entering the desired directory locations, tailored according to the partitioning and directory structures of the particular host.
  5. Enter the MySQL directory to access the database DBOMMS. The default location is /var/lib/mysql/.
  6. Enter the user to access databases in MySQL; the OMMS default is < root >.
  7. Enter the password for the user to access databases in MySQL; here we suggest entering no password at all (just hit the enter key at this step), and if desired, returning later for advanced configuring in accordance with MySQL documentation. These configuration data will be stored in #/var/www/html/OMMS/config.inc.
  8. Enter the localhost; the OMMS default is < localhost >; this information will be stored in #/var/www/html/OMMS/configUser.inc, and ensures that authorized users are actively logged into the software.
  9. Enter the location for Analysis Directory: on our system: /data1/analysis/ (this location will contain the analysis input and output files; you must manually enter your preferred path at this step).
  10. Enter the location for Sequences Directory; on our system: /data1/seqs/ (this location will contain the sequence input files; you must manually enter your preferred path at this step).
  11. Enter the location of BLAST databases; on our system (you must manually enter your preferred path at this step): /usr/bin/blast/db/.
  12. Enter the location for Bowtie 2 indices; on our system: /usr/bin/bowtie2/index/ (again, details for these, and similar configuration steps can be found in the #/var/www/html/OMMS/configDir.inc file. This configuration information is crucial for the proper functioning of the Analysis and Results portals).
  13. The OMMS installer will scan for bioinformatics utilities that are to be integrated with the Analysis portal (BLAST, Bowtie 2, TopHat, Cufflinks, etc.).
  14. Start the Apache (httpd) and MySQL (mysqld) server daemons (#service httpd start; # service mysqld start).
  15. Start MySQL (#mysql) and list the databases (mysql> show databases;). Use the “DBOMMS” database (mysql> use DBOMMS;). List the tables in this database (mysql> show tables;). Populate the table “user” in the database DBOMMS for each user of the OMMS (mysql> INSERT INTO DBOMMS.user VALUES (‘name1’, MD5(‘password1’));. Alternatively, in phpMyAdmin, select the following: database: DBOMMS; table: “user.” To finish, select “Insert,” and enter the desired username (e.g., “name1”), and choose the MD5 hash function to enter a password (e.g., ‘password1’).
  16. Verify that the OMMS web server is running by pointing a Mozilla Firefox browser to http://localhost/OMMS.
  17. The OMMS is now ready for use at http://localhost/OMMS/. Please note that the OMMS can be tailored further and extended. Instructions given here are for a “quick start” generic installation and configuration.
  18. Verify successful installation of the OMMS by confirming the presence of the OMMS in the default path of the Apache HTTP Server software (#cd /var/www/html/OMMS).
  19. For further details, please refer to “THE PUBLICATION REFERENCE HERE” (insert upon acceptance).

V. Uninstalling The OMMS

  1. Verify the existence of the uninstaller script in the OMMS distribution (#ls -lah /usr/local/bin/OMMS/uninstall.sh).
  2. Run the uninstaller (# ./uninstaller.sh) to remove the OMMS from the file system.

VI. Using The OMMS With Integrated Bioinformatics Tools

The OMMS is point-and-click intuitive. Once the user is logged in, the OMMS displays the main page with the three portals: “MetaData,” “Analysis” and “Results.” The “MetaData” portal contains the “Specimen Info,” “Sample Processing” and “Sequence MetaInfo” tables. Each of these tables allows de novo creation and/or update of a metadata entry, as well as viewing information in any of the tables (using the “Consult” function). Spreadsheets can be generated for all samples and/or individual entries while in the “Consult” mode.

The Analysis portal has the following options: “Select Input,” “BLAST,” “Bowtie2” and “TopHat.” These programs can be accessed via the “Sequence MetaInfo” table. “Select Input” points to a sequence flat file (usually fastq or fasta format) that will be used as standard input in the automatically-generated analysis directory. After the input file is selected, the user can avail themselves of any of the options in the “Analysis” portal.

For BLAST and Bowtie 2 analyses, the appropriate databases and indices must be built. The OMMS supports the blastn, blastx and blastp programs, as well as the selection of database, evalue, and various output formats. The OMMS implementation of Bowtie 2 uses default parameters (in the -n alignment mode) to generate standard output and error (i.e., timing data for a run, error and sam files). In non-default cases, users can select the number of processors allocated for a job. Our use of the TopHat program allows selection of index and subsequent Cufflinks analysis using the “accepted_hits.bam” file (from TopHat).

The “Results” portal is used to access standard output from any of the integrated executables (e.g., BLAST, Bowtie 2). After entering the “Results” portal, the OMMS displays the “Analysis Type” and “Configuration” table, where a “Sequence Run ID” entry is associated with all analyses performed on that dataset. All output associated with a “Sequence Run ID” can be downloaded by clicking the results set of interest.

VII. Downloading Results

To download results files (from BLAST, Bowtie 2, etc.), click on the “Results” portal on the main page, and select the “Sequence Run ID” and analysis number of interest. To download a spreadsheet summary of entries in a table (“Specimen Info,” “Sample Processing,” “Sequence MetaInfo”), click on the “MetaData” portal on the main page, and select the table of interest (e.g., “Specimen Info”) and click on “Consult.” While in “Consult” mode, select “Spreadsheet for ‘Foo’ Table,” opening and saving as a tab-delimited file.

To download the full history of an entry, click on the “MetaData” portal and choose the table of interest (e.g., “Specimen Info”), and select “Consult.” While in “Consult” mode, select the relevant “Specimen UID,” and click on the “Spreadsheet” to generate a full history. The downloaded history can be opened, read and locally saved as a tab-delimited file.

To mix and match fields across tables (“Specimen Info,” “Sample Processing,” “Sequence MetaInfo” that are accessed via the “MetaData” portal), first select the “Sequence MetaInfo” table, and click on “Consult.” The user will see the “Spreadsheet from Specimen Info, Sample Processing and Sequence MetaInfo” clickable options. After clicking the desired option, tailor the spreadsheet by choosing the fields of interest, and download locally as a tab-delimited file.

To download metadata-associated notes (in the “Specimen Info,” “Sample Processing,” “Sequence MetaInfo” tables), click on the “MetaData” portal on the main page, and choose the table of interest (e.g., “Specimen Info”) and click on “Consult.” While in “Consult” mode, select the unique identifier (e.g., “Specimen UID”), under “Experimentalist Notes” click on the text file name to download.

VIII. Test Datasets

To validate OMMS integrated function with Bowtie2 and BLAST, example input sequence files were derived from larger read sets from a Human Microbiome study (GenBank accession number SRX025177). Fastq- and fasta- formatted sequences were sampled from a short read archive file (SRR063480; Illumina reads from a human microbiome stool sample). Sequence files were extracted using the NCBI SRA Toolkit (http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software). The resulting fastq files were further processed using in-house and Fastx tools to: 1) convert sra file (using the “fastq dump” tool from the SRA Toolkit) to separate mate pairs in fastq-formatted files; 2) quality filter sequences (using Fastx tools, “fastq_quality_filter,” with a minimum quality score of 20, and at least 65% of bases in a read with this minimum quality score); 3) sub-sample the larger read files. These test datasets (sequence files) are included in the distribution (in the “Test_DataSets” directory) to facilitate testing of Tuxedo Suite programs and BLAST:

  1. “HsStoo_onePerc_Mate1.fq”; fastq-formatted file of mate pair 1 (representing 1% of the quality filtered SRR063480 read archive) to test Bowtie2 in single-end mode;
  2. “HsStoo_Mate1.fq,” “HsStoo_Mate2.fq”; fastq-formatted files of corresponding mate pairs (representing the first 1000 mate pair sequences of the SRR063480 read archive) to test Bowtie2 in paired-end mode;
  3. “HsStoo_BLAST500.fa”; fasta-formatted file of 500 reads randomly sampled from mate pair 1 file (from SRR063480) to test BLAST;
To validate the OMMS interoperability with TopHat and Cufflinks, one percent of fastq-formatted reads from an RNASeq file (SRR023838) from a recent study (SRP001119) were subsampled using in-house tools.  The resulting dataset is:
  1. “SRR023838_RNASeq.fq” (fastq-formatted RNASeq data);

The human microbiome “HsStoo” example input sequence files were used to test our implementation of Bowtie 2 with an index built from Staphylococcus aureus supsp. aureus N315 (Ref Seq NC_002745.2) and BLAST with a database built from Clostridium kluyveri NBRC 12016 (Ref Seq NC_011837.1). To validate interoperability of the OMMS with TopHat and Cufflinks (using the “SRR023838_RNASeq.fq” input), the UCSC distribution of the reference human genome (hg19) was used as an index (ftp://igenome:G3nom3s4u@ussd-ftp.illumina.com/Homo_sapiens/UCSC/hg19/Homo_sapiens_UCSC_hg19.tar.gz).

IX. Support

Our software is an open-source database schema designed to support metadata curation tasks for next generation sequencing projects, and should not be regarded as complete, commercial-grade, bug-free application. Please contact Martha or me to make suggestions or report bugs. Enjoy!

X. Acknowledgements

Sincerest thanks to the following people and R & D entities for assisting with the development and/or testing of the OMMS:

  • Professor Susan Atlas (The Center for Advanced Research Computing, The University of New Mexico)
  • Mr. Steven C. Arroyo (Sandia National Laboratories)
  • Professor Gavin C. Conant (Division of Animal Sciences; Informatics Institute, The University of Missouri-Columbia)
  • Dr. Michael W. Folsom (Sandia National Laboratories)
  • Professor Melanie Moses (Department of Computer Science, University of New Mexico)
  • Professor Donald O. Natvig (Department of Biology, The University of New Mexico)
  • Mr. Brian D. Nelson, (Network-Centric Security Systems Design, Sandia National Laboratories)
  • Ms. Susan Wilson (The Center for Advanced Research Computing, The University of New Mexico)
  • The University of New Mexico Center for Advanced Research Computing (https://www.hpc.unm.edu/)