"Concise industry news from the US pharmaceutical industry..."
New Account

The Magazine

Issue 13

E-magazine
  • Previous Issues

Blog

Spencer Green
Chairman, GDS International

Sales and the 'Talent Magnet'

A lot is written about being a ‘Talent Magnet’, either as a company, or as President. It’s all good practice – listen, mentor, reward, provide clear goals and career maps. Good practice for the employer, but what about the employee?
25 May 2011

So Much Data… so Little Time

Incogen | www.incogen.com

No Comments

Introduction

The Internet has put the combined efforts, experience and knowledge of entire research communities literally at the fingertips of scientists. These resources offer potentially enormous value toward critical life science research such as, for example, the identification of cancer biomarkers. However, they remain vastly underutilized because of a lack of tools and technologies that allow researchers easy access to those assets. Furthermore, information from these diverse resources is rarely – if ever – complete, curated, or semantically homogeneous with the information produced in-house by the researcher.

The U.S. spent more than $2.1 Trillion on health care, or approximately $6,700 per person, in 2006 [1] . Health spending rose almost 9 percent in 2006 and continues to be a substantial burden on the federal government and individuals. As the cost of health care continues to rise, the need for reducing expenses through new approaches, such as system-based, integrative research and personalized medicine, is growing tremendously. However, the promises of the new approaches also bring with them vastly complex research challenges that must be overcome first, before the benefits of these new approaches can be realized.

To emphasize this point we can examine recent advances in genomics and proteomics. While these advances have greatly increased our understanding of the molecular basis for functions of organisms, researchers have discovered that the characterization of single genes or proteins has provided only limited insight and benefits toward early diagnoses, improved subtyping and prognoses, and treatment of diseases such as cancer.

In order to understand the intricate web of interactions that make up the biological functioning of life, we must try to decipher how a gene or protein fits into a dynamic environment with thousands of other genes and proteins. The interpretation of these dynamic systems is vastly more complex than, for example, sequencing the human genome, which is linear and, for the purpose of sequencing, static. A comprehensive understanding of biological phenomena that can ultimately lead to health care benefits can only be achieved through the melding of information and insights from technologies that characterize genes and proteins at many levels, such as sequence, transcription, regulation, structure, function, kinetics, and localization.

The collection, integration, and analysis of these diverse sets of data represent a substantial undertaking, both technical and managerial, that requires resources not available to most investigators. Therefore, most life science projects present researchers with the daunting task of collecting, integrating, and analyzing distributed, heterogeneous sets of data in various states of integrity and completeness, while often using inefficient, ad hoc methods that do not sufficiently address critical issues such as data integrity, completeness, or semantic heterogeneity.

Each of these issues presents an enormous and diverse challenge and is unlikely to be solved by a single company, institution, or even a consortium. As the field continues to move into a direction that requires effective integration of diverse knowledge bases and technologies, there exists an increasing need for robust, sophisticated, and commercial-strength bioinformatics frameworks that can enable teams of multidisciplinary researchers to meld their distinct expertise and most efficiently apply their contributions toward a comprehensive understanding of biological phenomena. The fundamental objective put forward in the report of the Working Group on Biomedical Computing of the Advisory Committee to the Director of the National Institutes of Health on the Biomedical Information Science and Technology Initiative (BISTI Report) states that “… biomedical researchers need, first of all, the expertise to marry information technology to biology in a productive way. New hardware and software will be needed, together with deepened support and collaboration from experts in allied fields. Inevitably, those needs will grow as biology moves increasingly from a bench-based to a computer-based science, as models replace some experiments and complement others…. The overarching need is for an intellectual fusion of biomedicine and information technology.”

Current Technology Landscape

Over the last decade, advances in technologies such as DNA sequencing or measuring levels of gene and protein expression have resulted in an increase of data production that has easily outpaced the evolution of computational processing power predicted by Moore’s law. Furthermore, the combination of this data toward clinical applications, for example using protein mass spectrometry and gene expression microarray technologies toward biomarker discovery, has extraordinary potential for diagnostic, prognostic, and treatment strategies for complex diseases such as cancer, which require sensitive diagnostic tools for prognosis and development of flexible treatment strategies. Unfortunately, current mass spectrometry and microarray data analysis options available to researchers often require improvised combinations of tools provided by instrument manufacturers, third parties, and in-house development teams. These strategies are inefficient, inflexible and, in general, fall short of allowing researchers to fully exploit the data available to them. Furthermore, the lack of tools to meaningfully integrate heterogeneous data types has prevented researchers from efficiently linking genes, their products, and pathways. This integration is critical in attempting to elucidate the network of structural, regulatory, and dynamic interactions, thereby giving researchers the ability to obtain a more comprehensive understanding of the disease and, ultimately, to develop effective treatment strategies.

It is not surprising, therefore, that a great deal of effort (technical and marketing) has surrounded the field of data integration in life sciences in the last decade [2-9]. Consequently, our industry has witnessed the rise of many, and fall of most, approaches to data integration. Most of the early integration efforts have focused on integrating genomic sequence data and the associated metadata, such as annotations, results from various analyses, sample information, etc. More recent efforts have focused on integrating heterogeneous data types [10-15]. INCOGEN has closely followed the evolution of those approaches and has actively contributed to and led collaborative efforts toward data integration solutions [16-21]. During this time, we have identified three barriers that stand between scientists and their ability to derive the full benefits of their research:

Barrier 1: Lack of tools that combine data preprocessing with downstream analysis:

Currently, molecular profiling techniques are limited by 1) an inadequate understanding and control of the physical processes involved in source preparation and 2) the irreproducibility of the complex surface chemistry used to isolate the molecules of interest. These sample-induced variations may mask the expected correlation between the concentration of an analyte and its corresponding expression intensity. A comprehensive understanding of the technology and optimized data preprocessing is required to address these issues before they negatively impact subsequent analysis and results [22-24]. Unfortunately, analysis software provided by hardware manufacturers is often inadequate and/or inflexible and does not provide means for developing a complete, optimized analysis workflow that includes signal processing, feature selection, and classification. Notably in the gene expression analysis community, several large-scale bioinformatics efforts (caArray [25], cGAP [26]) have been launched to provide database repositories with access to suites of high-level statistical analysis tools for pattern recognition and visualization (e.g., caWorkbench [27]) or the construction of flexible analysis pipelines (e.g., GenePattern [28]). However, despite the usefulness of these data management and analysis portals, they do not provide access to the low-level signal processing tools, which are usually only supplied with the hardware (e.g., ScanArrayExpress [29]) or wrapped into proprietary software (e.g., Gene Pix [30], ArrayVision [31]). During INCOGEN’s work toward the development of diagnostic classifiers for mass spectrometry and gene expression data, we demonstrated that it is critical to fully integrate low-level signal processing with high-level, statistical inference methods to discover robust discriminating patterns and identify diagnostically significant variables, as well as to gain a more complete understanding of the system under study [22, 32-35].

Barrier 2: Lack of tools to meaningfully integrate genomic and proteomic data:

Over the last decade, techniques for the independent analysis of genomic and proteomic data have been the focus of extensive research, and many analysis methods have been developed and applied to these areas. More recently, the need to meaningfully combine these types of information has been recognized as a critical step towards a comprehensive understanding of complex biological questions [36-43]. While it is becoming increasingly common for investigators to gather parallel measurements for complementary gene expression and protein data, the integration of these heterogeneous data sets for combined computational analysis has not often been attempted because of the lack of necessary tools. A recent example project conducted at INCOGEN involved the analysis of breast cancer data that was generated using both the mass spectrometry and microarray platforms [44]. The results from the pilot study have shown the significant amount of diagnostically relevant knowledge that can be gained from meaningfully combining data from the microarray and mass spectrometry platforms and the subsequent incorporation of biological knowledge from existing resources. However, many components of the analysis had to be performed manually because of a lack of adequate tools.

Barrier 3: Lack of tools to promote interdisciplinary research:

A comprehensive understanding of biological phenomena can only be achieved through the melding of information and insights from technologies that characterize genes and proteins at many levels, such as sequence, transcription, regulation, structure, function, kinetics, localization, etc. This integration of knowledge requires a departure from conventional approaches toward life science research and is only possible by combining state-of-the-art technologies and enabling knowledge exchange from traditionally divergent fields such as molecular biology, clinical research, computational science, physics, statistics, and hardware engineering [38]. Currently, a platform does not exist that allows researchers from multiple disciplines to contribute their knowledge toward a common research goal.

Future of Bioinformatics Software

The ideal solution to address these barriers would require a statistically robust experimental project design, which was defined a priori; as well as reliable, interoperable and user-friendly software and homogeneous data content representation and exchange formats. While the ideal solution presents a laudable goal for the life science informatics community and has formed the basis for efforts such as the – now inactive – Interoperable Informatics Infrastructure Consortium (I3C) and NCI’s Cancer Biomedical Information Grid (caBIG), many of the required components do not yet exist and are likely to take years or even decades to develop or establish themselves.

During the last decade, INCOGEN has played a leading role in providing tools that enable scientists to conduct effective research in the absence of the described “ideal” solution. Software such as GenePort [45] and VIBE [46] have been designed to support multidisciplinary teams of researchers to meaningfully integrate their science and expertise toward achieving their research goals. To ensure the development of tools that are robust and useful for, for example, cancer diagnosis, the software development projects were carried out while performing the actual cancer research. This approach requires a level of coordination among researchers from many disciplines that is rarely possible, but essential. Furthermore, while most scientists agree that an integrated, systems approach is fundamentally necessary to fully understand the biological foundations of life, often proponents of such approaches are vague about how they will overcome the challenges of data overload and meaningful integration of large heterogeneous data sets. INCOGEN’s multidisciplinary partnerships and complementary expertise have proven successful in approaching these complex challenges. It is crucial that, in the future, software companies will follow similar approaches that can lead to the development of robust and effective software platforms. This successful combination of the bioinformatics and clinical aspects is vital for the development of tools to meaningfully integrate data, knowledge, and expertise from various disciplines toward the systematic understanding of biology and, ultimately, widespread benefits in health care.

References

  • http://www.cms.hhs.gov/NationalHealthExpendData
  • www.netgenics.com.
  • www.accelrys.com.
  • www.lionbiosciences.com.
  • www.synergy.com.
  • http://www.avaki.com/solutions/solutions_life_sciences.html.
  • discoveryhub.html.
  • www.inforsenseinc.com.
  • http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+srsq2+-noSession.
  • M. Sasinowski, H.S., K. Miller, R. Castillo, D. Coppit, D.I. Malyarenko, M. Tracy, R.R. Drake, O.J. Semmes, A Visual Programming Platform for Diagnostic Workflows Based on a Combination of Protein and Gene Expression Profiles. Proceedings of the Early Detection Research Network Annual Meeting, 2008.
  • Huang, H., et al., Integration of bioinformatics resources fro functional analysis of gene expression and proteomic data. Front Biosci, 2007. 12: p. 5071 - 5088.
  • Ng, A., et al., pSTIING: a 'systems' approach towards integrating signalling pathways, interaction and transcriptional regulatory networks in inflamation and cancer. Nucleic Acid Res., 2006. 1(34:D): p. 527 - 534.
  • Bauer, M. and M.-. Ueffing, Reverse genetics for proteomics: from proteomic discovery to scientific content. J Neural Transm, 2006. 8(113): p. 1033 - 1040.
  • Fagan, A., A.C. Culhane, and D.G. Higgins, A multivariat analysis approach to the integation of proteomics and gene expression data. Proteomics, 2007. 7(13): p. 2162 - 2171.
  • Waters, M., et al., CEBS--Chemical Effects in Biological Systems: a public data repository integrating study design and toxicity data with microarray and proteomics data. Neucleic Acid Res, 2008. 1(36:D): p. 892 - 900
  • Sasinowski, M. Interoperable Informatics Infrastructure Consortium (I3C) Demonstration and Prototype Description. in Biotechnology Industry Organization (BIO 2001) Annual Convention. 2001.
  • Sasinowski, M. and E. Neumann. Towards Convergence Across the Life Sciences. in Intelligent Systems for Molecular Biology Meeting. 2001.
  • I3C, Interoperable Informatics Infrastructure Consortium (I3C) Road Map. 2002.
  • Lin, S.M., et al., What is mxXML Good For? Expert Review Proteomics, 2005.
  • Sasinowski, M., VIBE-MS: A visual programming software platform for creation, management, and optimization of mass spectrometry analysis workflows fro cancer diagnosis. caBIG 2006 Annual Meeting, 2006.
  • M. Sasinowski, H.S., K. Miller, R. Castillo, D. Coppit, D.I. Malyarenko, M. Tracy, R.R. Drake, O.J. Semmes, A Visual Programming Platform for Diagnostic Workflows Based on a Combination of Protein and Gene Expression Profiles. Proceedings of the Early Detection Research Network Annual Meeting, 2008.
  • Malyarenko, D.I., et al., Enhancement of sensitivity and resolution of SELDI TOF-MS records for serum peptides using time series analysis techniques. Clin Chem, 2005. 51(1): p. 1-2.
  • Baggerly, K.A., J.S. Morris, and K.R. Coombes, Reproducibility of SELDI-TOF protein patterns in serum: comparing data sets from different experiments. Bioinf Adv Access, 2004(Jan): p. 1-9.
  • Morris, J.S., et al., Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics, 2005. 21(9): p. 1764-1775. 1.
  • http://caarray.nci.nih.gov.
  • http://cgap.nci.nih.gov/.
  • http://amdec-bioinfo.cu-genome.org/html/caWorkBench.htm.
  • http://cabigcvs.nci.nih.gov/viewcvs/viewcvs.cgi/genepattern/.
  • P10361_ScanArray_Brochure.pdf.
  • http://www.moleculardevices.com/pages/software/gn_genepix_pro.html.
  • http://imagingresearch.com/products/ARV.asp.
  • Malyarenko, D.I., et al., Resampling and deconvolution of linear time-of-flight records over 100 kDa mass range for enhanced profiling of protein modifications. Rapid Commun Mass Spectrom, 2002.
  • Malyarenko, D.I., et al., Optimization of parameters for output-energy target filtering to enhance resolution of linear time-of-flight data from peptide profiling experiments. J Am Soc Mass Spectrom, 2005.
  • INCOGEN, NIH SBIR Phase II Final Report. INCOGEN Technical Report 3-31-07 2007.
  • Tracy, M.B., Chen, H., Weaver, D.M., Malyarenko, D.I., Sasinowski, M., Cazares, L.H., Drake, R.R., Semmes, J.O., Tracy, E.R., Cooke, W.E., Precision Enhancement of MALDI- TOF-MS Using High Resolution Peak Detection and Label-Fed Alignment. Proteomics, 2008.
  • Kitano, H., Systems biology: a brief overview. Science, 2002. 295: p. 1662-1664.
  • Hood, L., Systems biology: integrating technology, biology, and computation. Mechanisms of Ageing and Development, 2003. 124: p. 9-16.
  • Zerhouni, E., The NIH Roadmap. Science, 2003. 302: p. 62-72.
  • Hanash, S., Integrated global profiling of cancer. Nature Reviews – Cancer, 2004. 4: p. 638-643.
  • Waters, K.M., J.G. Pounds, and B.D. Thrall, Data merging for integrated microarray and proteomics analysis” Brief Funct Genomic Proteomic. Brief Funct Genomic Proteomic, 2006. 12(5(4)): p. 261 - 272.
  • Ouzounian, M., et al., Predict, prevent and personalize: Genomic and proteomic approaches to cardiovascular medicine. Can J Cardiol, 2007. 8(23 Suppl A): p. 28A - 33A.
  • Ou, K., et al., Novel Breast Cancer Biomarkers Identified by Integrative Proteomic and Gene Expression Mapping. J Proteome Res, 2008. 3: p. Epub ahead of print.
  • Daemen, A., et al., Integrating microarray and proteomics data to predict the response on cetuximab in patients with rectal cancer. Pac Symp Biocomput, 2008: p. 166 - 177.
  • Sasinowski, M., Miller, K, Miller, J., Sasinowska, H., Castillo, R., Coppit, D., Malyarenko, D., Tracy, M., Chen, H, Tracy, E.R., Cooke, W.E., Manos, D.M., Bunai, T., Semmes, O.J., Drake, R.R., A Visual Programming Platform and Comprehensive Experimental Metadata Model for Diagnostic Workflows, Proceedings of the 56 th ASMS Conference on Mass Spectrometry
  • http://www.incogen.com/geneport
  • http://www.incogen.com/vibe

More like this...

  • To Screen or Not to Screen

    Biomarkers hold out great hope for the early detection and even prevention of some cancers, but how does the reality live up to the potential? James W. Jacobson of the National...
    Read more
  • Mapping Out New Strategies

    Christina Coughlin of Wyeth Research tells NGP about the challenges of using biomarkers in the discovery stages of drug development.
    Read more
  • Making Strides

    Alexander Kamb of Amgen tells NGP about the potential of biomarkers, challenges in drug discovery and future therapeutics.
    Read more
  • The Promise of Translational Medicine

    NGP speaks with Francesco Marincola, Chief of the Infectious Disease and Immunogenetics Section at the Clinical Center of the National Institutes of Health, about the...
    Read more
  • New Developments in Clinical Imaging

    Clinical imaging can now be used to personalize diagnoses and to shed new light on the relationship between disease pathology and what the patient feels. Paul Matthews of...
    Read more
  • Unlocking the Secrets of Gene Networks

    NGP talks to Dr. Eric Schadt of Merck Research Labs about the groundbreaking studies using genomic techniques to help us understand the complex changes at the root of common...
    Read more
Disclaimer: All comments posted in a personal capacity
POST A COMMENT
In order to post a comment you need to be regsitered and signed in.
Register | Sign in
No Comments Have Been Submitted
Disclaimer: All comments posted in a personal capacity