
Introduction
The Internet has put the combined efforts, experience and knowledge of entire research communities literally at the fingertips of scientists. These resources offer potentially enormous value toward critical life science research such as, for example, the identification of cancer biomarkers. However, they remain vastly underutilized because of a lack of tools and technologies that allow researchers easy access to those assets. Furthermore, information from these diverse resources is rarely – if ever – complete, curated, or semantically homogeneous with the information produced in-house by the researcher.
The U.S. spent more than $2.1 Trillion on health care, or approximately $6,700 per person, in 2006 [1] . Health spending rose almost 9 percent in 2006 and continues to be a substantial burden on the federal government and individuals. As the cost of health care continues to rise, the need for reducing expenses through new approaches, such as system-based, integrative research and personalized medicine, is growing tremendously. However, the promises of the new approaches also bring with them vastly complex research challenges that must be overcome first, before the benefits of these new approaches can be realized.
To emphasize this point we can examine recent advances in genomics and proteomics. While these advances have greatly increased our understanding of the molecular basis for functions of organisms, researchers have discovered that the characterization of single genes or proteins has provided only limited insight and benefits toward early diagnoses, improved subtyping and prognoses, and treatment of diseases such as cancer.
In order to understand the intricate web of interactions that make up the biological functioning of life, we must try to decipher how a gene or protein fits into a dynamic environment with thousands of other genes and proteins. The interpretation of these dynamic systems is vastly more complex than, for example, sequencing the human genome, which is linear and, for the purpose of sequencing, static. A comprehensive understanding of biological phenomena that can ultimately lead to health care benefits can only be achieved through the melding of information and insights from technologies that characterize genes and proteins at many levels, such as sequence, transcription, regulation, structure, function, kinetics, and localization.
The collection, integration, and analysis of these diverse sets of data represent a substantial undertaking, both technical and managerial, that requires resources not available to most investigators. Therefore, most life science projects present researchers with the daunting task of collecting, integrating, and analyzing distributed, heterogeneous sets of data in various states of integrity and completeness, while often using inefficient, ad hoc methods that do not sufficiently address critical issues such as data integrity, completeness, or semantic heterogeneity.
Each of these issues presents an enormous and diverse challenge and is unlikely to be solved by a single company, institution, or even a consortium. As the field continues to move into a direction that requires effective integration of diverse knowledge bases and technologies, there exists an increasing need for robust, sophisticated, and commercial-strength bioinformatics frameworks that can enable teams of multidisciplinary researchers to meld their distinct expertise and most efficiently apply their contributions toward a comprehensive understanding of biological phenomena. The fundamental objective put forward in the report of the Working Group on Biomedical Computing of the Advisory Committee to the Director of the National Institutes of Health on the Biomedical Information Science and Technology Initiative (BISTI Report) states that “… biomedical researchers need, first of all, the expertise to marry information technology to biology in a productive way. New hardware and software will be needed, together with deepened support and collaboration from experts in allied fields. Inevitably, those needs will grow as biology moves increasingly from a bench-based to a computer-based science, as models replace some experiments and complement others…. The overarching need is for an intellectual fusion of biomedicine and information technology.”
Current Technology Landscape
Over the last decade, advances in technologies such as DNA sequencing or measuring levels of gene and protein expression have resulted in an increase of data production that has easily outpaced the evolution of computational processing power predicted by Moore’s law. Furthermore, the combination of this data toward clinical applications, for example using protein mass spectrometry and gene expression microarray technologies toward biomarker discovery, has extraordinary potential for diagnostic, prognostic, and treatment strategies for complex diseases such as cancer, which require sensitive diagnostic tools for prognosis and development of flexible treatment strategies. Unfortunately, current mass spectrometry and microarray data analysis options available to researchers often require improvised combinations of tools provided by instrument manufacturers, third parties, and in-house development teams. These strategies are inefficient, inflexible and, in general, fall short of allowing researchers to fully exploit the data available to them. Furthermore, the lack of tools to meaningfully integrate heterogeneous data types has prevented researchers from efficiently linking genes, their products, and pathways. This integration is critical in attempting to elucidate the network of structural, regulatory, and dynamic interactions, thereby giving researchers the ability to obtain a more comprehensive understanding of the disease and, ultimately, to develop effective treatment strategies.
It is not surprising, therefore, that a great deal of effort (technical and marketing) has surrounded the field of data integration in life sciences in the last decade [2-9]. Consequently, our industry has witnessed the rise of many, and fall of most, approaches to data integration. Most of the early integration efforts have focused on integrating genomic sequence data and the associated metadata, such as annotations, results from various analyses, sample information, etc. More recent efforts have focused on integrating heterogeneous data types [10-15]. INCOGEN has closely followed the evolution of those approaches and has actively contributed to and led collaborative efforts toward data integration solutions [16-21]. During this time, we have identified three barriers that stand between scientists and their ability to derive the full benefits of their research:
Barrier 1: Lack of tools that combine data preprocessing with downstream analysis:
Currently, molecular profiling techniques are limited by 1) an inadequate understanding and control of the physical processes involved in source preparation and 2) the irreproducibility of the complex surface chemistry used to isolate the molecules of interest. These sample-induced variations may mask the expected correlation between the concentration of an analyte and its corresponding expression intensity. A comprehensive understanding of the technology and optimized data preprocessing is required to address these issues before they negatively impact subsequent analysis and results [22-24]. Unfortunately, analysis software provided by hardware manufacturers is often inadequate and/or inflexible and does not provide means for developing a complete, optimized analysis workflow that includes signal processing, feature selection, and classification. Notably in the gene expression analysis community, several large-scale bioinformatics efforts (caArray [25], cGAP [26]) have been launched to provide database repositories with access to suites of high-level statistical analysis tools for pattern recognition and visualization (e.g., caWorkbench [27]) or the construction of flexible analysis pipelines (e.g., GenePattern [28]). However, despite the usefulness of these data management and analysis portals, they do not provide access to the low-level signal processing tools, which are usually only supplied with the hardware (e.g., ScanArrayExpress [29]) or wrapped into proprietary software (e.g., Gene Pix [30], ArrayVision [31]). During INCOGEN’s work toward the development of diagnostic classifiers for mass spectrometry and gene expression data, we demonstrated that it is critical to fully integrate low-level signal processing with high-level, statistical inference methods to discover robust discriminating patterns and identify diagnostically significant variables, as well as to gain a more complete understanding of the system under study [22, 32-35].
Barrier 2: Lack of tools to meaningfully integrate genomic and proteomic data:
Over the last decade, techniques for the independent analysis of genomic and proteomic data have been the focus of extensive research, and many analysis methods have been developed and applied to these areas. More recently, the need to meaningfully combine these types of information has been recognized as a critical step towards a comprehensive understanding of complex biological questions [36-43]. While it is becoming increasingly common for investigators to gather parallel measurements for complementary gene expression and protein data, the integration of these heterogeneous data sets for combined computational analysis has not often been attempted because of the lack of necessary tools. A recent example project conducted at INCOGEN involved the analysis of breast cancer data that was generated using both the mass spectrometry and microarray platforms [44]. The results from the pilot study have shown the significant amount of diagnostically relevant knowledge that can be gained from meaningfully combining data from the microarray and mass spectrometry platforms and the subsequent incorporation of biological knowledge from existing resources. However, many components of the analysis had to be performed manually because of a lack of adequate tools.
Barrier 3: Lack of tools to promote interdisciplinary research:
A comprehensive understanding of biological phenomena can only be achieved through the melding of information and insights from technologies that characterize genes and proteins at many levels, such as sequence, transcription, regulation, structure, function, kinetics, localization, etc. This integration of knowledge requires a departure from conventional approaches toward life science research and is only possible by combining state-of-the-art technologies and enabling knowledge exchange from traditionally divergent fields such as molecular biology, clinical research, computational science, physics, statistics, and hardware engineering [38]. Currently, a platform does not exist that allows researchers from multiple disciplines to contribute their knowledge toward a common research goal.
Future of Bioinformatics Software
The ideal solution to address these barriers would require a statistically robust experimental project design, which was defined a priori; as well as reliable, interoperable and user-friendly software and homogeneous data content representation and exchange formats. While the ideal solution presents a laudable goal for the life science informatics community and has formed the basis for efforts such as the – now inactive – Interoperable Informatics Infrastructure Consortium (I3C) and NCI’s Cancer Biomedical Information Grid (caBIG), many of the required components do not yet exist and are likely to take years or even decades to develop or establish themselves.
During the last decade, INCOGEN has played a leading role in providing tools that enable scientists to conduct effective research in the absence of the described “ideal” solution. Software such as GenePort [45] and VIBE [46] have been designed to support multidisciplinary teams of researchers to meaningfully integrate their science and expertise toward achieving their research goals. To ensure the development of tools that are robust and useful for, for example, cancer diagnosis, the software development projects were carried out while performing the actual cancer research. This approach requires a level of coordination among researchers from many disciplines that is rarely possible, but essential. Furthermore, while most scientists agree that an integrated, systems approach is fundamentally necessary to fully understand the biological foundations of life, often proponents of such approaches are vague about how they will overcome the challenges of data overload and meaningful integration of large heterogeneous data sets. INCOGEN’s multidisciplinary partnerships and complementary expertise have proven successful in approaching these complex challenges. It is crucial that, in the future, software companies will follow similar approaches that can lead to the development of robust and effective software platforms. This successful combination of the bioinformatics and clinical aspects is vital for the development of tools to meaningfully integrate data, knowledge, and expertise from various disciplines toward the systematic understanding of biology and, ultimately, widespread benefits in health care.
References