OmicABEL: Story of a successful interdisciplinary collaboration; Diego Fabregat-Traver (AICES, RWTH Aachen, Germany), Yurii S. Aulchenko (SD RAS, Novosibirsk, Russia), Paolo Bientinesi (AICES, RWTH Aachen, Germany)
We discuss the main hurdles and results of a four-year long collaboration between a group of computer scientists and a group of computational biologists. Despite the difficulties inherent to every interdisciplinary collaboration, we can safely affirm that ours is a story of success: we recently released OmicABEL, a high-performance library for large-scale genome-wide association studies (GWAS) that attains remarkable speedups over the state-of-the-art tools, and most importantly, we attracted more research groups and paved the way for further joint research.
Due to the boost in the amount of available genomic data, computational biologists are eager to perform GWAS analyses of ever increasing size: a typical large analysis involves the computation of billions of generalized least-squares (GLS) problems and the processing of terabytes of data. Unfortunately, neither general nor domain-specific libraries provide a viable solution, requiring unreasonable computing resources and time to completion. Our main goal thus has been the development of efficient and scalable algorithms and routines for the computation of large-scale GWAS.
The main limitation in available tools is that they are based on the traditional black-box approach where the individual GLSs (or small subsets of them) are addressed in isolation. Instead, the key to a viable solver is to consider the entire grid of problems as a whole, taking advantage of the specific correlation among them, and to uncover and exploit as much domain knowledge as possible (e.g., the special structure of matrices). Based on this principle, our application-tailored algorithms lower the computational cost by orders of magnitude. Combining the efficiency of our algorithms with high-performance techniques to attain scalability, and out-of-core techniques to effectively manage datasets residing on disk, our library outperforms the state-of-the-art software by factors of up to 1000. Analyses that previously required the use of supercomputers can now be performed on one multi-core node in a matter of hours.
While a remarkable result, the outcome of our collaboration goes beyond OmicABEL. The collaboration turned into a breeding ground, with multiple research groups involved in the development of a variety of projects. Finally, we highlight the most important lesson we learned from this collaboration: There is a huge disconnection between the field of high-performance computing and the application domains. When both worlds come together, however, the quality of the outcome grows exponentially. In order to advance in the field of simulation sciences, it is imperative that experts from different fields sit together and work on filling this gap. Close collaboration and constant communication are crucial to overcome language barriers and to achieve satisfactory results.