Background: Most biocomputing pipelines are run on clusters of computers. Each type of cluster has its own API (application programming interface). That API defines how a program that is to run on the cluster must request the submission, content and monitoring of jobs to be run on the cluster. Sometimes, it is desirable to run the same pipeline on different types of cluster. This can happen in situations including when: - different labs are collaborating, but they do not use the same type of cluster; - a pipeline is released to other labs as open source or commercial software; - a lab has access to multiple types of cluster, and wants to choose between them for scaling, cost or other purposes; - a lab is migrating their infrastructure from one cluster type to another; - during testing or travelling, it is often desired to run on a single computer. However, since each type of cluster has its own API, code that runs jobs on one type of cluster needs to be re-written if it is desired to run that application on a different type of cluster. To resolve this problem, we created a software module to generalize the submission of pipelines across computing environments, including local compute, clouds and clusters. Results: HPCI (High Performance Computing Interface) is a Perl module that provides the interface to a standardized generic cluster. When the HPCI module is used, it accepts a parameter to specify the cluster type. The HPCI module uses this to load a driver HPCD::<cluster>. This is used to translate the abstract HPCI interface to the specific software interface. Simply by changing the cluster parameter, the same pipeline can be run on a different type of cluster with no other changes. Conclusion: The HPCI module assists in writing Perl programs that can be run in different lab environments, with different site configuration requirements and different types of hardware clusters. Rather than having to re-write portions of the program, it is only necessary to change a configuration file. Using HPCI, an application can manage collections of jobs to be runs, specify ordering dependencies, detect success or failure of jobs run and allow automatic retry of failed jobs (allowing for the possibility of a changed configuration such as when the original attempt specified an inadequate memory allotment). Keywords: portability; cluster; environment; pipeline
- Downloaded 192 times
- Download rankings, all-time:
- Site-wide: 59,434 out of 76,820
- In bioinformatics: 6,280 out of 7,425
- Year to date:
- Site-wide: 58,626 out of 76,820
- Since beginning of last month:
- Site-wide: 59,357 out of 76,820
Downloads over time
Distribution of downloads per paper, site-wide
- 18 Dec 2019: We're pleased to announce PanLingua, a new tool that enables you to search for machine-translated bioRxiv preprints using more than 100 different languages.
- 21 May 2019: PLOS Biology has published a community page about Rxivist.org and its design.
- 10 May 2019: The paper analyzing the Rxivist dataset has been published at eLife.
- 1 Mar 2019: We now have summary statistics about bioRxiv downloads and submissions.
- 8 Feb 2019: Data from Altmetric is now available on the Rxivist details page for every preprint. Look for the "donut" under the download metrics.
- 30 Jan 2019: preLights has featured the Rxivist preprint and written about our findings.
- 22 Jan 2019: Nature just published an article about Rxivist and our data.
- 13 Jan 2019: The Rxivist preprint is live!