PyBDA¶
A commandline tool for analysis of big biological data sets for distributed HPC clusters.
About¶
Welcome to PyBDA.
PyBDA is a Python library and command line tool for big data analytics and machine learning.
In order to make PyBDA scale to big data sets, we use Apache [Spark]’s DataFrame API which, if developed against, automatically distributes data to the nodes of a high-performance cluster and does the computation of expensive machine learning tasks in parallel. For scheduling, PyBDA uses [Snakemake] to automatically execute pipelines of jobs. In particular, PyBDA will first build a DAG of methods/jobs you want to execute in succession (e.g. dimensionality reduction into clustering) and then compute every method by traversing the DAG. In the case of a successful computation of a job, PyBDA will write results and plots, and create some statistics. If one of the jobs fails PyBDA will report where and which method failed (owing to Snakemake’s scheduling) such that the same pipeline can effortlessly be continued from where it failed the last time.
PyBDA supports multiple machine learning methods that scale to big data sets which we either implemented from scratch entirely or interface the methodology from [MLLib]:
- dimensionality reduction using PCA, factor analysis, kPCA, linear discriminant analysis and ICA,
- clustering using k-means and Gaussian mixture models,
- supervised learning using generalized linear regression models, random forests and gradient boosting.
The package is actively developed. If you want to you can also contribute, for instance by adding new features or methods: fork us on GitHub.
Dependencies¶
- Apache Spark == 2.4.0
- Python == 3.6
- Linux or MacOS
Example¶
To run PyBDA you only need to provide a config-file and, if possible, the IP of a spark-cluster (otherwise you can just call PyBDA locally using local
).
The config file for several machine learning tasks might look like this:
spark: spark-submit
infile: data/single_cell_imaging_data.tsv
predict: data/single_cell_imaging_data.tsv
outfolder: data/results
meta: data/meta_columns.tsv
features: data/feature_columns.tsv
dimension_reduction: pca
n_components: 5
clustering: kmeans
n_centers: 50, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200
regression: forest
family: binomial
response: is_infected
sparkparams:
- "--driver-memory=3G"
- "--executor-memory=6G"
debug: true
The above configuration would tell PyBDA to execute multiple things:
- first use an PCA to embed the data into a 5-dimensional latent space,
- do a
k
-means clustering with different numbers of clusters centers on that space, - fit a random forest to the response called
is_infected
and use abinomial
family, - give the Spark driver 3Gb of memory and the executor 6Gb,
- print debug information.
You call PyBDA like that:
pybda run data/pybda-usecase.config local
where local
tells PyBDA to just use your desktop as Spark cluster.
The result of any PyBDA call creates several files and figures. For instance, we should check the performance of the forest:
family response accuracy f1 precision recall
binomial is_infected 0.8236 0.8231143143597965 0.8271935801788475 0.8236
For the PCA, we for instance create a biplot. It’s always informative to look at these:
And for the consecutive clustering, two of the plots generated from the clustering are shown below:
References¶
[Snakemake] | Köster, Johannes, and Sven Rahmann. “Snakemake—a scalable bioinformatics workflow engine.” Bioinformatics 28.19 (2012): 2520-2522. |
[Spark] | Zaharia, Matei, et al. “Apache Sspark: a unified engine for big data processing.” Communications of the ACM 59.11 (2016): 56-65. |
[MLLib] | Meng, Xiangrui, et al. “MLlib: Machine Learning in Apache Spark.” The Journal of Machine Learning Research 17.1 (2016): 1235-1241. |