Status Travis CircleCI Codecov Codacy Documentation Bioconda PyPi

A commandline tool for analysis of big biological data sets for distributed HPC clusters.


Welcome to PyBDA.

PyBDA is a Python library and command line tool for big data analytics and machine learning.

In order to make PyBDA scale to big data sets, we use Apache [Spark]’s DataFrame API which, if developed against, automatically distributes data to the nodes of a high-performance cluster and does the computation of expensive machine learning tasks in parallel. For scheduling, PyBDA uses [Snakemake] to automatically execute pipelines of jobs. In particular, PyBDA will first build a DAG of methods/jobs you want to execute in succession (e.g. dimensionality reduction into clustering) and then compute every method by traversing the DAG. In the case of a successful computation of a job, PyBDA will write results and plots, and create some statistics. If one of the jobs fails PyBDA will report where and which method failed (owing to Snakemake’s scheduling) such that the same pipeline can effortlessly be continued from where it failed the last time.

PyBDA supports multiple machine learning methods that scale to big data sets which we either implemented from scratch entirely or interface the methodology from [MLLib]:

  • dimensionality reduction using PCA, factor analysis, kPCA, linear discriminant analysis and ICA,
  • clustering using k-means and Gaussian mixture models,
  • supervised learning using generalized linear regression models, random forests and gradient boosting.

The package is actively developed. If you want to you can also contribute, for instance by adding new features or methods: fork us on GitHub.


  • Apache Spark == 2.4.0
  • Python == 3.6
  • Linux or MacOS


To run PyBDA you only need to provide a config-file and, if possible, the IP of a spark-cluster (otherwise you can just call PyBDA locally using local). The config file for several machine learning tasks might look like this:

Example of a configuration file.
spark: spark-submit
infile: data/single_cell_imaging_data.tsv
predict: data/single_cell_imaging_data.tsv
outfolder: data/results
meta: data/meta_columns.tsv
features: data/feature_columns.tsv
dimension_reduction: pca
n_components: 5
clustering: kmeans
n_centers: 50, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200
regression: forest
family: binomial
response: is_infected
  - "--driver-memory=3G"
  - "--executor-memory=6G"
debug: true

The above configuration would tell PyBDA to execute multiple things:

  • first use an PCA to embed the data into a 5-dimensional latent space,
  • do a k-means clustering with different numbers of clusters centers on that space,
  • fit a random forest to the response called is_infected and use a binomial family,
  • give the Spark driver 3Gb of memory and the executor 6Gb,
  • print debug information.

You call PyBDA like that:

pybda run data/pybda-usecase.config local

where local tells PyBDA to just use your desktop as Spark cluster.

The result of any PyBDA call creates several files and figures. For instance, we should check the performance of the forest:

Performance statistics of the random forest.
family	response	accuracy	f1	precision	recall
binomial	is_infected	0.8236	0.8231143143597965	0.8271935801788475	0.8236

For the PCA, we for instance create a biplot. It’s always informative to look at these:


PCA biplot of the single-cell imaging data.

And for the consecutive clustering, two of the plots generated from the clustering are shown below:


Number of clusters vs explained variance and BIC.


Distribution of the number of cells per cluster (component).