Getting started¶
Getting started walks you through the installation of koios and to submit your first program. See the examples to get a better overview that koios has to offer.
Installation¶
Installing PyBDA is easy.
Make sure to have
python3
installed. PyBDA does not support previous versions. The best way to do that is to download and install anaconda and create a virtual environment like this:conda create -y -n pybda python=3.6 source activate pybda
Note
Make sure to use Python version 3.6 since some dependencies PyBDA uses so far don’t work with versions 3.7 or 3.8.
I recommend installing PyBDA from Bioconda:
conda install -c bioconda pybda
Download Spark from Apache Spark (use the prebuilt for Apache Hadoop package type) and install the unpacked folder into a custom path like
`/opt/local/spark`
. Put an alias into your`.bashrc`
(or whatever shell you are using)echo "alias spark-submit='opt/local/spark/bin/spark-submit'" >> .bashrc
That is it.
Usage¶
Using PyBDA requires providing two things:
- a config file that specifies the methods you want to use, paths to files, and parameters,
- and the
IP
to a running spark-cluster which runs the algorithms and methods to be executed. If no cluster environment is available you can also run PyBDA locally. This, of course, somehow limits what PyBDA can do for you, since it’s real strength lies in distributed computation.
Config¶
Running PyBDA requires a yaml
configuration file that specifies several key-value pairs.
The config file consists of
- general arguments, such as file names,
- method specific arguments,
- arguments for Apache Spark.
General arguments¶
The following table shows the arguments that are mandatory and need to be set in every application.
Parameter | Explanation |
---|---|
spark |
path to Apache spark spark-submit executable |
infile |
tab-separated input file to use for any of the methods |
outfolder |
folder where all results are written to. |
meta |
names of the columns that represent meta information (“n”-separated) |
features |
names of the columns that represent numerical features, i.e. columns that are used for analysis (“n”-separated). |
sparkparams |
specifies parameters that are handed over to Apache Spark (which we cover in the section below) |
Method specific arguments¶
The following tables show the arguments required for the single methods, i.e. dimension reduction, clustering and regression.
Parameter | Argument | Explanation |
---|---|---|
Dimension reduction | ||
dimension_reduction |
factor_analysis /pca /kpca /lda /ica |
Specifies which method to use for dimension reduction |
n_components |
e.g 2,3,4 or 2 |
Comma-separated list of integers specifying the number of variables in the lower dimensional space to use per reduction |
response |
(only for lda ) |
Name of column in infile that is the response. Only required for linear discriminant analysis. |
Clustering | ||
clustering |
kmeans /gmm |
Specifies which method to use for clustering |
n_centers |
e.g 2,3,4 or 2 |
Comma-separated list of integers specifying the number of clusters to use per cluystering |
Regression | ||
regression |
glm /forest /gbm |
Specifies which method to use for regression |
response |
Name of column in infile that is the response |
|
family |
gaussian /binomial |
Distribution family of the response variable |
The abbreveations of the methods are explained in the following list.
factor_analysis
for factor analysis,forest
for random forests,gbm
for stochastic gradient boosting,glm
for generalized linear regression models,gmm
for Gaussian mixture models,ica
for independent component analysis,lda
for linear discriminant analysis,kmeans
for K-means,kpca
for kernel principal component analysis using Fourier features to approximate the kernel map,pca
for principal component analysis.
Example¶
For instance, consider the config file below:
spark: spark-submit
infile: data/single_cell_imaging_data.tsv
predict: data/single_cell_imaging_data.tsv
outfolder: data/results
meta: data/meta_columns.tsv
features: data/feature_columns.tsv
dimension_reduction: pca
n_components: 5
clustering: kmeans
n_centers: 50, 100, 110, 120, 130, 140, 150, 160, 170, 180, 190, 200
regression: forest
family: binomial
response: is_infected
sparkparams:
- "--driver-memory=3G"
- "--executor-memory=6G"
debug: true
It would execute the following jobs:
- dimension reduction (PCA) on the input file with 5 components,
- clustering on the result of the dimensionality reduction with multiple cluster centers (
k
-means), - binomial-family random forest (i.e. logistic) on the input file with response is_infected and features from
data/feature_columns.tsv
In addition we would allow Spark to use 3G driver memory, 6G executor memory and set the configuration variable spark.driver.maxResultSize
to 3G (all configurations can be found here).
Note
PyBDA first parses through the config file and builds a DAG of the methods that should be executed. If it finds dimensionality reduction and clustering, it will first embed the data in a lower dimensional space und use the result of this for clustering (i.e. in order to remove correlated features). The same does not happen with regression.
Spark parameters¶
The Spark documentation
for submitting applications provides details which arguments are valid here. You provide them as list in the yaml file as key sparkparams
:
Below, the most important two are listed:
"--driver-memory=xG"
- Amount of memory to use for the driver process in gigabyte, i.e. where SparkContext is initialized.
"--executor-memory=xG"
- Amount of memory to use per executor process in giga byte.
"--conf spark.driver.maxResultSize=3G"
- Limit of total size of serialized results of all partitions for each Spark action.
Spark¶
In order for PyBDA to work you need to have a working
standalone spark environment set up, running and listening to some IP
.
You can find a good introduction
here on how
to start the standalone Spark cluster. Alternatively, as mentioned above, a desktop PC suffices as well, but will limit what PyBDA can do for you.
We assume that you know how to use Apache Spark and start a cluster. However, for the sake of demonstration the next two sections give a short introduction how Spark clusters are set up.
Local Spark context¶
On a local resource, such as a laptop or desktop computer, there is no need to start a Spark cluster.
In such a scenario the IP
PyBDA requires for submitting jobs is just called local
.
Alternatively, you can always simulate a cluster environment. You start the Spark environment using:
$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-slave.sh <IP>
where $SPARK_HOME
is the installation path of Spark and IP
the IP to which we will submit jobs.
When calling start-master.sh
Spark will log the IP
it uses. Thus you need to have a look there to find it. Usually the line looks something like:
2019-01-23 21:57:29 INFO Master:54 - Starting Spark master at spark://<COMPUTERNAME>:7077
In the above case the IP
is spark://<COMPUTERNAME>:7077
. Thus you start the slave using
$SPARK_HOME/sbin/start-slave.sh spark://<COMPUTERNAME>:7077
That is it.
Cluster environment¶
If you are working on a cluster, you can use sparkhpc
to set up a Spark instance (find the documenation here).
Note
If you want to use sparkhpc
, please read its documentation to understand how Spark clusters are started.
Sparkhpc can be used to start a standalone cluster on an LSF/SGE high-performance computing environment. In order for them to work make sure to have openmpi and Java installed. Sparkhpc installs with PyBDA, but in case it didn’t just reinstall it:
pip install sparkhpc
Sparkhpc helps you setting up spark clusters for LSF and Slurm cluster environments. If you have one of those start a Spark cluster, for instance, using:
sparkcluster start --memory-per-executor 50000 --memory-per-core 10000 --walltime 4:00 --cores-per-executor 5 2 &
sparkcluster launch &
Warning
For your own cluster, you should modify the number of workers, nodes, cores and memory.
In the above call we would request 2 nodes with 5 cores each. Every core would receive 10G of memory, while the entuire executor would receive 50G of memory.
After the job has started, you need to call
sparkcluster info
in order to receive the Spark IP
.
Calling¶
If you made it thus far, you successfully
- modified the config file,
- started a Spark standalone cluster and have the
IP
to which the Spark cluster listens.
Now we can finally start our application.
For dimension reduction:
pybda dimension-reduction data/pybda-usecase.config IP
For clustering:
pybda clustering data/pybda-usecase.config IP
For regression:
pybda regression data/pybda-usecase.config IP
And, finally, if you want to execute all methods (i.e., regression/clustering/dimension reduction/…) you would call PyBDA with a run
argument:
pybda run data/pybda-usecase.config IP
In all cases, the methods create tsv
files, plots and statistics.
References¶
Murphy, Kevin P. Machine learning: a probabilistic perspective. MIT press (2012).
Breiman, Leo. “Random forests.” Machine learning 45.1 (2001): 5-32.
Friedman, Jerome H. “Stochastic gradient boosting.” Computational Statistics & Data Analysis 38.4 (2002): 367-378.
Trevor, Hastie, Tibshirani Robert, and Friedman JH. “The elements of statistical learning: data mining, inference, and prediction.” (2009).
Hyvärinen, Aapo, Juha Karhunen, and Erkki Oja. Independent component analysis. Vol. 46. John Wiley & Sons (2004).
Köster, Johannes, and Sven Rahmann. “Snakemake—a scalable bioinformatics workflow engine.” Bioinformatics 28.19 (2012): 2520-2522.
Meng, Xiangrui, et al. “MLlib: Machine Learning in Apache Spark.” The Journal of Machine Learning Research 17.1 (2016): 1235-1241
Rahimi, Ali, and Benjamin Recht. “Random features for large-scale kernel machines.” Advances in Neural Information Processing Systems (2008).
Zaharia, Matei, et al. “Apache Spark: a unified engine for big data processing.” Communications of the ACM 59.11 (2016): 56-65.