Regression

PyBDA supports several methods for regression. Here, we show how random forests and gradient boosting can be used to predict a response variable from a set of covariables. We use a single-cell imaging data set to predict whether a cell is infected by a pathogen or not.

We start by activating our environment:

[1]:
source ~/miniconda3/bin/activate pybda
(pybda)

To fit the two models, we can make use of a file we already provided in data. This should do the trick:

[2]:
cd data
(pybda)
[3]:
cat pybda-usecase-regression.config
spark: spark-submit
infile: single_cell_imaging_data.tsv
predict: single_cell_imaging_data.tsv
outfolder: results
meta: meta_columns.tsv
features: feature_columns.tsv
regression: forest, gbm
family: binomial
response: is_infected
sparkparams:
  - "--driver-memory=1G"
  - "--executor-memory=1G"
debug: true
(pybda)

The config file above we will do the following:

  • fit a random forest and gradient boosting models,
  • regress the response column on the features in feature_columns.tsv,
  • use a binomial family variable,
  • predict the response using the fitted models using the data set in predict,
  • give the Spark driver 1G of memory and the executor 1G of memory,
  • write the results to results,
  • print debug information.

So, a brief file like this is enough!

We then call PyBDA like this:

[4]:
pybda regression pybda-usecase-regression.config local
Checking command line arguments for method: regression
 Printing rule tree:
 -> _ (, single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_09/gbm_from_single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_09/forest_from_single_cell_imaging_data.tsv)

Building DAG of jobs...
[2019-08-09 00:15:53,204 - WARNING - snakemake.logging]: Building DAG of jobs...
Using shell: /bin/bash
[2019-08-09 00:15:53,219 - WARNING - snakemake.logging]: Using shell: /bin/bash
Provided cores: 1
[2019-08-09 00:15:53,219 - WARNING - snakemake.logging]: Provided cores: 1
Rules claiming more threads will be scaled down.
[2019-08-09 00:15:53,219 - WARNING - snakemake.logging]: Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       forest
        1       gbm
        2
[2019-08-09 00:15:53,220 - WARNING - snakemake.logging]: Job counts:
        count   jobs
        1       forest
        1       gbm
        2

[2019-08-09 00:15:53,221 - INFO - snakemake.logging]:
[Fri Aug  9 00:15:53 2019]
[2019-08-09 00:15:53,222 - INFO - snakemake.logging]: [Fri Aug  9 00:15:53 2019]
rule gbm:
    input: single_cell_imaging_data.tsv
    output: results/2019_08_09/gbm_from_single_cell_imaging_data-statistics.tsv
    jobid: 0
[2019-08-09 00:15:53,222 - INFO - snakemake.logging]: rule gbm:
    input: single_cell_imaging_data.tsv
    output: results/2019_08_09/gbm_from_single_cell_imaging_data-statistics.tsv
    jobid: 0

[2019-08-09 00:15:53,222 - INFO - snakemake.logging]:
 Printing rule tree:
 -> _ (, single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_09/gbm_from_single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_09/forest_from_single_cell_imaging_data.tsv)

Job counts:
        count   jobs
        1       gbm
        1
 Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/gbm.py --predict single_cell_imaging_data.tsv single_cell_imaging_data.tsv meta_columns.tsv feature_columns.tsv is_infected binomial results/2019_08_09/gbm_from_single_cell_imaging_data > results/2019_08_09/gbm_from_single_cell_imaging_data-spark.log 0m
[Fri Aug  9 00:17:17 2019]
[2019-08-09 00:17:17,838 - INFO - snakemake.logging]: [Fri Aug  9 00:17:17 2019]
Finished job 0.
[2019-08-09 00:17:17,838 - INFO - snakemake.logging]: Finished job 0.
1 of 2 steps (50%) done
[2019-08-09 00:17:17,838 - INFO - snakemake.logging]: 1 of 2 steps (50%) done

[2019-08-09 00:17:17,839 - INFO - snakemake.logging]:
[Fri Aug  9 00:17:17 2019]
[2019-08-09 00:17:17,839 - INFO - snakemake.logging]: [Fri Aug  9 00:17:17 2019]
rule forest:
    input: single_cell_imaging_data.tsv
    output: results/2019_08_09/forest_from_single_cell_imaging_data-statistics.tsv
    jobid: 1
[2019-08-09 00:17:17,839 - INFO - snakemake.logging]: rule forest:
    input: single_cell_imaging_data.tsv
    output: results/2019_08_09/forest_from_single_cell_imaging_data-statistics.tsv
    jobid: 1

[2019-08-09 00:17:17,839 - INFO - snakemake.logging]:
 Printing rule tree:
 -> _ (, single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_09/gbm_from_single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_09/forest_from_single_cell_imaging_data.tsv)

Job counts:
        count   jobs
        1       forest
        1
 Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/forest.py --predict single_cell_imaging_data.tsv single_cell_imaging_data.tsv meta_columns.tsv feature_columns.tsv is_infected binomial results/2019_08_09/forest_from_single_cell_imaging_data > results/2019_08_09/forest_from_single_cell_imaging_data-spark.log 0m
[Fri Aug  9 00:17:58 2019]
[2019-08-09 00:17:58,124 - INFO - snakemake.logging]: [Fri Aug  9 00:17:58 2019]
Finished job 1.
[2019-08-09 00:17:58,124 - INFO - snakemake.logging]: Finished job 1.
2 of 2 steps (100%) done
[2019-08-09 00:17:58,124 - INFO - snakemake.logging]: 2 of 2 steps (100%) done
Complete log: /home/simon/PROJECTS/pybda/data/.snakemake/log/2019-08-09T001553.143310.snakemake.log
[2019-08-09 00:17:58,125 - WARNING - snakemake.logging]: Complete log: /home/simon/PROJECTS/pybda/data/.snakemake/log/2019-08-09T001553.143310.snakemake.log
(pybda)

That’s it! The call automatically executes the jobs defined in the config. After both ran, we should check the plots and statistics. Let’s see what we got:

[5]:
cd results
ls -lgG *
(pybda) total 13832
-rw-rw-r-- 1    2909 Aug  9 00:17 forest_from_single_cell_imaging_data.log
-rw-r--r-- 1 5320871 Aug  9 00:17 forest_from_single_cell_imaging_data-predicted.tsv
-rw-rw-r-- 1  406579 Aug  9 00:17 forest_from_single_cell_imaging_data-spark.log
-rw-rw-r-- 1     118 Aug  9 00:17 forest_from_single_cell_imaging_data-statistics.tsv
-rw-rw-r-- 1    2903 Aug  9 00:17 gbm_from_single_cell_imaging_data.log
-rw-r--r-- 1 5323224 Aug  9 00:17 gbm_from_single_cell_imaging_data-predicted.tsv
-rw-rw-r-- 1 3084636 Aug  9 00:17 gbm_from_single_cell_imaging_data-spark.log
-rw-rw-r-- 1     130 Aug  9 00:17 gbm_from_single_cell_imaging_data-statistics.tsv
(pybda)

Let’s check how good the two methods compare:

[6]:
cat */gbm_from_single_cell_imaging_data-statistics.tsv
family  response        accuracy        f1      precision       recall
binomial        is_infected     0.9349  0.9348907798833392      0.9351464843746091      0.9349000000000001
(pybda)
[7]:
cat */forest_from_single_cell_imaging_data-statistics.tsv
family  response        accuracy        f1      precision       recall
binomial        is_infected     0.8236  0.8231143143597965      0.8271935801788475      0.8236
(pybda)

The GBM performed way better than the random forest. That is hardly surprising, because the data set is very noisy, thus recursively training on the errors of a learner should be advantageous.

PyBDA creates plenty of other files to check out! For instance, we should always look at the log files we created:

[8]:
cat */gbm_from_single_cell_imaging_data.log
[2019-08-09 00:15:55,705 - INFO - pybda.spark_session]: Initializing pyspark session
[2019-08-09 00:15:57,046 - INFO - pybda.spark_session]: Config: spark.master, value: local
[2019-08-09 00:15:57,046 - INFO - pybda.spark_session]: Config: spark.driver.memory, value: 1G
[2019-08-09 00:15:57,047 - INFO - pybda.spark_session]: Config: spark.app.name, value: gbm.py
[2019-08-09 00:15:57,047 - INFO - pybda.spark_session]: Config: spark.driver.port, value: 39021
[2019-08-09 00:15:57,047 - INFO - pybda.spark_session]: Config: spark.rdd.compress, value: True
[2019-08-09 00:15:57,047 - INFO - pybda.spark_session]: Config: spark.app.id, value: local-1565302556519
[2019-08-09 00:15:57,047 - INFO - pybda.spark_session]: Config: spark.serializer.objectStreamReset, value: 100
[2019-08-09 00:15:57,047 - INFO - pybda.spark_session]: Config: spark.driver.host, value: 192.168.1.33
[2019-08-09 00:15:57,047 - INFO - pybda.spark_session]: Config: spark.executor.id, value: driver
[2019-08-09 00:15:57,047 - INFO - pybda.spark_session]: Config: spark.submit.deployMode, value: client
[2019-08-09 00:15:57,047 - INFO - pybda.spark_session]: Openened spark context at: Fri Aug  9 00:15:57 2019
[2019-08-09 00:15:57,066 - INFO - pybda.io.io]: Reading tsv: single_cell_imaging_data.tsv
[2019-08-09 00:16:01,018 - INFO - pybda.spark.features]: Casting columns to double.
[2019-08-09 00:16:02,010 - INFO - pybda.spark.features]: Assembling column to feature vector
[2019-08-09 00:16:02,183 - INFO - pybda.spark.features]: Dropping redundant columns
[2019-08-09 00:16:02,203 - INFO - pybda.ensemble]: Fitting forest with family='binomial'
[2019-08-09 00:16:05,686 - INFO - pybda.decorators]: function: '_balance' took: 3.4832 sec
[2019-08-09 00:17:03,806 - INFO - pybda.decorators]: function: '_fit' took: 61.6037 sec
[2019-08-09 00:17:12,037 - INFO - pybda.fit.ensemble_fit]: Writing regression statistics
[2019-08-09 00:17:12,038 - INFO - pybda.io.io]: Reading tsv: single_cell_imaging_data.tsv
[2019-08-09 00:17:12,286 - INFO - pybda.spark.features]: Casting columns to double.
[2019-08-09 00:17:12,866 - INFO - pybda.spark.features]: Assembling column to feature vector
[2019-08-09 00:17:12,975 - INFO - pybda.decorators]: function: 'predict' took: 0.0645 sec
[2019-08-09 00:17:12,988 - INFO - pybda.spark.features]: Dropping column 'features'
[2019-08-09 00:17:12,994 - INFO - pybda.spark.features]: Dropping column 'rawPrediction'
[2019-08-09 00:17:13,002 - INFO - pybda.spark.features]: Splitting vector columns: probability
[2019-08-09 00:17:13,531 - INFO - pybda.io.io]: Writing tsv: results/2019_08_09/gbm_from_single_cell_imaging_data-predicted
[2019-08-09 00:17:16,616 - INFO - pybda.spark_session]: Stopping Spark context
[2019-08-09 00:17:16,616 - INFO - pybda.spark_session]: Closed spark context at: Fri Aug  9 00:17:16 2019
[2019-08-09 00:17:16,616 - INFO - pybda.spark_session]: Computation took: 79
(pybda)

Furthermore, the Spark log file is sometimes important to look at when the methods failed:

[9]:
head */gbm_from_single_cell_imaging_data-spark.log
2019-08-09 00:15:54 WARN  Utils:66 - Your hostname, hoto resolves to a loopback address: 127.0.1.1; using 192.168.1.33 instead (on interface wlp2s0)
2019-08-09 00:15:54 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2019-08-09 00:15:54 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-08-09 00:15:55 INFO  SparkContext:54 - Running Spark version 2.4.0
2019-08-09 00:15:55 INFO  SparkContext:54 - Submitted application: gbm.py
2019-08-09 00:15:55 INFO  SecurityManager:54 - Changing view acls to: simon
2019-08-09 00:15:55 INFO  SecurityManager:54 - Changing modify acls to: simon
2019-08-09 00:15:55 INFO  SecurityManager:54 - Changing view acls groups to:
2019-08-09 00:15:55 INFO  SecurityManager:54 - Changing modify acls groups to:
2019-08-09 00:15:55 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(simon); groups with view permissions: Set(); users  with modify permissions: Set(simon); groups with modify permissions: Set()
(pybda)