Combining multiple tasks at once

Often, we are interested in combining several methods at once. This notebook shows how it is done! Here, we use dimension reduction + clustering + regression at the same time with one, simple config file.

We start by loading our designated pybda environment:

[1]:
source ~/miniconda3/bin/activate pybda
(pybda)

To run combinations of methods and models, we simply need to list them all in the same config file. We deposited one in the data folder of pybda:

[2]:
cd data
(pybda)

[3]:
cat pybda-usecase-dimred+clustering+regression.config
spark: spark-submit
infile: single_cell_imaging_data.tsv
outfolder: results
meta: meta_columns.tsv
features: feature_columns.tsv
dimension_reduction: pca, ica
n_components: 5
clustering: kmeans, gmm
n_centers: 50, 100
regression: forest, glm
response: is_infected
family: binomial
sparkparams:
  - "--driver-memory=1G"
  - "--executor-memory=1G"
debug: true
(pybda)

The config file above we will do the following:

  • fit a PCA and ICA to single_cell_imaging_data.tsv using 5 components,
  • from the two results of PCA and ICA, do a \(k\)-means and a GMM clustering with 50, or 100, cluster centers, respectively,
  • regress the response column on the features in feature_columns.tsv using a random forest and a GLM,
  • use a binomial family variable,
  • give the Spark driver 1G of memory and the executor 1G of memory,
  • write the results to results,
  • print debug information.

That’s all we need to do!

We then call pybda from the command line. Usually we would want to call pybda with a specific target (i.e., clustering, dimension-reduction, or regression) such that we do not run everything. However, in this case, where we want to execute everything, we call it with run.

[4]:
pybda run pybda-usecase-dimred+clustering+regression.config local
Checking command line arguments for method: regression
Checking command line arguments for method: dimension_reduction
Checking command line arguments for method: clustering
 Printing rule tree:
 -> _ (, single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_15/glm_from_single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_15/forest_from_single_cell_imaging_data.tsv)
         -> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data.tsv)
         -> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data.tsv)

Building DAG of jobs...
[2019-08-15 20:35:00,248 - WARNING - snakemake.logging]: Building DAG of jobs...
Using shell: /bin/bash
[2019-08-15 20:35:00,265 - WARNING - snakemake.logging]: Using shell: /bin/bash
Provided cores: 1
[2019-08-15 20:35:00,265 - WARNING - snakemake.logging]: Provided cores: 1
Rules claiming more threads will be scaled down.
[2019-08-15 20:35:00,265 - WARNING - snakemake.logging]: Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       forest
        1       glm
        1       gmm
        1       ica
        1       kmeans
        1       pca
        6
[2019-08-15 20:35:00,266 - WARNING - snakemake.logging]: Job counts:
        count   jobs
        1       forest
        1       glm
        1       gmm
        1       ica
        1       kmeans
        1       pca
        6

[2019-08-15 20:35:00,266 - INFO - snakemake.logging]:
[Thu Aug 15 20:35:00 2019]
[2019-08-15 20:35:00,267 - INFO - snakemake.logging]: [Thu Aug 15 20:35:00 2019]
rule pca:
    input: single_cell_imaging_data.tsv
    output: results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data-loadings.tsv, results/2019_08_15/pca_from_single_cell_imaging_data-plot
    jobid: 0
[2019-08-15 20:35:00,267 - INFO - snakemake.logging]: rule pca:
    input: single_cell_imaging_data.tsv
    output: results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data-loadings.tsv, results/2019_08_15/pca_from_single_cell_imaging_data-plot
    jobid: 0

[2019-08-15 20:35:00,267 - INFO - snakemake.logging]:
 Printing rule tree:
 -> _ (, single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_15/glm_from_single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_15/forest_from_single_cell_imaging_data.tsv)
         -> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data.tsv)
         -> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data.tsv)

Job counts:
        count   jobs
        1       pca
        1
 Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/pca.py 5 single_cell_imaging_data.tsv feature_columns.tsv results/2019_08_15/pca_from_single_cell_imaging_data > results/2019_08_15/pca_from_single_cell_imaging_data-spark.log 0m
Traceback (most recent call last):
  File "/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 170, in manager
  File "/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 73, in worker
  File "/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 397, in main
    if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File "/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 714, in read_int
    raise EOFError
EOFError
[Thu Aug 15 20:36:09 2019]
[2019-08-15 20:36:09,391 - INFO - snakemake.logging]: [Thu Aug 15 20:36:09 2019]
Finished job 0.
[2019-08-15 20:36:09,392 - INFO - snakemake.logging]: Finished job 0.
1 of 6 steps (17%) done
[2019-08-15 20:36:09,392 - INFO - snakemake.logging]: 1 of 6 steps (17%) done

[2019-08-15 20:36:09,392 - INFO - snakemake.logging]:
[Thu Aug 15 20:36:09 2019]
[2019-08-15 20:36:09,392 - INFO - snakemake.logging]: [Thu Aug 15 20:36:09 2019]
rule ica:
    input: single_cell_imaging_data.tsv
    output: results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data-loadings.tsv, results/2019_08_15/ica_from_single_cell_imaging_data-plot
    jobid: 3
[2019-08-15 20:36:09,393 - INFO - snakemake.logging]: rule ica:
    input: single_cell_imaging_data.tsv
    output: results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data-loadings.tsv, results/2019_08_15/ica_from_single_cell_imaging_data-plot
    jobid: 3

[2019-08-15 20:36:09,393 - INFO - snakemake.logging]:
 Printing rule tree:
 -> _ (, single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_15/glm_from_single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_15/forest_from_single_cell_imaging_data.tsv)
         -> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data.tsv)
         -> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data.tsv)

Job counts:
        count   jobs
        1       ica
        1
 Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/ica.py 5 single_cell_imaging_data.tsv feature_columns.tsv results/2019_08_15/ica_from_single_cell_imaging_data > results/2019_08_15/ica_from_single_cell_imaging_data-spark.log 0m
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
Traceback (most recent call last):
  File "/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 170, in manager
  File "/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 73, in worker
  File "/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 397, in main
    if read_int(infile) == SpecialLengths.END_OF_STREAM:
  File "/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 714, in read_int
    raise EOFError
EOFError
[Thu Aug 15 20:41:10 2019]
[2019-08-15 20:41:10,231 - INFO - snakemake.logging]: [Thu Aug 15 20:41:10 2019]
Finished job 3.
[2019-08-15 20:41:10,232 - INFO - snakemake.logging]: Finished job 3.
2 of 6 steps (33%) done
[2019-08-15 20:41:10,232 - INFO - snakemake.logging]: 2 of 6 steps (33%) done

[2019-08-15 20:41:10,233 - INFO - snakemake.logging]:
[Thu Aug 15 20:41:10 2019]
[2019-08-15 20:41:10,233 - INFO - snakemake.logging]: [Thu Aug 15 20:41:10 2019]
rule kmeans:
    input: results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv
    output: results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.png, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.pdf, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.eps, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.svg, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.tsv, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.png, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.pdf, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.eps, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.svg, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-transformed-K50-clusters, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-transformed-K100-clusters, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-transformed-K50-clusters, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-transformed-K100-clusters
    jobid: 1
[2019-08-15 20:41:10,233 - INFO - snakemake.logging]: rule kmeans:
    input: results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv
    output: results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.png, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.pdf, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.eps, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.svg, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.tsv, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.png, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.pdf, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.eps, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.svg, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-transformed-K50-clusters, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-transformed-K100-clusters, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-transformed-K50-clusters, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-transformed-K100-clusters
    jobid: 1

[2019-08-15 20:41:10,233 - INFO - snakemake.logging]:
 Printing rule tree:
 -> _ (, single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_15/glm_from_single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_15/forest_from_single_cell_imaging_data.tsv)
         -> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data.tsv)
         -> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data.tsv)

Job counts:
        count   jobs
        1       kmeans
        1
 Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/kmeans.py 50,100 results/2019_08_15/ica_from_single_cell_imaging_data.tsv feature_columns.tsv results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data > results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-spark.log 0m
 Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/kmeans.py 50,100 results/2019_08_15/pca_from_single_cell_imaging_data.tsv feature_columns.tsv results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data > results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-spark.log 0m
[Thu Aug 15 20:42:10 2019]
[2019-08-15 20:42:10,216 - INFO - snakemake.logging]: [Thu Aug 15 20:42:10 2019]
Finished job 1.
[2019-08-15 20:42:10,216 - INFO - snakemake.logging]: Finished job 1.
3 of 6 steps (50%) done
[2019-08-15 20:42:10,216 - INFO - snakemake.logging]: 3 of 6 steps (50%) done

[2019-08-15 20:42:10,217 - INFO - snakemake.logging]:
[Thu Aug 15 20:42:10 2019]
[2019-08-15 20:42:10,217 - INFO - snakemake.logging]: [Thu Aug 15 20:42:10 2019]
rule gmm:
    input: results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv
    output: results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.png, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.pdf, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.eps, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.svg, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.tsv, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.png, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.pdf, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.eps, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.svg, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-transformed-K50-components, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-transformed-K100-components, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-transformed-K50-components, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-transformed-K100-components
    jobid: 5
[2019-08-15 20:42:10,217 - INFO - snakemake.logging]: rule gmm:
    input: results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv
    output: results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.png, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.pdf, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.eps, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.svg, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.tsv, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.png, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.pdf, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.eps, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.svg, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-transformed-K50-components, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-transformed-K100-components, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-transformed-K50-components, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-transformed-K100-components
    jobid: 5

[2019-08-15 20:42:10,217 - INFO - snakemake.logging]:
 Printing rule tree:
 -> _ (, single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_15/glm_from_single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_15/forest_from_single_cell_imaging_data.tsv)
         -> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data.tsv)
         -> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data.tsv)

Job counts:
        count   jobs
        1       gmm
        1
 Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/gmm.py 50,100 results/2019_08_15/ica_from_single_cell_imaging_data.tsv feature_columns.tsv results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data > results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-spark.log 0m
 Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/gmm.py 50,100 results/2019_08_15/pca_from_single_cell_imaging_data.tsv feature_columns.tsv results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data > results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-spark.log 0m
[Thu Aug 15 20:45:47 2019]
[2019-08-15 20:45:47,512 - INFO - snakemake.logging]: [Thu Aug 15 20:45:47 2019]
Finished job 5.
[2019-08-15 20:45:47,512 - INFO - snakemake.logging]: Finished job 5.
4 of 6 steps (67%) done
[2019-08-15 20:45:47,512 - INFO - snakemake.logging]: 4 of 6 steps (67%) done

[2019-08-15 20:45:47,513 - INFO - snakemake.logging]:
[Thu Aug 15 20:45:47 2019]
[2019-08-15 20:45:47,513 - INFO - snakemake.logging]: [Thu Aug 15 20:45:47 2019]
rule forest:
    input: single_cell_imaging_data.tsv
    output: results/2019_08_15/forest_from_single_cell_imaging_data-statistics.tsv
    jobid: 4
[2019-08-15 20:45:47,513 - INFO - snakemake.logging]: rule forest:
    input: single_cell_imaging_data.tsv
    output: results/2019_08_15/forest_from_single_cell_imaging_data-statistics.tsv
    jobid: 4

[2019-08-15 20:45:47,513 - INFO - snakemake.logging]:
 Printing rule tree:
 -> _ (, single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_15/glm_from_single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_15/forest_from_single_cell_imaging_data.tsv)
         -> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data.tsv)
         -> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data.tsv)

Job counts:
        count   jobs
        1       forest
        1
 Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/forest.py --predict None single_cell_imaging_data.tsv meta_columns.tsv feature_columns.tsv is_infected binomial results/2019_08_15/forest_from_single_cell_imaging_data > results/2019_08_15/forest_from_single_cell_imaging_data-spark.log 0m
[Thu Aug 15 20:46:15 2019]
[2019-08-15 20:46:15,981 - INFO - snakemake.logging]: [Thu Aug 15 20:46:15 2019]
Finished job 4.
[2019-08-15 20:46:15,981 - INFO - snakemake.logging]: Finished job 4.
5 of 6 steps (83%) done
[2019-08-15 20:46:15,982 - INFO - snakemake.logging]: 5 of 6 steps (83%) done

[2019-08-15 20:46:15,982 - INFO - snakemake.logging]:
[Thu Aug 15 20:46:15 2019]
[2019-08-15 20:46:15,982 - INFO - snakemake.logging]: [Thu Aug 15 20:46:15 2019]
rule glm:
    input: single_cell_imaging_data.tsv
    output: results/2019_08_15/glm_from_single_cell_imaging_data-table.tsv, results/2019_08_15/glm_from_single_cell_imaging_data-statistics.tsv
    jobid: 2
[2019-08-15 20:46:15,982 - INFO - snakemake.logging]: rule glm:
    input: single_cell_imaging_data.tsv
    output: results/2019_08_15/glm_from_single_cell_imaging_data-table.tsv, results/2019_08_15/glm_from_single_cell_imaging_data-statistics.tsv
    jobid: 2

[2019-08-15 20:46:15,982 - INFO - snakemake.logging]:
 Printing rule tree:
 -> _ (, single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_15/glm_from_single_cell_imaging_data.tsv)
         -> regression (single_cell_imaging_data.tsv, results/2019_08_15/forest_from_single_cell_imaging_data.tsv)
         -> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data.tsv)
         -> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data.tsv)
                 -> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data.tsv)

Job counts:
        count   jobs
        1       glm
        1
 Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/glm.py --predict None single_cell_imaging_data.tsv meta_columns.tsv feature_columns.tsv is_infected binomial results/2019_08_15/glm_from_single_cell_imaging_data > results/2019_08_15/glm_from_single_cell_imaging_data-spark.log 0m
[Thu Aug 15 20:46:45 2019]
[2019-08-15 20:46:45,434 - INFO - snakemake.logging]: [Thu Aug 15 20:46:45 2019]
Finished job 2.
[2019-08-15 20:46:45,434 - INFO - snakemake.logging]: Finished job 2.
6 of 6 steps (100%) done
[2019-08-15 20:46:45,434 - INFO - snakemake.logging]: 6 of 6 steps (100%) done
Complete log: /home/simon/PROJECTS/pybda/data/.snakemake/log/2019-08-15T203500.185751.snakemake.log
[2019-08-15 20:46:45,435 - WARNING - snakemake.logging]: Complete log: /home/simon/PROJECTS/pybda/data/.snakemake/log/2019-08-15T203500.185751.snakemake.log
(pybda)

That’s it! After pybda finishes we should check which files we got. For instance, this we wanted to use a PCA as input for \(k\)-means, two of the created files would be: - pca_from_single_cell_imaging_data.tsv and - kmeans_frompca_from_single_cell_imaging_data.tsv and

[5]:
cd results
ls -lgG * | grep kmeans_from_pca_from_single_cell_imaging_data
(pybda) drwxrwxr-x 4    4096 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data
-rw-rw-r-- 1   33389 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-cluster_sizes-histogram.eps
-rw-rw-r-- 1   10937 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-cluster_sizes-histogram.pdf
-rw-rw-r-- 1   94000 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-cluster_sizes-histogram.png
-rw-rw-r-- 1   43122 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-cluster_sizes-histogram.svg
-rw-rw-r-- 1    6668 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data.log
-rw-rw-r-- 1   20587 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-profile.eps
-rw-rw-r-- 1   11827 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-profile.pdf
-rw-rw-r-- 1  223485 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-profile.png
-rw-rw-r-- 1   28460 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-profile.svg
-rw-rw-r-- 1     220 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-profile.tsv
-rw-rw-r-- 1  893509 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-spark.log
-rw-rw-r-- 1      34 Aug 15 20:41 kmeans_from_pca_from_single_cell_imaging_data-total_sse.tsv
drwxrwxr-x 2    4096 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-transformed-K100-clusters
drwxrwxr-x 2    4096 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-transformed-K50-clusters
(pybda)

Since we should always check log files let’s have a look at one. For instance, since we wanted to use ICA and afterwards \(k\)-means we can check if pybda did everything correctly:

[6]:
cat */kmeans_from_ica_from_single_cell_imaging_data.log | tail -n40 | head -n5
[2019-08-15 20:41:17,066 - INFO - pybda.spark.features]: Casting columns to double.
[2019-08-15 20:41:17,351 - INFO - pybda.spark.features]: Assembling column to feature vector
[2019-08-15 20:41:17,351 - INFO - pybda.spark.features]: Found columns with prefix f_ from previous computation: f_0    f_1     f_2     f_3     f_4. Preferring these columns as features
[2019-08-15 20:41:17,455 - INFO - pybda.spark.features]: Dropping redundant columns
[2019-08-15 20:41:18,203 - INFO - pybda.spark.dataframe]: Using data with n=10000 and p=5
(pybda)

Above we see that pybda when executing the \(k\)-means realized that we used dimension reduction before (Found columns with prefix f_ from previous computation) and thus uses these for the clustering and not the original data. This should also be true, for instance, for PCA and the Gaussian mixture model:

[7]:
cat */gmm_from_pca_from_single_cell_imaging_data.log | tail -n40 | head -n5
[2019-08-15 20:44:02,430 - INFO - pybda.io.io]: Reading tsv: results/2019_08_15/pca_from_single_cell_imaging_data.tsv
[2019-08-15 20:44:05,672 - INFO - pybda.spark.features]: Found columns with prefix f_ from previous computation: f_0    f_1     f_2     f_3     f_4.
Preferring these columns as features/
[2019-08-15 20:44:05,672 - INFO - pybda.spark.features]: Casting columns to double.
[2019-08-15 20:44:05,946 - INFO - pybda.spark.features]: Assembling column to feature vector
(pybda)