Combining multiple tasks at once¶
Often, we are interested in combining several methods at once. This notebook shows how it is done! Here, we use dimension reduction + clustering + regression at the same time with one, simple config file.
We start by loading our designated pybda
environment:
[1]:
source ~/miniconda3/bin/activate pybda
(pybda)
To run combinations of methods and models, we simply need to list them all in the same config file. We deposited one in the data
folder of pybda
:
[2]:
cd data
(pybda)
[3]:
cat pybda-usecase-dimred+clustering+regression.config
spark: spark-submit
infile: single_cell_imaging_data.tsv
outfolder: results
meta: meta_columns.tsv
features: feature_columns.tsv
dimension_reduction: pca, ica
n_components: 5
clustering: kmeans, gmm
n_centers: 50, 100
regression: forest, glm
response: is_infected
family: binomial
sparkparams:
- "--driver-memory=1G"
- "--executor-memory=1G"
debug: true
(pybda)
The config file above we will do the following:
- fit a PCA and ICA to
single_cell_imaging_data.tsv
using 5 components, - from the two results of PCA and ICA, do a \(k\)-means and a GMM clustering with 50, or 100, cluster centers, respectively,
- regress the
response
column on the features infeature_columns.tsv
using a random forest and a GLM, - use a
binomial
family variable, - give the Spark driver 1G of memory and the executor 1G of memory,
- write the results to
results
, - print debug information.
That’s all we need to do!
We then call pybda
from the command line. Usually we would want to call pybda
with a specific target (i.e., clustering, dimension-reduction, or regression) such that we do not run everything. However, in this case, where we want to execute everything, we call it with run.
[4]:
pybda run pybda-usecase-dimred+clustering+regression.config local
Checking command line arguments for method: regression
Checking command line arguments for method: dimension_reduction
Checking command line arguments for method: clustering
Printing rule tree:
-> _ (, single_cell_imaging_data.tsv)
-> regression (single_cell_imaging_data.tsv, results/2019_08_15/glm_from_single_cell_imaging_data.tsv)
-> regression (single_cell_imaging_data.tsv, results/2019_08_15/forest_from_single_cell_imaging_data.tsv)
-> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data.tsv)
-> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data.tsv)
Building DAG of jobs...
[2019-08-15 20:35:00,248 - WARNING - snakemake.logging]: Building DAG of jobs...
Using shell: /bin/bash
[2019-08-15 20:35:00,265 - WARNING - snakemake.logging]: Using shell: /bin/bash
Provided cores: 1
[2019-08-15 20:35:00,265 - WARNING - snakemake.logging]: Provided cores: 1
Rules claiming more threads will be scaled down.
[2019-08-15 20:35:00,265 - WARNING - snakemake.logging]: Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 forest
1 glm
1 gmm
1 ica
1 kmeans
1 pca
6
[2019-08-15 20:35:00,266 - WARNING - snakemake.logging]: Job counts:
count jobs
1 forest
1 glm
1 gmm
1 ica
1 kmeans
1 pca
6
[2019-08-15 20:35:00,266 - INFO - snakemake.logging]:
[Thu Aug 15 20:35:00 2019]
[2019-08-15 20:35:00,267 - INFO - snakemake.logging]: [Thu Aug 15 20:35:00 2019]
rule pca:
input: single_cell_imaging_data.tsv
output: results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data-loadings.tsv, results/2019_08_15/pca_from_single_cell_imaging_data-plot
jobid: 0
[2019-08-15 20:35:00,267 - INFO - snakemake.logging]: rule pca:
input: single_cell_imaging_data.tsv
output: results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data-loadings.tsv, results/2019_08_15/pca_from_single_cell_imaging_data-plot
jobid: 0
[2019-08-15 20:35:00,267 - INFO - snakemake.logging]:
Printing rule tree:
-> _ (, single_cell_imaging_data.tsv)
-> regression (single_cell_imaging_data.tsv, results/2019_08_15/glm_from_single_cell_imaging_data.tsv)
-> regression (single_cell_imaging_data.tsv, results/2019_08_15/forest_from_single_cell_imaging_data.tsv)
-> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data.tsv)
-> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data.tsv)
Job counts:
count jobs
1 pca
1
Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/pca.py 5 single_cell_imaging_data.tsv feature_columns.tsv results/2019_08_15/pca_from_single_cell_imaging_data > results/2019_08_15/pca_from_single_cell_imaging_data-spark.log 0m
Traceback (most recent call last):
File "/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 170, in manager
File "/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 73, in worker
File "/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 397, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 714, in read_int
raise EOFError
EOFError
[Thu Aug 15 20:36:09 2019]
[2019-08-15 20:36:09,391 - INFO - snakemake.logging]: [Thu Aug 15 20:36:09 2019]
Finished job 0.
[2019-08-15 20:36:09,392 - INFO - snakemake.logging]: Finished job 0.
1 of 6 steps (17%) done
[2019-08-15 20:36:09,392 - INFO - snakemake.logging]: 1 of 6 steps (17%) done
[2019-08-15 20:36:09,392 - INFO - snakemake.logging]:
[Thu Aug 15 20:36:09 2019]
[2019-08-15 20:36:09,392 - INFO - snakemake.logging]: [Thu Aug 15 20:36:09 2019]
rule ica:
input: single_cell_imaging_data.tsv
output: results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data-loadings.tsv, results/2019_08_15/ica_from_single_cell_imaging_data-plot
jobid: 3
[2019-08-15 20:36:09,393 - INFO - snakemake.logging]: rule ica:
input: single_cell_imaging_data.tsv
output: results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data-loadings.tsv, results/2019_08_15/ica_from_single_cell_imaging_data-plot
jobid: 3
[2019-08-15 20:36:09,393 - INFO - snakemake.logging]:
Printing rule tree:
-> _ (, single_cell_imaging_data.tsv)
-> regression (single_cell_imaging_data.tsv, results/2019_08_15/glm_from_single_cell_imaging_data.tsv)
-> regression (single_cell_imaging_data.tsv, results/2019_08_15/forest_from_single_cell_imaging_data.tsv)
-> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data.tsv)
-> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data.tsv)
Job counts:
count jobs
1 ica
1
Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/ica.py 5 single_cell_imaging_data.tsv feature_columns.tsv results/2019_08_15/ica_from_single_cell_imaging_data > results/2019_08_15/ica_from_single_cell_imaging_data-spark.log 0m
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/mllib/linalg/__init__.py:291: ComplexWarning: Casting complex values to real discards the imaginary part
Traceback (most recent call last):
File "/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 170, in manager
File "/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/daemon.py", line 73, in worker
File "/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 397, in main
if read_int(infile) == SpecialLengths.END_OF_STREAM:
File "/opt/local/spark/spark/python/lib/pyspark.zip/pyspark/serializers.py", line 714, in read_int
raise EOFError
EOFError
[Thu Aug 15 20:41:10 2019]
[2019-08-15 20:41:10,231 - INFO - snakemake.logging]: [Thu Aug 15 20:41:10 2019]
Finished job 3.
[2019-08-15 20:41:10,232 - INFO - snakemake.logging]: Finished job 3.
2 of 6 steps (33%) done
[2019-08-15 20:41:10,232 - INFO - snakemake.logging]: 2 of 6 steps (33%) done
[2019-08-15 20:41:10,233 - INFO - snakemake.logging]:
[Thu Aug 15 20:41:10 2019]
[2019-08-15 20:41:10,233 - INFO - snakemake.logging]: [Thu Aug 15 20:41:10 2019]
rule kmeans:
input: results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv
output: results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.png, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.pdf, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.eps, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.svg, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.tsv, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.png, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.pdf, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.eps, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.svg, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-transformed-K50-clusters, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-transformed-K100-clusters, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-transformed-K50-clusters, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-transformed-K100-clusters
jobid: 1
[2019-08-15 20:41:10,233 - INFO - snakemake.logging]: rule kmeans:
input: results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv
output: results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.png, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.pdf, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.eps, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.svg, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-profile.tsv, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.png, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.pdf, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.eps, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.svg, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-profile.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-transformed-K50-clusters, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-transformed-K100-clusters, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-transformed-K50-clusters, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-transformed-K100-clusters
jobid: 1
[2019-08-15 20:41:10,233 - INFO - snakemake.logging]:
Printing rule tree:
-> _ (, single_cell_imaging_data.tsv)
-> regression (single_cell_imaging_data.tsv, results/2019_08_15/glm_from_single_cell_imaging_data.tsv)
-> regression (single_cell_imaging_data.tsv, results/2019_08_15/forest_from_single_cell_imaging_data.tsv)
-> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data.tsv)
-> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data.tsv)
Job counts:
count jobs
1 kmeans
1
Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/kmeans.py 50,100 results/2019_08_15/ica_from_single_cell_imaging_data.tsv feature_columns.tsv results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data > results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data-spark.log 0m
Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/kmeans.py 50,100 results/2019_08_15/pca_from_single_cell_imaging_data.tsv feature_columns.tsv results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data > results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data-spark.log 0m
[Thu Aug 15 20:42:10 2019]
[2019-08-15 20:42:10,216 - INFO - snakemake.logging]: [Thu Aug 15 20:42:10 2019]
Finished job 1.
[2019-08-15 20:42:10,216 - INFO - snakemake.logging]: Finished job 1.
3 of 6 steps (50%) done
[2019-08-15 20:42:10,216 - INFO - snakemake.logging]: 3 of 6 steps (50%) done
[2019-08-15 20:42:10,217 - INFO - snakemake.logging]:
[Thu Aug 15 20:42:10 2019]
[2019-08-15 20:42:10,217 - INFO - snakemake.logging]: [Thu Aug 15 20:42:10 2019]
rule gmm:
input: results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv
output: results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.png, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.pdf, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.eps, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.svg, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.tsv, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.png, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.pdf, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.eps, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.svg, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-transformed-K50-components, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-transformed-K100-components, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-transformed-K50-components, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-transformed-K100-components
jobid: 5
[2019-08-15 20:42:10,217 - INFO - snakemake.logging]: rule gmm:
input: results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv
output: results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.png, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.pdf, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.eps, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.svg, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-profile.tsv, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.png, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.pdf, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.eps, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.svg, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-profile.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-transformed-K50-components, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-transformed-K100-components, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-transformed-K50-components, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-transformed-K100-components
jobid: 5
[2019-08-15 20:42:10,217 - INFO - snakemake.logging]:
Printing rule tree:
-> _ (, single_cell_imaging_data.tsv)
-> regression (single_cell_imaging_data.tsv, results/2019_08_15/glm_from_single_cell_imaging_data.tsv)
-> regression (single_cell_imaging_data.tsv, results/2019_08_15/forest_from_single_cell_imaging_data.tsv)
-> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data.tsv)
-> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data.tsv)
Job counts:
count jobs
1 gmm
1
Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/gmm.py 50,100 results/2019_08_15/ica_from_single_cell_imaging_data.tsv feature_columns.tsv results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data > results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data-spark.log 0m
Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/gmm.py 50,100 results/2019_08_15/pca_from_single_cell_imaging_data.tsv feature_columns.tsv results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data > results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data-spark.log 0m
[Thu Aug 15 20:45:47 2019]
[2019-08-15 20:45:47,512 - INFO - snakemake.logging]: [Thu Aug 15 20:45:47 2019]
Finished job 5.
[2019-08-15 20:45:47,512 - INFO - snakemake.logging]: Finished job 5.
4 of 6 steps (67%) done
[2019-08-15 20:45:47,512 - INFO - snakemake.logging]: 4 of 6 steps (67%) done
[2019-08-15 20:45:47,513 - INFO - snakemake.logging]:
[Thu Aug 15 20:45:47 2019]
[2019-08-15 20:45:47,513 - INFO - snakemake.logging]: [Thu Aug 15 20:45:47 2019]
rule forest:
input: single_cell_imaging_data.tsv
output: results/2019_08_15/forest_from_single_cell_imaging_data-statistics.tsv
jobid: 4
[2019-08-15 20:45:47,513 - INFO - snakemake.logging]: rule forest:
input: single_cell_imaging_data.tsv
output: results/2019_08_15/forest_from_single_cell_imaging_data-statistics.tsv
jobid: 4
[2019-08-15 20:45:47,513 - INFO - snakemake.logging]:
Printing rule tree:
-> _ (, single_cell_imaging_data.tsv)
-> regression (single_cell_imaging_data.tsv, results/2019_08_15/glm_from_single_cell_imaging_data.tsv)
-> regression (single_cell_imaging_data.tsv, results/2019_08_15/forest_from_single_cell_imaging_data.tsv)
-> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data.tsv)
-> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data.tsv)
Job counts:
count jobs
1 forest
1
Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/forest.py --predict None single_cell_imaging_data.tsv meta_columns.tsv feature_columns.tsv is_infected binomial results/2019_08_15/forest_from_single_cell_imaging_data > results/2019_08_15/forest_from_single_cell_imaging_data-spark.log 0m
[Thu Aug 15 20:46:15 2019]
[2019-08-15 20:46:15,981 - INFO - snakemake.logging]: [Thu Aug 15 20:46:15 2019]
Finished job 4.
[2019-08-15 20:46:15,981 - INFO - snakemake.logging]: Finished job 4.
5 of 6 steps (83%) done
[2019-08-15 20:46:15,982 - INFO - snakemake.logging]: 5 of 6 steps (83%) done
[2019-08-15 20:46:15,982 - INFO - snakemake.logging]:
[Thu Aug 15 20:46:15 2019]
[2019-08-15 20:46:15,982 - INFO - snakemake.logging]: [Thu Aug 15 20:46:15 2019]
rule glm:
input: single_cell_imaging_data.tsv
output: results/2019_08_15/glm_from_single_cell_imaging_data-table.tsv, results/2019_08_15/glm_from_single_cell_imaging_data-statistics.tsv
jobid: 2
[2019-08-15 20:46:15,982 - INFO - snakemake.logging]: rule glm:
input: single_cell_imaging_data.tsv
output: results/2019_08_15/glm_from_single_cell_imaging_data-table.tsv, results/2019_08_15/glm_from_single_cell_imaging_data-statistics.tsv
jobid: 2
[2019-08-15 20:46:15,982 - INFO - snakemake.logging]:
Printing rule tree:
-> _ (, single_cell_imaging_data.tsv)
-> regression (single_cell_imaging_data.tsv, results/2019_08_15/glm_from_single_cell_imaging_data.tsv)
-> regression (single_cell_imaging_data.tsv, results/2019_08_15/forest_from_single_cell_imaging_data.tsv)
-> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/ica_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_ica_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/ica_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_ica_from_single_cell_imaging_data.tsv)
-> dimension_reduction (single_cell_imaging_data.tsv, results/2019_08_15/pca_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/gmm_from_pca_from_single_cell_imaging_data.tsv)
-> clustering (results/2019_08_15/pca_from_single_cell_imaging_data.tsv, results/2019_08_15/kmeans_from_pca_from_single_cell_imaging_data.tsv)
Job counts:
count jobs
1 glm
1
Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/glm.py --predict None single_cell_imaging_data.tsv meta_columns.tsv feature_columns.tsv is_infected binomial results/2019_08_15/glm_from_single_cell_imaging_data > results/2019_08_15/glm_from_single_cell_imaging_data-spark.log 0m
[Thu Aug 15 20:46:45 2019]
[2019-08-15 20:46:45,434 - INFO - snakemake.logging]: [Thu Aug 15 20:46:45 2019]
Finished job 2.
[2019-08-15 20:46:45,434 - INFO - snakemake.logging]: Finished job 2.
6 of 6 steps (100%) done
[2019-08-15 20:46:45,434 - INFO - snakemake.logging]: 6 of 6 steps (100%) done
Complete log: /home/simon/PROJECTS/pybda/data/.snakemake/log/2019-08-15T203500.185751.snakemake.log
[2019-08-15 20:46:45,435 - WARNING - snakemake.logging]: Complete log: /home/simon/PROJECTS/pybda/data/.snakemake/log/2019-08-15T203500.185751.snakemake.log
(pybda)
That’s it! After pybda
finishes we should check which files we got. For instance, this we wanted to use a PCA as input for \(k\)-means, two of the created files would be: - pca_from_single_cell_imaging_data.tsv
and - kmeans_frompca_from_single_cell_imaging_data.tsv
and
[5]:
cd results
ls -lgG * | grep kmeans_from_pca_from_single_cell_imaging_data
(pybda) drwxrwxr-x 4 4096 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data
-rw-rw-r-- 1 33389 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-cluster_sizes-histogram.eps
-rw-rw-r-- 1 10937 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-cluster_sizes-histogram.pdf
-rw-rw-r-- 1 94000 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-cluster_sizes-histogram.png
-rw-rw-r-- 1 43122 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-cluster_sizes-histogram.svg
-rw-rw-r-- 1 6668 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data.log
-rw-rw-r-- 1 20587 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-profile.eps
-rw-rw-r-- 1 11827 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-profile.pdf
-rw-rw-r-- 1 223485 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-profile.png
-rw-rw-r-- 1 28460 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-profile.svg
-rw-rw-r-- 1 220 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-profile.tsv
-rw-rw-r-- 1 893509 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-spark.log
-rw-rw-r-- 1 34 Aug 15 20:41 kmeans_from_pca_from_single_cell_imaging_data-total_sse.tsv
drwxrwxr-x 2 4096 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-transformed-K100-clusters
drwxrwxr-x 2 4096 Aug 15 20:42 kmeans_from_pca_from_single_cell_imaging_data-transformed-K50-clusters
(pybda)
Since we should always check log files let’s have a look at one. For instance, since we wanted to use ICA and afterwards \(k\)-means we can check if pybda
did everything correctly:
[6]:
cat */kmeans_from_ica_from_single_cell_imaging_data.log | tail -n40 | head -n5
[2019-08-15 20:41:17,066 - INFO - pybda.spark.features]: Casting columns to double.
[2019-08-15 20:41:17,351 - INFO - pybda.spark.features]: Assembling column to feature vector
[2019-08-15 20:41:17,351 - INFO - pybda.spark.features]: Found columns with prefix f_ from previous computation: f_0 f_1 f_2 f_3 f_4. Preferring these columns as features
[2019-08-15 20:41:17,455 - INFO - pybda.spark.features]: Dropping redundant columns
[2019-08-15 20:41:18,203 - INFO - pybda.spark.dataframe]: Using data with n=10000 and p=5
(pybda)
Above we see that pybda
when executing the \(k\)-means realized that we used dimension reduction before (Found columns with prefix f_ from previous computation
) and thus uses these for the clustering and not the original data. This should also be true, for instance, for PCA and the Gaussian mixture model:
[7]:
cat */gmm_from_pca_from_single_cell_imaging_data.log | tail -n40 | head -n5
[2019-08-15 20:44:02,430 - INFO - pybda.io.io]: Reading tsv: results/2019_08_15/pca_from_single_cell_imaging_data.tsv
[2019-08-15 20:44:05,672 - INFO - pybda.spark.features]: Found columns with prefix f_ from previous computation: f_0 f_1 f_2 f_3 f_4.
Preferring these columns as features/
[2019-08-15 20:44:05,672 - INFO - pybda.spark.features]: Casting columns to double.
[2019-08-15 20:44:05,946 - INFO - pybda.spark.features]: Assembling column to feature vector
(pybda)