Clustering and mixture models

The following notebook shows how PyBDA can be used for a simple clustering task. We use the iris data, because clustering it is fairly easy and we can use it as a check if everything worked out nicely. We’ll use a \(k\)-means clustering and compare it to a Gaussian mixture model (GMM). The \(k\)-means has fairly stringent assumptions about the data, i.e. spherical Gaussians, while the GMM estimates the variances from the data.

We start by activating our environment:

[1]:
source ~/miniconda3/bin/activate pybda
(pybda)

In order to do two clusterings, we merely need to set up a short config file with the two method names. We already provided a file that could do the trick for us in the data folder:

[2]:
cd data
(pybda)
[3]:
cat pybda-usecase-clustering.config
spark: spark-submit
infile: iris.tsv
outfolder: results
meta: iris_meta_columns.tsv
features: iris_feature_columns.tsv
clustering: kmeans,gmm
n_centers: 3, 5, 10
sparkparams:
  - "--driver-memory=1G"
  - "--executor-memory=1G"
debug: true
(pybda)

In the config above we will do the following:

  • do clustering/mixture model with 3, 5 and 10 cluster centers on the features provided in iris_feature_columns.tsv,
  • give the Spark driver 1G of memory and the executor 1G of memory,
  • write the results to results,
  • print debug information.

This totals 6 clusterings from two different methods with minimal coding effort.

Having the parameters set, we can call PyBDA:

[4]:
pybda clustering pybda-usecase-clustering.config local | head -n 10
Building DAG of jobs...
[2019-08-08 22:22:30,284 - WARNING - snakemake.logging]: Building DAG of jobs...
Using shell: /bin/bash
[2019-08-08 22:22:30,296 - WARNING - snakemake.logging]: Using shell: /bin/bash
Provided cores: 1
[2019-08-08 22:22:30,296 - WARNING - snakemake.logging]: Provided cores: 1
Rules claiming more threads will be scaled down.
[2019-08-08 22:22:30,296 - WARNING - snakemake.logging]: Rules claiming more threads will be scaled down.
Job counts:
        count   jobs
        1       gmm
        1       kmeans
        2
[2019-08-08 22:22:30,297 - WARNING - snakemake.logging]: Job counts:
        count   jobs
        1       gmm
        1       kmeans
        2

[2019-08-08 22:22:30,297 - INFO - snakemake.logging]:
[Thu Aug  8 22:22:30 2019]
[2019-08-08 22:22:30,297 - INFO - snakemake.logging]: [Thu Aug  8 22:22:30 2019]
rule gmm:
    input: iris.tsv
    output: results/2019_08_08/gmm_from_iris, results/2019_08_08/gmm_from_iris-profile.png, results/2019_08_08/gmm_from_iris-profile.pdf, results/2019_08_08/gmm_from_iris-profile.eps, results/2019_08_08/gmm_from_iris-profile.svg, results/2019_08_08/gmm_from_iris-profile.tsv, results/2019_08_08/gmm_from_iris-transformed-K3-components, results/2019_08_08/gmm_from_iris-transformed-K5-components, results/2019_08_08/gmm_from_iris-transformed-K10-components
    jobid: 0
[2019-08-08 22:22:30,297 - INFO - snakemake.logging]: rule gmm:
    input: iris.tsv
    output: results/2019_08_08/gmm_from_iris, results/2019_08_08/gmm_from_iris-profile.png, results/2019_08_08/gmm_from_iris-profile.pdf, results/2019_08_08/gmm_from_iris-profile.eps, results/2019_08_08/gmm_from_iris-profile.svg, results/2019_08_08/gmm_from_iris-profile.tsv, results/2019_08_08/gmm_from_iris-transformed-K3-components, results/2019_08_08/gmm_from_iris-transformed-K5-components, results/2019_08_08/gmm_from_iris-transformed-K10-components
    jobid: 0

[2019-08-08 22:22:30,297 - INFO - snakemake.logging]:
Job counts:
        count   jobs
        1       gmm
        1
 Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/gmm.py 3,5,10 iris.tsv iris_feature_columns.tsv results/2019_08_08/gmm_from_iris > results/2019_08_08/gmm_from_iris-spark.log 0m
 Printing rule tree:
 -> _ (, iris.tsv)
         -> clustering (iris.tsv, results/2019_08_08/gmm_from_iris.tsv)
         -> clustering (iris.tsv, results/2019_08_08/kmeans_from_iris.tsv)

[Thu Aug  8 22:22:59 2019]
[2019-08-08 22:22:59,844 - INFO - snakemake.logging]: [Thu Aug  8 22:22:59 2019]
Finished job 0.
[2019-08-08 22:22:59,844 - INFO - snakemake.logging]: Finished job 0.
1 of 2 steps (50%) done
[2019-08-08 22:22:59,845 - INFO - snakemake.logging]: 1 of 2 steps (50%) done

[2019-08-08 22:22:59,847 - INFO - snakemake.logging]:
[Thu Aug  8 22:22:59 2019]
[2019-08-08 22:22:59,848 - INFO - snakemake.logging]: [Thu Aug  8 22:22:59 2019]
rule kmeans:
    input: iris.tsv
    output: results/2019_08_08/kmeans_from_iris, results/2019_08_08/kmeans_from_iris-profile.png, results/2019_08_08/kmeans_from_iris-profile.pdf, results/2019_08_08/kmeans_from_iris-profile.eps, results/2019_08_08/kmeans_from_iris-profile.svg, results/2019_08_08/kmeans_from_iris-profile.tsv, results/2019_08_08/kmeans_from_iris-transformed-K3-clusters, results/2019_08_08/kmeans_from_iris-transformed-K5-clusters, results/2019_08_08/kmeans_from_iris-transformed-K10-clusters
    jobid: 1
[2019-08-08 22:22:59,848 - INFO - snakemake.logging]: rule kmeans:
    input: iris.tsv
    output: results/2019_08_08/kmeans_from_iris, results/2019_08_08/kmeans_from_iris-profile.png, results/2019_08_08/kmeans_from_iris-profile.pdf, results/2019_08_08/kmeans_from_iris-profile.eps, results/2019_08_08/kmeans_from_iris-profile.svg, results/2019_08_08/kmeans_from_iris-profile.tsv, results/2019_08_08/kmeans_from_iris-transformed-K3-clusters, results/2019_08_08/kmeans_from_iris-transformed-K5-clusters, results/2019_08_08/kmeans_from_iris-transformed-K10-clusters
    jobid: 1

[2019-08-08 22:22:59,850 - INFO - snakemake.logging]:
Job counts:
        count   jobs
        1       kmeans
        1
 Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/kmeans.py 3,5,10 iris.tsv iris_feature_columns.tsv results/2019_08_08/kmeans_from_iris > results/2019_08_08/kmeans_from_iris-spark.log 0m
 Printing rule tree:
 -> _ (, iris.tsv)
         -> clustering (iris.tsv, results/2019_08_08/gmm_from_iris.tsv)
[Thu Aug  8 22:23:31 2019]
[2019-08-08 22:23:31,590 - INFO - snakemake.logging]: [Thu Aug  8 22:23:31 2019]
Finished job 1.
[2019-08-08 22:23:31,590 - INFO - snakemake.logging]: Finished job 1.
2 of 2 steps (100%) done
[2019-08-08 22:23:31,590 - INFO - snakemake.logging]: 2 of 2 steps (100%) done
Complete log: /home/simon/PROJECTS/pybda/data/.snakemake/log/2019-08-08T222230.229745.snakemake.log
[2019-08-08 22:23:31,590 - WARNING - snakemake.logging]: Complete log: /home/simon/PROJECTS/pybda/data/.snakemake/log/2019-08-08T222230.229745.snakemake.log
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe
(pybda)

The call automatically executes the jobs defined in the config. After both ran, we should check the plots and statistics. Let’s see what we got:

[5]:
cd results
ls -lgG *
(pybda) total 3196
drwxrwxr-x 5    4096 Aug  8 22:22 gmm_from_iris
-rw-rw-r-- 1   41630 Aug  8 22:22 gmm_from_iris-cluster_sizes-histogram.eps
-rw-rw-r-- 1   11913 Aug  8 22:22 gmm_from_iris-cluster_sizes-histogram.pdf
-rw-rw-r-- 1   94191 Aug  8 22:22 gmm_from_iris-cluster_sizes-histogram.png
-rw-rw-r-- 1   54900 Aug  8 22:22 gmm_from_iris-cluster_sizes-histogram.svg
-rw-rw-r-- 1    6332 Aug  8 22:22 gmm_from_iris.log
-rw-rw-r-- 1   22803 Aug  8 22:22 gmm_from_iris-profile.eps
-rw-rw-r-- 1   13040 Aug  8 22:22 gmm_from_iris-profile.pdf
-rw-rw-r-- 1  218232 Aug  8 22:22 gmm_from_iris-profile.png
-rw-rw-r-- 1   31619 Aug  8 22:22 gmm_from_iris-profile.svg
-rw-rw-r-- 1     137 Aug  8 22:22 gmm_from_iris-profile.tsv
-rw-rw-r-- 1 1288436 Aug  8 22:22 gmm_from_iris-spark.log
drwxrwxr-x 2    4096 Aug  8 22:22 gmm_from_iris-transformed-K10-components
drwxrwxr-x 2    4096 Aug  8 22:22 gmm_from_iris-transformed-K3-components
drwxrwxr-x 2    4096 Aug  8 22:22 gmm_from_iris-transformed-K5-components
drwxrwxr-x 5    4096 Aug  8 22:23 kmeans_from_iris
-rw-rw-r-- 1   42077 Aug  8 22:23 kmeans_from_iris-cluster_sizes-histogram.eps
-rw-rw-r-- 1   12123 Aug  8 22:23 kmeans_from_iris-cluster_sizes-histogram.pdf
-rw-rw-r-- 1   95647 Aug  8 22:23 kmeans_from_iris-cluster_sizes-histogram.png
-rw-rw-r-- 1   55531 Aug  8 22:23 kmeans_from_iris-cluster_sizes-histogram.svg
-rw-rw-r-- 1    6699 Aug  8 22:23 kmeans_from_iris.log
-rw-rw-r-- 1   21507 Aug  8 22:23 kmeans_from_iris-profile.eps
-rw-rw-r-- 1   11786 Aug  8 22:23 kmeans_from_iris-profile.pdf
-rw-rw-r-- 1  211541 Aug  8 22:23 kmeans_from_iris-profile.png
-rw-rw-r-- 1   29644 Aug  8 22:23 kmeans_from_iris-profile.svg
-rw-rw-r-- 1     296 Aug  8 22:23 kmeans_from_iris-profile.tsv
-rw-rw-r-- 1  920657 Aug  8 22:23 kmeans_from_iris-spark.log
-rw-rw-r-- 1      33 Aug  8 22:23 kmeans_from_iris-total_sse.tsv
drwxrwxr-x 2    4096 Aug  8 22:23 kmeans_from_iris-transformed-K10-clusters
drwxrwxr-x 2    4096 Aug  8 22:23 kmeans_from_iris-transformed-K3-clusters
drwxrwxr-x 2    4096 Aug  8 22:23 kmeans_from_iris-transformed-K5-clusters
(pybda)

Now, finally let’s check how many clusters/components are recommended for each method:

[6]:
cat */kmeans_from_iris-profile.tsv
k       within_cluster_variance explained_variance      total_variance  BIC
3       78.85566447695781       0.8842690519576577      681.3705911067716       143.99392330020913
5       46.71230004910447       0.931443621637341       681.3705911067716       151.93564122512583
10      32.49992444218044       0.9523021321049536      681.3705911067716       237.93597150012693
(pybda)
[7]:
cat */gmm_from_iris-profile.tsv
k       loglik  BIC
3       -189.53852954643244     599.5450120331002
5       -154.60572954901806     679.998470861159
10      -50.516420858075136     847.6175005364923
(pybda)

So in both cases the optimal number would be three! Just as expected from the iris data. Nice!

There’s plenty of other files and plots available to check out, though! For instance, we should always look at the log files we created to check some params, and what we actually computed:

[8]:
cat */gmm_from_iris.log
[2019-08-08 22:22:33,019 - INFO - pybda.spark_session]: Initializing pyspark session
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.master, value: local
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.driver.memory, value: 1G
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.rdd.compress, value: True
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.serializer.objectStreamReset, value: 100
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.app.name, value: gmm.py
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.executor.id, value: driver
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.driver.port, value: 37957
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.app.id, value: local-1565295753672
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.submit.deployMode, value: client
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.driver.host, value: 192.168.1.33
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Openened spark context at: Thu Aug  8 22:22:34 2019
[2019-08-08 22:22:34,101 - INFO - pybda.io.io]: Reading tsv: iris.tsv
[2019-08-08 22:22:37,724 - INFO - pybda.spark.features]: Casting columns to double.
[2019-08-08 22:22:37,974 - INFO - pybda.spark.features]: Assembling column to feature vector
[2019-08-08 22:22:38,080 - INFO - pybda.spark.features]: Dropping redundant columns
[2019-08-08 22:22:38,816 - INFO - pybda.spark.dataframe]: Using data with n=150 and p=4
[2019-08-08 22:22:38,822 - INFO - __main__]: Clustering with K: 3
[2019-08-08 22:22:42,892 - INFO - pybda.decorators]: function: '_fit_one' took: 4.0702 sec
[2019-08-08 22:22:42,893 - INFO - pybda.fit.clustering_fit]: Writing cluster fit to: results/2019_08_08/gmm_from_iris/gmm-fit-K3
[2019-08-08 22:22:43,828 - INFO - pybda.fit.clustering_fit]: Writing cluster size file to: results/2019_08_08/gmm_from_iris/gmm-fit-K3_cluster_sizes.tsv
[2019-08-08 22:22:43,833 - INFO - pybda.fit.gmm_fit]: Writing cluster weights/means/variances
[2019-08-08 22:22:43,982 - INFO - pybda.fit.gmm_fit]: Writing LogLik and BIC to: results/2019_08_08/gmm_from_iris/gmm-fit-K3_statistics.tsv
[2019-08-08 22:22:43,982 - INFO - __main__]: Clustering with K: 5
[2019-08-08 22:22:46,301 - INFO - pybda.decorators]: function: '_fit_one' took: 2.3183 sec
[2019-08-08 22:22:46,301 - INFO - pybda.fit.clustering_fit]: Writing cluster fit to: results/2019_08_08/gmm_from_iris/gmm-fit-K5
[2019-08-08 22:22:46,485 - INFO - pybda.fit.clustering_fit]: Writing cluster size file to: results/2019_08_08/gmm_from_iris/gmm-fit-K5_cluster_sizes.tsv
[2019-08-08 22:22:46,488 - INFO - pybda.fit.gmm_fit]: Writing cluster weights/means/variances
[2019-08-08 22:22:46,590 - INFO - pybda.fit.gmm_fit]: Writing LogLik and BIC to: results/2019_08_08/gmm_from_iris/gmm-fit-K5_statistics.tsv
[2019-08-08 22:22:46,590 - INFO - __main__]: Clustering with K: 10
[2019-08-08 22:22:48,165 - INFO - pybda.decorators]: function: '_fit_one' took: 1.5755 sec
[2019-08-08 22:22:48,166 - INFO - pybda.fit.clustering_fit]: Writing cluster fit to: results/2019_08_08/gmm_from_iris/gmm-fit-K10
[2019-08-08 22:22:48,347 - INFO - pybda.fit.clustering_fit]: Writing cluster size file to: results/2019_08_08/gmm_from_iris/gmm-fit-K10_cluster_sizes.tsv
[2019-08-08 22:22:48,352 - INFO - pybda.fit.gmm_fit]: Writing cluster weights/means/variances
[2019-08-08 22:22:48,452 - INFO - pybda.fit.gmm_fit]: Writing LogLik and BIC to: results/2019_08_08/gmm_from_iris/gmm-fit-K10_statistics.tsv
[2019-08-08 22:22:48,452 - INFO - pybda.fit.clustering_fit_profile]: Writing fit profile to results/2019_08_08/gmm_from_iris-profile.tsv
[2019-08-08 22:22:48,454 - INFO - pybda.fit.clustering_fit_profile]: 3
[2019-08-08 22:22:48,462 - INFO - pybda.fit.gmm_fit_profile]: Plotting profile to: results/2019_08_08/gmm_from_iris-profile.png
[2019-08-08 22:22:51,947 - INFO - pybda.fit.clustering_fit_profile]: Plotting cluster sizes to: results/2019_08_08/gmm_from_iris-cluster_sizes-histogram.png
[2019-08-08 22:22:53,737 - INFO - pybda.fit.gmm_fit_profile]: Plotting profile to: results/2019_08_08/gmm_from_iris-profile.pdf
[2019-08-08 22:22:54,015 - INFO - pybda.fit.clustering_fit_profile]: Plotting cluster sizes to: results/2019_08_08/gmm_from_iris-cluster_sizes-histogram.pdf
[2019-08-08 22:22:54,398 - INFO - pybda.fit.gmm_fit_profile]: Plotting profile to: results/2019_08_08/gmm_from_iris-profile.svg
[2019-08-08 22:22:54,654 - INFO - pybda.fit.clustering_fit_profile]: Plotting cluster sizes to: results/2019_08_08/gmm_from_iris-cluster_sizes-histogram.svg
[2019-08-08 22:22:55,080 - INFO - pybda.fit.gmm_fit_profile]: Plotting profile to: results/2019_08_08/gmm_from_iris-profile.eps
[2019-08-08 22:22:55,348 - INFO - pybda.fit.clustering_fit_profile]: Plotting cluster sizes to: results/2019_08_08/gmm_from_iris-cluster_sizes-histogram.eps
[2019-08-08 22:22:55,702 - INFO - pybda.decorators]: function: 'fit' took: 17.6145 sec
[2019-08-08 22:22:55,714 - INFO - pybda.fit.gmm_transformed]: Writing components to: results/2019_08_08/gmm_from_iris-transformed-K3-components
[2019-08-08 22:22:55,715 - INFO - pybda.spark.features]: Splitting vector columns: features
[2019-08-08 22:22:55,863 - INFO - pybda.spark.features]: Splitting vector columns: responsibilities
[2019-08-08 22:22:57,157 - INFO - pybda.fit.gmm_transformed]: Writing components to: results/2019_08_08/gmm_from_iris-transformed-K5-components
[2019-08-08 22:22:57,161 - INFO - pybda.spark.features]: Splitting vector columns: features
[2019-08-08 22:22:57,305 - INFO - pybda.spark.features]: Splitting vector columns: responsibilities
[2019-08-08 22:22:57,833 - INFO - pybda.fit.gmm_transformed]: Writing components to: results/2019_08_08/gmm_from_iris-transformed-K10-components
[2019-08-08 22:22:57,834 - INFO - pybda.spark.features]: Splitting vector columns: features
[2019-08-08 22:22:57,951 - INFO - pybda.spark.features]: Splitting vector columns: responsibilities
[2019-08-08 22:22:58,572 - INFO - pybda.spark_session]: Stopping Spark context
[2019-08-08 22:22:58,572 - INFO - pybda.spark_session]: Closed spark context at: Thu Aug  8 22:22:58 2019
[2019-08-08 22:22:58,572 - INFO - pybda.spark_session]: Computation took: 24
(pybda)

Furthermore, the Spark log file is sometimes important to look at when the methods failed:

[10]:
head */gmm_from_iris-spark.log
2019-08-08 22:22:31 WARN  Utils:66 - Your hostname, hoto resolves to a loopback address: 127.0.1.1; using 192.168.1.33 instead (on interface wlp2s0)
2019-08-08 22:22:31 WARN  Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2019-08-08 22:22:31 WARN  NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-08-08 22:22:33 INFO  SparkContext:54 - Running Spark version 2.4.0
2019-08-08 22:22:33 INFO  SparkContext:54 - Submitted application: gmm.py
2019-08-08 22:22:33 INFO  SecurityManager:54 - Changing view acls to: simon
2019-08-08 22:22:33 INFO  SecurityManager:54 - Changing modify acls to: simon
2019-08-08 22:22:33 INFO  SecurityManager:54 - Changing view acls groups to:
2019-08-08 22:22:33 INFO  SecurityManager:54 - Changing modify acls groups to:
2019-08-08 22:22:33 INFO  SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(simon); groups with view permissions: Set(); users  with modify permissions: Set(simon); groups with modify permissions: Set()
(pybda)