Clustering and mixture models¶
The following notebook
shows how PyBDA can be used for a simple clustering task. We use the iris
data, because clustering it is fairly easy and we can use it as a check if everything worked out nicely. We’ll use a \(k\)-means clustering and compare it to a Gaussian mixture model (GMM). The \(k\)-means has fairly stringent assumptions about the data, i.e. spherical Gaussians, while the GMM estimates the variances from the data.
We start by activating our environment:
[1]:
source ~/miniconda3/bin/activate pybda
(pybda)
In order to do two clusterings, we merely need to set up a short config file with the two method names. We already provided a file that could do the trick for us in the data
folder:
[2]:
cd data
(pybda)
[3]:
cat pybda-usecase-clustering.config
spark: spark-submit
infile: iris.tsv
outfolder: results
meta: iris_meta_columns.tsv
features: iris_feature_columns.tsv
clustering: kmeans,gmm
n_centers: 3, 5, 10
sparkparams:
- "--driver-memory=1G"
- "--executor-memory=1G"
debug: true
(pybda)
In the config above we will do the following:
- do clustering/mixture model with 3, 5 and 10 cluster centers on the features provided in
iris_feature_columns.tsv
, - give the Spark driver 1G of memory and the executor 1G of memory,
- write the results to
results
, - print debug information.
This totals 6 clusterings from two different methods with minimal coding effort.
Having the parameters set, we can call PyBDA:
[4]:
pybda clustering pybda-usecase-clustering.config local | head -n 10
Building DAG of jobs...
[2019-08-08 22:22:30,284 - WARNING - snakemake.logging]: Building DAG of jobs...
Using shell: /bin/bash
[2019-08-08 22:22:30,296 - WARNING - snakemake.logging]: Using shell: /bin/bash
Provided cores: 1
[2019-08-08 22:22:30,296 - WARNING - snakemake.logging]: Provided cores: 1
Rules claiming more threads will be scaled down.
[2019-08-08 22:22:30,296 - WARNING - snakemake.logging]: Rules claiming more threads will be scaled down.
Job counts:
count jobs
1 gmm
1 kmeans
2
[2019-08-08 22:22:30,297 - WARNING - snakemake.logging]: Job counts:
count jobs
1 gmm
1 kmeans
2
[2019-08-08 22:22:30,297 - INFO - snakemake.logging]:
[Thu Aug 8 22:22:30 2019]
[2019-08-08 22:22:30,297 - INFO - snakemake.logging]: [Thu Aug 8 22:22:30 2019]
rule gmm:
input: iris.tsv
output: results/2019_08_08/gmm_from_iris, results/2019_08_08/gmm_from_iris-profile.png, results/2019_08_08/gmm_from_iris-profile.pdf, results/2019_08_08/gmm_from_iris-profile.eps, results/2019_08_08/gmm_from_iris-profile.svg, results/2019_08_08/gmm_from_iris-profile.tsv, results/2019_08_08/gmm_from_iris-transformed-K3-components, results/2019_08_08/gmm_from_iris-transformed-K5-components, results/2019_08_08/gmm_from_iris-transformed-K10-components
jobid: 0
[2019-08-08 22:22:30,297 - INFO - snakemake.logging]: rule gmm:
input: iris.tsv
output: results/2019_08_08/gmm_from_iris, results/2019_08_08/gmm_from_iris-profile.png, results/2019_08_08/gmm_from_iris-profile.pdf, results/2019_08_08/gmm_from_iris-profile.eps, results/2019_08_08/gmm_from_iris-profile.svg, results/2019_08_08/gmm_from_iris-profile.tsv, results/2019_08_08/gmm_from_iris-transformed-K3-components, results/2019_08_08/gmm_from_iris-transformed-K5-components, results/2019_08_08/gmm_from_iris-transformed-K10-components
jobid: 0
[2019-08-08 22:22:30,297 - INFO - snakemake.logging]:
Job counts:
count jobs
1 gmm
1
Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/gmm.py 3,5,10 iris.tsv iris_feature_columns.tsv results/2019_08_08/gmm_from_iris > results/2019_08_08/gmm_from_iris-spark.log 0m
Printing rule tree:
-> _ (, iris.tsv)
-> clustering (iris.tsv, results/2019_08_08/gmm_from_iris.tsv)
-> clustering (iris.tsv, results/2019_08_08/kmeans_from_iris.tsv)
[Thu Aug 8 22:22:59 2019]
[2019-08-08 22:22:59,844 - INFO - snakemake.logging]: [Thu Aug 8 22:22:59 2019]
Finished job 0.
[2019-08-08 22:22:59,844 - INFO - snakemake.logging]: Finished job 0.
1 of 2 steps (50%) done
[2019-08-08 22:22:59,845 - INFO - snakemake.logging]: 1 of 2 steps (50%) done
[2019-08-08 22:22:59,847 - INFO - snakemake.logging]:
[Thu Aug 8 22:22:59 2019]
[2019-08-08 22:22:59,848 - INFO - snakemake.logging]: [Thu Aug 8 22:22:59 2019]
rule kmeans:
input: iris.tsv
output: results/2019_08_08/kmeans_from_iris, results/2019_08_08/kmeans_from_iris-profile.png, results/2019_08_08/kmeans_from_iris-profile.pdf, results/2019_08_08/kmeans_from_iris-profile.eps, results/2019_08_08/kmeans_from_iris-profile.svg, results/2019_08_08/kmeans_from_iris-profile.tsv, results/2019_08_08/kmeans_from_iris-transformed-K3-clusters, results/2019_08_08/kmeans_from_iris-transformed-K5-clusters, results/2019_08_08/kmeans_from_iris-transformed-K10-clusters
jobid: 1
[2019-08-08 22:22:59,848 - INFO - snakemake.logging]: rule kmeans:
input: iris.tsv
output: results/2019_08_08/kmeans_from_iris, results/2019_08_08/kmeans_from_iris-profile.png, results/2019_08_08/kmeans_from_iris-profile.pdf, results/2019_08_08/kmeans_from_iris-profile.eps, results/2019_08_08/kmeans_from_iris-profile.svg, results/2019_08_08/kmeans_from_iris-profile.tsv, results/2019_08_08/kmeans_from_iris-transformed-K3-clusters, results/2019_08_08/kmeans_from_iris-transformed-K5-clusters, results/2019_08_08/kmeans_from_iris-transformed-K10-clusters
jobid: 1
[2019-08-08 22:22:59,850 - INFO - snakemake.logging]:
Job counts:
count jobs
1 kmeans
1
Submitting job spark-submit --master local --driver-memory=1G --executor-memory=1G /home/simon/PROJECTS/pybda/pybda/kmeans.py 3,5,10 iris.tsv iris_feature_columns.tsv results/2019_08_08/kmeans_from_iris > results/2019_08_08/kmeans_from_iris-spark.log 0m
Printing rule tree:
-> _ (, iris.tsv)
-> clustering (iris.tsv, results/2019_08_08/gmm_from_iris.tsv)
[Thu Aug 8 22:23:31 2019]
[2019-08-08 22:23:31,590 - INFO - snakemake.logging]: [Thu Aug 8 22:23:31 2019]
Finished job 1.
[2019-08-08 22:23:31,590 - INFO - snakemake.logging]: Finished job 1.
2 of 2 steps (100%) done
[2019-08-08 22:23:31,590 - INFO - snakemake.logging]: 2 of 2 steps (100%) done
Complete log: /home/simon/PROJECTS/pybda/data/.snakemake/log/2019-08-08T222230.229745.snakemake.log
[2019-08-08 22:23:31,590 - WARNING - snakemake.logging]: Complete log: /home/simon/PROJECTS/pybda/data/.snakemake/log/2019-08-08T222230.229745.snakemake.log
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='UTF-8'>
BrokenPipeError: [Errno 32] Broken pipe
(pybda)
The call automatically executes the jobs defined in the config. After both ran, we should check the plots and statistics. Let’s see what we got:
[5]:
cd results
ls -lgG *
(pybda) total 3196
drwxrwxr-x 5 4096 Aug 8 22:22 gmm_from_iris
-rw-rw-r-- 1 41630 Aug 8 22:22 gmm_from_iris-cluster_sizes-histogram.eps
-rw-rw-r-- 1 11913 Aug 8 22:22 gmm_from_iris-cluster_sizes-histogram.pdf
-rw-rw-r-- 1 94191 Aug 8 22:22 gmm_from_iris-cluster_sizes-histogram.png
-rw-rw-r-- 1 54900 Aug 8 22:22 gmm_from_iris-cluster_sizes-histogram.svg
-rw-rw-r-- 1 6332 Aug 8 22:22 gmm_from_iris.log
-rw-rw-r-- 1 22803 Aug 8 22:22 gmm_from_iris-profile.eps
-rw-rw-r-- 1 13040 Aug 8 22:22 gmm_from_iris-profile.pdf
-rw-rw-r-- 1 218232 Aug 8 22:22 gmm_from_iris-profile.png
-rw-rw-r-- 1 31619 Aug 8 22:22 gmm_from_iris-profile.svg
-rw-rw-r-- 1 137 Aug 8 22:22 gmm_from_iris-profile.tsv
-rw-rw-r-- 1 1288436 Aug 8 22:22 gmm_from_iris-spark.log
drwxrwxr-x 2 4096 Aug 8 22:22 gmm_from_iris-transformed-K10-components
drwxrwxr-x 2 4096 Aug 8 22:22 gmm_from_iris-transformed-K3-components
drwxrwxr-x 2 4096 Aug 8 22:22 gmm_from_iris-transformed-K5-components
drwxrwxr-x 5 4096 Aug 8 22:23 kmeans_from_iris
-rw-rw-r-- 1 42077 Aug 8 22:23 kmeans_from_iris-cluster_sizes-histogram.eps
-rw-rw-r-- 1 12123 Aug 8 22:23 kmeans_from_iris-cluster_sizes-histogram.pdf
-rw-rw-r-- 1 95647 Aug 8 22:23 kmeans_from_iris-cluster_sizes-histogram.png
-rw-rw-r-- 1 55531 Aug 8 22:23 kmeans_from_iris-cluster_sizes-histogram.svg
-rw-rw-r-- 1 6699 Aug 8 22:23 kmeans_from_iris.log
-rw-rw-r-- 1 21507 Aug 8 22:23 kmeans_from_iris-profile.eps
-rw-rw-r-- 1 11786 Aug 8 22:23 kmeans_from_iris-profile.pdf
-rw-rw-r-- 1 211541 Aug 8 22:23 kmeans_from_iris-profile.png
-rw-rw-r-- 1 29644 Aug 8 22:23 kmeans_from_iris-profile.svg
-rw-rw-r-- 1 296 Aug 8 22:23 kmeans_from_iris-profile.tsv
-rw-rw-r-- 1 920657 Aug 8 22:23 kmeans_from_iris-spark.log
-rw-rw-r-- 1 33 Aug 8 22:23 kmeans_from_iris-total_sse.tsv
drwxrwxr-x 2 4096 Aug 8 22:23 kmeans_from_iris-transformed-K10-clusters
drwxrwxr-x 2 4096 Aug 8 22:23 kmeans_from_iris-transformed-K3-clusters
drwxrwxr-x 2 4096 Aug 8 22:23 kmeans_from_iris-transformed-K5-clusters
(pybda)
Now, finally let’s check how many clusters/components are recommended for each method:
[6]:
cat */kmeans_from_iris-profile.tsv
k within_cluster_variance explained_variance total_variance BIC
3 78.85566447695781 0.8842690519576577 681.3705911067716 143.99392330020913
5 46.71230004910447 0.931443621637341 681.3705911067716 151.93564122512583
10 32.49992444218044 0.9523021321049536 681.3705911067716 237.93597150012693
(pybda)
[7]:
cat */gmm_from_iris-profile.tsv
k loglik BIC
3 -189.53852954643244 599.5450120331002
5 -154.60572954901806 679.998470861159
10 -50.516420858075136 847.6175005364923
(pybda)
So in both cases the optimal number would be three! Just as expected from the iris
data. Nice!
There’s plenty of other files and plots available to check out, though! For instance, we should always look at the log
files we created to check some params, and what we actually computed:
[8]:
cat */gmm_from_iris.log
[2019-08-08 22:22:33,019 - INFO - pybda.spark_session]: Initializing pyspark session
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.master, value: local
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.driver.memory, value: 1G
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.rdd.compress, value: True
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.serializer.objectStreamReset, value: 100
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.app.name, value: gmm.py
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.executor.id, value: driver
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.driver.port, value: 37957
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.app.id, value: local-1565295753672
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.submit.deployMode, value: client
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Config: spark.driver.host, value: 192.168.1.33
[2019-08-08 22:22:34,089 - INFO - pybda.spark_session]: Openened spark context at: Thu Aug 8 22:22:34 2019
[2019-08-08 22:22:34,101 - INFO - pybda.io.io]: Reading tsv: iris.tsv
[2019-08-08 22:22:37,724 - INFO - pybda.spark.features]: Casting columns to double.
[2019-08-08 22:22:37,974 - INFO - pybda.spark.features]: Assembling column to feature vector
[2019-08-08 22:22:38,080 - INFO - pybda.spark.features]: Dropping redundant columns
[2019-08-08 22:22:38,816 - INFO - pybda.spark.dataframe]: Using data with n=150 and p=4
[2019-08-08 22:22:38,822 - INFO - __main__]: Clustering with K: 3
[2019-08-08 22:22:42,892 - INFO - pybda.decorators]: function: '_fit_one' took: 4.0702 sec
[2019-08-08 22:22:42,893 - INFO - pybda.fit.clustering_fit]: Writing cluster fit to: results/2019_08_08/gmm_from_iris/gmm-fit-K3
[2019-08-08 22:22:43,828 - INFO - pybda.fit.clustering_fit]: Writing cluster size file to: results/2019_08_08/gmm_from_iris/gmm-fit-K3_cluster_sizes.tsv
[2019-08-08 22:22:43,833 - INFO - pybda.fit.gmm_fit]: Writing cluster weights/means/variances
[2019-08-08 22:22:43,982 - INFO - pybda.fit.gmm_fit]: Writing LogLik and BIC to: results/2019_08_08/gmm_from_iris/gmm-fit-K3_statistics.tsv
[2019-08-08 22:22:43,982 - INFO - __main__]: Clustering with K: 5
[2019-08-08 22:22:46,301 - INFO - pybda.decorators]: function: '_fit_one' took: 2.3183 sec
[2019-08-08 22:22:46,301 - INFO - pybda.fit.clustering_fit]: Writing cluster fit to: results/2019_08_08/gmm_from_iris/gmm-fit-K5
[2019-08-08 22:22:46,485 - INFO - pybda.fit.clustering_fit]: Writing cluster size file to: results/2019_08_08/gmm_from_iris/gmm-fit-K5_cluster_sizes.tsv
[2019-08-08 22:22:46,488 - INFO - pybda.fit.gmm_fit]: Writing cluster weights/means/variances
[2019-08-08 22:22:46,590 - INFO - pybda.fit.gmm_fit]: Writing LogLik and BIC to: results/2019_08_08/gmm_from_iris/gmm-fit-K5_statistics.tsv
[2019-08-08 22:22:46,590 - INFO - __main__]: Clustering with K: 10
[2019-08-08 22:22:48,165 - INFO - pybda.decorators]: function: '_fit_one' took: 1.5755 sec
[2019-08-08 22:22:48,166 - INFO - pybda.fit.clustering_fit]: Writing cluster fit to: results/2019_08_08/gmm_from_iris/gmm-fit-K10
[2019-08-08 22:22:48,347 - INFO - pybda.fit.clustering_fit]: Writing cluster size file to: results/2019_08_08/gmm_from_iris/gmm-fit-K10_cluster_sizes.tsv
[2019-08-08 22:22:48,352 - INFO - pybda.fit.gmm_fit]: Writing cluster weights/means/variances
[2019-08-08 22:22:48,452 - INFO - pybda.fit.gmm_fit]: Writing LogLik and BIC to: results/2019_08_08/gmm_from_iris/gmm-fit-K10_statistics.tsv
[2019-08-08 22:22:48,452 - INFO - pybda.fit.clustering_fit_profile]: Writing fit profile to results/2019_08_08/gmm_from_iris-profile.tsv
[2019-08-08 22:22:48,454 - INFO - pybda.fit.clustering_fit_profile]: 3
[2019-08-08 22:22:48,462 - INFO - pybda.fit.gmm_fit_profile]: Plotting profile to: results/2019_08_08/gmm_from_iris-profile.png
[2019-08-08 22:22:51,947 - INFO - pybda.fit.clustering_fit_profile]: Plotting cluster sizes to: results/2019_08_08/gmm_from_iris-cluster_sizes-histogram.png
[2019-08-08 22:22:53,737 - INFO - pybda.fit.gmm_fit_profile]: Plotting profile to: results/2019_08_08/gmm_from_iris-profile.pdf
[2019-08-08 22:22:54,015 - INFO - pybda.fit.clustering_fit_profile]: Plotting cluster sizes to: results/2019_08_08/gmm_from_iris-cluster_sizes-histogram.pdf
[2019-08-08 22:22:54,398 - INFO - pybda.fit.gmm_fit_profile]: Plotting profile to: results/2019_08_08/gmm_from_iris-profile.svg
[2019-08-08 22:22:54,654 - INFO - pybda.fit.clustering_fit_profile]: Plotting cluster sizes to: results/2019_08_08/gmm_from_iris-cluster_sizes-histogram.svg
[2019-08-08 22:22:55,080 - INFO - pybda.fit.gmm_fit_profile]: Plotting profile to: results/2019_08_08/gmm_from_iris-profile.eps
[2019-08-08 22:22:55,348 - INFO - pybda.fit.clustering_fit_profile]: Plotting cluster sizes to: results/2019_08_08/gmm_from_iris-cluster_sizes-histogram.eps
[2019-08-08 22:22:55,702 - INFO - pybda.decorators]: function: 'fit' took: 17.6145 sec
[2019-08-08 22:22:55,714 - INFO - pybda.fit.gmm_transformed]: Writing components to: results/2019_08_08/gmm_from_iris-transformed-K3-components
[2019-08-08 22:22:55,715 - INFO - pybda.spark.features]: Splitting vector columns: features
[2019-08-08 22:22:55,863 - INFO - pybda.spark.features]: Splitting vector columns: responsibilities
[2019-08-08 22:22:57,157 - INFO - pybda.fit.gmm_transformed]: Writing components to: results/2019_08_08/gmm_from_iris-transformed-K5-components
[2019-08-08 22:22:57,161 - INFO - pybda.spark.features]: Splitting vector columns: features
[2019-08-08 22:22:57,305 - INFO - pybda.spark.features]: Splitting vector columns: responsibilities
[2019-08-08 22:22:57,833 - INFO - pybda.fit.gmm_transformed]: Writing components to: results/2019_08_08/gmm_from_iris-transformed-K10-components
[2019-08-08 22:22:57,834 - INFO - pybda.spark.features]: Splitting vector columns: features
[2019-08-08 22:22:57,951 - INFO - pybda.spark.features]: Splitting vector columns: responsibilities
[2019-08-08 22:22:58,572 - INFO - pybda.spark_session]: Stopping Spark context
[2019-08-08 22:22:58,572 - INFO - pybda.spark_session]: Closed spark context at: Thu Aug 8 22:22:58 2019
[2019-08-08 22:22:58,572 - INFO - pybda.spark_session]: Computation took: 24
(pybda)
Furthermore, the Spark log
file is sometimes important to look at when the methods failed:
[10]:
head */gmm_from_iris-spark.log
2019-08-08 22:22:31 WARN Utils:66 - Your hostname, hoto resolves to a loopback address: 127.0.1.1; using 192.168.1.33 instead (on interface wlp2s0)
2019-08-08 22:22:31 WARN Utils:66 - Set SPARK_LOCAL_IP if you need to bind to another address
2019-08-08 22:22:31 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2019-08-08 22:22:33 INFO SparkContext:54 - Running Spark version 2.4.0
2019-08-08 22:22:33 INFO SparkContext:54 - Submitted application: gmm.py
2019-08-08 22:22:33 INFO SecurityManager:54 - Changing view acls to: simon
2019-08-08 22:22:33 INFO SecurityManager:54 - Changing modify acls to: simon
2019-08-08 22:22:33 INFO SecurityManager:54 - Changing view acls groups to:
2019-08-08 22:22:33 INFO SecurityManager:54 - Changing modify acls groups to:
2019-08-08 22:22:33 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(simon); groups with view permissions: Set(); users with modify permissions: Set(simon); groups with modify permissions: Set()
(pybda)