From-Scratch EM Algorithm for GMM Matches scikit-learn on UMAP-Reduced Text Data#31216

dimitris-markopoulos started this conversation inShow and tell

dimitris-markopoulos

Apr 16, 2025

· 0 comments

Return to top

Discussion options

dimitris-markopoulos
Apr 16, 2025

I implemented the EM algorithm for multivariate Gaussian Mixture Models from scratch and benchmarked it against sklearn.mixture.GaussianMixture. On a UMAP-reduced version of a high-dimensional text dataset, the results aligned almost perfectly:

Matching mixing weights, means, and covariances

Adjusted Rand Index = 1.0000

Component assignments match after greedy alignment via L2 distance

The implementation is object-oriented, numerically stable (with covariance regularization), and tracks parameter convergence across iterations. A direct comparison to scikit-learn is included.

Notebook:
06_em_algorithm_fit_gmm.ipynb

Core class:
ml_utils.py

Note: The convergence only matches this closely after dimensionality reduction with UMAP. On raw high-dimensional data, convergence is more sensitive to initialization.

Happy to share this as a learning tool or discussion starter around reproducibility and clustering convergence diagnostics.

—
If you are interested in seeing the entire project:here

You must be logged in to vote

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

From-Scratch EM Algorithm for GMM Matches scikit-learn on UMAP-Reduced Text Data#31216

Uh oh!

{{title}}

Uh oh!

dimitris-markopoulos
Apr 16, 2025

Replies: 0 comments

Select a reply

Uh oh!

Movatterモバイル変換

Uh oh!

From-Scratch EM Algorithm for GMM Matches scikit-learn on UMAP-Reduced Text Data#31216

Uh oh!

dimitris-markopoulosApr 16, 2025

Replies: 0 comments

Uh oh!

dimitris-markopoulos
Apr 16, 2025