kaydotdev/stochastic-quantizationPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star3

Robust and Scalable Stochastic Quasi-Gradient Clustering

License

View license

3 stars 1 fork Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
code		code
data		data
manuscript		manuscript
results		results
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
NOTICES		NOTICES
README.md		README.md
environment.yml		environment.yml

Repository files navigation

Stochastic Quasi-Gradient Clustering

This repository presents an implementation ofStochastic Quasi-Gradient Clustering (alternatively termedStochastic Quantization within the machine learning domain), a robust and scalable alternative to existingK-means solvers. The algorithm is specifically designed to handle large-scale datasets while optimizing memoryutilization during computational processes. This implementation investigates the application of the algorithm tohigh-dimensional unsupervised and semi-supervised learning tasks. The repository contains both a Python packagefor reproducing experimental results and a LaTeX manuscript documenting the theoretical and experimental outcomesof the algorithm. The Python package continues to evolve independently of the research documentation; therefore,to reproduce specific results presented in the paper, researchers should refer to the commit hash mentioned inthe description.

Example

The Python implementation of the algorithm has ascikit-learn-friendlyAPI, thus enabling its integration into thePipelinesequence of built-in data transformers.

fromsklearn.datasetsimportload_irisfromsklearn.preprocessingimportStandardScalerfromsklearn.pipelineimportPipelineimportsqc# Load the Iris datasetX,_=load_iris(return_X_y=True)# Create an optimizer for SQG-clusteringoptimizer=sqc.SGDOptimizer()# Create and fit a pipeline with preprocessing and SQG-clusteringpipeline=Pipeline(    [        ("scaler",StandardScaler()),# Scale features to have mean=0 and variance=1        ("sqc",sqc.StochasticQuantization(optimizer,n_clusters=3)),    ]).fit(X)# Get the cluster labelslabels=pipeline.predict(X)

Research Articles

Robust Clustering on High-Dimensional Data with Stochastic Quantization

byAnton Kozyriev¹,Vladimir Norkin^1,2

Igor Sikorsky Kyiv Polytechnic Institute, National Technical University of Ukraine, Kyiv, 03056, Ukraine
V.M.Glushkov Institute of Cybernetics, National Academy of Sciences of Ukraine, Kyiv, 03178, Ukraine

Published in the International Scientific Technical Journal"Problems of Control and Informatics". This paper addresses the inherent limitations oftraditional vector quantization (clustering) algorithms, particularly K-means and its variant K-means++, andinvestigates the stochastic quantization (SQ) algorithm as a scalable alternative methodology for high-dimensionalunsupervised and semi-supervised learning problems.

Latest commit hash

ed22ae0b5507564d917b57d4cbdea952cc134d77

Citation

@article{Kozyriev_Norkin_2025,title        ={Robust clustering on high-dimensional data with stochastic quantization},author       ={Kozyriev, Anton and Norkin, Vladimir},year         ={2025},month        ={Feb.},journal      ={International Scientific Technical Journal "Problems of Control and Informatics"},volume       ={70},number       ={1},pages        ={32–48},doi          ={10.34229/1028-0979-2025-1-3},url          ={https://jais.net.ua/index.php/files/article/view/438},abstractnote = {&amp;lt;p&amp;gt;This paper addresses the limitations of traditional vector quantization (clustering) algorithms, particularly K-means and its variant K-means++, and explores the stochastic quantization (SQ) algorithm as a scalable alternative for high-dimensional unsupervised and semi-supervised learning problems. Some traditional clustering algorithms suffer from inefficient memory utilization during computation, necessitating the loading of all data samples into memory, which becomes impractical for large-scale datasets. While variants such as mini-batch K-means partially mitigate this issue by reducing memory usage, they lack robust theoretical convergence guarantees due to the non-convex nature of clustering problems. In contrast, SQ-algorithm provides strong theoretical convergence guarantees, making it a robust alternative for clustering tasks. We demonstrate the computational efficiency and rapid convergence of the algorithm on an image classification problem with partially labeled data. To address the challenge of high dimensionality, we trained Triplet Network to encode images into low-dimensional representations in a latent space, which serve as a basis for comparing the efficiency of both SQ-algorithm and traditional quantization algorithm.&amp;lt;/p&amp;gt;},}

Getting Started

Before working with the source code, it is important to note that the Python package in the repository is intendedSOLELY FOR EXPERIMENTAL PURPOSES and is not production-ready. To proceed with this project, follow the instructionsbelow to configure your environment, install the necessary dependencies, and execute the code to reproduce the resultspresented in the paper.

Dependencies

The installation process requires a Conda package manager for managing third-party dependencies and virtualenvironments. A step-by-step guide on installing the CLI tool is available on the officialwebsite. The third-party dependencies usedare listed in theenvironment.yml file, with the corresponding licenses in theNOTICES file.

Installation

Clone the repository (alternatively, you can download the source code as azip archive):

git clone https://github.com/kaydotdev/stochastic-quantization.gitcd stochastic-quantization

then, create a Conda virtual environment and activate it:

conda env create -f environment.ymlconda activate stochastic-quantization

Reproducing the Results

Use the following command to install the coresq package with third-party dependencies, run the test suite, compileLaTeX files, and generate results:

make all

Produced figures and other artifacts (except compiled LaTeX files) will be stored in theresultsdirectory. Optionally, use the following command to perform the actions above without LaTeX file compilation:

make -C code all

To automatically remove all generated results and compiled LaTeX files produced by scripts, use the following command:

make clean

License

This repository contains both software (source code) and an academic manuscript. Different licensing terms apply tothese components as follows:

Source Code: All source code contained in this repository, unless otherwise specified, is licensed under the MITLicense. The full text of the MIT License can be found in the fileLICENSE.code.md in thecode directory.
Academic Manuscript: The academic manuscript, including all LaTeX source files and associated content (e.g.,figures), is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License(CC BY-NC-ND 4.0). The full text of the CC BY-NC-ND 4.0 License can be found in the fileLICENSE.manuscript.md in themanuscript directory.

About

Robust and Scalable Stochastic Quasi-Gradient Clustering

doi.org/10.34229/1028-0979-2025-1-3

Releases

No releases published

Packages

No packages published

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Stochastic Quasi-Gradient Clustering

Example

Research Articles

Robust Clustering on High-Dimensional Data with Stochastic Quantization

Getting Started

Dependencies

Installation

Reproducing the Results

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

License

kaydotdev/stochastic-quantization

Folders and files

Latest commit

History

Repository files navigation

Stochastic Quasi-Gradient Clustering

Example

Research Articles

Robust Clustering on High-Dimensional Data with Stochastic Quantization

Getting Started

Dependencies

Installation

Reproducing the Results

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages