Compression and machine learning: a new perspective on feature space vectors

@article{Sculley2006CompressionAM,  title={Compression and machine learning: a new perspective on feature space vectors},  author={D. Sculley and Carla E. Brodley},  journal={Data Compression Conference (DCC'06)},  year={2006},  pages={332-341},  url={https://api.semanticscholar.org/CorpusID:12311412}}

D. SculleyC. Brodley
Published inData Compression Conference28 March 2006
Computer Science

Compression-based methods are not a "parameter free" magic bullet for feature selection and data representation, but are instead concrete similarity measures within defined feature spaces, and are therefore akin to explicit feature vector models used in standard machine learning algorithms.

View on IEEE

eecs.tufts.edu

132 Citations

Highly Influential Citations

Background Citations

Methods Citations

Results Citations

Tables from this paper

Topics

Compression Algorithm (opens in a new tab)Similarity Measure (opens in a new tab)Machine Learning (opens in a new tab)Parameter-free (opens in a new tab)Kolmogorov Complexity (opens in a new tab)Data Representation (opens in a new tab)Clusters (opens in a new tab)Classification (opens in a new tab)Feature Selection (opens in a new tab)Compression (opens in a new tab)

132 Citations

Text Mining Using Data Compression Models

Andrej Bratko

Computer Science

2012

A compression-based method for instance selection, capable of extracting a diverse subset of documents that are representative of a larger document collection that is useful for initializing k-means clustering, and as a pool-based active learning strategy for supervised training of text classifiers.

Compression-Based Data Mining

Eamonn J. KeoghL. KeoghJ. Handley

Computer Science

Encyclopedia of Data Warehousing and Mining

2009

Compression-based data mining is a universal approach to clustering, classification, dimensionality reduction, and anomaly detection. It is motivated by results in bioinformatics, learning, and…

Compressive Feature Learning

Hristo S. PaskovRobert WestJohn C. MitchellT. Hastie

Computer Science

NIPS

2013

This paper addresses the problem of unsupervised feature learning for text data by using a dictionary-based compression scheme to extract a succinct feature set and finds a set of word k-grams that minimizes the cost of reconstructing the text losslessly.

An Efficient Algorithm for Large Scale Compressive Feature Learning

Hristo S. PaskovJohn C. MitchellT. Hastie

Computer Science

AISTATS

2014

The recently proposed Compressive Feature Learning framework is expanded and it is shown that CFL is NP–Complete and a novel and efficient approximation algorithm based on a homotopy that transforms a convex relaxation of CFL into the original problem is provided.

Text Classification Using Compression-Based Dissimilarity Measures

D. CoutinhoMário A. T. Figueiredo

Computer Science

Int. J. Pattern Recognit. Artif. Intell.

2015

Experimental evaluation of the proposed efficient methods for text classification based on information-theoretic dissimilarity measures reveals that it approximates, sometimes even outperforms previous state-of-the-art techniques, despite being much simpler, in the sense that they do not require any text pre-processing or feature engineering.

Text Classification with Compression Algorithms

A. Zippo

Computer Science, Mathematics

ArXiv

2012

A kernel function that estimates the similarity between two objects computing by their compressed lengths is defined, which is important because compression algorithms can detect arbitrarily long dependencies within the text strings.

[PDF]

PyLZJD: An Easy to Use Tool for Machine Learning

Edward RaffJoe AurelioCharles K. Nicholas

Computer Science

SciPy

2019

PyLZJD is introduced, a library that implements LZJD in a manner meant to be easy to use and apply for novice practitioners, with examples of how to use it on problems of disparate data types.

[PDF]

Construction of Efficient V-Gram Dictionary for Sequential Data Analysis

Igor KuralenokNatalia StarikovaAleksandr KhvorovJ. Serdyuk

Computer Science

CIKM

2018

A new method for constructing an optimal feature set from sequential data that creates a dictionary of n-grams of variable length, based on the minimum description length principle, which shows competitive results on standard text classification collections without using the text structure.

Authorship Verification based on Compression-Models

Oren HalvaniChristian WinterL. Graner

Computer Science

ArXiv

2017

This work proposes an intrinsic AV method, which yields competitive results compared to a number of current state-of-the-art approaches, based on support vector machines or neural networks, and can handle complicated AV cases where both, the questioned and the reference document, are not related to each other in terms of topic or genre.

[PDF]

On the Usefulness of Compression Models for Authorship Verification

Oren HalvaniChristian WinterL. Graner

Computer Science

ARES

2017

...

31 References

Clustering by compression

Rudi CilibrasiP. Vitányi

Computer Science

IEEE Transactions on Information Theory

2005

Evidence of successful application in areas as diverse as genomics, virology, languages, literature, music, handwritten digits, astronomy, and combinations of objects from completely different domains, using statistical, dictionary, and block sorting compressors is reported.

1,154

[PDF]

Text categorization using compression models

E. FrankChang ChuiI. Witten

Computer Science

Proceedings DCC 2000. Data Compression Conference

2000

Test categorization is the assignment of natural language texts to predefined categories based on their concept to provide an overall judgement on the document as a whole, rather than discarding information by pre-selecting features.

The similarity metric

Ming LiXin ChenXin LiBin MaP. Vitányi

Mathematics, Computer Science

IEEE Transactions on Information Theory

2004

A new "normalized information distance" is proposed, based on the noncomputable notion of Kolmogorov complexity, and it is demonstrated that it is a metric and called the similarity metric.

[PDF]

Introduction to Information Theory and Data Compression

D. HankersonPeter F. JohnsonG. Harris

Computer Science

1998

This pioneering textbook serves two independent courses-in information theory and in data compression-and also proves valuable for independent study and as a reference.

Spam Filtering Using Compression Models

Andrej Bratko

Computer Science

2005

This paper summarizes the experiments for the TREC 2005 spam track, in which the use of adaptive statistical data compression models are considered for the spam filtering task, and presents experimental results indicating that compression models perform well in comparison to established spam filters.

Text mining: a new frontier for lossless compression

I. WittenZane BrayM. MahouiW. Teahan

Computer Science

Proceedings DCC'99 Data Compression Conference…

1999

This paper aims to promote text compression as a key technology for text mining, allowing databases to be created from formatted tables such as stock-market information on Web pages.

Kernel Methods for Pattern Analysis

J. Shawe-TaylorN. Cristianini

Computer Science, Mathematics

2004

This book provides an easy introduction for students and researchers to the growing field of kernel-based pattern analysis, demonstrating with examples how to handcraft an algorithm or a kernel for a new specific application, and covering all the necessary conceptual and mathematical tools to do so.

A repetition based measure for verification of text collections and for text categorization

D. KhmelevW. Teahan

Computer Science

SIGIR

2003

The results show that the method outperforms SVM at multi-class categorization, and interestingly, that results correlate strongly with compression-based methods.

Towards parameter-free data mining

Eamonn J. KeoghStefano LonardiC. Ratanamahatana

Computer Science

KDD

2004

This work shows that recent results in bioinformatics and computational theory hold great promise for a parameter-free data-mining paradigm, and shows that this approach is competitive or superior to the state-of-the-art approaches in anomaly/interestingness detection, classification, and clustering with empirical tests on time series/DNA/text/video datasets.

Data Compression Using Adaptive Coding and Partial String Matching

J. ClearyI. Witten

Computer Science

IEEE Trans. Commun.

1984

This paper describes how the conflict can be resolved with partial string matching, and reports experimental results which show that mixed-case English text can be coded in as little as 2.2 bits/ character with no prior knowledge of the source.

1,385

Related Papers

Showing 1 through 3 of 0 Related Papers

Movatterモバイル変換

Compression and machine learning: a new perspective on feature space vectors

Tables from this paper

Topics

132 Citations

Text Mining Using Data Compression Models

Compression-Based Data Mining

Compressive Feature Learning

An Efficient Algorithm for Large Scale Compressive Feature Learning

Text Classification Using Compression-Based Dissimilarity Measures

Text Classification with Compression Algorithms

PyLZJD: An Easy to Use Tool for Machine Learning

Construction of Efficient V-Gram Dictionary for Sequential Data Analysis

Authorship Verification based on Compression-Models

On the Usefulness of Compression Models for Authorship Verification

31 References

Clustering by compression

Text categorization using compression models

The similarity metric

Introduction to Information Theory and Data Compression

Spam Filtering Using Compression Models

Text mining: a new frontier for lossless compression

Kernel Methods for Pattern Analysis

A repetition based measure for verification of text collections and for text categorization

Towards parameter-free data mining

Data Compression Using Adaptive Coding and Partial String Matching

Related Papers