Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Maximum Relevance Minimum Redundancy for big datasets

License

NotificationsYou must be signed in to change notification settings

jvicentem/big_mrmr

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

23 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Python 3.7Open In Collab

Maximum Relevance Minimum Redundancy for big datasets.

This Python library uses Spark and Cython to speed up the calculations on big datasets.

This post in the Uber Engineering blog (https://eng.uber.com/optimal-feature-discovery-ml/) inspired me to develop this library.

In order to know more about MRMR read the previous post or go to the bottom of this README to find useful links.

You can find below the documentation and some code examples.

Documentation

The main method you will need to use is the one from:

from mrmr import MRMR

More specifically, you will need to instantiate an object from the classMRMR and then invoke the functionmrmr() (i.e.MRMR(...).mrmr()).

MRMR class

You can configure the MRMR calculation via the instantiation parameters of theMRMR class. Here are the parameters:

df:pyspark.sql.dataframe.DataFrame : The input Apache Spark DataFrame with the data.

target :string : Name of the column that represents your target column. This column must be included in the input dataframe.

k (Default value10):int : Number of features extracted. More specifically, this sets the number of features with the highest MRMR. The default value is10, but if you have fewer variables, you must place here a lower value.

subset (Default value[]):List[string] : List of column names to use. These columns must be included in the input dataframe. The default value is an empty list and it will use all the columns.

cont_vars (Default value[]):List[string] : List of columns that are continuous variables. The columns in this list will have their values replaced by their decile, converting them in discrete variables. These columns must be included in the input dataframe. Consider this as an easy discretization, maybe you want to do your own discretization based on the meaning of your data.

replace_na (Default valueFalse):bool : True if you want to have the NAs/Null/None values replaced or not. Numerical variables will have the value -1 when Null, and string variables will have the value "Null" when Null.

optimal_k (Default valueFalse):bool : True if you want to calculate the best subset with sizek of columns that, on average, they generate the highest MRMR. If False, it will calculate the MRMR of each column using all the columns in the dataframe (or the ones specified in thesubset parameter).

What is the difference? If you look at MRMR formula, you will see that the Mutual Information of a variable is calculated taking into all the other variables. So, whenoptimal_k is True, it will only take into account the variables in a sample of sizek. If False, it will take into account all variables.

This means, that whenoptimal_k is True themrmr() method will return a DataFrame with samples of variables of sizek and their average MRMR. Whenoptimal_k is False a DataFrame with the name of the variables and the MRMR of each of them (only the topk variables will be returned).

top_best_solutions (Default value5):int : Number of top best variables subsets whenoptimal_k is True. Whenoptimal_k is False, this parameter is ignored.

must_included_vars (Default value[]):List[string] : List of names of columns that must be included in the solutions. These columns must be included in the input dataframe and in thesubset parameter, if used.

max_mins (Default valueNone):float : If different thanNone, the calculation will end after that number of minutes. The countdown will start at the beginning of the heaviest part of the code: the calculation of the optimal columns according to MRMR. It is advisable to set a value to this parameter, because depending on the number of columns of your DataFrame and on the value ofk, it may take a very long time to explore all the possible solutions.

cache_or_checkp (Default valueNone):string : If different thanNone, it will cache the Spark DataFrame when its value equalscache and it will checkpoint the Spark DataFrame when its value equalscheckpoint. Take into account that if you want to checkpoint the Spark DataFrame you will need to set a checkpoint directory (Google this or check the examples to see how).

seed (Default value16121993):int : Random seed value to get reproducible results. This is used only whenoptimal_k equals True, otherwise it is ignored.

mrmr function

mrmr : Returns a Pandas DataFrame. Whenoptimal_k is True themrmr() method will return a DataFrame with samples of variables of sizek and their average MRMR. Whenoptimal_k is False a DataFrame with the name of the variables and the MRMR of each of them (only the topk variables will be returned).

Performance considerations

It is highly advisable to checkpoint or store the input DataFrame before using it in themrmr function. This way, Spark will forget about the previous DAG and calculations will be faster. In the examples, you will see the input DataFrame is stored in Parquet format before using it.

Whenoptimal_k is False, the process is way less heavier than when equals to True. As a reference, I was able to run the algorithm whenoptimal_k is False in 36 minutes on a ~10GB with a 2015 laptop (2 cores, 4 with hyperthreading, 16GB RAM, HDD).

Whenoptimal_k is True, it depends on the size of the dataframe as well as on the number of combinations to try. If you are experiencing problems you will want to random sample your DataFrame in order to work with a smaller DataFrame.

If you also struggle with performance, consider creating a Spark Session that runs on a cluster.

Examples

  • Optimal C example:Open In Collab

  • Optimal K example:Open In Collab

Useful links

https://en.wikipedia.org/wiki/Adjusted_mutual_information

Vinh, Epps, and Bailey, (2010). Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance, JMLR

https://jmlr.csail.mit.edu/papers/volume11/vinh10a/vinh10a.pdf

About

Maximum Relevance Minimum Redundancy for big datasets

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp