Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md
NotificationsYou must be signed in to change notification settings

termehs/netropy

Repository files navigation

CRAN statusCRAN Downloads

Installation

You can install the released version of netropy fromCRAN with:

install.packages("netropy")

The development version fromGitHub with:

# install.packages("devtools")devtools::install_github("termehs/netropy")

Statistical Entropy Analysis of Network Data

Multivariate entropy analysis is a general statistical method foranalyzing and finding dependence structure in data consisting ofrepeated observations of variables with a common domain and withdiscrete finite range spaces. Only nominal scale is required for eachvariable, so only the size of the variable’s range space is importantbut not its actual values. Variables on ordinal or numerical scales,even continuous numerical scales, can be used, but they should beaggregated so that their ranges match the number of available repeatedobservations. By investigating the frequencies of occurrences of jointvariable outcomes, complicated dependence structures, partialindependence and conditional independence as well as redundancies andfunctional dependence can be found.

This package introduces these entropy tools in the context of networkdata. Brief description of various functions implemented in the packageare given in the following but more details are provided in the packagevignettes and the references listed.

library('netropy')

Loading Internal Data

The different entropy tools are explained and illustrated by exploringdata from a network study of a corporate law firm, which has previouslybeen analysed by several authors(link).The data set is included in the package as a list with objectsrepresenting adjacency matrices for each of the three networks advice(directed), friendship (directed) and co-work (undirected), togetherwith a data frame comprising 8 attributes on each of the 71 lawyers.

To load the data, extract each object and assign the correct names tothem:

data(lawdata)adj.advice<-lawdata[[1]]adj.friend<-lawdata[[2]]adj.cowork<-lawdata[[3]]df.att<-lawdata[[4]]

Variable Domains and Data Editing

A requirement for the applicability of these entropy tools is thespecification of discrete variables with finite range spaces on the samedomain: either node attributes/vertex variables, edges/dyad variables ortriad variables. These can be either observed or transformed as shown inthe following using the above example data set.

We have 8 vertex variables with 71 observations, two of which (yearsandage) are numerical and needs categorization based on theircumulative distributions. This categorization is in details described inthe vignette “variable domains and data editing”. Here we just show thenew dataframe created (note that variablesenior is omitted as it onlycomprises unique values and that we edit all variable to start from 0):

att.var<-data.frame(status=df.att$status-1,gender=df.att$gender,office=df.att$office-1,years= ifelse(df.att$years<=3,0,                      ifelse(df.att$years<=13,1,2)),age= ifelse(df.att$age<=35,0,                      ifelse(df.att$age<=45,1,2)),practice=df.att$practice,lawschool=df.att$lawschool-1    )head(att.var)#>   status gender office years age practice lawschool#> 1      0      1      0     2   2        1         0#> 2      0      1      0     2   2        0         0#> 3      0      1      1     1   2        1         0#> 4      0      1      0     2   2        0         2#> 5      0      1      1     2   2        1         1#> 6      0      1      1     2   2        1         0

These vertex variables can be transformed into dyad variables by usingthe functionget_dyad_var(). Observed node attributes in the dataframeatt_var are then transformed into pairs of individual attributes. Forexample,status with binary outcomes is transformed into dyads having4 possible outcomes$(0,0), (0,1), (1,0), (1,1)$:

dyad.status<- get_dyad_var(att.var$status,type='att')dyad.gender<- get_dyad_var(att.var$gender,type='att')dyad.office<- get_dyad_var(att.var$office,type='att')dyad.years<- get_dyad_var(att.var$years,type='att')dyad.age<- get_dyad_var(att.var$age,type='att')dyad.practice<- get_dyad_var(att.var$practice,type='att')dyad.lawschool<- get_dyad_var(att.var$lawschool,type='att')

Similarly, dyad variables can be created based on observed ties. For theundirected edges, we use indicator variables read directly from theadjacency matrix for the dyad in question, while for the directed ones(advice andfriendship) we have pairs of indicators representingsending and receiving ties with 4 possible outcomes :

dyad.cwk<- get_dyad_var(adj.cowork,type='tie')dyad.adv<- get_dyad_var(adj.advice,type='tie')dyad.frn<- get_dyad_var(adj.friend,type='tie')

All 10 dyad variables are merged into one data frame for subsequententropy analysis:

dyad.var<-data.frame(cbind(status=dyad.status$var,gender=dyad.gender$var,office=dyad.office$var,years=dyad.years$var,age=dyad.age$var,practice=dyad.practice$var,lawschool=dyad.lawschool$var,cowork=dyad.cwk$var,advice=dyad.adv$var,friend=dyad.frn$var)                  )head(dyad.var)#>   status gender office years age practice lawschool cowork advice friend#> 1      3      3      0     8   8        1         0      0      3      2#> 2      3      3      3     5   8        3         0      0      0      0#> 3      3      3      3     5   8        2         0      0      1      0#> 4      3      3      0     8   8        1         6      0      1      2#> 5      3      3      0     8   8        0         6      0      1      1#> 6      3      3      1     7   8        1         6      0      1      1

A similar functionget_triad_var() is implemented for transformingvertex variables and different relation types into triad variables. Thisis described in more detail in the vignette “variable domains and dataediting”.

Univariate, Bivariate and Trivariate Entropies

The functionentropy_bivar() computes the bivariate entropies of allpairs of variables in the dataframe. The output is given as an uppertriangular matrix with cells giving the bivariate entropies of row andcolumn variables. The diagonal thus gives the univariate entropies foreach variable in the dataframe:

H2<- entropy_bivar(dyad.var)H2#>           status gender office years   age practice lawschool cowork advice#> status     1.493  2.868  3.640 3.370 3.912    3.453     4.363  2.092  2.687#> gender        NA  1.547  3.758 3.939 4.274    3.506     4.439  2.158  2.785#> office        NA     NA  2.239 4.828 4.901    4.154     5.058  2.792  3.388#> years         NA     NA     NA 2.671 4.857    4.582     5.422  3.268  3.868#> age           NA     NA     NA    NA 2.801    4.743     5.347  3.411  4.028#> practice      NA     NA     NA    NA    NA    1.962     4.880  2.530  3.127#> lawschool     NA     NA     NA    NA    NA       NA     2.953  3.567  4.186#> cowork        NA     NA     NA    NA    NA       NA        NA  0.615  1.687#> advice        NA     NA     NA    NA    NA       NA        NA     NA  1.248#> friend        NA     NA     NA    NA    NA       NA        NA     NA     NA#>           friend#> status     2.324#> gender     2.415#> office     3.044#> years      3.483#> age        3.637#> practice   2.831#> lawschool  3.812#> cowork     1.456#> advice     1.953#> friend     0.881

Bivariate entropies can be used to detect redundant variables thatshould be omitted from the dataframe for further analysis. This occurswhen the univariate entropy for a variable is equal to the bivariateentropies for pairs including that variable. As seen above, thedataframedyad.var has no redundant variables. This can also bechecked using the functionredundancy() which yields a binary matrixas output indicating which row and column variables are hold the sameinformation:

redundancy(dyad.var)#> no redundant variables#> NULL

More examples of using the functionredundancy() is given in thevignette “univariate bivariate and trivariate entropies”.

Trivariate entropies can be computed using the functionentropy_trivar() which returns a dataframe with the first threecolumns representing possible triples of variablesV1,V2, andV3from the dataframe in question, and their entropiesH(V1,V2,V3) as thefourth column. We illustrated this on the dataframedyad.var:

H3<- entropy_trivar(dyad.var)head(H3,10)# view first 10 rows of dataframe#>        V1     V2        V3 H(V1,V2,V3)#> 1  status gender    office       4.938#> 2  status gender     years       4.609#> 3  status gender       age       5.129#> 4  status gender  practice       4.810#> 5  status gender lawschool       5.664#> 6  status gender    cowork       3.464#> 7  status gender    advice       4.048#> 8  status gender    friend       3.685#> 9  status office     years       5.321#> 10 status office       age       5.721

Joint Entropy and Association Graphs

Joint entropies is a non-negative measure of association among pairs ofvariables. It is equal to 0 if and only if two variables are completelyindependent of each other.

The functionjoint_entropy() computes the joint entropies between allpairs of variables in a given dataframe and returns a list consisting ofthe upper triangular joint entropy matrix (univariate entropies in thediagonal) and a dataframe giving the frequency distributions of uniquejoint entropy values. A function argument specifies the precision givenin number of decimals for which the frequency distribution of uniqueentropy values is created (default is 3). Applying the function on thedataframedyad.var with two decimals:

J<- joint_entropy(dyad.var,2)J$matrix#>           status gender office years  age practice lawschool cowork advice#> status      1.49   0.17   0.09  0.79 0.38     0.00      0.08   0.02   0.05#> gender        NA   1.55   0.03  0.28 0.07     0.00      0.06   0.00   0.01#> office        NA     NA   2.24  0.08 0.14     0.05      0.13   0.06   0.10#> years         NA     NA     NA  2.67 0.61     0.05      0.20   0.02   0.05#> age           NA     NA     NA    NA 2.80     0.02      0.41   0.01   0.02#> practice      NA     NA     NA    NA   NA     1.96      0.04   0.05   0.08#> lawschool     NA     NA     NA    NA   NA       NA      2.95   0.00   0.01#> cowork        NA     NA     NA    NA   NA       NA        NA   0.62   0.18#> advice        NA     NA     NA    NA   NA       NA        NA     NA   1.25#> friend        NA     NA     NA    NA   NA       NA        NA     NA     NA#>           friend#> status      0.05#> gender      0.01#> office      0.08#> years       0.07#> age         0.05#> practice    0.01#> lawschool   0.02#> cowork      0.04#> advice      0.18#> friend      0.88J$freq#>       j  #(J = j) #(J >= j)#> 1  0.79         1         1#> 2  0.61         1         2#> 3  0.41         1         3#> 4  0.38         1         4#> 5  0.28         1         5#> 6   0.2         1         6#> 7  0.18         2         8#> 8  0.17         1         9#> 9  0.14         1        10#> 10 0.13         1        11#> 11  0.1         1        12#> 12 0.09         1        13#> 13 0.08         4        17#> 14 0.07         2        19#> 15 0.06         2        21#> 16 0.05         7        28#> 17 0.04         2        30#> 18 0.03         1        31#> 19 0.02         5        36#> 20 0.01         5        41#> 21    0         4        45

As seen, the strongest association is between the variablesstatus andyears with joint entropy values of 0.79. We have independence (jointentropy value of 0) between two pairs of variables:(status,practice), (practise,gender), (cowork,gender),and(cowork,lawschool).

These results can be illustrated in a association graph using thefunctionassoc_graph() which returns aggraph object in which nodesrepresent variables and links represent strength of association (thickerlinks indicate stronger dependence). To use the function we need to loadtheggraph library and to determine a threshold which the graph drawnis based on. We set it to 0.15 so that we only visualize the strongestassociations

library(ggraph)assoc_graph(dyad.var,0.15)

Given this threshold, we see isolated and disconnected nodesrepresenting independent variables. We note strong dependence betweenthe three dyadic variablesstatus,years andage, but also asomewhat strong dependence among the three variableslawschool,years andage, and the three variablesstatus,years andgender. The association graph can also be interpreted as a tendencyfor relationscowork andfriend to be independent conditionally onrelationadvice, that is, any dependence between dyad variablescowork andfriend is explained byadvice.

A threshold that gives a graph with reasonably many small independent orconditionally independent subsets of variables can be considered torepresent a multivariate model for further testing.

More details and examples of joint entropies and association graphs aregiven in the vignette “joint entropies and association graphs”.

Prediction Power Based on Expected Conditional Entropies

The functionprediction_power() computes prediction power when pairsof variables in a given dataframe are used to predict a third variablefrom the same dataframe. The variable to be predicted and the dataframein which this variable also is part of is given as input arguments, andthe output is an upper triangular matrix giving the expected conditionalentropies of pairs of row and column variables (denoted$X$ and$Y$) ofthe matrix, i.e. EH(Z|X,Y) where$Z$ is the variable to be predicted.The diagonal givesEH(Z|X) , that is when only one variable as apredictor. Note thatNA’s are in the row and column representing thevariable being predicted.

Assume we are interested in predicting variablestatus (that iswhether a lawyer in the data set is an associate or partner). This isdone by running the following syntax

prediction_power('status',dyad.var)#>           status gender office years   age practice lawschool cowork advice#> status        NA     NA     NA    NA    NA       NA        NA     NA     NA#> gender        NA  1.375  1.180 0.670 0.855    1.304     1.225  1.306  1.263#> office        NA     NA  2.147 0.493 0.820    1.374     1.245  1.373  1.325#> years         NA     NA     NA 2.265 0.573    0.682     0.554  0.691  0.667#> age           NA     NA     NA    NA 1.877    1.089     0.958  1.087  1.052#> practice      NA     NA     NA    NA    NA    2.446     1.388  1.459  1.410#> lawschool     NA     NA     NA    NA    NA       NA     3.335  1.390  1.337#> cowork        NA     NA     NA    NA    NA       NA        NA  2.419  1.400#> advice        NA     NA     NA    NA    NA       NA        NA     NA  2.781#> friend        NA     NA     NA    NA    NA       NA        NA     NA     NA#>           friend#> status        NA#> gender     1.270#> office     1.334#> years      0.684#> age        1.058#> practice   1.427#> lawschool  1.350#> cowork     1.411#> advice     1.407#> friend     3.408

For better readability, the powers of different predictors can beconveniently compared by using prediction plots that display a colormatrix with rows for$X$ and columns for$Y$ with darker colors in thecells when we have higher prediction power for$Z$. This is shown forthe prediction ofstatus:

Obviously, the darkest color is obtained when the variable to bepredicted is included among the predictors, and the cells exhibitprediction power for a single predictor on the diagonal and for twopredictors symmetrically outside the diagonal. Some findings are asfollows: good predictors forstatus are given byyears incombination with any other variable, andage in combination with anyother variable. The best sole predictor isgender.

More details and examples of expected conditional entropies andprediction power are given in the vignette “prediction power based onexpected conditional entropies”.

Divergence Tests of Goodness of Fit

Occurring cliques in association graphs represent connected componentsof dependent variables, and by comparing the graphs for differentthresholds, specific structural models of multivariate dependence can besuggested and tested. The functiondiv_gof() allows such hypothesistests for pairwise independence of$X$ and$Y$:$X \bot Y$, and pairwiseindependence conditional a third variable$Z$:$X\bot Y|Z$.

To testfriend$\bot$cowork$|$advice, that is whether dyadvariablefriend is independent ofcowork givenadvice we use thefunction as shown below:

div_gof(dat=dyad.var,var1="friend",var2="cowork",var_cond="advice")#> the specified model of conditional independence cannot be rejected#>      D df(D)#> 1 0.94    12

Not specifying argumentvar_cond would instead testfriend$\bot$cowork without any conditioning.

References

Parts of the theoretical background is provided in the packagevignettes, but for more details, consult the following literature:

Frank, O., & Shafie, T. (2016). Multivariate entropy analysis ofnetwork data.Bulletin of Sociological Methodology/Bulletin deMéthodologie Sociologique, 129(1), 45-63.link

About

No description, website, or topics provided.

Resources

License

Unknown, MIT licenses found

Licenses found

Unknown
LICENSE
MIT
LICENSE.md

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors3

  •  
  •  
  •  

Languages


[8]ページ先頭

©2009-2025 Movatter.jp