Movatterモバイル変換

ProcData: An RPackage for Process Data Analysis

ProcData provides tools for exploratory process dataanalysis. It contains an example dataset and functions for

reading responses from a csv file
process manipulation
action sequence generators
feature extraction methods
fitting and making prediction from sequence models

Installation

Download the package fromthedownload page and execute the following command inR

install.packages(FILENAME,repos =NULL,dependencies =TRUE)

whereFILENAME should be replaced by the name of thepackage file downloaded including its path. The development version canbe installed fromGitHub with:

devtools::install_github("xytangtang/ProcData")

ProcData depends on packagesRcpp andkeras. A C compilerand python are needed. Some functions inProcData callsfunctions inkeras to fit neural networks. To make surethese functions run properly, execute the following command inR.

library(keras)install_keras(tensorflow ="1.13.1")

Note that if this step is skipped,ProcData can still beinstalled and loaded, but calling the functions that depends onkeras will give an error.

Data Structure

ProcData organizes response processes as an object ofclassproc which is a list containing the action sequencesand the timestamp sequences. Functions are provided to summarize andmanipulateproc objects.

Dataset

ProcData includes a datasetcc_data of theaction sequences and binary item responses of 16920 respondents of itemCP025Q01 in PISA 2012. The item interface can be foundhere.To load the dataset, run

data(cc_data)

cc_data is a list of two elements:

seqs is a `proc’ object.
responses is a numeric vector containing the binaryresponses outcomes.

For data stored in csv files,read.seqs can be used toread response processes into R and to organize them into aproc object. In the input csv file, each process can bestored in a single line or multiple lines. The sample files for the twostyles are example_single.csv and example_multiple.csv. The processes inthe two files can be read by running

seqs1<-read.seqs(file="example_single.csv",style="single",id_var="ID",action_var="Action",time_var="Time",seq_sep=", ")seqs2<-read.seqs(file="example_multiple.csv",style="multiple",id_var="ID",action_var="Action",time_var="Time")

write.seqs can be used to writeprocobjects in csv files.

Data Generators

ProcData also provides three action sequencesgenerators:

seq_gen generates action sequences of an imaginarysimulation-experiment-based item;
seq_gen2 generates action sequences according to agiven probability transition matrix;
seq_gen3 generates action sequences from a recurrentneural network. It depends onkeras.

Feature Extraction Methods

ProcData implements three feature extraction methodsthat compress varying length response processes into fixed dimensionnumeric vectors. The first method extract n-gram features from responseprocesses. The other two methods are based on multidimensional scaling(MDS) and sequence-to-sequence autoencoders (seq2seq AE). Details of themethods can be foundhere.

N-Gram

Functionseq2feature_ngram extracts ngram features fromresponse processes.

seqs<-seq_gen(100)theta<-seq2feature_ngram(seqs)

MDS

The following functions implement the MDS methods.

seq2feature_mds extractsK features from agiven set of response processes or their dissimilarity matrix.
chooseK_mds selects the number of features to beextracted by cross-validation.

seqs<-seq_gen(100)K_res<-chooseK_mds(seqs,K_cand=5:10,return_dist=TRUE)theta<-seq2feature_mds(K_res$dist_mat, K_res$K)$theta

seq2seq AE

Similar to MDS, the seq2seq AE method is implemented by twofunctions. Both functions depend onkeras.

seq2feature_seq2seq extractsK featuresfrom a given set of response processes.
chooseK_seq2seq selects the number of features to beextracted by cross-validation.

seqs<-seq_gen(100)K_res<-chooseK_seq2seq(seqs,K_cand=c(5,10),valid_prop=0.2)seq2seq_res<-seq2feature_seq2seq(seqs, K_res$K,samples_train=1:80,samples_valid=81:100)theta<- seq2seq_res$theta

Note that if the number of candidates ofK is large anda large number of epochs is needed for training the seq2seq AE,chooseK_seq2seq can be slow. One can parallel the selectionprocedure via multiple independent calls ofseq2feature_seq2seq with properly specified training,validation, and test sets.

Sequence Models

A sequence model relates response processes and covariates with aresponse variable. The model combines a recurrent neural network and afully connected neural network.

seqm fits a sequence model. It returns an object ofclass `seqm’.
predict.seqm predicts the response variable with agiven fitted sequence model. Bothseqm andpredict.seqm depends onkeras.

n<-100seqs<-seq_gen(n)y1<-sapply(seqs$action_seqs,function(x)"CHECK_A"%in% x)y2<-sapply(seqs$action_seqs,function(x)log10(length(x)))index_test<-sample(1:n,10)index_train<-setdiff(1:n, index_test)seqs_train<-sub_seqs(seqs, index_train)seqs_test<-sub_seqs(seqs, index_test)actions<-unique(unlist(seqs))# a simple sequence model for a binary response variableseqm_res1<-seqm(seqs = seqs_train,response = y1,response_type ="binary",actions=actions,K_emb =5,K_rnn =5,n_epoch =5)pred_res1<-predict(seqm_res1,new_seqs = seqs_test)# a simple sequence model for a numeric response variableseqm_res2<-seqm(seqs = seqs_test,response = y2,response_type ="scale",actions=actions,K_emb =5,K_rnn =5,n_epoch =5)pred_res2<-predict(seqm_res2,new_seqs = seqs_test)

[8]ページ先頭