Pyndl - Naive Discriminative Learning in Python
pyndl implements Naïve Discriminative Learning (NDL) in Python. NDL is anincremental learning algorithm grounded in the principles of discrimination learning and motivated by animal and human learning research. Lately, NDL has become a popular tool in language research to examine large corpora andvocabularies, with 750,000 spoken word tokens and a vocabulary size of 52,402word types. In contrast to previous implementations,pyndl allows for a broader range of analysis, including non-English languages, adds further learning rules and provides better maintainability while having the same fast processing speed. As of today, it supports multiple research groups in their work and led to several scientific publications.
Quickstart
Installation
First, you need to installpyndl. The easiest way to do this is usingpip:
pip install --user pyndl
Warning
If you are using any other operating system than Linux this process can bemore difficult. Check outInstallation for more detailed installationinstruction.However, currently we can only ensure the expected behaviour on Linuxsystem. Be aware that on other operating system some functionality may notwork
Naive Discriminative Learning
Naive Discriminative Learning, henceforth NDL, is an incremental learningalgorithm based on the learning rule of Rescorla and Wagner1, whichdescribes the learning of direct associations between cues and outcomes.The learning is thereby structured in events where each event consists of aset of cues which give hints to outcomes. Outcomes can be seen as the result ofan event, where each outcome can be either present or absent. NDL is naive inthe sense that cue-outcome associations are estimated separately for eachoutcome.
The Rescorla-Wagner learning rule describes how the association strength\(\Delta V_{i}^{t}\) at time\(t\) changes over time. Time is heredescribed in form of learning events. For each event the association strengthis updated as
Thereby, the change in association strength\(\Delta V_{i}^{t}\) is definedas
with
\(\alpha_{i}\) being the salience of the cue\(i\)
\(\beta_{1}\) being the salience of the situation in which the outcomeoccurs
\(\beta_{2}\) being the salience of the situation in which the outcomedoes not occur
\(\lambda\) being the the maximum level of associative strength possible
Note
Usually, the parameters are set to\(\alpha_{i} = \alpha_{j} \:\forall i, j\),\(\beta_{1} = \beta_{2}\) and\(\lambda = 1\)
Usage
Analyzing data withpyndl involves three steps
The data has to be preprocessed into the correct format
One of the learning methods ofpyndl is used to learn the desired associations
The learned association (commonly also called weights) can be stored or directlybe analyzed further.
In the following, a usage example ofpyndl is provided, in which the first two of thethree steps are described for learning the associations between bigrams and meanings. Thefirst section of this example focuses on the correct preparation of the data with inbuiltmethods. However, it is worth to note that the learning algorithm itself does not requirethe data to be preprocessed bypyndl, nor it is limited by that. Thepyndl.preprocess module should rather be seen as a collection of established andcommonly used preprocessing methods within the context of NDL. Custom preprocessing canbe used as long as the created event files follow the structure as outlined in the nextsection. The second section, describes how the associations can be learned usingpyndl,while the last section describes how this can be exported and, for instance, loaded in Rfor further investigation.
Data Preparation
To analyse any data usingpyndl requires them to be in the long format as anutf-8 encoded tab delimited gzipped text file with a header in the first lineand two columns:
the first column contains an underscore delimited list of all cues
the second column contains an underscore delimited list of all outcomes
each line therefore represents an event with a pair of a cue and an outcome(occurring one time)
the events (lines) are ordered chronologically
The algorithm itself is agnostic to the actual domain as long as the data is tokenizedas Unicode character strings. Whilepyndl provides some basic preprocessing for graphemetokenization (see for instance the following examples), the tokenization of ideograms,pictograms, logograms, and speech has to be implemented manually. However, genericimplementations are welcome as a contribution.
Creating Grapheme Clusters From Wide Format Data
Often data which should be analysed is not in the right format to be processedwithpyndl. To illustrate how to get the data in the right format we use datafrom Baayen, Milin, Đurđević, Hendrix & Marelli2 as an example:
Table 1 | |||
|---|---|---|---|
Word | Frequency | Lexical Meaning | Number |
hand | 10 | HAND | |
hands | 20 | HAND | PLURAL |
land | 8 | LAND | |
lands | 3 | LAND | PLURAL |
and | 35 | AND | |
sad | 18 | SAD | |
as | 35 | AS | |
lad | 102 | LAD | |
lads | 54 | LAD | PLURAL |
lass | 134 | LASS | |
Table 1 shows some words, their frequencies of occurrence and their meanings asan artificial lexicon in the wide format. In the following, the letters(unigrams and bigrams) of the words constitute the cues, whereas the meaningsrepresent the outcomes.
As the data in table 1 are artificial we can generate such a file for thisexample by expanding table 1 randomly regarding the frequency of occurrence ofeach event. The resulting event filelexample.tab.gzconsists of 420 lines (419 = sum of frequencies + 1 header) and looks like thefollowing (nevertheless you are encouraged to take a closer look at this fileusing any text editor of your choice):
Cues | Outcomes |
|---|---|
#h_ha_an_nd_ds_s# | hand_plural |
#l_la_ad_d# | lad |
#l_la_as_ss_s# | lass |
Creating Grapheme Clusters From Corpus Data
Often the corpus which should be analysed is only a raw utf-8 encoded text filethat contains huge amounts of text. From here on we will refer to such a fileas a corpus file. In the corpus files several documents can be stored with a---end.of.document--- or---END.OF.DOCUMENT--- string markingwhere an old document finished and a new document starts.
Thepyndl.preprocess module (besides other things)provides the functionality to directly generate an event file based on a rawcorpus file and filter it:
>>>frompyndlimportpreprocess>>>preprocess.create_event_file(corpus_file='docs/data/lcorpus.txt',...event_file='docs/data/levent.tab.gz',...allowed_symbols='a-zA-Z',...context_structure='document',...event_structure='consecutive_words',...event_options=(1,),...cue_structure='bigrams_to_word')
Here we use the example corpuslcorpus.txt toproduce an event filelevent.tab.gz which (uncompressed) looks like this:
Cues | Outcomes |
|---|---|
an_#h_ha_d#_nd | hand |
ot_fo_oo_#f_t# | foot |
ds_s#_an_#h_ha_nd | hands |
Note
pyndl.corpus allows you to generate such a corpus file from abunch of gunzipped xml subtitle files filled with words.
Learn the associations
The strength of the associations for the data can now easily be computed usingthepyndl.ndl.ndl function from thepyndl.ndl module:
>>>frompyndlimportndl>>>weights=ndl.ndl(events='docs/data/levent.tab.gz',...alpha=0.1,betas=(0.1,0.1),method="threading")
Save and load a weight matrix
To save time in the future, we recommend saving the weights. For compatibilityreasons we recommend saving the weight matrix in the netCDF format3:
>>>weights.to_netcdf('docs/data/weights.nc')
Now, the saved weights can later be reused or be analysed in Python or R. InPython the weights can simply be loaded with thexarray module:
>>>importxarray>>>withxarray.open_dataarray('docs/data/weights.nc')asweights_read:...weights_read
In R you need thencdf4 packageto load a in netCDF format saved matrix:
>#install.packages("ncdf4") # uncomment to install>library(ncdf4)>weights_nc<-nc_open(filename="docs/data/weights.nc")>weights_read<-t(as.matrix(ncvar_get(nc=weights_nc,varid="__xarray_dataarray_variable__")))>rownames(weights_read)<-ncvar_get(nc=weights_nc,varid="outcomes")>colnames(weights_read)<-ncvar_get(nc=weights_nc,varid="cues")>nc_close(nc=weights_nc)>rm(weights_nc)
Clean up
In order to keep everything clean we might want to remove all the files wecreated in this tutorial:
>>>importos>>>os.remove('docs/data/levent.tab.gz')
- 1
Rescorla, R. A., & Wagner, A. R. (1972). A theory of Pavlovianconditioning: Variations in the effectiveness of reinforcement andnon-reinforcement.Classical conditioning II: Current research andtheory, 2, 64-99.
- 2
Baayen, R. H., Milin, P., Đurđević, D. F., Hendrix, P., & Marelli, M.(2011). An amorphous model for morphological processing in visualcomprehension based on naive discriminative learning.Psychological review, 118(3), 438.
- 3
Unidata (2012). NetCDF. doi:10.5065/D6H70CW6. Retrieved fromhttp://doi.org/10.5065/D6RN35XM)