Movatterモバイル変換


[0]ホーム

URL:


advanced naryn

Introduction

Naryn allows efficient access and analysis of medical records thatare maintained in a custom database.

Naryn can work under R (as a package) or Python (as a module). Thevast majority of the functions and the concepts are shared between thetwo implementations, yet certain differences still exist and aresummarized in a table below. Code examples and function names in thisdocument are presented for R but they can equally run in Python with theinterface changes as to the table.

Database

DB dirs, Namespaces and Read-Only Tracks

Naryn allows accessing the data that resides intracks whereeach track holds certain type of medical data such as patients’diagnoses or their hemoglobin level at certain points of time. The trackfiles can be aggregated from one or more directories. Before the trackscan be accessed, Naryn needs to establish connection to the directories,also referred asdb dirs. Callemr_db.connectfunction to establish the access to the tracks in the db_dirs. Toestablish a connection usingemr_db.connect, Naryn requiresto specify at-least one db dir. Optionally,emr_db.connectaccepts additional db dirs which can also contain additional tracks. Ina case where 2 or more db dirs contain the same track name (namespacecollision), the track will be taken from the db dir which was passedlast in the order of connections. For example, if we have 2 dbdirs/db1 and/db2 which both contain a tracknamedtrack1, the callemr_db.connect(c('/db1', '/db2')) will result with Narynusingtrack1 from/db2. As you might expectthe overriding is consistent not only for the track’s data itself, butalso for any other Naryn entity using or pointing to the track.

Even though all db directories may contain track files, theirdesignation is different. All the db dirs except for the last dir in theorder of connections are mainly read-only. The directory which wasconnected last in the order, is termeduser dir, and isintended to store volatile data like the results of intermediatecalculations. New tracks can be created only in the db dir which waslast in the order of connections, usingemr_track.import oremr_track.create. In order to write tracks to a db dirwhich is not last in the connection order, you must explicitly pass thepath to the required db dir, and this should be done for a welljustified reason.

A track may be marked as read-only to prevent its accidental deletionor modification. Useemr_track.readonly to set or getread-only property of the track. A newly created tracks is alwayswritable. If you wish to mark it as “read-only”, please do it in aseparate call.

Load-on-demand vs. Pre-load Modes

emr_db.connect supports two modes of work - ‘load ondemand’ and ‘pre-load’. In ‘load on demand’ mode tracks are loaded intomemory only when they are accessed. Tracks stay in the memory up until Rsessions ends or the package is unloaded (Python: since modules cannotbe forced to unload,db_unload is introduced).

In ‘pre-load’ mode, all the tracks are pre-loaded into memory makingsubsequent track access significantly faster. As loaded tracks reside inshared memory, other R sessions running on the same machine may alsoenjoy significant run-time boost. On the flip side, pre-loading all thetracks prolongs the execution ofemr_db.connect andrequires enough memory to accommodate all the data.

Choosing between the two modes depends on the specific needs. Whileload_on_demand=TRUE seems to be a solid default choice, inan environment where there are frequent short-living R sessions, eachaccessing a track, one might opt for running a “daemon” - an additionalpermanent R session. The daemon would pre-load all the tracks in advanceand stay alive thus boosting the run-time of the later emergingsessions.

Maintaining Database

Naryn caches certain data on the disk to maintain fast run-times. Inparticular two files (.naryn and.ids) arecreated in any database, and another file called.logical_tracks is created in global databases.

.naryn file contains a list of all tracks in the currentroot directory and their last modification dates. This file spares afull root directory rescan whenemr_db.connect is called.The recorded modification dates allow to efficiently synchronize thetrack changes induced by synchronously running R sessions.

.logical_tracks implements the same mechanism forlogical tracks, which store their properties (source and values) under afolder calledlogical.

.ids file contains available ids that are used to runcertain types oftrack expression iterators (see below). Thesource of these ids comes from a `patients.dob} (i.e. Date Of Birth)track, which must be present in the global root directory before theseiterators may be utilized.

Various functions such asemr_track.import modify thesefiles according to the changes that DB undergoes (addition / removal /modification of tracks). Thus manual (outside of Naryn) modification,replacement, addition or deletion of track files cause the cache filesto go out of sync. Various problems might arise as a consequence, suchas run-time errors, out-dated data from modified tracks and sub-optimalrun-time performance.

Manual modifications of the database files can still be performed,yet they must be ratified by runningemr_db.reload.

File and directory permissions

Naryn creates files and directories with a umask of007(except for read-only tracks), which means that files and directorieswould have permissions of660 (rw-rw----) and770 (rwxrwx---) respectively. This means that in order toaccess a database that someone outside the group created, the file andfolder permissions need to be changed first.

Tracks

Each track is stored in a binary file with.nrtrack fileextension. One of the two internal formats,dense orsparse, is automatically selected during the track creation.The choice of the exact format is based on the optimal run-timeperformance.

Records and References

Track is a data structure that stores a set of records of(id, time, ref, numeric value) type. For example,hemoglobin level of patients can be stored in this way, whereid would be the id of the patient andtimewould indicate the moment when the blood test was made. Another trackcan contain the code of the laboratory which carried out the test. Ifthe times of the records from the two tracks match, one would concludewhich lab performed the given test.

Time resolution is always in hours. It might happen that twodifferent blood tests are carried out by two different labs for the samepatient at the same hour. Assuming that each lab has certain bias due todifferent equipment used, the reads of the hemoglobin might come outdifferent. Since both of the tests are carried out at exactly the samehour it will be impossible later to link each result to the lab thatperformed it.

In those cases when two or more values share identicalid andtime Naryn requires them to use thendifferentref (references). A reference is aninteger number in the range of [-1, 254], which when no time collisionoccurs is normally set to -1. However, in cases of ambiguity it can giveadditional resolution to the time. In our blood example the results ofthe first lab could have been recorded withref = 0 and thesecond lab would do it withref = 1. This way the twohemoglobin readings could later be separated and correctly linked totheir originating labs.

Categorical and Quantitative Tracks

Tracks store numerical values assigned to the patients and times. Thenumerical data however can have different meaning and hence imposedifferent set of operations to be applied to it. Laboratory codes,diagnosis codes, binary information such as date of birth or doctorvisits are one type of data which we callcategorical. Anothertype of data indicate usually the readings of different instruments suchas the heartbeat rate or glucose level. This type of data is calledquantitative.

The operations that can be applied to both of these types can be verydifferent. One might want to search for the specific diagnosis code, yetit makes little sense to search for the very specific heartbeat rate,say “68”. On contrary heartbeat rate readings from different times canbe averaged or a mean value might be calculated - something that has nomeaning in case of categorical data.

During the track creation one must specify the type of the track:categorical or quantitative. Various operations that can be laterapplied to the track are bound to the track type.

Logical tracks

In addition to the physical tracks which are stored in the binaryfiles,naryn supports a concept of alogical trackwhich is an alias to a physical track. For example, assume we have atrack calledlab.103 which contains hemoglobin levels ofpatients. It would be more convenient to refer to it explicitly byhemoglobin instead of remembering the lab code.Logicaltracks do exactly this, we can create a logical track calledhemoglobin which refers to the physicallab.103:

emr_track.logical.create("hemoglobin","lab.103")emr_extract("hemoglobin")

You can also uselogical tracks to create an alias forspecific values from acategorical track. For example, supposewe have a track calleddiagnosis.250 which contains thediagnosis times of ICD code 250 (“250.*”), with the valuesbeing the sub-diagnosis (e.g. 1 for 250.1 and4 for 250.4).Logical tracks allow us to create analias for a specific sub-diagnosis value and then refer to it as aregular track:

emr_track.logical.create("dx.250.1_4","diagnosis.250",values =c(1,4))emr_extract("dx.250.1_4")

Under the hood logical tracks are implemented using the virtualtracks mechanism (see below), but unlike virtual tracks - they are partof the database and are persistent between sessions. You can delete alogical track by callingemr_track.logical.rm and list themusingemr_track.logical.ls.

Track Attributes

In addition to numeric data a track may store arbitrary meta-datasuch as description, source, etc. The meta-data is stored in the form ofname-value pairs or attributes where the value is a characterstring.

Though not officially enforced attributes are intended to storerelatively short character strings. Please usetrack variablesto store data in any other format.

A single attribute can be retrieved, added, modified or deleted usingemr_track.attr.get andemr_track.attr.setfunctions. Bulk access to more than one attribute is facilitated byemr_track.attr.export function.

Track names which attributes values match a pattern can be retrievedusingemr_track.ls,emr_track.global.ls andemr_track.user.ls functions.

Track Variables

Track statistics, results of time-consuming per-track calculations,historical data and any other data in arbitrary format can be stored ina track’s supplementary data in the form of track variables. Trackvariable can be retrieved, added, modified or deleted usingemr_track.var.get,emr_track.var.set andemr_track.var.rm functions. List of track variables can beretrieved usingemr_track.var.ls function.

Note: track variables created in R are not visible in Python and viceversa.

Track Attributes vs. Track Variables

Though both track attributes and track variables can be used to storemeta-data of a track, there are a few important differences between thetwo that are summed up in the following table:

Track AttributesTrack Variables
Optimal use caseTrack properties as short, non-empty character strings (description,source, …)Arbitrary data associated with the track
Value typeCharacter stringArbitrary
Single value retrievalemr_track.attr.getemr_track.var.get
Bulk value retrievalemr_track.attr.export
Single value modificationemr_track.attr.setemr_track.var.set
Object names retrievalemr_track.attr.exportemr_track.var.ls
Object removalemr_track.attr.rmemr_track.var.rm
Search by valueR:emr_track.ls,emr_track.global.ls,emr_track.user.ls
R vs. Python compatibilityYesNo

Subsets

The analysis of data often involves dividing the data to train andtest sets. Naryn allows to subset the data viaemr_db.subset function.emr_db.subset acceptsa list of ids or samples the ids randomly. These ids constitute thesubset. The ids that are not in the subset are skipped by all theiterators,filters and various functions.

One may think of a subset as an additional layer, a “viewport”, thatfilters out some of the ids. Some lower-level functions such asemr_track.info oremr_track.unique ignore thesubsets. Same applies topercentile.* functions of thevirtual tracks.

Accessing the Data

Track Expressions

Introduction

Track expression allows to retrieve numerical data that isrecorded in the tracks. Track expressions are widely used in variousfunctions (emr_screen,emr_extract,emr_dist, …).

Track expression is a character string that closely resembles a validR/Python expression. Just like any other R/Python expression it mayinclude conditions, function calls and variables defined beforehand."1 > 2","mean(1:10)" and"myvar < 17" are all valid track expressions. Unlikeregular R/Python expressions track expression might also contain tracknames and / orvirtual track names.

To understand how the track expression allows the access to thetracks we must explain how the track expression gets evaluated.

Every track expression is accompanied by aniterator thatproduces a set ofid-time points of(id, time, ref) type. For each each iterator point thetrack expression is evaluated. The value of the track expression"mean(1:10)" is constant regardless the iterator point.However the track expression might contain a track namemytrack, like:"mytrack * 3". Naryn recognizesthen thatmytrack is not a regular R/Python variable butrather a track name. A newrun-time track variable namedmytrack is added then to R environment (or Python modulelocal dictionary). For each iterator point this variable is assigned thevalue of the track that matches(id, time, ref) (or NaN ifno matching value exists in the track). Oncemytrack isassigned the corresponding value, the track expression is evaluated inR/Python.

Run-time Track Variable is a Vector

To boost the performance of the track expression evaluation, run-timetrack variables are actually defined as vectors in R rather thanscalars. The result of the evaluation is expected to be also a vector ofa similar size. One should always keep in his mind the vectorialnotation and write the track expressions accordingly.

For example, at first glance a track expression"min(mytrack, 10)" seems to be perfectly fine. However theevaluation of this expression produces always a scalar, i.e. a singlenumber even ifmytrack is actually a vector. The way tocorrect the specific track expression so that it works on vectors, is tousepmin function instead ofmin.

Python

Similarly to R, a track variable in Python is not a scalar but ratheran instance ofnumpy.ndarray. The evaluation of a trackexpression must therefore produce anumpy.ndarray as well.Various operations on numpy arrays indeed work the same way as withscalars, however logical operations require different syntax. Forinstance:

screen("mytrack1 > 1 and mytrack2 < 2", iterator="mytrack1")

will produce an error given thatmytrack1 andmytrack2 are numpy arrays. The correct way to write theexpression is:

screen("(mytrack1 > 1) & (mytrack2 < 2)", iterator="mytrack1")

One may coerce the track variable to behave like a scalar: by settingemr_eval.buf.size option to1 (see Appendixfor more details). Beware though that this might take its heavy toll onrun-time.

Matching Reference in the Track Expression

If the track expression contains a track (or virtual track) name,then the values from the track are fetched one-by-one into theidentically named R variable based onid,timeandref of the iterator point. If howeverrefof the iterator point equals to-1, we treat it as a“wildcard”: matching is required then only forid andtime.

“Wildcard” reference in the iterator might create a new issue: morethan one track value might match then a single iterator point. In thiscase the value placed in the track variable (e.g. mytrack)depends on the type of the track. If the track is categorical the trackvariable is set to-1, otherwise it is set to the averageof all matching values.

Virtual Tracks

So far we have shown that in some situationsmytrackvariable can be set to the average of the matching track values. Butwhat if we do not want to average the values but rather pick up themaximal, minimal or median value? What if we want to use the percentileof a track value rather than the value itself? And maybe we even want toalter the time of the iterator point: shift it or expand to a timewindow and by that look at the different set of track values? Forinstance: given an iterator point we might want to know what was themaximal level of glucose during the last year that preceded the time ofthe point.

This is where virtual tracks come in use.

Virtual track is a named set of rules that describe how the trackshould be proceeded, and how the time of the iterator point should bemodified. Virtual tracks are created byemr_vtrack.createfunction:

emr_vtrack.create("annual_glucose",src ="glucose_track",func ="quantile",param =0.5,time.shift =c(-year(),0))

This call creates a new virtual track namedannual_glucose based on the underlying physicalsourcetrackglucose_track. For each iterator point with timeT we look at values ofglucose_track in thetime window of[T-365*24,T], i.e. one year prior toT. We calculate then the median over the values(func="quantile",param=0.5).

There is a rich set of various functions besides “quantile” that canbe applied to the track values. Some of these functions can be used onlywith categorical tracks, other ones - only with quantitative tracks andsome functions can be applied to both types of the track. Please referthe documentation ofemr_vtrack.create.

Once a virtual track is created it can be used in a trackexpression:

emr_extract("annual_glucose",iterator =list(year(),"patients.dob"))

This would give us a median of an annual glucose level in year-stepsstarting from the patient’s birthday. (This example makes use of anExtended Beat Iterator that would be explained later.)

Let’s expand our example further and ignore in our calculations theglucose readings that had been made within a week after steroids hadbeen prescribed. We can use an additionalfilter parameterto do that.

emr_filter.create("steroids_filter","steroids_track",time.shift=c(-week(),0))emr_vtrack.create("annual_glucose",src ="glucose_track",func ="quantile",param =0.5,time.shift =c(-year(),0),filter ="!steroids_filter")emr_extract("annual_glucose",iterator =list(year(),"date_of_birth_track"))

Filter is applied to the ID-Time points of the source track(e.g. glucose_track in our example). The virtual trackfunction (quantile, …) is applied then only to the pointsthat pass the filter. The concept of filters is explained extensively ina separate chapter.

Virtual tracks allow also to remap the patient ids. This is done viaid.map parameter which accepts a data frame that definesthe id mapping. Remapping ids might be useful if family ties areexplored. For example, instead of glucose level of the patient we areinterested to check the glucose level of one of his family members.

Iterators

So far we have discussed the track expressions and how they areevaluated given the iterator point. In this section we will show how theiterator points are generated.

An iterator is defined viaiterator parameter. There area few types of iterators such astrack iterator,beatiterator, etc. The type determines which points are generated bythe iterator. The information about each type is listed below.

Iterator is always accompanied by four additional parameters:stime,etime,keepref andfilter.stime andetime bind thetime scope of the iterator: the points that the iterator generates liealways within these boundaries. The effect ofkeepref=TRUEdepends on the iterator type. However for all the iterator types ifkeepref=FALSE the reference of all the iterator points isset to-1.filter parameter sets the iteratorfilter which is discussed thoroughly later in the document in a separatechapter.

Track Iterator

Track iterator returns the points (including the reference)from the specified track. Track name is specified as a string.

Ifkeepref=FALSE the reference of each point is set to-1.

Example:

# Returns the level of glucose one hour after the insulin shot was madeemr_vtrack.create("glucose","glucose_track",func="avg",time.shift=1)emr_extract("glucose",iterator="insulin_shot_track")

Id-Time Points Iterator

Id-Time points iterator generates points from anid-timepoints table (see: Appendix). Ifkeepref=FALSE thereference of each point is set to-1.

Example:

# Returns the level of glucose one hour after the insulin shot was madeemr_vtrack.create("glucose","glucose_track",func="avg",time.shift=1)r<-emr_extract("insulin_shot_track")# <-- implicit iterator is used hereemr_extract("glucose",iterator=r)

Ids Iterator

Ids iterator generates points with ids taken from anidstable (see: Appendix) and times that run fromstime toetime with a step of 1.

Ifkeepref=TRUE for each id-time pair the iteratorgenerates 255 points with references running from0 to254. Ifkeepref=FALSE only one point isgenerated for the given id and time, and its reference is set to-1.

Example:

# Returns the level of glucose for each hour in year 2016 for ids 2 and 5stime<-emr_date2time(1,1,2016,0)etime<-emr_date2time(31,12,2016,23)emr_extract("glucose",iterator=data.frame(id=c(2,5)),stime=stime,etime=etime)

Time Intervals Iterator

Time intervals iterator generates points for all the idsthat appear in ‘patients.dob’ track with times taken from atimeintervals table (see: Appendix). Each time starts at the beginningof the time interval and runs to the end of it with a step of 1. Thatbeing said the points that lie outside of[stime, etime]range are skipped.

Ifkeepref=TRUE for each id-time pair the iteratorgenerates 255 points with references running from0 to254. Ifkeepref=FALSE only one point isgenerated for the given id and time, and its reference is set to-1.

Example:

# Returns the level of hangover for all patients the next day after New Year Eve# for the years 2015 and 2016stime1<-emr_date2time(1,1,2015,0)etime1<-emr_date2time(1,1,2015,23)stime2<-emr_date2time(1,1,2016,0)etime2<-emr_date2time(1,1,2016,23)emr_extract("alcohol_level_track",iterator=data.frame(stime=c(stime1, stime2),etime=c(etime1, etime2)))

Id-Time Intervals Iterator

Id-Time intervals iterator generates for each id points thatcover['stime', 'etime'] time range as specified inid-time intervals table (see: Appendix). Each time starts atthe beginning of the time interval and runs to the end of it with a stepof 1. That being said the points that lie outside of[stime, etime] range are skipped.

Ifkeepref=TRUE for each id-time pair the iteratorgenerates 255 points with references running from0 to254. Ifkeepref=FALSE only one point isgenerated for the given id and time, and its reference is set to-1.

Beat Iterator

Beat Iterator generates a “time beat” at the given periodfor each id that appear in ‘patients.dob’ track. The period is givenalways in hours.

Example:

emr_extract("glucose_track",iterator=10,stime=1000,etime=2000)

This will create a beat iterator with a period of 10 hours startingatstime up untiletime is reached. If, forexample,stime equals1000 then the beatiterator will create for each id iterator points at times: 1000, 1010,1020, …

Ifkeepref=TRUE for each id-time pair the iteratorgenerates 255 points with references running from0 to254. Ifkeepref=FALSE only one point isgenerated for the given id and time, and its reference is set to-1.

Extended Beat Iterator

Extended beat iterator is as its name suggests a variationon the beat iterator. It works by the same principle of creating timepoints with the given period however instead of basing the times countonstime it accepts an additional parameter - a track or aId-Time Points table - that instructs what should be theinitial time point for each of the ids. The two parameters (period andmapping) should come in a list. Each id is required to appear only onceand if a certain id does not appear at all, it is skipped by theiterator.

Anyhow points that lie outside of[stime, etime] rangeare not generated.

Example:

# Returns the maximal weight of patients at one year span starting from their birthdaysemr_vtrack.create("weight","weight_track",func ="max",time.shift =c(0,year()))emr_extract("weight",iterator =list(year(),"birthday_track"),stime =1000,etime =2000)

Periodic Iterator

periodic iterator goes over every year/month. Youcan use it by runningemr_monthly_iterator oremr_yearly_iterator.

Example:

iter<-emr_yearly_iterator(emr_date2time(1,1,2002),emr_date2time(1,1,2017))emr_extract("dense_track",iterator = iter,stime =1,etime =3)iter<-emr_monthly_iterator(emr_date2time(1,1,2002),n =15)emr_extract("dense_track",iterator = iter,stime =1,etime =3)

Implicit Iterator

The iterator is set implicitly if its value remainsNULL(which is the default). In that case the track expression is analyzedand searched for track names. If all the track variables or virtualtrack variables point to the same track, this track is used as a sourcefor a track iterator. If more then one track appears in the trackexpression, an error message is printed out notifying ambiguity.

Revealing Current Iterator Time

During the evaluation of a track expression one can access aspecially defined variable namedEMR_TIME (Python:TIME). This variable contains a vector(numpy.ndarray in Python) of current iterator times. Thelength of the vector matches the length of the track variable (which isa vector too).

Note that some values inEMR_TIME might be set 0. Skipthose intervals and the values of the track variables at thecorresponding indices.

# Returns times of the current iterator as a day of monthemr_extract("emr_time2dayofmonth(EMR_TIME)",iterator ="sparse_track")

Filters

Filter is used to approve / reject an ID-Time point. It canbe applied to an iterator, in which case the iterator points arerequired to be approved by the filter before they are passed further tothe track expression. Filter may also be used by a virtual track. Inthis case the virtual track function (seefunc parameter ofemr_vtrack.create) is applied only to the points from thesource track (src parameter) that pass the filter.

Filter has a form of a logical expression consisting ofnamed orunnamedelementary filters (the“building bricks” of the filter) connected with the logical operators:&,|,! (and,or andnot in Python) and brackets().

Named Filters

Suppose we are interested in hemoglobin levels of patients who wereprescribed either drugX or drugY but not drugZ within a time window ofone week before the test. Assume that drugX, drugY and drugZ areresiding each in its separate track. Without filters we would need tocallemr_extract four times, store potentially huge dataframe results in the memory and finally merge the tables within R whilecaring about time windows. With filters we can do it much easier:

emr_filter.create("filterX","drugX",time.shift =c(week(),0))emr_filter.create("filterY","drugY",time.shift =c(week(),0))emr_filter.create("filterZ","drugZ",time.shift =c(week(),0))emr_extract("hemoglobin",filter ="(filterX | filterY) & !filterZ")

We can further expand the example above by specifying the ‘operator’argument on filter creation. If we wish to extract, the same informationas before, but in this case we are interested only in patients whichhave an hemoglobin level of at least 16 (in addition to our drugtreatment requirements). Under the same assumptions in the previousexample, our code would look like:

emr_filter.create("filterX","drugX",time.shift =c(week(),0))emr_filter.create("filterY","drugY",time.shift =c(week(),0))emr_filter.create("filterZ","drugZ",time.shift =c(week(),0))emr_filter.create("hemoglobin_gt_16","hemoglobin",val=16,operator=">")emr_extract("hemoglobin",filter ="(filterX | filterY) & !filterZ & hemoglobin_gt_16")

Python

Filter with logical conditions will use Python’s notation like:

extract("hemoglobin",filter="(filterX or filterY) and not filterZ")

Each call toemr_filter.create creates anamedelementary filter (or simply: named filter) with a unique name. Thenamed filter can then be used infilter parameter of aniterator and be combined with other named filters using the logicaloperators.

Other Objects within Filters

In our previous example we created three named filters based on threetracks. If time window was not required, we could have used the names ofthe tracks directly in the filter, like:filter = "(drugX | drugY) & !drugZ".

In addition to track names other types of objects can be used withinthe filter. These are:Id-Time Points Table,IdsTable,Time Intervals Table andId-Time IntervalsTable (seeAppendix for the format of these tables). Whenused in the filter the object should be constructed in advanced and bereferred by its name. “In place” construction (aka:filter = "data.frame(...)" is not allowed.

Managing Reference in Filters

The ID-Time Point embeds within itself a reference value. Namedfilters allow to specify whether the reference should be used formatching or not. Whenkeepref=TRUE is set withinemr_filter.create, the candidate point’s reference ismatched with the filter’s reference. Otherwise the references areignored.

It is important to remember that references are always ignored whenany object but a named filter is used within a filter. For instance, iffilter = "drug" anddrug is a name of a track(and not a name of a named filter), then the references will be ignoredduring the matching. To ensure the filter matches the references ofdrug track, one must define a named filter withkeepref=TRUE parameter:

emr_filter.create("drug_filter","drug",keepref=TRUE)emr_extract(my.track.expression,filter="drug_filter",keepref=TRUE)

Advanced Naryn

Random Algorithms

Various functions in the library such asemr_quantilesmake use of pseudo-random number generator. Each time the function isinvoked a unique series of random numbers is issued. Hence two identicalcalls might produce different results. To guarantee reproducible resultscallset.seed (Python:seed) before invokingthe function.

Note: R and Python implementations of Naryn use differentpseudo-random number generator algorithms. Sadly it means that theresult achieved in R cannot be reproducible in Python if random is used,even if identical seed is shared between the two platforms.

Multitasking

To boost the run time performance various functions in the librarysupport multitasking mode, i.e. parallel computation by severalconcurrent processes. Multitasking is not invoked immediately:approximately 0.3 seconds from the function launch the actual progressis measured and total run-time is estimated. If the estimated run-timeexceeds the limit (currently: 2 seconds), multitasking kicks in.

The number of processes launched in the multitasking mode depends onthe total run-time estimation (longer run-time will use more processes)and the values ofemr_min.processes andemr_max.processes R options. In any case the number ofprocesses never exceeds the number of CPU cores available.

Multitasking can significantly boost the performance however itutilizes more CPU. When CPU utilization is the priority it is advisableto switch off multitasking by settingemr_multitasking Roption toFALSE.

In addition to increased CPU usage multitasking might also alter thebehavior of functions that return ID-Time points such asemr_extract andemr_screen. When multitaskingis not invoked these functions return the results always sorted by ID,time and reference. In multitasking mode however the result might comeout unsorted. Moreover subsequent calls might return results reshuffleddifferently. One might usesort parameter in thesefunctions to ensure the points come out sorted. Please bear in mind thatsorting the results takes its toll especially on particularly large dataframes. That’s why by defaultsort is set toFALSE.

Appendix

R vs. Python Interface Differences

RPython
Naming Conventions
(except for virtual track ‘func’, whichstays unchanged)
emr_xxx.yyy.zzzxxx_yyy_zzz
VariablesDefined in.naryn environment:
EMR_GROOT
EMR_UROOT
Defined in module’senvironment:
_GROOT
_UROOT
Run-time Variables (available only during track expressionevaluation)EMR_TIMETIME
Package / Module OptionsControlled via standard options mechanism:
options(emr_xxx.yyy=zzz)
getOption("emr_xxx.yyy")
Controlled by module’sCONFIG variable:
CONFIG['xxx_yyy']=zzz
CONFIG['xxx_yyy']
Data Types (used as function parameters)data.frame
list
vector of strings
vector of numerics
NULL
pandas.DataFrame
list
list of strings
numpy.ndarray ofnumerics
None
Data Types (return value)data.frame
list
vector of strings

vector ofnumerics, no labels
vector of numerics, with labels

NULL
pandas.DataFrame
dict
numpy.ndarray of objects (strings)
numpy.ndarray of numerics

pandas.DataFrame with two columns (label, numeric)
None
Database ManagementDatabase is unloaded when the package is detached.db_unload() must be called explicitly to unload thedatabase.
Setting seed for random number generator.
Note: R and Python use different random generators,results are therefore not reproducible between them.
set.seedseed
Track VariablesVariables saved in Python are not visible in R.Variables saved in R are not visible in Python.
Setting Track Variablesemr_track.set creates a directory named.trackname.vartrack_set creates a directory named.trackname.pyvar
Named Filters and Virtual TracksNamed filters and virtuals tracks may be saved along with the restof R’s environment.filter_export,filter_import,vtrack_export,vtrack_import must beexplicitly called to save / restore named filters or virtualtracks.
Pattern Matchingemr_track.ls,
emr_track.global.ls,
emr_track.user.ls,
emr_track.var.ls,
emr_filter.ls
accept pattern matching parameters.
Return:vectorof strings that match the pattern.
track_ls,
track_global_ls,
track_user_ls,
track_var_ls,
filter_ls
do not support pattern matching.
Return:numpy.ndarray of objects (strings) that containsall the objects (tracks, …)
Time shift parameter (used in various functions)time.shift is a numeric or a vector of twonumerics.time_shift is a numeric or a list of two numerics.
Calculating Distributionemr_dist returns N-dimensionalvector withlabels (dimension names)dist return N-dimensionalnumpy.ndarraywithout labels.
Calculating Correlation Statisticsemr_cor:
For N-dimensional binning the returnedvaluer may be addressed as:
r$cor[bin1,...,binN,i,j], wherei andj are indices ofcor.exprs.
cor:
For N-dimensional binning the returnedvaluer may be addressed as:
r['cor'][bin1,...,binN,i,j], wherei andj are indices ofcor_exprs.
Othersemr_annotate




Not implemented, use
pandas.DataFrame.merge
or
pandas.merge_sorted
instead.

Options

Naryn supports the following options. The options can be set/examinedvia R’soptions andgetOption.

(UseCONFIG['option_name'] to control the module optionsin Python. Please mind as well Python’s name convention: R’semr_xxx.yyy option will change its name toxxx_yyy.)

OptionDefault ValueDescription
emr_multitaskingTRUEShould the multitasking be allowed?
emr_min.processes8Minimal number of processes launched when multitasking isinvoked.
emr_max.processes20Maximal number of processes launched when multitasking isinvoked.
emr_max.data.size10000000Maximal size of data sets (rows of a data frame, length of a vector,…) stored in memory. Prevents excessive memory usage.
emr_eval.buf.size1000Size of the track expression evaluation buffer.
emr_warning.itr.no.filter.size100000Threshold above which “beat iterator used without filter” warning isissued.

Common Table Formats

Id-Time Points Table

Id-Time Points table is a data frame having two first columns named‘id’ and ‘time’. References might be specified by a third column named‘ref’. If ‘ref’ column is missing or named differently references areset to-1. Additional columns, if presented, areignored.

Id-Time Values Table

Id-Time Values table is an extension ofId-Time Points tablewith an additional column named ‘value’. Additional columns, ifpresented, are ignored.

Ids Table

Ids table is a data frame having the first column named ‘id’. Each idmust appear only once. Additional columns of the data frame, ifpresented, are ignored.

Time Intervals Table

Time Intervals table is a data frame having two first columns named‘stime’ and ‘etime’ (i.e. start time and end time). Additional columns,if presented, are ignored.

Id-Time Intervals Table

Id-Time Intervals table is a data frame having three first columnsnamed ‘id’, ‘stime’ and ‘etime’ (i.e. start time and end time).Additional columns, if presented, are ignored.


[8]ページ先頭

©2009-2025 Movatter.jp