Movatterモバイル変換


[0]ホーム

URL:


1 - Preparing data forSegmentation/Clustering with segclust2d

R. Patin

2024-04-24

library(segclust2d)data(simulshift)data(simulmode)simulmode$abs_spatial_angle<-abs(simulmode$spatial_angle)simulmode<- simulmode[!is.na(simulmode$abs_spatial_angle), ]

This summary provides information on:

Type of data accepted and content required to run segmentation() orsegclust()

Right now, the function insegclust2d package acceptthree different kind of input data:

Future version may provide support forsftraj objects aswell.

data.frame

data.frame is the format natively supported bysegmentation() andsegclust(). Ifx_data.frame is a data frame, the syntax is simply:

segmentation(x_data.frame,lmin =5,Kmax =25)

Move

Move object can alternatively be provided to thefunction. If usingsegmentation(), the user may omitseg.var argument and the algorithm will use the movementcoordinates as segmentation variables. Alternatively if the userspecifies the segmented variable with argumentseg.var,those variables must be present in the data associated to theMove objectx_move@data Ifx_moveis aMove object, the syntax is simply:

segmentation(x_move,lmin =5,Kmax =25)

ltraj

ltraj object can alternatively be provided to thefunction. If usingsegmentation(), the user may omitseg.var argument and the algorithm will use the movementcoordinates as segmentation variables. Alternatively if the userspecifies the segmented variable with argumentseg.var,those variables must be present in the data associated to theltraj objectx_ltraj@data Ifx_ltraj is altraj object, the syntax issimply:

segmentation(x_ltraj,lmin =5,Kmax =25)

sftraj

sftraj objects are not supported for the moment.

Limitations on data size and subsampling

Computation cost for the algorithm scales non-linearly and can beboth memory and time-consuming. Performance depends on computer, butfrom what we have tested, a segmentation on data of size > 10000 canbe quite memory intensive (more than 10Go of RAM) andsegmentation-clustering can be quite long for data > 1000 (fewminutes to hours). For such dataset we recommend either subsampling ifloosing resolution is not a big deal (looking for home-range changesover a year with hourly points might be a lost of time when daily pointsare sufficient) or splitting the dataset for very long data. Althoughfor segmentation-clustering, clusters will not be easily comparablebetween the different part of the dataset, if one provides parts whereall cluster are present for sure, there should be no problem.

Subsampling options

Disabling automatic subsampling

Subsampling is automatically enabled in the function to avoidunwanted memory saturation or very long computation time. By defaultargumentsubsample is set toTRUE. In order tototally disable subsampling you have to provide argumentsubsample :

shiftseg<-segmentation(simulshift,Kmax =30,lmin=5,seg.var =c("x","y"),subsample =FALSE)mode_segclust<-segclust(simulmode,Kmax =30,lmin=5,ncluster =c(2,3,4),seg.var =c("speed","abs_spatial_angle"),subsample =FALSE)

Automatic subsampling

By default subsampling is allowed (subsample = TRUE) andsubsampling will occur if the number of data exceed a threshold (10000for segmentation, 1000 for segmentation-clustering). The function willsubsample by the lower factor (by 2, 3, 4…) for which the dataset willfall below the threshold once subsampled. For instance a 2500 rowsdataset for segmentation-clustering would be subsampled by 3 to fallbelow 1000 rows. The threshold can be changed through argumentsubsample_over.

shiftseg<-segmentation(simulshift,Kmax =30,lmin=5,seg.var =c("x","y"),subsample_over =2000)mode_segclust<-segclust(simulmode,Kmax =30,lmin=5,ncluster =c(2,3,4),seg.var =c("speed","abs_spatial_angle"),subsample_over =500)

Manual subsampling

One can also override this automatic subsampling by selectingdirectly the subsampling factor through argumentsubsample_by.

shiftseg<-segmentation(simulshift,Kmax =30,lmin=5,seg.var =c("x","y"),subsample_by =60)mode_segclust<-segclust(simulmode,Kmax =30,lmin=5,ncluster =c(2,3,4),seg.var =c("speed","abs_spatial_angle"),subsample_by =2)

Consequences of subsampling onlmin

Beware that subsampling will also affect yourlminargument. If subsampling by 2,lmin will be divided by 2.The function will tell about the value of lmin and its adjustment withsubsampling with different messages:

#> ✔ Using lmin = 240#> ✔ Adjusting lmin to subsampling. #> Dividing lmin by 60, with a minimum of 5#> → After subsampling, lmin = 5. #> Corresponding to lmin = 300 on the original time scale

Best practice with subsampling

In addition to reducing computation time, subsampling may also helpthe algorithm. Considering movement at the scale of hours when lookingfor home-ranges at the scale of months may blur the signal and for suchanalysis, one data per day may be sufficient. For all analyses, the usershould think about the appropriate temporal resolution, with the ideathat the finest temporal resolution may not always be appropriate.

Note that subsampling has been implemented in such way thatoutputs will show all points but segmentation is calculated only onsubsampled points. Points used in segmentation can be retrieved throughaugment in data columnsubsample_ind (Thesubsample indices for kept points and NA for ignored points).

Outputs may be more easily explored if subsampling is done beforeproviding the data tosegclust2d functions

Covariate calculations

The package also includes functions in order to calculate unusualcovariates, such as the turning angle at constant step length (herecalledspatial_angle, see Patin et al. 2020 for moredetails). For the latter, a radius have to be chosen and can bespecified through argumentradius. If no radius isspecified, the default one will be the median of the step lengthdistribution. Other covariates calculated are : persistence and turningspeed (v_p and v_r) from Gurarie et al (2009), distance travelledbetween points, speed and smoothed version of the latter. Covariatesdependent on time interval (like speed) are by default calculated withhours, but you can change this with argumentunits as inthe example below.

simple_data<- simulmode[,c("dateTime","x","y")]full_data<-add_covariates(simple_data,coord.names =c("x","y"),timecol ="dateTime",smoothed =TRUE,units ="min")head(full_data)

Advice on data pre-processing

When pre-processing movement data before segmentation/clustering itis common to interpolate missing data points. This may however causeproblem if this leads to repetition of values. This can also arise ifthe individual has a very stable speed (i.e. a boat or a bird derivingon the sea) leading to very similar values.

When the repetition of identical or very similar values are longerthan parameterlmin, there are segments with null variance,which cannot be accounted for by the algorithm. Should such cases arise,the algorithm will fail and tell you about it:

df<-data.frame(x =rep(1,500),y =rep(2,500))segclust(df,seg.var =c("x","y"),lmin =50,ncluster =3 )
#> ✖ Data have repetition of nearly-identical values longer than lmin. #> The algorithm cannot estimate variance for segment with repeated values. This is potentially caused by interpolation  of missing values or rounding of values.#> → Please check for repeated or very similar values of x and y

To avoid this problem interpolation should be done rather on thecovariates to be segmented rather than the coordinates. Alternativelysmall and rare gaps of data could be ignored. If the gap is too largethere is also the possibility to split the dataset.

Dataset with naturally occurring repetition of similar values (a boatat constant speed) are generally difficult to process with oursegmentation/clustering algorithm.


[8]ページ先頭

©2009-2025 Movatter.jp