- Notifications
You must be signed in to change notification settings - Fork17
Open
Description
This
Lines 949 to 963 ine156fbe
| ## Set up subsamples (outside of the loop): | |
| subsample<-list() | |
| for (iin1:ntrees) { | |
| if (is.function(sampfrac)) { | |
| subsample[[i]]<- sampfrac(n=n,weights=weights) | |
| } | |
| elseif (sampfrac==1) { | |
| subsample[[i]]<- sample(1:n,size=n,replace=TRUE, | |
| prob=weights) | |
| } | |
| elseif (sampfrac<1) { | |
| subsample[[i]]<- sample(1:n,size= round(sampfrac*n), | |
| replace=FALSE,prob=weights) | |
| } | |
| } |
is bad idea as it scales very poorly. E.g., suppose there are 200000 rows, we want to sub-sample to half the size and we want 1000 trees. Then it requires 10^5 * 10^3 * 4 = 400 mega bytes of ram. I do see that the reason is theforeach call later
Line 998 ine156fbe
| rules<-foreach::foreach(i=1:ntrees,.combine="c",.packages= c("partykit","pre")) %dopar% { |
However, one may still be able to get reproducible results ifclusterSetRNGStream is used withforeach. Though, I have not used theforeach package much and it requires that the loop iterator is split equally to each thread. An alternative is to replace theforeach withparSapply which I know will work.
Metadata
Metadata
Assignees
Labels
No labels