AScanner iterates over aDataset's fragments and returns dataaccording to given row filtering and column projection. AScannerBuildercan help create one.
Factory
Scanner$create() wraps theScannerBuilder interface to make aScanner.It takes the following arguments:
dataset: ADatasetorarrow_dplyr_queryobject, as returned by thedplyrmethods onDataset.projection: A character vector of column names to select columns or anamed list of expressionsfilter: AExpressionto filter the scanned rows by, orTRUE(default)to keep all rows.use_threads: logical: should scanning use multithreading? DefaultTRUE...: Additional arguments, currently ignored
Methods
ScannerBuilder has the following methods:
$Project(cols): Indicate that the scan should only return columns givenbycols, a character vector of column names or a named list ofExpression.$Filter(expr): Filter rows by anExpression.$UseThreads(threads): logical: should the scan use multithreading?The method's default input isTRUE, but you must call the method to enablemultithreading because the scanner default isFALSE.$BatchSize(batch_size): integer: Maximum row count of scanned recordbatches, default is 32K. If scanned record batches are overflowing memorythen this method can be called to reduce their size.$schema: Active binding, returns theSchema of the Dataset$Finish(): Returns aScanner
Scanner currently has a single method,$ToTable(), which evaluates thequery and returns an ArrowTable.
Examples
# Set up directory for examplestf<-tempfile()dir.create(tf)on.exit(unlink(tf))write_dataset(mtcars,tf, partitioning="cyl")ds<-open_dataset(tf)scan_builder<-ds$NewScan()scan_builder$Filter(Expression$field_ref("hp")>100)#> ScannerBuilderscan_builder$Project(list(hp_times_ten=10*Expression$field_ref("hp")))#> ScannerBuilder# Once configured, call $Finish()scanner<-scan_builder$Finish()# Can get results as a tableas.data.frame(scanner$ToTable())#> hp_times_ten#> 1 1130#> 2 1090#> 3 1100#> 4 1100#> 5 1100#> 6 1050#> 7 1230#> 8 1230#> 9 1750#> 10 1750#> 11 2450#> 12 1800#> 13 1800#> 14 1800#> 15 2050#> 16 2150#> 17 2300#> 18 1500#> 19 1500#> 20 2450#> 21 1750#> 22 2640#> 23 3350# Or as a RecordBatchReaderscanner$ToRecordBatchReader()#> RecordBatchReader#> 1 columns#> hp_times_ten: double#>#> See $metadata for additional Schema metadata