Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Ctrl+K

DataFrame#

DataFrame.__getattr__(name)

Returns theColumn denoted byname.

DataFrame.__getitem__(item)

Returns the column as aColumn.

DataFrame.agg(*exprs)

Aggregate on the entireDataFrame without groups (shorthand fordf.groupBy().agg()).

DataFrame.alias(alias)

Returns a newDataFrame with an alias set.

DataFrame.approxQuantile(col, probabilities, ...)

Calculates the approximate quantiles of numerical columns of aDataFrame.

DataFrame.asTable()

Converts the DataFrame into atable_arg.TableArg object, which can be used as a table argument in a TVF(Table-Valued Function) including UDTF (User-Defined Table Function).

DataFrame.cache()

Persists theDataFrame with the default storage level (MEMORY_AND_DISK_DESER).

DataFrame.checkpoint([eager])

Returns a checkpointed version of thisDataFrame.

DataFrame.coalesce(numPartitions)

Returns a newDataFrame that has exactlynumPartitions partitions.

DataFrame.colRegex(colName)

Selects column based on the column name specified as a regex and returns it asColumn.

DataFrame.collect()

Returns all the records in the DataFrame as a list ofRow.

DataFrame.columns

Retrieves the names of all columns in theDataFrame as a list.

DataFrame.corr(col1, col2[, method])

Calculates the correlation of two columns of aDataFrame as a double value.

DataFrame.count()

Returns the number of rows in thisDataFrame.

DataFrame.cov(col1, col2)

Calculate the sample covariance for the given columns, specified by their names, as a double value.

DataFrame.createGlobalTempView(name)

Creates a global temporary view with thisDataFrame.

DataFrame.createOrReplaceGlobalTempView(name)

Creates or replaces a global temporary view using the given name.

DataFrame.createOrReplaceTempView(name)

Creates or replaces a local temporary view with thisDataFrame.

DataFrame.createTempView(name)

Creates a local temporary view with thisDataFrame.

DataFrame.crossJoin(other)

Returns the cartesian product with anotherDataFrame.

DataFrame.crosstab(col1, col2)

Computes a pair-wise frequency table of the given columns.

DataFrame.cube(*cols)

Create a multi-dimensional cube for the currentDataFrame using the specified columns, allowing aggregations to be performed on them.

DataFrame.describe(*cols)

Computes basic statistics for numeric and string columns.

DataFrame.distinct()

Returns a newDataFrame containing the distinct rows in thisDataFrame.

DataFrame.drop(*cols)

Returns a newDataFrame without specified columns.

DataFrame.dropDuplicates([subset])

Return a newDataFrame with duplicate rows removed, optionally only considering certain columns.

DataFrame.dropDuplicatesWithinWatermark([subset])

Return a newDataFrame with duplicate rows removed,

DataFrame.drop_duplicates([subset])

drop_duplicates() is an alias fordropDuplicates().

DataFrame.dropna([how, thresh, subset])

Returns a newDataFrame omitting rows with null or NaN values.

DataFrame.dtypes

Returns all column names and their data types as a list.

DataFrame.exceptAll(other)

Return a newDataFrame containing rows in thisDataFrame but not in anotherDataFrame while preserving duplicates.

DataFrame.executionInfo

Returns a ExecutionInfo object after the query was executed.

DataFrame.exists()

Return aColumn object for an EXISTS Subquery.

DataFrame.explain([extended, mode])

Prints the (logical and physical) plans to the console for debugging purposes.

DataFrame.fillna(value[, subset])

Returns a newDataFrame which null values are filled with new value.

DataFrame.filter(condition)

Filters rows using the given condition.

DataFrame.first()

Returns the first row as aRow.

DataFrame.foreach(f)

Applies thef function to allRow of thisDataFrame.

DataFrame.foreachPartition(f)

Applies thef function to each partition of thisDataFrame.

DataFrame.freqItems(cols[, support])

Finding frequent items for columns, possibly with false positives.

DataFrame.groupBy(*cols)

Groups theDataFrame by the specified columns so that aggregation can be performed on them.

DataFrame.groupingSets(groupingSets, *cols)

Create multi-dimensional aggregation for the currentDataFrame using the specified grouping sets, so we can run aggregation on them.

DataFrame.head([n])

Returns the firstn rows.

DataFrame.hint(name, *parameters)

Specifies some hint on the currentDataFrame.

DataFrame.inputFiles()

Returns a best-effort snapshot of the files that compose thisDataFrame.

DataFrame.intersect(other)

Return a newDataFrame containing rows only in both thisDataFrame and anotherDataFrame.

DataFrame.intersectAll(other)

Return a newDataFrame containing rows in both thisDataFrame and anotherDataFrame while preserving duplicates.

DataFrame.isEmpty()

Checks if theDataFrame is empty and returns a boolean value.

DataFrame.isLocal()

ReturnsTrue if thecollect() andtake() methods can be run locally (without any Spark executors).

DataFrame.isStreaming

ReturnsTrue if thisDataFrame contains one or more sources that continuously return data as it arrives.

DataFrame.join(other[, on, how])

Joins with anotherDataFrame, using the given join expression.

DataFrame.limit(num)

Limits the result count to the number specified.

DataFrame.lateralJoin(other[, on, how])

Lateral joins with anotherDataFrame, using the given join expression.

DataFrame.localCheckpoint([eager, storageLevel])

Returns a locally checkpointed version of thisDataFrame.

DataFrame.mapInPandas(func, schema[, ...])

Maps an iterator of batches in the currentDataFrame using a Python native function that is performed on pandas DataFrames both as input and output, and returns the result as aDataFrame.

DataFrame.mapInArrow(func, schema[, ...])

Maps an iterator of batches in the currentDataFrame using a Python native function that is performed onpyarrow.RecordBatchs both as input and output, and returns the result as aDataFrame.

DataFrame.metadataColumn(colName)

Selects a metadata column based on its logical column name and returns it as aColumn.

DataFrame.melt(ids, values, ...)

Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.

DataFrame.na

Returns aDataFrameNaFunctions for handling missing values.

DataFrame.observe(observation, *exprs)

Define (named) metrics to observe on the DataFrame.

DataFrame.offset(num)

Returns a new :class:DataFrame by skipping the firstn rows.

DataFrame.orderBy(*cols, **kwargs)

Returns a newDataFrame sorted by the specified column(s).

DataFrame.persist([storageLevel])

Sets the storage level to persist the contents of theDataFrame across operations after the first time it is computed.

DataFrame.plot

Returns aplot.core.PySparkPlotAccessor for plotting functions.

DataFrame.printSchema([level])

Prints out the schema in the tree format.

DataFrame.randomSplit(weights[, seed])

Randomly splits thisDataFrame with the provided weights.

DataFrame.rdd

Returns the content as anpyspark.RDD ofRow.

DataFrame.registerTempTable(name)

Registers thisDataFrame as a temporary table using the given name.

DataFrame.repartition(numPartitions, *cols)

Returns a newDataFrame partitioned by the given partitioning expressions.

DataFrame.repartitionByRange(numPartitions, ...)

Returns a newDataFrame partitioned by the given partitioning expressions.

DataFrame.replace(to_replace[, value, subset])

Returns a newDataFrame replacing a value with another value.

DataFrame.rollup(*cols)

Create a multi-dimensional rollup for the currentDataFrame using the specified columns, allowing for aggregation on them.

DataFrame.sameSemantics(other)

ReturnsTrue when the logical query plans inside bothDataFrames are equal and therefore return the same results.

DataFrame.sample([withReplacement, ...])

Returns a sampled subset of thisDataFrame.

DataFrame.sampleBy(col, fractions[, seed])

Returns a stratified sample without replacement based on the fraction given on each stratum.

DataFrame.scalar()

Return aColumn object for a SCALAR Subquery containing exactly one row and one column.

DataFrame.schema

Returns the schema of thisDataFrame as apyspark.sql.types.StructType.

DataFrame.select(*cols)

Projects a set of expressions and returns a newDataFrame.

DataFrame.selectExpr(*expr)

Projects a set of SQL expressions and returns a newDataFrame.

DataFrame.semanticHash()

Returns a hash code of the logical query plan against thisDataFrame.

DataFrame.show([n, truncate, vertical])

Prints the firstn rows of the DataFrame to the console.

DataFrame.sort(*cols, **kwargs)

Returns a newDataFrame sorted by the specified column(s).

DataFrame.sortWithinPartitions(*cols, **kwargs)

Returns a newDataFrame with each partition sorted by the specified column(s).

DataFrame.sparkSession

Returns Spark session that created thisDataFrame.

DataFrame.stat

Returns aDataFrameStatFunctions for statistic functions.

DataFrame.storageLevel

Get theDataFrame's current storage level.

DataFrame.subtract(other)

Return a newDataFrame containing rows in thisDataFrame but not in anotherDataFrame.

DataFrame.summary(*statistics)

Computes specified statistics for numeric and string columns.

DataFrame.tail(num)

Returns the lastnum rows as alist ofRow.

DataFrame.take(num)

Returns the firstnum rows as alist ofRow.

DataFrame.to(schema)

Returns a newDataFrame where each row is reconciled to match the specified schema.

DataFrame.toArrow()

Returns the contents of thisDataFrame as PyArrowpyarrow.Table.

DataFrame.toDF(*cols)

Returns a newDataFrame that with new specified column names

DataFrame.toJSON([use_unicode])

Converts aDataFrame into aRDD of string.

DataFrame.toLocalIterator([prefetchPartitions])

Returns an iterator that contains all of the rows in thisDataFrame.

DataFrame.toPandas()

Returns the contents of thisDataFrame as Pandaspandas.DataFrame.

DataFrame.transform(func, *args, **kwargs)

Returns a newDataFrame.

DataFrame.transpose([indexColumn])

Transposes a DataFrame such that the values in the specified index column become the new columns of the DataFrame.

DataFrame.union(other)

Return a newDataFrame containing the union of rows in this and anotherDataFrame.

DataFrame.unionAll(other)

Return a newDataFrame containing the union of rows in this and anotherDataFrame.

DataFrame.unionByName(other[, ...])

Returns a newDataFrame containing union of rows in this and anotherDataFrame.

DataFrame.unpersist([blocking])

Marks theDataFrame as non-persistent, and remove all blocks for it from memory and disk.

DataFrame.unpivot(ids, values, ...)

Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set.

DataFrame.where(condition)

where() is an alias forfilter().

DataFrame.withColumn(colName, col)

Returns a newDataFrame by adding a column or replacing the existing column that has the same name.

DataFrame.withColumns(*colsMap)

Returns a newDataFrame by adding multiple columns or replacing the existing columns that have the same names.

DataFrame.withColumnRenamed(existing, new)

Returns a newDataFrame by renaming an existing column.

DataFrame.withColumnsRenamed(colsMap)

Returns a newDataFrame by renaming multiple columns.

DataFrame.withMetadata(columnName, metadata)

Returns a newDataFrame by updating an existing column with metadata.

DataFrame.withWatermark(eventTime, ...)

Defines an event time watermark for thisDataFrame.

DataFrame.write

Interface for saving the content of the non-streamingDataFrame out into external storage.

DataFrame.writeStream

Interface for saving the content of the streamingDataFrame out into external storage.

DataFrame.writeTo(table)

Create a write configuration builder for v2 sources.

DataFrame.mergeInto(table, condition)

Merges a set of updates, insertions, and deletions based on a source table into a target table.

DataFrame.pandas_api([index_col])

Converts the existing DataFrame into a pandas-on-Spark DataFrame.

DataFrameNaFunctions.drop([how, thresh, subset])

Returns a newDataFrame omitting rows with null or NaN values.

DataFrameNaFunctions.fill(value[, subset])

Returns a newDataFrame which null values are filled with new value.

DataFrameNaFunctions.replace(to_replace[, ...])

Returns a newDataFrame replacing a value with another value.

DataFrameStatFunctions.approxQuantile(col, ...)

Calculates the approximate quantiles of numerical columns of aDataFrame.

DataFrameStatFunctions.corr(col1, col2[, method])

Calculates the correlation of two columns of aDataFrame as a double value.

DataFrameStatFunctions.cov(col1, col2)

Calculate the sample covariance for the given columns, specified by their names, as a double value.

DataFrameStatFunctions.crosstab(col1, col2)

Computes a pair-wise frequency table of the given columns.

DataFrameStatFunctions.freqItems(cols[, support])

Finding frequent items for columns, possibly with false positives.

DataFrameStatFunctions.sampleBy(col, fractions)

Returns a stratified sample without replacement based on the fraction given on each stratum.

Table Argument#

DataFrame.asTable returns a table argument in PySpark.

This class provides methods to specify partitioning, ordering, and single-partition constraintswhen passing a DataFrame as a table argument to TVF(Table-Valued Function)s includingUDTF(User-Defined Table Function)s.

TableArg.partitionBy(*cols)

Partitions the data based on the specified columns.

TableArg.orderBy(*cols)

Orders the data within each partition by the specified columns.

TableArg.withSinglePartition()

Forces the data to be processed in a single partition.

Plotting#

TheDataFrame.plot attribute serves both as a callable method and a namespace, providing access to variousplotting functions via thePySparkPlotAccessor. Users can call specific plotting methods in the formatDataFrame.plot.<kind>.

PySparkPlotAccessor.area(x, y, **kwargs)

Draw a stacked area plot.

PySparkPlotAccessor.bar(x, y, **kwargs)

Vertical bar plot.

PySparkPlotAccessor.barh(x, y, **kwargs)

Make a horizontal bar plot.

PySparkPlotAccessor.line(x, y, **kwargs)

Plot DataFrame as lines.

PySparkPlotAccessor.pie(x, y, **kwargs)

Generate a pie plot.

PySparkPlotAccessor.scatter(x, y, **kwargs)

Create a scatter plot with varying marker point size and color.

PySparkPlotAccessor.box([column])

Make a box plot of the DataFrame columns.

PySparkPlotAccessor.kde(bw_method[, column, ind])

Generate Kernel Density Estimate plot using Gaussian kernels.

PySparkPlotAccessor.hist([column, bins])

Draw one histogram of the DataFrame’s columns.


[8]ページ先頭

©2009-2025 Movatter.jp