DataFrame.__getattr__(name)
| Returns theColumn denoted byname. |
DataFrame.__getitem__(item)
| Returns the column as aColumn. |
DataFrame.agg(*exprs)
| Aggregate on the entireDataFrame without groups (shorthand fordf.groupBy().agg()). |
DataFrame.alias(alias)
| Returns a newDataFrame with an alias set. |
DataFrame.approxQuantile(col, probabilities, ...)
| Calculates the approximate quantiles of numerical columns of aDataFrame. |
DataFrame.asTable()
| Converts the DataFrame into atable_arg.TableArg object, which can be used as a table argument in a TVF(Table-Valued Function) including UDTF (User-Defined Table Function). |
DataFrame.cache()
| Persists theDataFrame with the default storage level (MEMORY_AND_DISK_DESER). |
DataFrame.checkpoint([eager])
| Returns a checkpointed version of thisDataFrame. |
DataFrame.coalesce(numPartitions)
| Returns a newDataFrame that has exactlynumPartitions partitions. |
DataFrame.colRegex(colName)
| Selects column based on the column name specified as a regex and returns it asColumn. |
DataFrame.collect()
| Returns all the records in the DataFrame as a list ofRow. |
DataFrame.columns
| Retrieves the names of all columns in theDataFrame as a list. |
DataFrame.corr(col1, col2[, method])
| Calculates the correlation of two columns of aDataFrame as a double value. |
DataFrame.count()
| Returns the number of rows in thisDataFrame. |
DataFrame.cov(col1, col2)
| Calculate the sample covariance for the given columns, specified by their names, as a double value. |
DataFrame.createGlobalTempView(name)
| Creates a global temporary view with thisDataFrame. |
DataFrame.createOrReplaceGlobalTempView(name)
| Creates or replaces a global temporary view using the given name. |
DataFrame.createOrReplaceTempView(name)
| Creates or replaces a local temporary view with thisDataFrame. |
DataFrame.createTempView(name)
| Creates a local temporary view with thisDataFrame. |
DataFrame.crossJoin(other)
| Returns the cartesian product with anotherDataFrame. |
DataFrame.crosstab(col1, col2)
| Computes a pair-wise frequency table of the given columns. |
DataFrame.cube(*cols)
| Create a multi-dimensional cube for the currentDataFrame using the specified columns, allowing aggregations to be performed on them. |
DataFrame.describe(*cols)
| Computes basic statistics for numeric and string columns. |
DataFrame.distinct()
| Returns a newDataFrame containing the distinct rows in thisDataFrame. |
DataFrame.drop(*cols)
| Returns a newDataFrame without specified columns. |
DataFrame.dropDuplicates([subset])
| Return a newDataFrame with duplicate rows removed, optionally only considering certain columns. |
DataFrame.dropDuplicatesWithinWatermark([subset])
| Return a newDataFrame with duplicate rows removed, |
DataFrame.drop_duplicates([subset])
| drop_duplicates() is an alias fordropDuplicates().
|
DataFrame.dropna([how, thresh, subset])
| Returns a newDataFrame omitting rows with null or NaN values. |
DataFrame.dtypes
| Returns all column names and their data types as a list. |
DataFrame.exceptAll(other)
| Return a newDataFrame containing rows in thisDataFrame but not in anotherDataFrame while preserving duplicates. |
DataFrame.executionInfo
| Returns a ExecutionInfo object after the query was executed. |
DataFrame.exists()
| Return aColumn object for an EXISTS Subquery. |
DataFrame.explain([extended, mode])
| Prints the (logical and physical) plans to the console for debugging purposes. |
DataFrame.fillna(value[, subset])
| Returns a newDataFrame which null values are filled with new value. |
DataFrame.filter(condition)
| Filters rows using the given condition. |
DataFrame.first()
| Returns the first row as aRow. |
DataFrame.foreach(f)
| Applies thef function to allRow of thisDataFrame. |
DataFrame.foreachPartition(f)
| Applies thef function to each partition of thisDataFrame. |
DataFrame.freqItems(cols[, support])
| Finding frequent items for columns, possibly with false positives. |
DataFrame.groupBy(*cols)
| Groups theDataFrame by the specified columns so that aggregation can be performed on them. |
DataFrame.groupingSets(groupingSets, *cols)
| Create multi-dimensional aggregation for the currentDataFrame using the specified grouping sets, so we can run aggregation on them. |
DataFrame.head([n])
| Returns the firstn rows. |
DataFrame.hint(name, *parameters)
| Specifies some hint on the currentDataFrame. |
DataFrame.inputFiles()
| Returns a best-effort snapshot of the files that compose thisDataFrame. |
DataFrame.intersect(other)
| Return a newDataFrame containing rows only in both thisDataFrame and anotherDataFrame. |
DataFrame.intersectAll(other)
| Return a newDataFrame containing rows in both thisDataFrame and anotherDataFrame while preserving duplicates. |
DataFrame.isEmpty()
| Checks if theDataFrame is empty and returns a boolean value. |
DataFrame.isLocal()
| ReturnsTrue if thecollect() andtake() methods can be run locally (without any Spark executors). |
DataFrame.isStreaming
| ReturnsTrue if thisDataFrame contains one or more sources that continuously return data as it arrives. |
DataFrame.join(other[, on, how])
| Joins with anotherDataFrame, using the given join expression. |
DataFrame.limit(num)
| Limits the result count to the number specified. |
DataFrame.lateralJoin(other[, on, how])
| Lateral joins with anotherDataFrame, using the given join expression. |
DataFrame.localCheckpoint([eager, storageLevel])
| Returns a locally checkpointed version of thisDataFrame. |
DataFrame.mapInPandas(func, schema[, ...])
| Maps an iterator of batches in the currentDataFrame using a Python native function that is performed on pandas DataFrames both as input and output, and returns the result as aDataFrame. |
DataFrame.mapInArrow(func, schema[, ...])
| Maps an iterator of batches in the currentDataFrame using a Python native function that is performed onpyarrow.RecordBatchs both as input and output, and returns the result as aDataFrame. |
DataFrame.metadataColumn(colName)
| Selects a metadata column based on its logical column name and returns it as aColumn. |
DataFrame.melt(ids, values, ...)
| Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. |
DataFrame.na
| Returns aDataFrameNaFunctions for handling missing values. |
DataFrame.observe(observation, *exprs)
| Define (named) metrics to observe on the DataFrame. |
DataFrame.offset(num)
| Returns a new :class:DataFrame by skipping the firstn rows. |
DataFrame.orderBy(*cols, **kwargs)
| Returns a newDataFrame sorted by the specified column(s). |
DataFrame.persist([storageLevel])
| Sets the storage level to persist the contents of theDataFrame across operations after the first time it is computed. |
DataFrame.plot
| Returns aplot.core.PySparkPlotAccessor for plotting functions. |
DataFrame.printSchema([level])
| Prints out the schema in the tree format. |
DataFrame.randomSplit(weights[, seed])
| Randomly splits thisDataFrame with the provided weights. |
DataFrame.rdd
| Returns the content as anpyspark.RDD ofRow. |
DataFrame.registerTempTable(name)
| Registers thisDataFrame as a temporary table using the given name. |
DataFrame.repartition(numPartitions, *cols)
| Returns a newDataFrame partitioned by the given partitioning expressions. |
DataFrame.repartitionByRange(numPartitions, ...)
| Returns a newDataFrame partitioned by the given partitioning expressions. |
DataFrame.replace(to_replace[, value, subset])
| Returns a newDataFrame replacing a value with another value. |
DataFrame.rollup(*cols)
| Create a multi-dimensional rollup for the currentDataFrame using the specified columns, allowing for aggregation on them. |
DataFrame.sameSemantics(other)
| ReturnsTrue when the logical query plans inside bothDataFrames are equal and therefore return the same results. |
DataFrame.sample([withReplacement, ...])
| Returns a sampled subset of thisDataFrame. |
DataFrame.sampleBy(col, fractions[, seed])
| Returns a stratified sample without replacement based on the fraction given on each stratum. |
DataFrame.scalar()
| Return aColumn object for a SCALAR Subquery containing exactly one row and one column. |
DataFrame.schema
| Returns the schema of thisDataFrame as apyspark.sql.types.StructType. |
DataFrame.select(*cols)
| Projects a set of expressions and returns a newDataFrame. |
DataFrame.selectExpr(*expr)
| Projects a set of SQL expressions and returns a newDataFrame. |
DataFrame.semanticHash()
| Returns a hash code of the logical query plan against thisDataFrame. |
DataFrame.show([n, truncate, vertical])
| Prints the firstn rows of the DataFrame to the console. |
DataFrame.sort(*cols, **kwargs)
| Returns a newDataFrame sorted by the specified column(s). |
DataFrame.sortWithinPartitions(*cols, **kwargs)
| Returns a newDataFrame with each partition sorted by the specified column(s). |
DataFrame.sparkSession
| Returns Spark session that created thisDataFrame. |
DataFrame.stat
| Returns aDataFrameStatFunctions for statistic functions. |
DataFrame.storageLevel
| Get theDataFrame's current storage level. |
DataFrame.subtract(other)
| Return a newDataFrame containing rows in thisDataFrame but not in anotherDataFrame. |
DataFrame.summary(*statistics)
| Computes specified statistics for numeric and string columns. |
DataFrame.tail(num)
| Returns the lastnum rows as alist ofRow. |
DataFrame.take(num)
| Returns the firstnum rows as alist ofRow. |
DataFrame.to(schema)
| Returns a newDataFrame where each row is reconciled to match the specified schema. |
DataFrame.toArrow()
| Returns the contents of thisDataFrame as PyArrowpyarrow.Table. |
DataFrame.toDF(*cols)
| Returns a newDataFrame that with new specified column names |
DataFrame.toJSON([use_unicode])
| Converts aDataFrame into aRDD of string. |
DataFrame.toLocalIterator([prefetchPartitions])
| Returns an iterator that contains all of the rows in thisDataFrame. |
DataFrame.toPandas()
| Returns the contents of thisDataFrame as Pandaspandas.DataFrame. |
DataFrame.transform(func, *args, **kwargs)
| Returns a newDataFrame. |
DataFrame.transpose([indexColumn])
| Transposes a DataFrame such that the values in the specified index column become the new columns of the DataFrame. |
DataFrame.union(other)
| Return a newDataFrame containing the union of rows in this and anotherDataFrame. |
DataFrame.unionAll(other)
| Return a newDataFrame containing the union of rows in this and anotherDataFrame. |
DataFrame.unionByName(other[, ...])
| Returns a newDataFrame containing union of rows in this and anotherDataFrame. |
DataFrame.unpersist([blocking])
| Marks theDataFrame as non-persistent, and remove all blocks for it from memory and disk. |
DataFrame.unpivot(ids, values, ...)
| Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. |
DataFrame.where(condition)
| where() is an alias forfilter().
|
DataFrame.withColumn(colName, col)
| Returns a newDataFrame by adding a column or replacing the existing column that has the same name. |
DataFrame.withColumns(*colsMap)
| Returns a newDataFrame by adding multiple columns or replacing the existing columns that have the same names. |
DataFrame.withColumnRenamed(existing, new)
| Returns a newDataFrame by renaming an existing column. |
DataFrame.withColumnsRenamed(colsMap)
| Returns a newDataFrame by renaming multiple columns. |
DataFrame.withMetadata(columnName, metadata)
| Returns a newDataFrame by updating an existing column with metadata. |
DataFrame.withWatermark(eventTime, ...)
| Defines an event time watermark for thisDataFrame. |
DataFrame.write
| Interface for saving the content of the non-streamingDataFrame out into external storage. |
DataFrame.writeStream
| Interface for saving the content of the streamingDataFrame out into external storage. |
DataFrame.writeTo(table)
| Create a write configuration builder for v2 sources. |
DataFrame.mergeInto(table, condition)
| Merges a set of updates, insertions, and deletions based on a source table into a target table. |
DataFrame.pandas_api([index_col])
| Converts the existing DataFrame into a pandas-on-Spark DataFrame. |
DataFrameNaFunctions.drop([how, thresh, subset])
| Returns a newDataFrame omitting rows with null or NaN values. |
DataFrameNaFunctions.fill(value[, subset])
| Returns a newDataFrame which null values are filled with new value. |
DataFrameNaFunctions.replace(to_replace[, ...])
| Returns a newDataFrame replacing a value with another value. |
DataFrameStatFunctions.approxQuantile(col, ...)
| Calculates the approximate quantiles of numerical columns of aDataFrame. |
DataFrameStatFunctions.corr(col1, col2[, method])
| Calculates the correlation of two columns of aDataFrame as a double value. |
DataFrameStatFunctions.cov(col1, col2)
| Calculate the sample covariance for the given columns, specified by their names, as a double value. |
DataFrameStatFunctions.crosstab(col1, col2)
| Computes a pair-wise frequency table of the given columns. |
DataFrameStatFunctions.freqItems(cols[, support])
| Finding frequent items for columns, possibly with false positives. |
DataFrameStatFunctions.sampleBy(col, fractions)
| Returns a stratified sample without replacement based on the fraction given on each stratum. |