DataFrame.__getattr__ (name)
| Returns theColumn denoted byname . |
DataFrame.__getitem__ (item)
| Returns the column as aColumn . |
DataFrame.agg (*exprs)
| Aggregate on the entireDataFrame without groups (shorthand fordf.groupBy().agg() ). |
DataFrame.alias (alias)
| Returns a newDataFrame with an alias set. |
DataFrame.approxQuantile (col, probabilities, ...)
| Calculates the approximate quantiles of numerical columns of aDataFrame . |
DataFrame.asTable ()
| Converts the DataFrame into atable_arg.TableArg object, which can be used as a table argument in a TVF(Table-Valued Function) including UDTF (User-Defined Table Function). |
DataFrame.cache ()
| Persists theDataFrame with the default storage level (MEMORY_AND_DISK_DESER). |
DataFrame.checkpoint ([eager])
| Returns a checkpointed version of thisDataFrame . |
DataFrame.coalesce (numPartitions)
| Returns a newDataFrame that has exactlynumPartitions partitions. |
DataFrame.colRegex (colName)
| Selects column based on the column name specified as a regex and returns it asColumn . |
DataFrame.collect ()
| Returns all the records in the DataFrame as a list ofRow . |
DataFrame.columns
| Retrieves the names of all columns in theDataFrame as a list. |
DataFrame.corr (col1, col2[, method])
| Calculates the correlation of two columns of aDataFrame as a double value. |
DataFrame.count ()
| Returns the number of rows in thisDataFrame . |
DataFrame.cov (col1, col2)
| Calculate the sample covariance for the given columns, specified by their names, as a double value. |
DataFrame.createGlobalTempView (name)
| Creates a global temporary view with thisDataFrame . |
DataFrame.createOrReplaceGlobalTempView (name)
| Creates or replaces a global temporary view using the given name. |
DataFrame.createOrReplaceTempView (name)
| Creates or replaces a local temporary view with thisDataFrame . |
DataFrame.createTempView (name)
| Creates a local temporary view with thisDataFrame . |
DataFrame.crossJoin (other)
| Returns the cartesian product with anotherDataFrame . |
DataFrame.crosstab (col1, col2)
| Computes a pair-wise frequency table of the given columns. |
DataFrame.cube (*cols)
| Create a multi-dimensional cube for the currentDataFrame using the specified columns, allowing aggregations to be performed on them. |
DataFrame.describe (*cols)
| Computes basic statistics for numeric and string columns. |
DataFrame.distinct ()
| Returns a newDataFrame containing the distinct rows in thisDataFrame . |
DataFrame.drop (*cols)
| Returns a newDataFrame without specified columns. |
DataFrame.dropDuplicates ([subset])
| Return a newDataFrame with duplicate rows removed, optionally only considering certain columns. |
DataFrame.dropDuplicatesWithinWatermark ([subset])
| Return a newDataFrame with duplicate rows removed, |
DataFrame.drop_duplicates ([subset])
| drop_duplicates() is an alias fordropDuplicates() .
|
DataFrame.dropna ([how, thresh, subset])
| Returns a newDataFrame omitting rows with null or NaN values. |
DataFrame.dtypes
| Returns all column names and their data types as a list. |
DataFrame.exceptAll (other)
| Return a newDataFrame containing rows in thisDataFrame but not in anotherDataFrame while preserving duplicates. |
DataFrame.executionInfo
| Returns a ExecutionInfo object after the query was executed. |
DataFrame.exists ()
| Return aColumn object for an EXISTS Subquery. |
DataFrame.explain ([extended, mode])
| Prints the (logical and physical) plans to the console for debugging purposes. |
DataFrame.fillna (value[, subset])
| Returns a newDataFrame which null values are filled with new value. |
DataFrame.filter (condition)
| Filters rows using the given condition. |
DataFrame.first ()
| Returns the first row as aRow . |
DataFrame.foreach (f)
| Applies thef function to allRow of thisDataFrame . |
DataFrame.foreachPartition (f)
| Applies thef function to each partition of thisDataFrame . |
DataFrame.freqItems (cols[, support])
| Finding frequent items for columns, possibly with false positives. |
DataFrame.groupBy (*cols)
| Groups theDataFrame by the specified columns so that aggregation can be performed on them. |
DataFrame.groupingSets (groupingSets, *cols)
| Create multi-dimensional aggregation for the currentDataFrame using the specified grouping sets, so we can run aggregation on them. |
DataFrame.head ([n])
| Returns the firstn rows. |
DataFrame.hint (name, *parameters)
| Specifies some hint on the currentDataFrame . |
DataFrame.inputFiles ()
| Returns a best-effort snapshot of the files that compose thisDataFrame . |
DataFrame.intersect (other)
| Return a newDataFrame containing rows only in both thisDataFrame and anotherDataFrame . |
DataFrame.intersectAll (other)
| Return a newDataFrame containing rows in both thisDataFrame and anotherDataFrame while preserving duplicates. |
DataFrame.isEmpty ()
| Checks if theDataFrame is empty and returns a boolean value. |
DataFrame.isLocal ()
| ReturnsTrue if thecollect() andtake() methods can be run locally (without any Spark executors). |
DataFrame.isStreaming
| ReturnsTrue if thisDataFrame contains one or more sources that continuously return data as it arrives. |
DataFrame.join (other[, on, how])
| Joins with anotherDataFrame , using the given join expression. |
DataFrame.limit (num)
| Limits the result count to the number specified. |
DataFrame.lateralJoin (other[, on, how])
| Lateral joins with anotherDataFrame , using the given join expression. |
DataFrame.localCheckpoint ([eager, storageLevel])
| Returns a locally checkpointed version of thisDataFrame . |
DataFrame.mapInPandas (func, schema[, ...])
| Maps an iterator of batches in the currentDataFrame using a Python native function that is performed on pandas DataFrames both as input and output, and returns the result as aDataFrame . |
DataFrame.mapInArrow (func, schema[, ...])
| Maps an iterator of batches in the currentDataFrame using a Python native function that is performed onpyarrow.RecordBatchs both as input and output, and returns the result as aDataFrame . |
DataFrame.metadataColumn (colName)
| Selects a metadata column based on its logical column name and returns it as aColumn . |
DataFrame.melt (ids, values, ...)
| Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. |
DataFrame.na
| Returns aDataFrameNaFunctions for handling missing values. |
DataFrame.observe (observation, *exprs)
| Define (named) metrics to observe on the DataFrame. |
DataFrame.offset (num)
| Returns a new :class:DataFrame by skipping the firstn rows. |
DataFrame.orderBy (*cols, **kwargs)
| Returns a newDataFrame sorted by the specified column(s). |
DataFrame.persist ([storageLevel])
| Sets the storage level to persist the contents of theDataFrame across operations after the first time it is computed. |
DataFrame.plot
| Returns aplot.core.PySparkPlotAccessor for plotting functions. |
DataFrame.printSchema ([level])
| Prints out the schema in the tree format. |
DataFrame.randomSplit (weights[, seed])
| Randomly splits thisDataFrame with the provided weights. |
DataFrame.rdd
| Returns the content as anpyspark.RDD ofRow . |
DataFrame.registerTempTable (name)
| Registers thisDataFrame as a temporary table using the given name. |
DataFrame.repartition (numPartitions, *cols)
| Returns a newDataFrame partitioned by the given partitioning expressions. |
DataFrame.repartitionByRange (numPartitions, ...)
| Returns a newDataFrame partitioned by the given partitioning expressions. |
DataFrame.replace (to_replace[, value, subset])
| Returns a newDataFrame replacing a value with another value. |
DataFrame.rollup (*cols)
| Create a multi-dimensional rollup for the currentDataFrame using the specified columns, allowing for aggregation on them. |
DataFrame.sameSemantics (other)
| ReturnsTrue when the logical query plans inside bothDataFrame s are equal and therefore return the same results. |
DataFrame.sample ([withReplacement, ...])
| Returns a sampled subset of thisDataFrame . |
DataFrame.sampleBy (col, fractions[, seed])
| Returns a stratified sample without replacement based on the fraction given on each stratum. |
DataFrame.scalar ()
| Return aColumn object for a SCALAR Subquery containing exactly one row and one column. |
DataFrame.schema
| Returns the schema of thisDataFrame as apyspark.sql.types.StructType . |
DataFrame.select (*cols)
| Projects a set of expressions and returns a newDataFrame . |
DataFrame.selectExpr (*expr)
| Projects a set of SQL expressions and returns a newDataFrame . |
DataFrame.semanticHash ()
| Returns a hash code of the logical query plan against thisDataFrame . |
DataFrame.show ([n, truncate, vertical])
| Prints the firstn rows of the DataFrame to the console. |
DataFrame.sort (*cols, **kwargs)
| Returns a newDataFrame sorted by the specified column(s). |
DataFrame.sortWithinPartitions (*cols, **kwargs)
| Returns a newDataFrame with each partition sorted by the specified column(s). |
DataFrame.sparkSession
| Returns Spark session that created thisDataFrame . |
DataFrame.stat
| Returns aDataFrameStatFunctions for statistic functions. |
DataFrame.storageLevel
| Get theDataFrame 's current storage level. |
DataFrame.subtract (other)
| Return a newDataFrame containing rows in thisDataFrame but not in anotherDataFrame . |
DataFrame.summary (*statistics)
| Computes specified statistics for numeric and string columns. |
DataFrame.tail (num)
| Returns the lastnum rows as alist ofRow . |
DataFrame.take (num)
| Returns the firstnum rows as alist ofRow . |
DataFrame.to (schema)
| Returns a newDataFrame where each row is reconciled to match the specified schema. |
DataFrame.toArrow ()
| Returns the contents of thisDataFrame as PyArrowpyarrow.Table . |
DataFrame.toDF (*cols)
| Returns a newDataFrame that with new specified column names |
DataFrame.toJSON ([use_unicode])
| Converts aDataFrame into aRDD of string. |
DataFrame.toLocalIterator ([prefetchPartitions])
| Returns an iterator that contains all of the rows in thisDataFrame . |
DataFrame.toPandas ()
| Returns the contents of thisDataFrame as Pandaspandas.DataFrame . |
DataFrame.transform (func, *args, **kwargs)
| Returns a newDataFrame . |
DataFrame.transpose ([indexColumn])
| Transposes a DataFrame such that the values in the specified index column become the new columns of the DataFrame. |
DataFrame.union (other)
| Return a newDataFrame containing the union of rows in this and anotherDataFrame . |
DataFrame.unionAll (other)
| Return a newDataFrame containing the union of rows in this and anotherDataFrame . |
DataFrame.unionByName (other[, ...])
| Returns a newDataFrame containing union of rows in this and anotherDataFrame . |
DataFrame.unpersist ([blocking])
| Marks theDataFrame as non-persistent, and remove all blocks for it from memory and disk. |
DataFrame.unpivot (ids, values, ...)
| Unpivot a DataFrame from wide format to long format, optionally leaving identifier columns set. |
DataFrame.where (condition)
| where() is an alias forfilter() .
|
DataFrame.withColumn (colName, col)
| Returns a newDataFrame by adding a column or replacing the existing column that has the same name. |
DataFrame.withColumns (*colsMap)
| Returns a newDataFrame by adding multiple columns or replacing the existing columns that have the same names. |
DataFrame.withColumnRenamed (existing, new)
| Returns a newDataFrame by renaming an existing column. |
DataFrame.withColumnsRenamed (colsMap)
| Returns a newDataFrame by renaming multiple columns. |
DataFrame.withMetadata (columnName, metadata)
| Returns a newDataFrame by updating an existing column with metadata. |
DataFrame.withWatermark (eventTime, ...)
| Defines an event time watermark for thisDataFrame . |
DataFrame.write
| Interface for saving the content of the non-streamingDataFrame out into external storage. |
DataFrame.writeStream
| Interface for saving the content of the streamingDataFrame out into external storage. |
DataFrame.writeTo (table)
| Create a write configuration builder for v2 sources. |
DataFrame.mergeInto (table, condition)
| Merges a set of updates, insertions, and deletions based on a source table into a target table. |
DataFrame.pandas_api ([index_col])
| Converts the existing DataFrame into a pandas-on-Spark DataFrame. |
DataFrameNaFunctions.drop ([how, thresh, subset])
| Returns a newDataFrame omitting rows with null or NaN values. |
DataFrameNaFunctions.fill (value[, subset])
| Returns a newDataFrame which null values are filled with new value. |
DataFrameNaFunctions.replace (to_replace[, ...])
| Returns a newDataFrame replacing a value with another value. |
DataFrameStatFunctions.approxQuantile (col, ...)
| Calculates the approximate quantiles of numerical columns of aDataFrame . |
DataFrameStatFunctions.corr (col1, col2[, method])
| Calculates the correlation of two columns of aDataFrame as a double value. |
DataFrameStatFunctions.cov (col1, col2)
| Calculate the sample covariance for the given columns, specified by their names, as a double value. |
DataFrameStatFunctions.crosstab (col1, col2)
| Computes a pair-wise frequency table of the given columns. |
DataFrameStatFunctions.freqItems (cols[, support])
| Finding frequent items for columns, possibly with false positives. |
DataFrameStatFunctions.sampleBy (col, fractions)
| Returns a stratified sample without replacement based on the fraction given on each stratum. |