Functions on PySpark Is Rad

Functions on PySpark Is Rad https://pysparkisrad.com/functions/ Recent content in Functions on PySpark Is Rad Hugo -- gohugo.io en-us abs https://pysparkisrad.com/functions/abs/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/abs/ abs # pyspark.sql.functions.abs(col) # version: since 1.3 Computes the absolute value. Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"num": 1},{"num": -2},{"num": 0}] df = spark.createDataFrame(data) # Use function df = (df .withColumn("absolute", F.abs("num")) ) df.show() num absolute 1 1 -2 2 0 0 Usage: This is just a basic math function. Nothing special about it. I’ve used it before when doing subtraction between two columns. acos https://pysparkisrad.com/functions/acos/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/acos/ acos # pyspark.sql.functions.acos(col) # version: since 1.4 Inverse cosine of col, as if computed by java.lang.Math.acos() Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"num": 1.0},{"num": 0.5},{"num": 0.0}] df = spark.createDataFrame(data) # Use function df = (df .withColumn("acos", F.acos("num")) ) df.show() num acos 1.0 0.0 0.5 1.0471975511965979 0.0 1.5707963267948966 Usage: This is just a basic math function. Nothing special about it. Never used it. acosh https://pysparkisrad.com/functions/acosh/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/acosh/ acosh # pyspark.sql.functions.acosh(col) # version: since 3.1.0 Computes inverse hyperbolic cosine of the input column. Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"num": 1.0},{"num": 2.5},{"num": 5.0}] df = spark.createDataFrame(data) # Use function df = (df .withColumn("acosh", F.acosh("num")) ) df.show() num acosh 1.0 0.0 2.5 1.566799236972411 5.0 2.2924316695611777 Usage: This is just a basic math function. Nothing special about it. In fact I have no idea what it means. add_months https://pysparkisrad.com/functions/add_months/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/add_months/ add_months # pyspark.sql.functions.add_months(start, months) # version: since 1.5.0 Returns the date that is months months after start start: date column months: integer Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"date": '2047-04-08'}, {"date": '1999-12-31'}, {"date": '1906-02-28'}] df = spark.createDataFrame(data) df = df.select(F.to_date(df.date, 'yyyy-MM-dd') .alias("date")) # Use function df = (df .withColumn("add_months", F.add_months(F.col("date"),3)) ) df.show() date add_months 2047-04-08 2047-07-08 1999-12-31 2000-03-31 1906-02-28 1906-05-28 Usage: array https://pysparkisrad.com/functions/array/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/array/ array # pyspark.sql.functions.array(col*) # version: since 1.4.0 Creates a new array column. Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"a": 1,"b": 2},{"a": 3,"b": 4},{"a": 5,"b": 6}] df = spark.createDataFrame(data) # Use function df = (df .withColumn("array", F.array(F.col("a"),F.col("b"))) ) df.show() a b array 1 2 [1, 2] 3 4 [3, 4] 5 6 [5, 6] Usage: I use this often. Also used to create an empty array if needed by filling the array with none. array_contains https://pysparkisrad.com/functions/array_contains/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/array_contains/ array_contains # pyspark.sql.functions.array_contains(col, value) # version: since 1.5.0 Collection function: returns null if the array is null, true if the array contains the given value, and false otherwise. value: value or column to check for in an array Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"a": [],"b": 1},{"a": [1,2,2],"b": 1}, {"a": [4,5,5],"b": 1}] df = spark.createDataFrame(data) # Use function df = (df . array_distinct https://pysparkisrad.com/functions/array_distinct/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/array_distinct/ array_distinct # pyspark.sql.functions.array_distinct(col) # version: since 2.4.0 Collection function: removes duplicate values from the array. Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"a": [1,2,2],"b": 1},{"b": 1}] df = spark.createDataFrame(data).drop("b") # Use function df = (df .withColumn("array_distinct", F.array_distinct(F.col("a"))) ) df.show() a array_distinct [1, 2, 2] [1, 2] null null Usage: Simple array function. I have used it a lot. Especially when combining two columns of arrays that may have the same values in them. array_except https://pysparkisrad.com/functions/array_except/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/array_except/ array_except # pyspark.sql.functions.**array_except(col1, col2) # version: since 2.4.0 Collection function: returns an array of the elements in col1 but not in col2, without duplicates. Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"a": [1,2,2],"b": [3,2,2]},{"b": [1,2,2]},{"a": [4,5,5],"b": [1,2,2]},{"a": [4,5,5]}] df = spark.createDataFrame(data) # Use function df = (df .withColumn("array_except", F.array_except(F.col("a"),F.col("b"))) ) df.show() a b array_except [1, 2, 2] [3, 2, 2] [1] null [1, 2, 2] null [4, 5, 5] [1, 2, 2] [4, 5] [4, 5, 5] null null Usage: array_intersect https://pysparkisrad.com/functions/array_intersect/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/array_intersect/ array_intersect # pyspark.sql.functions.array_intersect(col1, col2) # version: since 2.4.0 Collection function: returns an array of the elements in the intersection of col1 and col2, without duplicates. Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"a": [1,2,2],"b": [3,2,2]},{"b": [1,2,2]},{"a": [4,5,5],"b": [1,2,2]},{"a": [4,5,5]}] df = spark.createDataFrame(data) # Use function df = (df .withColumn("array_intersect", F.array_intersect(F.col("a"),F.col("b"))) ) df.show() a b array_intersect [1, 2, 2] [3, 2, 2] [2] null [1, 2, 2] null [4, 5, 5] [1, 2, 2] [] [4, 5, 5] null null Usage: array_join https://pysparkisrad.com/functions/array_join/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/array_join/ array_join # pyspark.sql.functions.array_join(col, delimiter, null_replacement=None) # version: since 2.4.0 Concatenates the elements of column using the delimiter. Null values are replaced with null_replacement if set, otherwise they are ignored. delimeter: string that goes between elements null_replacement: string instead of None for null Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"a":1,"b":2,"c":2},{"a":3,"c":5}] df = spark.createDataFrame(data) df = df.select(F.array(F.col("a"),F.col("b"),F.col("c")).alias("a")) # Use function df = (df . array_max https://pysparkisrad.com/functions/array_max/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/array_max/ array_max # pyspark.sql.functions.array_max(col) # version: since 2.4.0 Collection function: returns the maximum value of the array. Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"a":1,"b":2,"c":2},{"a":3,"c":5}] df = spark.createDataFrame(data) df = df.select(F.array(F.col("a"),F.col("b"),F.col("c")).alias("a")) # Use function df = (df .withColumn("array_max", F.array_max(F.col("a"))) ) df.show() a array_max [1, 2, 2] 2 [3, null, 5] 5 Usage: Simple array function. returns: Column(sc.\_jvm.functions.array_max(\_to_java_column(col))) PySpark manual tags: largest number in array, highest in array, highest in list array_min https://pysparkisrad.com/functions/array_min/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/array_min/ array_min # pyspark.sql.functions.array_min(col) # version: since 2.4.0 Collection function: returns the minimum value of the array. Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"a":1,"b":2,"c":2},{"a":3,"c":5}] df = spark.createDataFrame(data) df = df.select(F.array(F.col("a"),F.col("b"),F.col("c")).alias("a")) # Use function df = (df .withColumn("array_min", F.array_min(F.col("a"))) ) df.show() a array_min [1, 2, 2] 1 [3, null, 5] 3 Usage: Simple array function. returns: Column(sc.\_jvm.functions.array_min(\_to_java_column(col))) PySpark manual tags: smallest number in array, lowest in array, lowest in list array_position https://pysparkisrad.com/functions/array_position/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/array_position/ array_position # pyspark.sql.functions.array_position(col, value) # version: since 2.4.0 Collection function: Locates the position of the first occurrence of the given value in the given array. Returns null if either of the arguments are null. Returns 0 if value not found. Note that the return value is the cardinal position, not zero based. value: string or number Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"a":1,"b":2,"c":2},{"a":3,"c":5}] df = spark. array_remove https://pysparkisrad.com/functions/array_remove/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/array_remove/ array_remove # pyspark.sql.functions.array_remove(col, element) # version: since 2.4.0 Collection function: Remove all elements that equal to element from the given array. element: string or number Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"a":1,"b":2,"c":2},{"a":3,"c":5}] df = spark.createDataFrame(data) df = df.select(F.array(F.col("a"),F.col("b"),F.col("c")).alias("a")) # Use function df = (df .withColumn("array_remove", F.array_remove(F.col("a"),2)) ) df.show() a array_remove [1, 2, 2] [1] [3, null, 5] [3, null, 5] Usage: array_repeat https://pysparkisrad.com/functions/array_repeat/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/array_repeat/ array_repeat # pyspark.sql.functions.array_repeat(col, count) # version: since 2.4.0 Collection function: creates an array containing a column repeated count times. count: int Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"a":1},{"a":2},{"a":5}] df = spark.createDataFrame(data) # Use function df = (df .withColumn("array_repeat", F.array_repeat(F.col("a"),3)) ) df.show() a array_repeat 1 [1, 1, 1] 2 [2, 2, 2] 5 [5, 5, 5] Usage: Simple array function. return Column(sc.\_jvm.functions.array_repeat(\_to_java_column(col),\_to_java_column(count) if isinstance(count, Column) else count)) PySpark manual array_sort https://pysparkisrad.com/functions/array_sort/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/array_sort/ array_sort # pyspark.sql.functions.array_sort(col) # version: since 2.4.0 Collection function: sorts the input array in ascending order. The elements of the input arraymust be orderable. Null elements will be placed at the end of the returned array. Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"a":3,"b":2,"c":2},{"a":3,"c":5}] df = spark.createDataFrame(data) df = df.select(F.array(F.col("a"),F.col("b"),F.col("c")).alias("a")) # Use function df = (df .withColumn("array_sort", F.array_sort(F.col("a"))) ) df.show() a array_sort [3, 2, 2] [2, 2, 3] [3, null, 5] [3, 5, null] Usage: array_union https://pysparkisrad.com/functions/array_union/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/array_union/ array_union # pyspark.sql.functions.array_union(col1, col2) # version: since 2.4.0 Collection function: returns an array of the elements in the union of col1 and col2, without duplicates. Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"a": [1,2,2],"b": [3,2,2]},{"b": [1,2,2]},{"a": [4,5,5],"b": [1,2,2]}] df = spark.createDataFrame(data) # Use function df = (df .withColumn("array_union", F.array_union(F.col("a"),F.col("b"))) ) df.show() a b array_union [1, 2, 2] [3, 2, 2] [1, 2, 3] null [1, 2, 2] null [4, 5, 5] [1, 2, 2] [4, 5, 1, 2] Usage: arrays_overlap https://pysparkisrad.com/functions/arrays_overlap/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/arrays_overlap/ arrays_overlap # pyspark.sql.functions.arrays_overlap(a1, a2) # version: since 2.4.0 Collection function: returns true if the arrays contain any common non-null element; if not, returns null if both the arrays are non-empty and any of them contains a null element; returns false otherwise. Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"a": [1,2,2],"b": [3,2,2]},{"b": [1,2,2]},{"a": [4,5,5],"b": [1,2,2]}] df = spark.createDataFrame(data) # Use function df = (df . arrays_zip https://pysparkisrad.com/functions/arrays_zip/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/arrays_zip/ arrays_zip # pyspark.sql.functions.arrays_zip(cols*) # version: since 2.4.0 Collection function: Returns a merged array of structs in which the N-th struct contains all N-th values of input arrays. Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"a": [1,2],"b": [3,2]},{"b": [1,2]}] df = spark.createDataFrame(data) # Use function df = (df .withColumn("arrays_zip", F.arrays_zip(F.col("a"),F.col("b"))) ) df.show() a b arrays_zip [1, 2] [3, 2] [{1, 3}, {2, 2}] null [1, 2] null Usage: ascii https://pysparkisrad.com/functions/ascii/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/ascii/ ascii # pyspark.sql.functions.ascii(col) # version: since 1.5.0 Computes the numeric value of the first character of the string column. Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"b": "hi","a": "aa"},{"a": ""},{"b": "bob"}] df = spark.createDataFrame(data).drop("b") # Use function df = (df .withColumn("ascii", F.ascii(F.col("a"))) ) df.show() a ascii aa 97 0 null null Usage: Simple function. Just gets the ascii code of the first letter. asin https://pysparkisrad.com/functions/asin/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/asin/ asin # pyspark.sql.functions.asin(col) # version: since 1.3 Inverse sine of col, as if computed by java.lang.Math.asin() Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"num": 1.0},{"num": .5},{"num": 0.0}] df = spark.createDataFrame(data) # Use function df = (df .withColumn("asin", F.asin("num")) ) df.show() num asin 1.0 1.5707963267948966 0.5 0.5235987755982989 0.0 0.0 Usage: This is just a basic math function. Nothing special about it. Never used it. asinh https://pysparkisrad.com/functions/asinh/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/asinh/ asinh # pyspark.sql.functions.asinh(col) # version: since 3.1.0 Computes inverse hyperbolic sine of the input column. Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"num": 1.0},{"num": .5},{"num": 0.0}] df = spark.createDataFrame(data) # Use function df = (df .withColumn("asinh", F.asinh("num")) ) df.show() num asinh 1.0 0.8813735870195429 0.5 0.48121182505960347 0.0 0.0 Usage: This is just a basic math function. Nothing special about it. Never used it. assert_true https://pysparkisrad.com/functions/assert_true/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/assert_true/ assert_true # pyspark.sql.functions.assert_true(col, errMsg=None) # version: since 3.1.0 Returns null if the input column is true; throws an exception with the provided error message otherwise. Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"a": 1,"b": 2}] df = spark.createDataFrame(data) # Use function f = (df .withColumn("assert_true", F.assert_true( F.col("a") < F.col("b"))) ) df.show() a b assert_true 1 2 null Usage: Never used it. But I could see it being useful for some form of validation. atan https://pysparkisrad.com/functions/atan/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/atan/ atan # pyspark.sql.functions.atan(col) # version: since 1.4 Inverse tangent of col, as if computed by java.lang.Math.atan() Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"num": 1.0},{"num": .5},{"num": 0.0}] df = spark.createDataFrame(data) # Use function df = (df .withColumn("atan", F.atan("num")) ) df.show() num atan 1.0 0.7853981633974483 0.5 0.4636476090008061 0.0 0.0 Usage: This is just a basic math function. Nothing special about it. Never used it. atan2 https://pysparkisrad.com/functions/atan2/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/atan2/ atan2 # pyspark.sql.functions.atan2(col1, col2) # version: since 1.4 The theta component of the point (r, theta) in polar coordinates that corresponds to the point (x, y) in Cartesian coordinates, as if computed by java.lang.Math.atan2() Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"num1": 1.0,"num2": 1.0},{"num1": .5,"num2": 1.0},{"num1": 0.0,"num2": 1.0}] df = spark.createDataFrame(data) # Use function df = (df .withColumn("atan2", F.atan2("num1","num2")) ) df.show() num1 num2 atan2 1. atanh https://pysparkisrad.com/functions/atanh/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/atanh/ atanh # pyspark.sql.functions.atanh(col1) # version: since 3.1.0 Computes inverse hyperbolic tangent of the input column. Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"num": 1.0},{"num": .5},{"num": 0.0}] df = spark.createDataFrame(data) # Use function df = (df .withColumn("atan", F.atan("num")) ) df.show() num atan 1.0 0.7853981633974483 0.5 0.4636476090008061 0.0 0.0 Usage: This is just a basic math function. Nothing special about it. Never used it. avg https://pysparkisrad.com/functions/avg/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/avg/ avg # pyspark.sql.functions.avg(col) # version: since 1.3 Aggregate function: returns the average of the values in a group. Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"num": 1.0},{"num": .5},{"num": 0.0}] df = spark.createDataFrame(data) # Use function df = (df .withColumn("avg", F.avg("num")) ) df.show() avg(num) 0.5 Usage: Often used aggregation function. returns: \_invoke_function_over_column("avg", col) PySpark manual tags: average value, median, mean, average price © 2023 PySpark Is Rad base64 https://pysparkisrad.com/functions/base64/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/base64/ base64 # pyspark.sql.functions.base64(col) # version: since 1.5 Computes the BASE64 encoding of a binary column and returns it as a string column. Runnable Code: from pyspark.sql import functions as F # Set up dataframe df = spark.createDataFrame([ (bytearray(b'0001'), 1) ], schema=T.StructType([ T.StructField("bin", T.BinaryType()), T.StructField("number", T.IntegerType()) ])) df= df.drop("number") # Use function df = (df .withColumn("base64", F.base64(F.col("bin"))) ) df.show() bin base64 [30 30 30 31] MDAwMQ== Usage: I’ve never used this one. bin https://pysparkisrad.com/functions/bin/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/bin/ bin # pyspark.sql.functions.bin(col) # version: since 1.5.0 Returns the string representation of the binary value of the given column. Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"a": 1,"b": 2},{"a": 3,"b": 2},{"a": 5},{"b": 5}] df = spark.createDataFrame(data) df = df.drop("b") # Use function df = (df .withColumn("bin", F.bin(F.col("a"))) ) df.show() a bin 1 1 3 11 5 101 null null Usage: I don’t find myself needing binary very often. bround https://pysparkisrad.com/functions/bround/ Mon, 01 Jan 0001 00:00:00 +0000 https://pysparkisrad.com/functions/bround/ bround # pyspark.sql.functions.bround(col, scale=0) # version: since 2.0.0 Round the given value to scale decimal places using HALF_EVEN rounding mode if scale >= 0 or at integral part when scale < 0. Scale: Decimal places Runnable Code: from pyspark.sql import functions as F # Set up dataframe data = [{"a": 1.85,"b": 2},{"a": 1.86},{"b": 5}]#,{}] df = spark.createDataFrame(data) df = df.drop("b") # Use function df = (df .withColumn("bround", F.bround(F.col("a"), scale=1)) ) df.