uuid #
uuids in pyspark #
I often need uuids in Databricks.
I use both deterministic (repeatable) ids and random ids.
Random UUIDs #
For random ids I used to use a UDF.
import uuid
@udf
def create_random_id():
return str(uuid.uuid4())
But as of Spark 3.0.0 there is a Spark SQL for random uuids. So now I use this:
from pyspark.sql import functions as F
df.withColumn("uuid", F.expr("uuid()"))
This is nicer and should be faster since it uses native Spark SQL instead of a UDF which then runs python.
Deterministic UUIDs #
If I want to create a repeatable uuid based on one or more columns this is the UDF:
import uuid
@udf
def create_deterministic_uuid(some_string):
return str(
uuid.uuid5(
uuid.NAMESPACE_OID,
f'something:{some_string}'
)
)
tags: uuid, uuid4, uuid5, random id
© 2023 PySpark Is Rad