uuid

uuid #

uuids in pyspark #

I often need uuids in Databricks.

I use both deterministic (repeatable) ids and random ids.

Random UUIDs #

For random ids I used to use a UDF.

import uuid
@udf
def create_random_id():
     return str(uuid.uuid4())

But as of Spark 3.0.0 there is a Spark SQL for random uuids. So now I use this:

from pyspark.sql import functions as F
df.withColumn("uuid", F.expr("uuid()"))

This is nicer and should be faster since it uses native Spark SQL instead of a UDF which then runs python.

Deterministic UUIDs #

If I want to create a repeatable uuid based on one or more columns this is the UDF:

import uuid
@udf
def create_deterministic_uuid(some_string):
     return str(
          uuid.uuid5(
               uuid.NAMESPACE_OID, 
               f'something:{some_string}'
          )
     )

tags: uuid, uuid4, uuid5, random id




© 2023 PySpark Is Rad