<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Miscs on PySpark Is Rad</title>
    <link>https://pysparkisrad.com/misc/</link>
    <description>Recent content in Miscs on PySpark Is Rad</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en-us</language><atom:link href="https://pysparkisrad.com/misc/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>dataframe</title>
      <link>https://pysparkisrad.com/misc/dataframe/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://pysparkisrad.com/misc/dataframe/</guid>
      <description>dataframe # creating a simple dataframe # Often times you want to create a simple dataframe to try out PySpark functions.
Here is my favorite way to create a simple dataframe.
data = [{&amp;#34;a&amp;#34;: &amp;#34;hi&amp;#34;},{&amp;#34;a&amp;#34;: &amp;#34;bye&amp;#34;},{&amp;#34;a&amp;#34;: &amp;#34;fly&amp;#34;}] df = spark.createDataFrame(data) a hi bye fly For me, using a dictionary is the easiest way.
You can use this technique to create arrays (lists). You can create nulls by not including a key.</description>
    </item>
    
    <item>
      <title>uuid</title>
      <link>https://pysparkisrad.com/misc/uuid/</link>
      <pubDate>Mon, 01 Jan 0001 00:00:00 +0000</pubDate>
      
      <guid>https://pysparkisrad.com/misc/uuid/</guid>
      <description>uuid # uuids in pyspark # I often need uuids in Databricks.
I use both deterministic (repeatable) ids and random ids.
Random UUIDs # For random ids I used to use a UDF.
import uuid @udf def create_random_id(): return str(uuid.uuid4()) But as of Spark 3.0.0 there is a Spark SQL for random uuids. So now I use this:
from pyspark.sql import functions as F df.withColumn(&amp;#34;uuid&amp;#34;, F.expr(&amp;#34;uuid()&amp;#34;)) This is nicer and should be faster since it uses native Spark SQL instead of a UDF which then runs python.</description>
    </item>
    
  </channel>
</rss>
