When you're joining a large dataset with a small one in Apache Spark, a broadcast join can drastically improve performance. It's one of the simplest ways to avoid a costly shuffle join, and it's powered by Spark's ability to move small data to all nodes in the cluster.
📦 What Is a Broadcast Join?
In a typical join, Spark shuffles both sides of the join to align keys. But if one side is small enough, Spark can replicate it across all worker nodes. This means:
No shuffling of the large dataset
Local joins happen on each node
Less network traffic and more speed
🧪 When to Use It
You're joining a large table (e.g., 10M rows) with a small lookup table (e.g., 100 rows)
The small dataset fits in memory on each executor
You're doing star schema joins in analytics
🔧 How to Enable Broadcast Joins
Spark does this automatically when it detects small tables (below spark.sql.autoBroadcastJoinThreshold
, default 10MB).
You can also force it manually:
from pyspark.sql.functions import broadcast df_large.join(broadcast(df_small), "id")
Or in SQL:
SELECT /*+ BROADCAST(dim_table) */ * FROM fact_table JOIN dim_table ON fact_table.key = dim_table.key; CopyEdit
⚠️ When Not to Use It
Your "small" table is actually too large for memory on all nodes.
You're joining two large tables — broadcast won't help.
You’re using wide rows with complex nested fields (can blow up memory).
🧠 Pro Tip: Use Explain to Confirm
Always check Spark’s physical plan to see whether it applied the broadcast strategy:
df.join(broadcast(df_small), "id").explain()
Look for:
*(2) BroadcastHashJoin
Bottom Line: Broadcast joins are low-hanging fruit for Spark performance — just make sure your small table is actually small, and let Spark handle the rest.