When you're joining a large dataset with a small one in Apache Spark, a broadcast join can drastically improve performance. It's one of the simplest ways to avoid a costly shuffle join, and it's powered by Spark's ability to move small data to all nodes in the cluster.
In a typical join, Spark shuffles both sides of the join to align keys. But if one side is small enough, Spark can replicate it across all worker nodes. This means:
No shuffling of the large dataset
Local joins happen on each node
Less network traffic and more speed
You're joining a large table (e.g., 10M rows) with a small lookup table (e.g., 100 rows)
The small dataset fits in memory on each executor
You're doing star schema joins in analytics
Spark does this automatically when it detects small tables (below spark.sql.autoBroadcastJoinThreshold
, default 10MB).
You can also force it manually:
from pyspark.sql.functions import broadcast df_large.join(broadcast(df_small), "id")
Or in SQL:
SELECT /*+ BROADCAST(dim_table) */ * FROM fact_table JOIN dim_table ON fact_table.key = dim_table.key; CopyEdit
Your "small" table is actually too large for memory on all nodes.
You're joining two large tables — broadcast won't help.
You’re using wide rows with complex nested fields (can blow up memory).
Always check Spark’s physical plan to see whether it applied the broadcast strategy:
df.join(broadcast(df_small), "id").explain()
Look for:
*(2) BroadcastHashJoin
Bottom Line: Broadcast joins are low-hanging fruit for Spark performance — just make sure your small table is actually small, and let Spark handle the rest.