• NextGen Data
  • Posts
  • Broadcast Joins in Spark: When Less Shuffle Is More Speed

Broadcast Joins in Spark: When Less Shuffle Is More Speed

When you're joining a large dataset with a small one in Apache Spark, a broadcast join can drastically improve performance. It's one of the simplest ways to avoid a costly shuffle join, and it's powered by Spark's ability to move small data to all nodes in the cluster.

📦 What Is a Broadcast Join?

In a typical join, Spark shuffles both sides of the join to align keys. But if one side is small enough, Spark can replicate it across all worker nodes. This means:

  • No shuffling of the large dataset

  • Local joins happen on each node

  • Less network traffic and more speed

🧪 When to Use It

  • You're joining a large table (e.g., 10M rows) with a small lookup table (e.g., 100 rows)

  • The small dataset fits in memory on each executor

  • You're doing star schema joins in analytics

🔧 How to Enable Broadcast Joins

Spark does this automatically when it detects small tables (below spark.sql.autoBroadcastJoinThreshold, default 10MB).

You can also force it manually:

from pyspark.sql.functions import broadcast  df_large.join(broadcast(df_small), "id") 

Or in SQL:

SELECT /*+ BROADCAST(dim_table) */ * FROM fact_table JOIN dim_table ON fact_table.key = dim_table.key; CopyEdit

⚠️ When Not to Use It

  • Your "small" table is actually too large for memory on all nodes.

  • You're joining two large tables — broadcast won't help.

  • You’re using wide rows with complex nested fields (can blow up memory).

🧠 Pro Tip: Use Explain to Confirm

Always check Spark’s physical plan to see whether it applied the broadcast strategy:

df.join(broadcast(df_small), "id").explain() 

Look for:

*(2) BroadcastHashJoin 

Bottom Line: Broadcast joins are low-hanging fruit for Spark performance — just make sure your small table is actually small, and let Spark handle the rest.