We are running Spark 3.5.0, PySpark 3.5.0 with 5 nodes in our cluster in client mode. The nodes in our cluster are using Ubuntu 22.04.2 LTS.
We were running into "No Space Left on Device" Errors caused Spark writing shuffle and cache files to the default /tmp on our worker nodes.
The problem: Only one of many nodes listens to spark.local.dir defined in spark-submit and SPARK_WORKER_OPTS defined in spark-env.sh.
We have tried to define spark.local.dir
via 1) spark-submit
, 2) SparkConf().set()
, 3) spark_defaults.conf
and SPARK_LOCAL_DIRS
via 1) spark-env.sh
, and 2) --conf "spark.executorEnv.SPARK_LOCAL_DIRS=/mnt_path"
. In all cases, one of our nodes successfully writes to the defined path but the other nodes are writing to the default /tmp
. Note, our defined path is a high-speed mounted volume shared with all of the nodes in our cluster.
Also, we have added the following to our spark-env.sh
: export SPARK_WORKER_OPTS="-Dspark.worker.cleanup.enabled=true -Dspark.worker.cleanup.interval=60 -Dspark.worker.cleanup.appDataTtl=60"
in hopes of removing work folders that are filling up our Root Volume. Again, this Environment variable is only working on one of our nodes.