Document Properties for running SW on EMR 5.32
Description
When launching emr-5.32.0 with Spark 2.4.7 and using SW 3.30.1.2-1-2.4, only cluster size of 1 is formed (cannot get more executors).
Attempted settings:
unset SPARK_HOME
export SPARK_HOME
Having EMR Cores = 2 or 3
Adding additional conf: --conf "spark.ext.h2o.client.ignore.SPARK_PUBLIC_DNS=true" --conf "spark.scheduler.minRegisteredResourcesRatio=1"
This includes config settings:
Cluster scaling is off
Classification=spark-defaults and spark.dynamicAllocation.enabled=false
Launched SW:
```
bin/sparkling-shell --num-executors 2 --executor-memory 2g --master yarn --deploy-mode client --conf "spark.dynamicAllocation.enabled=false" --conf "spark.ext.h2o.client.ignore.SPARK_PUBLIC_DNS=true" --conf "spark.scheduler.minRegisteredResourcesRatio=1"
```
Launching spark/pyspark (no SW) gives 2 executors
There is no issue when using emr-5.22.0
Activity
Thanks for following the thread, I expected to be notified about an reply by email but it seems this doesn’t work.
Ok, I will try the recommended settings and add it to documentation as a part of this ticket.
Response on the forum ticket from Jonathan@AWS:
This is a performance optimization feature of emr-5.32.0, where Spark/YARN on EMR will now consolidate container requests into a fewer number of larger containers. Executor memory/cores will be a multiple of spark.executor.memory/cores. Generally, using a smaller number of larger executors will be more performant than a larger number of smaller executors, so this is now the behavior that is performed by default.
If for some reason you need to disable this behavior, you may do so by setting spark.yarn.heterogeneousExecutors.enabled=false. Alternatively, you may set spark.executor.maxMemory/maxCores to values lower than Int.MaxValue, if you want to cap the memory/cores that will be used for each executor. (That is, you can set specific maxMemory/maxCores values without fully disabling the feature.) Note that these are EMR-specific properties and will not be found in Apache Spark documentation.
Since this is not documented, we may need to document it ourselves for our users. It would be helpful is AWS was more transparent (especially with every future EMR release)
A forum ticket: https://forums.aws.amazon.com/thread.jspa?threadID=332806
I see. Maybe it can be posted on the discussion forum. https://forums.aws.amazon.com/index.jspa
I got it from the Resolution page: https://aws.amazon.com/premiumsupport/knowledge-center/send-feedback-aws/
That’s a good idea, but do you know a way how to report it? If I go to the AWS support center to open a case, it tells me: Technical support is unavailable under Basic Support Plan.