Document Properties for running SW on EMR 5.32

Description

When launching emr-5.32.0 with Spark 2.4.7 and using SW 3.30.1.2-1-2.4, only cluster size of 1 is formed (cannot get more executors).

Attempted settings:

  • unset SPARK_HOME

  • export SPARK_HOME

  • Having EMR Cores = 2 or 3

  • Adding additional conf: --conf "spark.ext.h2o.client.ignore.SPARK_PUBLIC_DNS=true" --conf "spark.scheduler.minRegisteredResourcesRatio=1"

This includes config settings:
Cluster scaling is off
Classification=spark-defaults and spark.dynamicAllocation.enabled=false

Launched SW:
```
bin/sparkling-shell --num-executors 2 --executor-memory 2g --master yarn --deploy-mode client --conf "spark.dynamicAllocation.enabled=false" --conf "spark.ext.h2o.client.ignore.SPARK_PUBLIC_DNS=true" --conf "spark.scheduler.minRegisteredResourcesRatio=1"
```

Launching spark/pyspark (no SW) gives 2 executors

There is no issue when using emr-5.22.0

Activity

Show:
Marek Novotny
January 4, 2021, 12:30 PM

Thanks for following the thread, I expected to be notified about an reply by email but it seems this doesn’t work.

Ok, I will try the recommended settings and add it to documentation as a part of this ticket.

Neema Mashayekhi
December 31, 2020, 8:22 PM

Response on the forum ticket from Jonathan@AWS:

This is a performance optimization feature of emr-5.32.0, where Spark/YARN on EMR will now consolidate container requests into a fewer number of larger containers. Executor memory/cores will be a multiple of spark.executor.memory/cores. Generally, using a smaller number of larger executors will be more performant than a larger number of smaller executors, so this is now the behavior that is performed by default.

If for some reason you need to disable this behavior, you may do so by setting spark.yarn.heterogeneousExecutors.enabled=false. Alternatively, you may set spark.executor.maxMemory/maxCores to values lower than Int.MaxValue, if you want to cap the memory/cores that will be used for each executor. (That is, you can set specific maxMemory/maxCores values without fully disabling the feature.) Note that these are EMR-specific properties and will not be found in Apache Spark documentation.

Since this is not documented, we may need to document it ourselves for our users. It would be helpful is AWS was more transparent (especially with every future EMR release)

Marek Novotny
December 17, 2020, 3:45 PM
Neema Mashayekhi
December 16, 2020, 5:22 PM

I see. Maybe it can be posted on the discussion forum. https://forums.aws.amazon.com/index.jspa

I got it from the Resolution page: https://aws.amazon.com/premiumsupport/knowledge-center/send-feedback-aws/

Marek Novotny
December 16, 2020, 10:13 AM

That’s a good idea, but do you know a way how to report it? If I go to the AWS support center to open a case, it tells me: Technical support is unavailable under Basic Support Plan.


Fixed

Assignee

Marek Novotny

Reporter

Neema Mashayekhi

Labels

None

CustomerVisible

No

testcase 1

None

testcase 2

None

testcase 3

None

h2ostream link

None

Affected Spark version

None

AffectedContact

None

AffectedCustomers

None

AffectedPilots

None

AffectedOpenSource

None

Support Assessment

None

Customer Request Type

None

Support ticket URL

None

End date

None

Baseline start date

None

Baseline end date

None

Task progress

None

Task mode

None

ReleaseNotesHidden

None

Fix versions

Priority

Major