pyspark - Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found - Stack Overflow

admin2025-05-01  1

When I try to run a pyspark step on my EMR cluster I get an error Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found. My understanding from AWS documentation is that the EMR file system should already be installed on my EMR cluster? I also tried referencing my .py file in s3 using s3a instead, and get a similar error saying the S3a file system can't be found.

Here's how I'm creating the EMR step:

aws emr add-steps --cluster-id j-XXXXXXXXXXXXX \
 --steps 'Type=spark,Name=Bronze,Args=[ --deploy-mode,cluster, --master,yarn, --conf,spark.yarn.submit.waitAppCompletion=true,s3://my-bucket/spark-scripts/spark_streaming.py],ActionOnFailure=CONTINUE'

And my bootstrap for my cluster is:

#!/bin/bash
sudo curl -O --output-dir /usr/lib/spark/jars/ .12/3.2.1/delta-spark_2.12-3.2.1.jar
sudo curl -O --output-dir /usr/lib/spark/jars/ .2.1/delta-storage-3.2.1.jar
sudo curl -O --output-dir /usr/lib/spark/jars/ .29.6/sqs-2.29.6.jar
sudo curl -O --output-dir /usr/lib/spark/jars/ .0.1.jar
sudo curl -O --output-dir /usr/lib/spark/jars/ .7.4.jar
sudo python3 -m pip install delta-spark==3.2.1
sudo python3 -m pip install boto3

When I try to run a pyspark step on my EMR cluster I get an error Caused by: java.lang.ClassNotFoundException: Class com.amazon.ws.emr.hadoop.fs.EmrFileSystem not found. My understanding from AWS documentation is that the EMR file system should already be installed on my EMR cluster? I also tried referencing my .py file in s3 using s3a instead, and get a similar error saying the S3a file system can't be found.

Here's how I'm creating the EMR step:

aws emr add-steps --cluster-id j-XXXXXXXXXXXXX \
 --steps 'Type=spark,Name=Bronze,Args=[ --deploy-mode,cluster, --master,yarn, --conf,spark.yarn.submit.waitAppCompletion=true,s3://my-bucket/spark-scripts/spark_streaming.py],ActionOnFailure=CONTINUE'

And my bootstrap for my cluster is:

#!/bin/bash
sudo curl -O --output-dir /usr/lib/spark/jars/ https://repo1.maven.org/maven2/io/delta/delta-spark_2.12/3.2.1/delta-spark_2.12-3.2.1.jar
sudo curl -O --output-dir /usr/lib/spark/jars/ https://repo1.maven.org/maven2/io/delta/delta-storage/3.2.1/delta-storage-3.2.1.jar
sudo curl -O --output-dir /usr/lib/spark/jars/ https://repo1.maven.org/maven2/software/amazon/awssdk/sqs/2.29.6/sqs-2.29.6.jar
sudo curl -O --output-dir /usr/lib/spark/jars/ https://awslabs-code-us-east-1.s3.amazonaws.com/spark-streaming-sql-s3-connector/spark-streaming-sql-s3-connector-0.0.1.jar
sudo curl -O --output-dir /usr/lib/spark/jars/ https://jdbc.postgresql.org/download/postgresql-42.7.4.jar
sudo python3 -m pip install delta-spark==3.2.1
sudo python3 -m pip install boto3
Share Improve this question edited Jan 2 at 15:54 Wev asked Jan 2 at 15:24 WevWev 2853 silver badges15 bronze badges
Add a comment  | 

1 Answer 1

Reset to default 0

I resolved this by removing:

pip install delta-spark==3.2.1

As mentioned in similar questions, overwriting EMR's spark installation causes this issue. I was unintentionally reinstalling spark via this pip install.

So now to use the delta library in Jupyter I reference it with a magic command, and for EMR Steps I reference the JAR as a conf.

转载请注明原文地址:http://www.anycun.com/QandA/1746110180a91808.html