I am reading a small table from an Oracle database using Spark on Databricks. The code is very simple:
df = spark.read.jdbc(url = url, table = table_name, properties = "{"driver": "oracle.jdbc.driver.OracleDriver"})"
df.write.delta.save(delta_location)
The extraction process is painfully slow. For 40k rows it takes around 2 hours. The source server is located in Australia while Spark runs in Europe. I suspect there might be a network latency issue since querying a different server, which is closer is significantly faster.
I tried adding .option("fetchsize", n) with different n values and it made no visible difference.
What am I missing? Am I doing something wrong? Can I run more troubleshooting?
I am reading a small table from an Oracle database using Spark on Databricks. The code is very simple:
df = spark.read.jdbc(url = url, table = table_name, properties = "{"driver": "oracle.jdbc.driver.OracleDriver"})"
df.write.delta.save(delta_location)
The extraction process is painfully slow. For 40k rows it takes around 2 hours. The source server is located in Australia while Spark runs in Europe. I suspect there might be a network latency issue since querying a different server, which is closer is significantly faster.
I tried adding .option("fetchsize", n) with different n values and it made no visible difference.
What am I missing? Am I doing something wrong? Can I run more troubleshooting?
The exact name property in Oracle JDBC is "defaultRowPrefetch". Try with that name and also making it part of the url with url?defaultRowPrefetch=1000 in case spark is not parsing it ok
select * from (select * from table_name)
. Or you can just run that query directly (say using sqldeveloper or dbeaver) on that oracle and see how long it takes. 40k rows should not take 2hours unless rows are huge (e.g. you have 1000s of columns or some columns have big blobs or something as value). If oracle query is the cause, then only possible "spark optimization" is what Ali suggested. – Kashyap Commented Jan 2 at 16:54