Paintully slow Spark Oracle read (JDBC) - Stack Overflow

admin2025-05-02  1

I am reading a small table from an Oracle database using Spark on Databricks. The code is very simple:

df = spark.read.jdbc(url = url, table = table_name, properties = "{"driver": "oracle.jdbc.driver.OracleDriver"})"

df.write.delta.save(delta_location)

The extraction process is painfully slow. For 40k rows it takes around 2 hours. The source server is located in Australia while Spark runs in Europe. I suspect there might be a network latency issue since querying a different server, which is closer is significantly faster.

I tried adding .option("fetchsize", n) with different n values and it made no visible difference.

What am I missing? Am I doing something wrong? Can I run more troubleshooting?

I am reading a small table from an Oracle database using Spark on Databricks. The code is very simple:

df = spark.read.jdbc(url = url, table = table_name, properties = "{"driver": "oracle.jdbc.driver.OracleDriver"})"

df.write.delta.save(delta_location)

The extraction process is painfully slow. For 40k rows it takes around 2 hours. The source server is located in Australia while Spark runs in Europe. I suspect there might be a network latency issue since querying a different server, which is closer is significantly faster.

I tried adding .option("fetchsize", n) with different n values and it made no visible difference.

What am I missing? Am I doing something wrong? Can I run more troubleshooting?

Share Improve this question asked Jan 2 at 10:45 Łukasz KastelikŁukasz Kastelik 6592 gold badges11 silver badges27 bronze badges 4
  • hmm, network latency, as you've already suspected. let me come back to you with suggestions. – user84 Commented Jan 2 at 13:52
  • Should be pretty straight forward to pin point if you suspect network latency. Just look at the Oracle logs in Aus to see how long does the query take to execute (query should be something like select * from (select * from table_name). Or you can just run that query directly (say using sqldeveloper or dbeaver) on that oracle and see how long it takes. 40k rows should not take 2hours unless rows are huge (e.g. you have 1000s of columns or some columns have big blobs or something as value). If oracle query is the cause, then only possible "spark optimization" is what Ali suggested. – Kashyap Commented Jan 2 at 16:54
  • 1 if it's a straight select from a table, it's definitely going to be network. The I/O cost of a single table scan is miniscule compared to the network cost of the client fetching the results. What was fetchsize set to ? You'd want orders of magnitude changes, not incremental ones (e.g. 1000 or 10000). Lastly, are there any LOB columns being fetched? Those greatly increase the chattiness of a data pull and so incur the latency penalty in extreme ways. There are some possibilities for mitigating this impact, but we'd need to know what you're fetching (how many bytes per row, any odd datatypes) – Paul W Commented Jan 3 at 1:26
  • And lastly, check the client side (process CPU) to ensure that it isn't spending a significant portion of its time on the save operation. – Paul W Commented Jan 3 at 1:28
Add a comment  | 

1 Answer 1

Reset to default 0

The exact name property in Oracle JDBC is "defaultRowPrefetch". Try with that name and also making it part of the url with url?defaultRowPrefetch=1000 in case spark is not parsing it ok

转载请注明原文地址:http://www.anycun.com/QandA/1746123898a92005.html