cancel
Showing results for 
Search instead for 
Did you mean: 

Oracle DB connection and processing

Oracle DB connection and processing

Say, I want to process data in Oracle DB and transform it using some tables in Hive DWH and publish the result as  new Hive table -- Will that be possible in Paxata?

Is connecting to Oracle Datawarehouse seamless in Paxata?

How does the processing happen? Paxata just pulls the data from Oracle directly into Spark everytime the project is run?
Labels (1)
3 Replies

@MagmaMan

Is connecting to Oracle Datawarehouse seamless in Paxata?

Yes, we provide a database connector that supports make databases including:

  • Oracle 11 and 12. 

We also support database connectivity to the following Hive versions: 

  • Hive (CDH5) - Version: CDH 5.12-5.14
  • Hive (HDP2) - Version: HDP 2.6.3

For details about configuring a database connection in Paxata, please view: 
https://community.datarobot.com/t5/admin-corner/how-to-configure-a-jdbc-data-source/m-p/6493#M46

For a list of all connectivity options in our 2018.2 release, please view: 
https://community.datarobot.com/t5/admin-corner/what-are-the-data-connections-i-can-access-within-pa...

How does the processing happen? 

From what I currently understand, your workflow would look like this in Paxata: 

  1. Configure the JDBC Connector for your Oracle warehouse
  2. Configure the Hive Connector for your Hive database
  3. Import data from Oracle using JDBC Connector by either browsing to a table or using SQL queries to create a cached Data Set in the Paxata Data Library
  4. Import data from Hive using Hive connector by either browsing to a table or using SQL queries to create a cached Data Set in the Paxata Data Library
  5. Create a Project in a Paxata project to perform Data Preparation
  6. Load your initial Data Set into the Project. This would likely be your Oracle data
  7. Add additional data sets to the Project via append (similar structure data to create a longer data set) or via lookup (join additional data to each row)
  8. Transform your data into a final prepared structure
  9. Export Answer Set to Hive using Hive Connector

 

Paxata just pulls the data from Oracle directly into Spark everytime the project is run?

After your initial Data Preparation, above, Paxata provides the ability for you to schedule a recurring execution of your Data Preparation Project via our Automation feature.

  • You can choose to import new data from your databases on each execution or to use the latest cached version available in the Data Library. During execution, Paxata will load the data into Spark according to your specification. 

Thanks for your question. 
Bill

@bstephens
Thank you for a great explanation! Now, I better appreciate the Data Library part. It looks like it may have to support massive data if needed. Is Azure blob storage a commonly used option for Data Library?

NOTE: The following comment is wrong. Please ignore. Direct Publish is when Spark directly talks with Data Library. So, data ultimately is always loaded from Data Library only either directly or through Data core servers (and not from Data sources like how I have WRONGLY explained below). Apologies.

EARLIER WRONG COMMENT:
I just found that there is a Direct Data Load mode for Tenants. If that is enabled, the Spark workers will directly load data from the source instead of using Data Library!!! Otherwise, Data Core servers need to be installed which can bring in data from Data Library and then Spark workers talk to the Data core servers.