In this example we will use IBM dashDB, a fully managed cloud data warehouse purpose-built for analytics, and explore how to connect to it from a R notebook and perform data access using sparkR dataframes.
Database connectivity will be established using JDBC (in another article we covered how to connect to a dashDB database using RJDBC).
If you don't already have an instance of dashDB already you can get one for free by following steps in this article.
Let's start by creating a new Jupyter notebook in Data Scientist Workbench. Set it to use R.
Create a new cell and load the RJDBC package:
library(RJDBC)Next, initialize a SQL Context:
sqlContext <- sparkRSQL.init(sc)Enter the other database connection credentials for your instance of dashDB (or DB2 database) and run the cell.
#Enter the values for you database connectionLet's create a connection URL:
dsn_hostname = "<Enter Hostname>" # e.g.: "bluemix05.bluforcloud.com"
dsn_port = "50000" # e.g. "50000"
dsn_database = "BLUDB" # e.g. "BLUDB"
dsn_uid = "<Enter UserID>" # e.g. "dash104434"
dsn_pwd = "<Enter Password>" # e.g. "7dBZ39xN6$o0JiX!m"
conurl <- paste0("jdbc:db2://",dsn_hostname,":", dsn_port,Now let's read the data from a sample table into a Spark dataframe:
df <- read.df(sqlContext, source="jdbc",We can now use either SQL, or native Spark dataframe functions to query and manipulate the data. Let's try SQL first:
class(df) #Confirm that df is a Spark dataframe
printSchema(df) #Print the schema of the Spark dataframe
registerTempTable(df, "tempdf")And now using Spark dataframe functions:
results <- sql(sqlContext, "SELECT * FROM tempdf Limit 10")
# results is now a DataFrame
SparkR::head(df)To learn how to manipulate data using Spark dataframes there is a tutorial notebook on the Welcome page of the Jupyter notebooks in Data Scientist Workbench.
To learn more about Spark you can take the free Spark Fundamentals course in Big Data University.