How to work with Hadoop data using SQL in an R Jupyter notebook

← Jupyter Notebooks

IBM Big SQL provides standards-compliant SQL access to data in Hadoop. Developers familiar with SQL can access data in Hadoop without having to learn new languages or skills.

You can work with Hadoop data using SQL in a Jupyter Notebook.The required libraries are pre-installed in your Jupyter Notebook environment, so you can establish a connection to a remote Hadoop cluster with Big SQL and then run SQL queries over Hadoop data.

Credentials

Big SQL Technology Sandbox is a large, shared cluster powered by Hadoop. You can use it to run R, SQL, Spark, and Hadoop jobs. It is a high performance environment demonstrating the advantages of parallelized processing of big data sets.

For credentials, sign up for an account on Demo Cloud. Your username and password there will be used for new SQL connections.

Access Hadoop data using SQL

Create a new Jupyter notebook in Data Scientist Workbench. Set it to use R.

Image: https://idsw.uservoice.com/assets/94802277/Screenshot%202016-01-22%2013.12.20.png

Image: https://idsw.uservoice.com/assets/94802277/Screenshot%202016-01-22%2013.12.20.png

Enter your Big SQL Technology Sandbox username and password in a new cell.

username = "my_demo_cloud_username";

password = "my_demo_cloud_password"

Notice: Your Big SQL Technology Sandbox username is different from your email address. For example, the username for jane.doe@example.com might be janedoe. You can see your username in the top right corner of Demo Cloud when you're logged in.

Run the cell above.

Image: https://idsw.uservoice.com/assets/94801053/Screenshot%202016-01-22%2012.56.57.png

Create a new cell.

Image: https://idsw.uservoice.com/assets/94801086/Screenshot%202016-01-22%2012.57.55.png

Enter the other connection details for our cluster.

database = "bigsql";
hostname = "iop-bi-master.imdemocloud.com";
port = "32051"

Run the cell above. Create a new cell.

Create the connection using the code below.

library(RJDBC);

jcc = JDBC("com.ibm.db2.jcc.DB2Driver", "/usr/local/lib/db2jcc4.jar");
jdbc_path = paste("jdbc:db2://",  hostname, ":", port, "/", database, sep="");
conn = dbConnect(jcc, jdbc_path, user=username, password=password)

Run the cell above. If something didn't work, make sure you set the username and password variables in the first cell to your Demo Cloud username and password.

gosalesdw.emp_employee_dim is a sample table in the bigsql database.

Create a new cell. Run a SQL query against the sample data.

query = "select * from gosalesdw.emp_employee_dim";
rs = dbSendQuery(conn, query);
df = fetch(rs, -1);
df

Cool! You've accessed data in a Hadoop cluster using a SQL connection from a Jupyter notebook.

dbDisconnect(conn)

Finally, as a best practice we should close the database connection once we're done with it.

How to work with Hadoop data using SQL in an R Jupyter notebook

Feedback and Knowledge Base

Searching…

Contact support

Give feedback

Knowledge Base

Cognitive Class Labs

We're glad you're here

Searching…

Contact support

Give feedback

Knowledge Base

Cognitive Class Labs