Skip to content

How to work with Hadoop data using SQL in an R Jupyter notebook

IBM Big SQL provides standards-compliant SQL access to data in Hadoop. Developers familiar with SQL can access data in Hadoop without having to learn new languages or skills.

You can work with Hadoop data using SQL in a Jupyter Notebook.The required libraries are pre-installed in your Jupyter Notebook environment, so you can establish a connection to a remote Hadoop cluster with Big SQL and then run SQL queries over Hadoop data.


Big SQL Technology Sandbox is a large, shared cluster powered by Hadoop. You can use it to run R, SQL, Spark, and Hadoop jobs. It is a high performance environment demonstrating the advantages of parallelized processing of big data sets.

For credentials, sign up for an account on Demo Cloud. Your username and password there will be used for new SQL connections.

Access Hadoop data using SQL

Create a new Jupyter notebook in Data Scientist Workbench. Set it to use R.

Enter your Big SQL Technology Sandbox username and password in a new cell.
username = "my_demo_cloud_username";

password = "my_demo_cloud_password"
Notice: Your Big SQL Technology Sandbox username is different from your email address. For example, the username for might be janedoe. You can see your username in the top right corner of Demo Cloud when you're logged in.

Run the cell above.

Create a new cell.

Enter the other connection details for our cluster.
database = "bigsql";
hostname = "";
port = "32051"
Run the cell above. Create a new cell.

Create the connection using the code below.

jcc = JDBC("", "/usr/local/lib/db2jcc4.jar");
jdbc_path = paste("jdbc:db2://", hostname, ":", port, "/", database, sep="");
conn = dbConnect(jcc, jdbc_path, user=username, password=password)

Run the cell above. If something didn't work, make sure you set the username and password variables in the first cell to your Demo Cloud username and password.

gosalesdw.emp_employee_dim is a sample table in the bigsql database.

Create a new cell. Run a SQL query against the sample data.
query = "select * from gosalesdw.emp_employee_dim";
rs = dbSendQuery(conn, query);
df = fetch(rs, -1);
Cool! You've accessed data in a Hadoop cluster using a SQL connection from a Jupyter notebook.
Finally, as a best practice we should close the database connection once we're done with it.

Feedback and Knowledge Base