How to work with Hadoop data using SQL in a Scala Jupyter notebook
You can work with Hadoop data using SQL in a Jupyter Notebook.The required libraries are pre-installed in your Jupyter Notebook environment, so you can establish a connection to a remote Hadoop cluster with Big SQL and then run SQL queries over Hadoop data.
Credentials
Big SQL Technology Sandbox is a large, shared cluster powered by Hadoop. You can use it to run R, SQL, Spark, and Hadoop jobs. It is a high performance environment demonstrating the advantages of parallelized processing of big data sets.
For credentials, sign up for an account on Demo Cloud. Your username and password there will be used for new SQL connections.
Access Hadoop data using SQL
Create a new Jupyter notebook in Data Scientist Workbench. Set it to use Scala.
Enter your Big SQL Technology Sandbox username and password in a new cell.
val username = "my_demo_cloud_username";
val password = "my_demo_cloud_password"
Notice: Your username is different from your email address. For example, the username forjane.doe@example.com
might bejanedoe
. You can see your username in the top right corner of Demo Cloud when you're logged in.
Run the cell above.
Create a new cell.
Enter the other connection details for our cluster.
val database = "bigsql";Run the cell above. Create a new cell.
val hostname = "iop-bi-master.imdemocloud.com";
val port = "32051"
Create the connection using the code below.
import java.sql.{Connection, DriverManager, ResultSet};
java.sql.DriverManager.registerDriver(new com.ibm.db2.jcc.DB2Driver);
val url = List("jdbc:db2://", hostname, ":", port, "/", database).mkString("");
var connection = java.sql.DriverManager.getConnection(url, username, password)
Run the cell above. If something didn't work, make sure you set the username and password variables in the first cell to your Demo Cloud username and password.
gosalesdw.emp_employee_dim
is a sample table in the bigsql database.Create a new cell. Run a SQL query against the sample data.
val sql = "select * from gosalesdw.emp_employee_dim";Create a new cell. Let's print the results.
val results = connection.createStatement.executeQuery(sql)
while ( results.next() ) {Cool! You've accessed data in a Hadoop cluster using a SQL connection from a Jupyter notebook.
val name = results.getString("EMPLOYEE_NAME");
val key = results.getString("EMPLOYEE_KEY");
println("Employee key, name = " + key + ", " + name)
}
Finally, as a best practice we should close the database connection once we're done with it.
connection.close()