Skip to content

Reading CSV files in to Spark Dataframes with read.df

A SparkDataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R, but with richer optimizations under the hood. SparkDataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases (e.g. MySQL, IBM dashDB etc.), or existing local R data frames.

The general method for creating SparkDataFrames from data sources is read.df. This method takes in the path for the file to load and the type of data source, and the currently active SparkSession will be used automatically. SparkR supports reading JSON, CSV and Parquet files natively and through Spark Packages .These packages can be added by specifying --packages with spark-submit or sparkR commands.

spark-csv

spark-csv packages implements a CSV data source for Apache Spark versions prior to V2.0. (Note: spark-csv is subsumed into Apache Spark 2.0 so that installation and configuration of spark-csv is no longer required). CSV files can be read as DataFrame.

Please go through the following steps to open a CSV file using read.df in SparkR:

  1. Open Cognitive Class Labs (Data Scientist Workbench) and go to RStudio IDE
  2. Go to the Files view (typically right pane on the bottom)
  3. Under Files , select .Rprofile. Open the file and add the following line of code anywhere within if(interactive()) block and save .Rprofile:
    Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.11:1.4.0" "sparkr-shell"')
  4. Restart R session:
    • Select Restart R in the Session menu, or
    • execute the following code in the console:
      sparkR.stop()
      sc <- sparkR.init()
  5. Then run the following code in the console to test spark-csv out:
    sqlContext <- sparkRSQL.init(sc)
    medals <- read.df(sqlContext, "/resources/data/samples/olympic-medals/medals.csv", source = "csv")
    medals

Feedback and Knowledge Base