Develop Spark Application
Introduction
To display how to upload data into Databricks on top of AWS using the Databricks UI, in our previous blog post, we have. Well, let’s write a short spark application to query the data set which was uploaded in the previous blog using notebook.
Creating a Notebook
For one to obtain a Spark application it is required that he or she creates a notebook. A notebook is an environment where data manipulations can be done by writing particular queries in any programming language. For creating a Notebook go to “New” in options and select “Notebook”.
Connecting to a Cluster
When you are done creating a notebook, there is an option to link it to a certain cluster. This can be done by clicking on the “Connect” button examples of the available clusters are provided. As a result of this, we only get one cluster and this is automatically chosen.
Writing Queries
All right, let us express some queries that allow modifying the data coming from the tables above. Now let me list all directories within our Databricks File System (DBFS) by using the ‘fs.ls’ command.
Loading Data into Data Frames
Then, to work with data from our files, we will transform it into data frames with the help of spark.read.csv. From the above SQL tables we shall generate two data frames with names orders and order_items respectively.
Joining Data Frames
And now let’s join the two data frames using the join function. Next we will use a join on the orders table and order items table where we’ll join it through the common primary key Order ID.
Filtering Data
To that list, let’s also include a filter that restricts the orders by status to “complete” or “closed”.
Grouping and Aggregating Data
After that, we will use the group by method to group the data by order date and then make use of the sum function to compute the daily revenue.
Rounding and Ordering Data
Last but not least, to minimize rounding errors, the revenue will be rounded to a decimal of 2 places and the data will be sorted according to the order date.