Skip to content

Commit 716cde3

Browse files
author
James Lee
committed
add HousePriceProblem and HousePriceSolution
1 parent 1c4966f commit 716cde3

File tree

3 files changed

+75
-1
lines changed

3 files changed

+75
-1
lines changed

in/RealEstate.csv

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
MLS,Location,Price,Bedrooms,Bathrooms,Size,Price/SQ.Ft,Status
1+
MLS,Location,Price,Bedrooms,Bathrooms,Size,Price SQ Ft,Status
22
132842,Arroyo Grande,795000.00,3,3,2371,335.30,Short Sale
33
134364,Paso Robles,399000.00,4,3,2818,141.59,Short Sale
44
135141,Paso Robles,545000.00,4,3,3032,179.75,Short Sale
Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
package com.sparkTutorial.sparkSql;
2+
3+
4+
public class HousePriceProblem {
5+
6+
/* TODO: Create a Spark program to read the house data from in/RealEstate.csv, group by location, aggregate the average price per SQ Ft and max price, and sort by average price per SQ Ft.
7+
8+
The HOUSES dataset contains a collection of recent real estate listings in San Luis Obispo county and
9+
around it. The dataset is provided in two formats: as a CSV file and as a Microsoft Excel (1997­2003)
10+
spreadsheet.
11+
12+
The dataset contains the following fields:
13+
1. MLS: Multiple listing service number for the house (unique ID).
14+
2. Location: city/town where the house is located. Most locations are in San Luis Obispo county and
15+
northern Santa Barbara county (Santa Maria­Orcutt, Lompoc, Guadelupe, Los Alamos), but there
16+
some out of area locations as well.
17+
3. Price: the most recent listing price of the house (in dollars).
18+
4. Bedrooms: number of bedrooms.
19+
5. Bathrooms: number of bathrooms.
20+
6. Size: size of the house in square feet.
21+
7. Price/SQ.ft: price of the house per square foot.
22+
8. Status: type of sale. Thee types are represented in the dataset: Short Sale, Foreclosure and Regular.
23+
24+
Each field is comma separated.
25+
26+
Sample output:
27+
28+
+----------------+-----------------+----------+
29+
| Location| avg(Price SQ Ft)|max(Price)|
30+
+----------------+-----------------+----------+
31+
| Oceano| 1145.0| 1195000|
32+
| Bradley| 606.0| 1600000|
33+
| San Luis Obispo| 459.0| 2369000|
34+
| Santa Ynez| 391.4| 1395000|
35+
| Cayucos| 387.0| 1500000|
36+
|.............................................|
37+
|.............................................|
38+
|.............................................|
39+
40+
*/
41+
}
Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
package com.sparkTutorial.sparkSql;
2+
3+
4+
import org.apache.log4j.Level;
5+
import org.apache.log4j.Logger;
6+
import org.apache.spark.sql.Column;
7+
import org.apache.spark.sql.Dataset;
8+
import org.apache.spark.sql.Row;
9+
import org.apache.spark.sql.SparkSession;
10+
11+
import static org.apache.spark.sql.functions.avg;
12+
import static org.apache.spark.sql.functions.max;
13+
14+
public class HousePriceSolution {
15+
16+
private static final String PRICE = "Price";
17+
private static final String PRICE_SQ_FT = "Price SQ Ft";
18+
19+
public static void main(String[] args) throws Exception {
20+
21+
Logger.getLogger("org").setLevel(Level.ERROR);
22+
SparkSession session = SparkSession.builder().appName("HousePriceSolution").master("local[1]").getOrCreate();
23+
24+
Dataset<Row> realEstate = session.read().option("header", "true").csv("in/RealEstate.csv");
25+
26+
Dataset<Row> castedRealEstate = realEstate.withColumn(PRICE, new Column(PRICE).cast("long")).withColumn(PRICE_SQ_FT, new Column(PRICE_SQ_FT).cast("long"));
27+
28+
castedRealEstate.groupBy("Location")
29+
.agg(avg(PRICE_SQ_FT), max(PRICE))
30+
.orderBy(new Column("avg(" + PRICE_SQ_FT + ")").desc())
31+
.show();
32+
}
33+
}

0 commit comments

Comments
 (0)