The statistics and analytics assignment requires five separate tasks:
- An overall view of house prices in Baycoast.
- Identification of the main factors influencing house prices
- Development of a multiple regression model for prices.
- Some basic time series analysis of house prices.
- Discuss the suitability of the data set along with other potential data sources and approaches for the purpose of this analysis.
Baycoast is a (fictitious) local government area (called a ‘city’) within greater Melbourne, Australia. There are number of different suburbs, all with their own history of development. Due to gradual emergence of new suburbs, the city grew in different stages. It covers some wealthy suburbs and some not so wealthy. As the name would indicate, the city is located on the Bay.
The city stretches for several kilometres along the Bay’s lovely beaches and for several kilometres inland. About 60,000 people live in the suburbs of Baycoast.
In this portion a descriptive analysis of house prices have been done in order to gain a comprehensive understanding of house prices in the Baycoast region. The analysis was based on a random sample of 120 houses from the city.
The mean of Prices was found as $’000, 886.575 with $’000, 324.9467 as standard deviation. The median value was found as $’000, 852. Looking at histogram, shown in Figure 1.1: Histogram of Prices, we can easily reveal that the mean and median are not having much difference which can be interpreted as an indication of normal distribution. This further can be interpreted as that the data was not having serious extreme house prices. However, standard deviation as $’000, 324.9467 indicated a fairly large distribution across the home prices. This is further revealed by a high range among house prices.
The lowest and highest Prices were found as $’000, 192.00 and 1761.00 respectively with a range of $’000 1569. The highest house price, $’000, 1761 was indeed way away from mean or average or median house prices indicative of presence of some extreme house prices among the data of 120 house prices collected for this study. Interestingly, the same highest house price was found as one mild outlier as case number 86 in Figure 1.2: Box Plot of Prices. The Price correspond to 86 case was $’000, 1761.
One interesting way of understanding location of house prices is to understand through Inter-Quartile Range. If we arrange house prices from lower to higher and divide the house prices in 4 parts or Quartiles (each quarter will have 25 % observation of house prices), then the difference among highest and lowest middle 50 % of house prices (which is known as Interquartile Range) was found as $’000, 852.
Based on the data, if we need to estimate the house prices of total population of Baycoast city, we can do that. However, a reasonable degree of inaccuracy in estimation will be associated due to several reasons. Two reasons can be cited for example sake as (1) Inadequate sample size (as compared to size of population) (2) Human errors in observing and coding or documenting house prices. Answer to such causes of deviation from accurate estimation is to estimate with some Confidence Level. Commonly used Confidence Level is 95 % Confidence Level which can be interpreted as that the estimate carries a confidence level of 95 %. It goes without saying that we consider a 5 % chance of error or deviation from accurate estimate of house prices.
Based on the sample data of house prices, we can say with 95 % Confidence Level that the mean house prices of entire houses of Baycoast city which we specifically call as population was ranging from $’000, 827.838 and $’000, 945.312. However, increasing Confidence Level from 95 % to any higher level will lead to more broader range of estimate of house prices and conversely, decreasing Confidence Level from 95 % to any lower level will lead to less broader range of estimate of house prices.
The data was slightly more concentrated towards lower side of Prices as seen in the histogram (shown in Figure 1.1). We can easily observe that bars within the range of $’000, 500 and $’000 1000 are more in numbers as compared to the bars within $’000, 1000 and $’000, 1500.
Skewness, 0.426 was indicative of two things about concentration of the house prices. Number one, positive sign associated with skewness indicated that the data was more concentrated towards lower side. (A negative skewness value would have indicated concentration on higher side of the house prices, which is not the case in this sample data of house prices).
Second thing is about the magnitude of such concentration. Ideally, if distribution is perfectly half-half around the mean value, the skewness would be zero! (which is rarely found in the real time data!). The skewness value 0.426 is very mild and hence, we say that though the data, sample data (not for the entire houses in Baycoast city), were slightly more concentrated on lower side.
Kurtosis value -0.148 was indicative of slightly depressed distribution in terms of its peakedness as compared to perfect normal curve or bell shaped curve. Again, the quantum of kurtosis value was not a major concern.
Descriptive statistics is shown in Table 1.1.
|Descriptive of Price $000|
|95% Confidence Interval for Mean||Lower Bound||827.838|
|5% Trimmed Mean||876.778|
Table 1.1: Descriptive of Prices
Coefficient of correlation is the best statistical tool to analyze the strength and direction of relationship among variables. We had run correlation analysis between Prices and other variables. Table 2.1 (in excel sheet) shows the Pearson coefficient correlation between Prices and other 22 variables. 7 variables viz, Material, ToTrainkm, ToBuskm, ToShopskm, Kitchen, AirCon and RentalStatus were not having useful relationship with Prices. Following variables had shown significant correlation with Price (only 5 highest correlations are shown below):
- Street, 0.723
- BayViews, 0.676
- WeeklyRents$, 0.666
- Areasqm, 0.568
- Storeys, 0.565
The Scatter plots for above 5 variables against Prices have been shown in the excel sheet.
We present the best model built for predicting Prices through the following table.
Model 1 included all 22 variables [Rooms, LotSizesqm, Age, Areasqm, Material, ToTrainkm, ToBuskm, ToShopkm, Street, Storeys, Style, Bedrooms, Bathrooms, Kitchen, Heating, Aircon, Bayviews, Suburb, WeeklyRent$, RentalReturn, Condition and RentalStatus] as independent variables for predicting Prices. The model threw a nice adjusted R Square as 0.988 with 35.9410 Standard Error of estimate. The Durbin Watson Statistics was found as 2.002 which showed that autocorrelation was not a concern.
However, 10 variables were found having non-linear relationship with Prices and hence in model 2, we had removed from the list of predictors. Model 2 had 12 predictors only [Lotsizesqm, Areasqm, ToTrainkm, Street, Storeys, Bathrooms, Kitchen, Heating, BayViews, Suburb, WeeklyRent$ and RentalReturn]. As expected the results were almost same as found in model 1.
In model 3, we tried to build model with those variables having coefficient of correlation higher than 0.50. Only 5 variables qualified the criteria viz [Areasqm, Street, Storeys, BayViews and WeeklyRent$]. The model had shown lowest adjusted R Square but higher Standard Error of Estimate, thus not found attractive for predictive purpose.
Based on comparison shown above, model 2 was recommended for predicting Prices based on 12 predictors.
Model 2 with 12 predictors only [Lotsizesqm, Areasqm, ToTrainkm, Street, Storeys, Bathrooms, Kitchen, Heating, BayViews, Suburb, WeeklyRent$ and RentalReturn] was found as the best of three attempted regression models based on highest R-Square and lowest Standard Error of Estimate.
Table 3.1: Regression models
|Model number||Variables||Adjusted R Square||Standard Error of Estimate||Durbin Watson Statistics|
|12 only [Lotsizesqm, Areasqm, ToTrainkm, Street, Storeys, Bathrooms, Kitchen, Heating, BayViews, Suburb, WeeklyRent$ and RentalReturn]||
|5 only [Areasqm, Street, Storeys, BayViews and WeeklyRent$]||
Following steps were followed for estimating Forecasts and then subsequently MAPE. These are suggested to be referred with the excel sheet Q4 (annexed).
- A 4-quarter moving average (column 4)
- 4-quarter, 2 year moving total (column 5)
- Ratio of actual centered moving average (column 6)
- (Observed/Centered moving average)*100 (column 7)
- Sum of non-extreme indices, quarter wise (column 8)
- 400/result of 5) (column 9)
- Adjusted indices (column 11)
- De-seasoned data (column 12)
- Regression model with 8) as response variable and Time Period as predictor emerged as constant as 492.8191 and regression coefficient as 27.584
- De-seasoned prediction based on regression model (column 13)
- Seasoned Forecast (column 14)
- ; where is Seasoned Forecast, is observed Prices and is the number of observations.
After plugging the respective values, MAPE was found as -0.045.
Actually, MAPE is useful when we compare more than 1 predictive model. MAPE of one model only cannot be commented with respect to good or bad model.
Research in business areas is an essential part of running a business successfully. Business research addresses some specific business problems which help in understanding the business typically the consumers and the product.
Several questions like Which product is more liked by consumers, frequency of sales, sales patterns, demography such as age, gender, income class, education class, location of consumers are addressed through scientific approach of data collection, data analysis and subsequently decision making based on results or conclusions thrown by the analysis.
While addressing a particular business problem or research question putting in proper words, the target population or set of consumers is to be identified first. Next, a suitable sample data is required to be collected. It goes without saying that the Questions to be asked from the respondents are going to be of paramount importance, hence, required to be designed cautiously.
To start with, gathered data is analysed through its central tendency and dispersion. Descriptive statistics along with graphical representation is used for understanding data. Extreme values or outliers are analyzed through proper graphical representations.
For answering typical research questions related to the target population, based on sample data, inferential statistical tools are used. The conclusions thrown by the statistical tools are then used for business decision making. Thus the whole process of research methods has great suitability in terms of solving business problems.
Alternate to substantial sample size, wherever it is not possible, a focused interview approach is also used. In this, related questions are asked through prominent experts of the related domain and then their feedback/opinions are analyzed.
Though, the approach based on primary data is heavily used in solving business problems, however, approach based on secondary data is also useful. The sources like magazines, government publications about demographic traits, research articles etc are very useful in sourcing the data.
Among, focussed interview and substantial sampling approach, the later is more suitable when the same study or analysis is to be repeated in future. This is because; a generalization is feasible when we approach towards a business problem through sampling. In focussed interview approach, generalization is not feasible.