Posted: August 27th, 2021
Student’s Name
Instructor’s Name
Course
Date
Methods of Data Analysis
Introduction
Data analysis is a critical component in understanding and extracting the actual meaning from business insights in the modern business setting. The reason is that data analysis provides the basis for business success. However, large volumes of data are created daily, with just less than 1% being analyzed and adopted to improve the business’s value. Regardless, this still provides essential information for achieving desired goals of any organization. Thus, knowing how to collect, analyze and interpret data remains a minefield.
In light of the above, a researcher from Mini Project Part 2 Company seeks support in data analysis. The company has a new dataset of 1000 California properties that contain the same variables as their first dataset. Now the researcher is interested in knowing which characteristics of California properties combined that best explain the variation observed in the median housing value of homes in California. They would like to help develop multiple linear models, including the predictors that best explain median housing value and can appropriately predict the median housing value of a new neighborhood. They will also share this model with real estate agents, so it should be simple to understand it effortlessly. Thus, the proposed model should be complex enough for good predictions and description of the population (with all the suitable properties) but simple enough that it is easily understood. Hence, it is essential to ensure that the analysis stepsare clear and justifiable such that there are no questions about why you chose the model that you present compared to any other possibility.
Methods
The data for the study was collected from the Quercus project page. A population of 20,433 California homes was selected before sampling the data to obtain a sample size of 1,000 California homes for the study.
Variable Selection
The study utilized 14 variables to help come up with the appropriate model. The variables were randomly selected from the company data and processed through four stages to build a reliable model. In the first case, all the 14 variables were subjected to regression analysis. The model was then tested for reliability. Scatterplots were used to assess the residual data. Afterwards, several transformations were made to remove data that exhibited excessive multicollinearity. Finally, r-standard plots were performed to examine disparity of the data from the mean to help assess the quality of the final model. The following are the variables utilized in the model;
Longitude – which represents the longitude where the home/region is located
Latitude – that represents the latitude where the home or region is located
Housing medium age – that represents the median age of houses in the area of this
Total rooms – the variable represents the total number of rooms in the homes in this area
Total bedrooms – the variable represents the total number of bedrooms in the homes in this area Population – this variable represents the population of the area where this home is located
Households – the variable represents the number of households in the area this home is located
Median income – this represents the median income of households in the area (in ten-thousand dollars) Median house value – the median house value in the area where this home is located
Near Bay – the variable represents the indicator of whether the home or region is located near a bay
Near ocean – it represents the indicator of whether the home/region is located near the ocean
One ocean represents whether the home/region is located within a one-hour drive of the ocean.
Inland – it represents the indicator of whether the home/region is located inland. Further, the X variable was employed to act as an identifier for each observation made on data.
Accordingly, the following model was used in setting up the assessment;
(i)
Model Validation
There are different approaches towards achieving model validation. The study utilized the split data method to implement data validation by utilizing the functions in R-Studio Statistics software. In this approach, the housing.csv data was split into two parts: training data and validation. The predicted probability (score) for sample validation was then performed using the considered model. The score file was ranked in a descending order using the estimated probability. The ranked file was split into deciles and observations in each decile ascertained and the cumulative events assessed for each decile. The cumulative events’ gain score was determined, which was divided by the percentage score of data for each decile determined. Lastly, KS Statistics was performed to measure the degree of separation between the negative and positive distribution.
Model Violations/Diagnosis
The model was tested to assess the existence of multicollinearity in the model. Multicollinearity exists because of the inter-dependence in independent variables. Thus, this situation renders the model invalid when the degree of correlation is high.
Results
In this section, the results of the analysis are presented. The aim is to illustrate the information obtained from analyzing the company data. The section also describes the data, processes involved in obtaining the results, and assessing the quality of the model.
Description of Data
Regression Analysis – based on the regression model under (i), the analysis produced the following regression output.
Figure 1: Results of Regression Analysis
Figure 1 shows a summary of regression output as obtain from the R-Studio analysis report. As such, the following is the regression model;
(ii)
The minimum value was -255,804, and the maximum value of 334,858. The quantile ranges are -40354 to 30449.
Distribution of Data
Distribution assessment sought to ascertain how the data is distributed. The scatterplot was utilized in achieving this, as displayed in figure 2 below.
Figure 2: Scatter Plot 1
Figure 2 shows the distribution of the housing medium value. Figure 3 below is a normal probability plot. This was used to assess the distribution of sample data against the normal line. The quantile to quantile plot assesses how data is distributed from the expected normal distribution line.
Figure 3: Normal Q-Q Plot
Figures 5 and 6 show the r-standard plots for the data. The r-standard plots were utilized to assess the standard dispersion of data from the mean. Figure 5 shows the r-standard distribution against deciles, and figure 6 shows the r-standard distribution against the quantiles.
Figure 4: Housing Data Distribution
Figure 5: R-Standard Plot
The goodness of Final Model
The goodness of the final model was assessed by examining the goodness of fit. This is illustrated in figure 6 below. The best model is determined by how best it cuts through the majority of data.
Figure 6: improved version Scatter plot 2
Discussion and Conclusion
The section presents a summary of the findings and interpretation of the analysis results as demonstrated under the results section. The purpose is to find meaning and assess the relevance of the information towards addressing the current problem faced by the company. It also discusses the flaws in the data.
Interpretation and Importance
The final regression analysis revealed the following regression model;
(ii)
The model implies that the housing value decreases by 3.367 units when other factors are not part of the environment. Equally, latitude and longitude have negative implications on the housing value as any change in these factors results in a significant decrease in the value by 3.845 and 3.442 units, respectively. The exact consequences are realized with total rooms, change in population size, and households which significantly decreases house value by 3.507, 2.845, and 1.611. However, a unit change in total bedrooms, median housing age, and median income increases housing value by 1.296, 1.012, and 3.713 units. Equally, houses near the Bay, ocean, and ocean have a higher value than those away from this location. Further, the significance of each coefficient for the independent variables at p-value <0.05 reveals that only latitude, longitude, median housing age, total bedrooms, population, median income, and nearby ocean are the only statistically significant factors. The rest are insignificant at a 5% significance level. Thus, it implies that these are the only factors influencing the company’s housing value changes.
Regarding the quality of data, most of the housing value data were positively skewed, as revealed under Figure 2. Examining the quantile to quantile plot (figure 3) revealed that the data is away from the expected normal distribution although positive skewed. Equally, the r-standard distribution plot indicates that most of the data were dispersed from the mean.
Thus, the most appropriate model for consideration is the following;
The model is the best fit as it cuts through most of the data points, as illustrated in figure 6. Therefore, this has implications that there is a possibility of fluctuation in value with changes in data distribution.
Limitation of Analysis
In summary, the analysis was not without challenges. Most of the data were sourced from open sources without validation. Further, examination of the study indicates that other factors should be considered in ascertaining appropriate predictors for the housing value. However, the study was limited on the current factors for lack of time to incorporate others. Thus, future studies should focus on incorporating a wide range of other factors.
Appendices
R-Studio Programming Code
Data Preparation
Data = read.csv (“housing.csv”)
set.seed(123)
rows = sample(1:nrow(Data), 1000, replace = F)
newdata = Data[rows,]
housing = newdata[sample(1:nrow(newdata),500, replace = F), ]
testdata = newdata[which(!(newdata$X %in% housing)), ]
Regression Operations
full = lm(median_house_value ~ longitude + latitude + housing_median_age + total_rooms + total_bedrooms + population + households + median_income+ near_bay + near_ocean + oneh_ocean + inland, data = housing)
summary(full)
#realize shows N/A for “inland” row, so we get rid off inland
full2 = lm(median_house_value ~ longitude + latitude + housing_median_age + total_rooms + total_bedrooms + population + households + median_income+ near_bay + near_ocean + oneh_ocean, data = housing)
summary(full2)
#now we have got rid off the predictors shows correlation as 1
#check conditions to see if residual plots can tell us what is wrong with the model
pairs(housing[, c(2,3,4,5,6,7,8,9)])
plot(housing$median_house_value~fitted(full2))
abline(a=0,b=1)
# first column, second column shows non-linear relationship, so get rid off longtitude and latitude
#full_3 with no longitude or latitude
full3 = lm(median_house_value ~ housing_median_age + total_rooms + total_bedrooms + population + households + median_income+ near_bay + near_ocean + oneh_ocean, data = housing)
pairs(housing[, c(4,5,6,7,8,9)])
plot(housing$median_house_value~fitted(full3))
abline(a=0,b=1)
#Residual Plot
plot(rstandard(full3) ~ fitted(full3))
plot(rstandard(full3) ~ housing[, 4])
plot(rstandard(full3) ~ housing[, 5])
plot(rstandard(full3) ~ housing[, 6])
qqnorm(rstandard(full3))
qqline(rstandard(full3))
#Probably remove
w = which (housing$median_house_value >= 500001)
# lots of 500001, strange.
nd = housing[-w, ]
full4 = lm(median_house_value ~ housing_median_age + total_rooms + total_bedrooms + population + households + median_income+ near_bay + near_ocean + oneh_ocean, data = nd)
pairs(housing[, c(4,5,6,7,8,9)])
plot(housing$median_house_value~fitted(full3))
abline(a=0,b=1)
#residual plot
plot(rstandard(full4) ~ fitted(full4))
plot(rstandard(full4) ~ nd[, 4])
plot(rstandard(full4) ~ nd[, 5])
plot(rstandard(full4) ~ nd[, 6])
plot(rstandard(full4) ~ nd[, 7])
plot(rstandard(full4) ~ nd[, 8])
plot(rstandard(full4) ~ nd[, 9])
#looks abit better
#Transformation
#multicolinear
library(car)
vif(full4)
#total_rooms, total_bedrooms, population, households
#try to remove see what happens
full5 = lm(median_house_value ~ housing_median_age + median_income+ near_bay + near_ocean + oneh_ocean, data = nd)
pairs(nd[,c(4,9)])
plot(nd$median_house_value~fitted(full5))
abline(a=0,b=1)
plot(rstandard(full5)~fitted(full5))
plot(rstandard(full5)~nd[,4])
#model selection
#see word doc
#model violation
#see word doc
Place an order in 3 easy steps. Takes less than 5 mins.