Exploratory Data Analysis (EDA) for House Prices
Project Overview
This project aims to analyze house price trends using exploratory data analysis (EDA). By leveraging statistical techniques and visualization tools, the study uncovers patterns, correlations, and key factors influencing house prices from 2006 to 2010. The insights gained from this analysis provide valuable information for real estate investors, buyers, and market analysts.
Dataset
The dataset is sourced from Kaggle’s House Prices – Advanced Regression Techniques competition. It contains 1,460 rows and 81 columns, including both numerical and categorical variables. The target variable in this dataset is SalePrice, representing the selling price of houses.
Features and Variables
- Numerical Features: SalePrice, LotArea, GrLivArea, GarageArea, OverallQual, etc.
- Categorical Features: MSZoning, Neighborhood, SaleCondition, etc.
- Target Variable: SalePrice (House Price)
Key Observations
- OverallQual, GrLivArea, and GarageCars have strong correlations with SalePrice.
- The majority of houses fall in the price range of $160,000 – $200,000, with some outliers representing premium properties.
- House prices remained relatively stable from 2006 to 2010, with minor fluctuations.
Analysis Performed
1. Data Preprocessing
- Handling Missing Values:
- Columns with <5% missing values were imputed using the median/mode.
- Columns with 5–20% missing values were carefully imputed based on domain knowledge.
- Columns with >20% missing values (e.g., PoolQC, Alley) were dropped.
- Removing Duplicates to ensure data integrity.
- Outlier Detection:
- Outliers were identified in features like LotArea, GrLivArea, and SalePrice.
2. Univariate Analysis
- SalePrice Distribution: Right-skewed distribution indicating a majority of lower-priced houses with a few high-priced properties.
- LotArea and GrLivArea: Highly skewed, requiring log transformation for better visualization.
- Neighborhoods: Some areas (e.g., StoneBr, NridgHt) have significantly higher house prices than others.
3. Multivariate Analysis
- Correlation Heatmap:
- Strong positive correlation between SalePrice and features like OverallQual (0.79), GrLivArea (0.71), and TotalBsmtSF (0.63).
- Weak correlation between YrSold and SalePrice, indicating no significant price trend over the years.
- Boxplots: Examined the relationship between categorical features (e.g., MSZoning, Neighborhood) and SalePrice.
- Seasonal Analysis: Sales peak in May – July, while November – February shows a decline in transactions.
4. Statistical Tests
| Statistical Test | Purpose | Test Statistic | p-value | Conclusion |
|---|---|---|---|---|
| Pearson Correlation | Test significant relationship between YrSold and SalePrice | -0.0289 | 0.2694 | No significant correlation |
| Chi-Square Test | Test distribution differences in house prices over years | 8.6713 | 0.7307 | No significant difference |
| T-Test | Compare mean house prices between years | 0.4950 | 0.5344 | No significant difference |
| ANOVA | Test variance in house prices across years | 0.6455 | 0.6301 | No significant difference |
| Linear Regression | Test linear relationship between YrSold and SalePrice | R-squared: 0.001 | 0.2690 | No significant trend |
Key Insights
1. Features Affecting House Prices
- Overall Quality (OverallQual) has the highest impact on SalePrice.
- GrLivArea (Above ground living area) significantly affects pricing.
- GarageCars (Number of garage spaces) contributes to pricing variations.
- Neighborhood plays a crucial role, with premium areas commanding higher prices.
2. Market Trends
- House prices remained stable from 2006 to 2010, indicating no major fluctuations.
- Seasonality effect: House sales peak in summer (May–July) and drop in winter (November–February).
- Premium properties: Some outliers in GrLivArea and LotArea represent high-value luxury homes.
Technologies Used
- Python Libraries:
pandas– Data manipulation and preprocessingnumpy– Numerical calculationsmatplotlib&seaborn– Data visualizationstatsmodels&scipy– Statistical analysis
How to Run the Project
- Clone the repository:
git clone https://github.com/hijirdella/House-Price-Analysis-EDA-and-Correlation-Insights - Install dependencies:
pip install -r requirements.txt - Run the Jupyter Notebook to explore the dataset and insights.
Contact
- Author: Hijir Della Wirasti
- GitHub: House Price Analysis Repository
- LinkedIn: Hijir Della Wirasti
- Email: hijirdw@gmail.com
This project provides a deep dive into the factors influencing house prices, offering valuable insights for real estate professionals and data analysts. Future improvements could include predictive modeling to forecast house prices based on these influential factors.
