Okay, here’s a sample dataset for a house price prediction model, incorporating many of the features we discussed. This data is synthetic and intended to illustrate the variety of features.
Code snippet
UniqueID,Size_LivingArea_SqFt,Size_Lot_SqFt,Size_TotalArea_SqFt,Rooms_Total,Bedrooms,Bathrooms_Full,Bathrooms_Half,Basement_Area_SqFt,Basement_Finished,Garage_Cars,Fireplaces,Porch_Area_SqFt,Year_Built,Year_Remodeled,Condition_Overall,Quality_Overall,Building_Type,House_Style,Foundation_Type,Roof_Material,Exterior_Material,Heating_Type,Cooling_Type,Kitchen_Quality,Bathroom_Quality,Fireplace_Quality,Basement_Quality,Stories,Floor_Material,Neighborhood,Proximity_Schools_Miles,Proximity_Parks_Miles,Proximity_PublicTransport_Miles,Proximity_Shopping_Miles,Proximity_Hospitals_Miles,Safety_CrimeRate_Index,Environmental_NoiseLevel_dB,Environmental_AirQuality_Index,Flood_Zone,View,Time_of_Sale,Interest_Rate,Inflation_Rate,Unemployment_Rate,Housing_Inventory,Economic_Growth_Rate,Sale_Price
1,1800,7500,2500,7,3,2,1,700,1,2,1,150,1995,2010,7,7,House,Ranch,Slab,Composition Shingle,Brick,Forced Air,Central AC,Good,Good,Average,Average,1,Hardwood,Bentonville Central,0.5,1.2,0.8,1.5,2.0,65,45,35,No,None,2024-08,6.2,3.5,4.2,0.05,2.5,285000
2,2200,10000,3000,8,4,3,0,800,0,2,1,200,2005,2005,6,6,House,Two-Story,Foundation,Composition Shingle,Siding,Forced Air,Central AC,Average,Average,Average,Poor,2,Carpet,Bentonville West,1.5,0.3,2.5,0.5,0.8,40,55,50,No,Trees,2024-11,6.5,3.8,4.0,0.03,2.8,350000
3,1500,6000,1800,6,3,1,1,0,0,1,0,100,1980,1980,5,5,House,Split-Level,Crawl Space,Asphalt,Vinyl Siding,Baseboard Heat,Window AC,Fair,Fair,None,None,1.5,Carpet,Bella Vista,3.0,0.8,0.5,2.0,5.0,80,35,25,Yes,None,2024-05,5.8,3.2,4.5,0.07,2.2,195000
4,2800,12000,3500,9,4,3,1,1000,1,3,2,250,2015,2018,8,8,House,Traditional,Foundation,Composition Shingle,Brick Veneer,Forced Air,Central AC,Excellent,Excellent,Good,Good,2,Hardwood,Centerton,0.2,2.0,1.0,0.3,1.0,50,40,30,No,Park View,2025-01,6.8,4.0,3.8,0.02,3.0,450000
5,1200,5000,1500,5,2,1,0,0,0,1,0,50,1970,1970,4,4,House,Ranch,Slab,Asphalt,Aluminum Siding,Wall Unit,Window AC,Poor,Fair,None,None,1,Vinyl,Rogers,2.5,1.5,3.5,1.0,3.0,90,60,65,No,None,2024-07,6.0,3.4,4.3,0.06,2.4,150000
6,3200,15000,4000,10,5,4,1,1200,1,3,2,300,2020,2022,9,9,House,Modern,Foundation,Metal,Stucco,Geothermal,Central AC,Excellent,Excellent,Excellent,Excellent,2,Tile,Bentonville Central,0.1,0.5,0.2,0.8,0.5,30,30,20,No,City View,2025-03,7.0,4.2,3.5,0.01,3.2,580000
7,1900,8000,2600,7,3,2,1,750,1,2,1,180,1998,2015,7,8,House,Colonial,Foundation,Composition Shingle,Brick,Forced Air,Central AC,Good,Excellent,Average,Good,2,Hardwood,Bella Vista,2.0,1.0,1.5,1.2,4.0,70,48,38,No,Trees,2024-09,6.3,3.6,4.1,0.04,2.6,310000
8,2500,11000,3300,8,4,2,1,900,1,2,1,220,2010,2010,6,7,House,Ranch,Slab,Composition Shingle,Siding,Forced Air,Central AC,Average,Good,Average,Average,1,Carpet,Rogers,1.0,2.5,2.0,0.7,2.5,55,52,45,No,None,2024-12,6.6,3.9,3.9,0.035,2.9,390000
9,1600,6500,2000,6,3,2,0,0,0,1,0,120,1985,1985,5,5,House,Split-Level,Crawl Space,Asphalt,Vinyl Siding,Baseboard Heat,Window AC,Fair,Fair,None,None,1.5,Vinyl,Centerton,2.8,0.5,0.3,2.5,1.5,85,40,30,Yes,None,2024-06,5.9,3.3,4.4,0.065,2.3,220000
10,3000,13000,3800,9,4,3,1,1100,1,3,2,280,2018,2020,8,9,House,Traditional,Foundation,Composition Shingle,Brick Veneer,Forced Air,Central AC,Excellent,Excellent,Good,Good,2,Hardwood,Bentonville West,0.3,1.8,0.9,0.5,0.7,45,35,28,No,Park View,2025-02,6.9,4.1,3.7,0.015,3.1,510000
Explanation of the Columns:
- UniqueID: A unique identifier for each house.
- Size_LivingArea_SqFt: The square footage of the living space.
- Size_Lot_SqFt: The square footage of the land lot.
- Size_TotalArea_SqFt: The total square footage including basement, etc.
- Rooms_Total: The total number of rooms.
- Bedrooms: The number of bedrooms.
- Bathrooms_Full: The number of full bathrooms.
- Bathrooms_Half: The number of half bathrooms.
- Basement_Area_SqFt: The square footage of the basement.
- Basement_Finished: 1 if the basement is finished, 0 otherwise.
- Garage_Cars: The number of cars the garage can hold.
- Fireplaces: The number of fireplaces.
- Porch_Area_SqFt: The square footage of porches, decks, or patios.
- Year_Built: The year the house was built.
- Year_Remodeled: The year the house was last remodeled (if applicable).
- Condition_Overall: An overall rating of the house’s condition (1-10).
- Quality_Overall: An overall rating of the house’s material and finish quality (1-10).
- Building_Type: The type of building (e.g., House, Townhouse, Condo).
- House_Style: The architectural style of the house (e.g., Ranch, Two-Story).
- Foundation_Type: The type of foundation (e.g., Slab, Foundation, Crawl Space).
- Roof_Material: The material of the roof (e.g., Composition Shingle, Asphalt).
- Exterior_Material: The material of the exterior (e.g., Brick, Siding).
- Heating_Type: The type of heating system (e.g., Forced Air, Baseboard Heat).
- Cooling_Type: The type of cooling system (e.g., Central AC, Window AC).
- Kitchen_Quality: A rating of the kitchen quality (e.g., Poor, Fair, Average, Good, Excellent).
- Bathroom_Quality: A rating of the bathroom quality.
- Fireplace_Quality: A rating of the fireplace quality.
- Basement_Quality: A rating of the basement quality.
- Stories: The number of stories in the house.
- Floor_Material: The primary flooring material (e.g., Hardwood, Carpet, Tile).
- Neighborhood: The name of the neighborhood (using Bentonville, Arkansas area examples).
- Proximity_Schools_Miles: The distance to the nearest good school in miles.
- Proximity_Parks_Miles: The distance to the nearest park in miles.
- Proximity_PublicTransport_Miles: The distance to the nearest public transportation stop in miles.
- Proximity_Shopping_Miles: The distance to the nearest shopping center in miles.
- Proximity_Hospitals_Miles: The distance to the nearest hospital in miles.
- Safety_CrimeRate_Index: A numerical index representing the crime rate (lower is safer).
- Environmental_NoiseLevel_dB: The average noise level in decibels.
- Environmental_AirQuality_Index: An index representing the air quality (lower is better).
- Flood_Zone: “Yes” if in a flood zone, “No” otherwise.
- View: A description of any significant view (e.g., Park View, City View, None).
- Time_of_Sale: The date of the sale.
- Interest_Rate: The prevailing mortgage interest rate at the time of sale.
- Inflation_Rate: The inflation rate at the time of sale.
- Unemployment_Rate: The unemployment rate in the area at the time of sale.
- Housing_Inventory: A measure of the available housing inventory (lower means higher demand).
- Economic_Growth_Rate: The economic growth rate at the time of sale.
- Sale_Price: The target variable – the price the house sold for.
How to Use This Data in Vertex AI:
- Save as CSV: Save this data as a
.csv
file (e.g.,house_price_data.csv
). - Upload to GCS: Upload this CSV file to a Google Cloud Storage (GCS) bucket.
- Create a Vertex AI Dataset: In the Vertex AI console, create a new Tabular Dataset and point it to the CSV file in your GCS bucket.
- Train Your Model: You can then use this Dataset to train an AutoML Tabular Regression model or use it as the data source for a custom training job.
Remember that this is just a small sample. For a real-world model, you would need a much larger and more diverse dataset. You would also need to handle data cleaning, preprocessing, and feature engineering steps within Vertex AI or your custom training code.
Pytorch model for this above dataset
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
from sklearn.model_selection import train_test_split
import pandas as pd
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# 1. Load the dataset
data = """UniqueID,Size_LivingArea_SqFt,Size_Lot_SqFt,Size_TotalArea_SqFt,Rooms_Total,Bedrooms,Bathrooms_Full,Bathrooms_Half,Basement_Area_SqFt,Basement_Finished,Garage_Cars,Fireplaces,Porch_Area_SqFt,Year_Built,Year_Remodeled,Condition_Overall,Quality_Overall,Building_Type,House_Style,Foundation_Type,Roof_Material,Exterior_Material,Heating_Type,Cooling_Type,Kitchen_Quality,Bathroom_Quality,Fireplace_Quality,Basement_Quality,Stories,Floor_Material,Neighborhood,Proximity_Schools_Miles,Proximity_Parks_Miles,Proximity_PublicTransport_Miles,Proximity_Shopping_Miles,Proximity_Hospitals_Miles,Safety_CrimeRate_Index,Environmental_NoiseLevel_dB,Environmental_AirQuality_Index,Flood_Zone,View,Time_of_Sale,Interest_Rate,Inflation_Rate,Unemployment_Rate,Housing_Inventory,Economic_Growth_Rate,Sale_Price
1,1800,7500,2500,7,3,2,1,700,1,2,1,150,1995,2010,7,7,House,Ranch,Slab,Composition Shingle,Brick,Forced Air,Central AC,Good,Good,Average,Average,1,Hardwood,Bentonville Central,0.5,1.2,0.8,1.5,2.0,65,45,35,No,None,2024-08,6.2,3.5,4.2,0.05,2.5,285000
2,2200,10000,3000,8,4,3,0,800,0,2,1,200,2005,2005,6,6,House,Two-Story,Foundation,Composition Shingle,Siding,Forced Air,Central AC,Average,Average,Average,Poor,2,Carpet,Bentonville West,1.5,0.3,2.5,0.5,0.8,40,55,50,No,Trees,2024-11,6.5,3.8,4.0,0.03,2.8,350000
3,1500,6000,1800,6,3,1,1,0,0,1,0,100,1980,1980,5,5,House,Split-Level,Crawl Space,Asphalt,Vinyl Siding,Baseboard Heat,Window AC,Fair,Fair,None,None,1.5,Carpet,Bella Vista,3.0,0.8,0.5,2.0,5.0,80,35,25,Yes,None,2024-05,5.8,3.2,4.5,0.07,2.2,195000
4,2800,12000,3500,9,4,3,1,1000,1,3,2,250,2015,2018,8,8,House,Traditional,Foundation,Composition Shingle,Brick Veneer,Forced Air,Central AC,Excellent,Excellent,Good,Good,2,Hardwood,Centerton,0.2,2.0,1.0,0.3,1.0,50,40,30,No,Park View,2025-01,6.8,4.0,3.8,0.02,3.0,450000
5,1200,5000,1500,5,2,1,0,0,0,1,0,50,1970,1970,4,4,House,Ranch,Slab,Asphalt,Aluminum Siding,Wall Unit,Window AC,Poor,Fair,None,None,1,Vinyl,Rogers,2.5,1.5,3.5,1.0,3.0,90,60,65,No,None,2024-07,6.0,3.4,4.3,0.06,2.4,150000
6,3200,15000,4000,10,5,4,1,1200,1,3,2,300,2020,2022,9,9,House,Modern,Foundation,Metal,Stucco,Geothermal,Central AC,Excellent,Excellent,Excellent,Excellent,2,Tile,Bentonville Central,0.1,0.5,0.2,0.8,0.5,30,30,20,No,City View,2025-03,7.0,4.2,3.5,0.01,3.2,580000
7,1900,8000,2600,7,3,2,1,750,1,2,1,180,1998,2015,7,8,House,Colonial,Foundation,Composition Shingle,Brick,Forced Air,Central AC,Good,Excellent,Average,Good,2,Hardwood,Bella Vista,2.0,1.0,1.5,1.2,4.0,70,48,38,No,Trees,2024-09,6.3,3.6,4.1,0.04,2.6,310000
8,2500,11000,3300,8,4,2,1,900,1,2,1,220,2010,2010,6,7,House,Ranch,Slab,Composition Shingle,Siding,Forced Air,Central AC,Average,Good,Average,Average,1,Carpet,Rogers,1.0,2.5,2.0,0.7,2.5,55,52,45,No,None,2024-12,6.6,3.9,3.9,0.035,2.9,390000
9,1600,6500,2000,6,3,2,0,0,0,1,0,120,1985,1985,5,5,House,Split-Level,Crawl Space,Asphalt,Vinyl Siding,Baseboard Heat,Window AC,Fair,Fair,None,None,1.5,Vinyl,Centerton,2.8,0.5,0.3,2.5,1.5,85,40,30,Yes,None,2024-06,5.9,3.3,4.4,0.065,2.3,220000
10,3000,13000,3800,9,4,3,1,1100,1,3,2,280,2018,2020,8,9,House,Traditional,Foundation,Composition Shingle,Brick Veneer,Forced Air,Central AC,Excellent,Excellent,Good,Good,2,Hardwood,Bentonville West,0.3,1.8,0.9,0.5,0.7,45,35,28,No,Park View,2025-02,6.9,4.1,3.7,0.015,3.1,510000
"""
from io import StringIO
df = pd.read_csv(StringIO(data))
# 2. Preprocessing
# Identify numerical and categorical features
numerical_features = ['Size_LivingArea_SqFt', 'Size_Lot_SqFt', 'Size_TotalArea_SqFt', 'Rooms_Total',
'Bedrooms', 'Bathrooms_Full', 'Bathrooms_Half', 'Basement_Area_SqFt',
'Garage_Cars', 'Fireplaces', 'Porch_Area_SqFt', 'Year_Built', 'Year_Remodeled',
'Condition_Overall', 'Quality_Overall', 'Stories',
'Proximity_Schools_Miles', 'Proximity_Parks_Miles',
'Proximity_PublicTransport_Miles', 'Proximity_Shopping_Miles',
'Proximity_Hospitals_Miles', 'Safety_CrimeRate_Index',
'Environmental_NoiseLevel_dB', 'Environmental_AirQuality_Index',
'Interest_Rate', 'Inflation_Rate', 'Unemployment_Rate',
'Housing_Inventory', 'Economic_Growth_Rate']
categorical_features = ['Building_Type', 'House_Style', 'Foundation_Type', 'Roof_Material',
'Exterior_Material', 'Heating_Type', 'Cooling_Type', 'Kitchen_Quality',
'Bathroom_Quality', 'Fireplace_Quality', 'Basement_Quality',
'Floor_Material', 'Neighborhood', 'Flood_Zone', 'View']
# Create preprocessor
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numerical_features),
('cat', OneHotEncoder(handle_unknown='ignore'), categorical_features)
],
remainder='passthrough' # Keep other columns (like UniqueID, Time_of_Sale)
)
# Separate features and target
X = df.drop('Sale_Price', axis=1)
y = df['Sale_Price'].values.reshape(-1, 1)
# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Fit and transform the data
X_train_processed = preprocessor.fit_transform(X_train)
X_test_processed = preprocessor.transform(X_test)
y_train_tensor = torch.tensor(y_train, dtype=torch.float32)
y_test_tensor = torch.tensor(y_test, dtype=torch.float32)
# Convert processed data to PyTorch Tensors
X_train_tensor = torch.tensor(X_train_processed, dtype=torch.float32)
X_test_tensor = torch.tensor(X_test_processed, dtype=torch.float32)
# 3. Define the PyTorch Dataset
class HousePriceDataset(Dataset):
def __init__(self, features, labels):
self.features = features
self.labels = labels
self.n_samples = features.shape[0]
def __getitem__(self, index):
return self.features[index], self.labels[index]
def __len__(self):
return self.n_samples
train_dataset = HousePriceDataset(X_train_tensor, y_train_tensor)
test_dataset = HousePriceDataset(X_test_tensor, y_test_tensor)
# 4. Define the DataLoader
batch_size = 8
train_loader = DataLoader(dataset=train_dataset, batch_size=batch_size, shuffle=True)
test_loader = DataLoader(dataset=test_dataset, batch_size=batch_size, shuffle=False)
# 5. Define the Neural Network Model
class HousePriceModel(nn.Module):
def __init__(self, input_size):
super(HousePriceModel, self).__init__()
self.linear1 = nn.Linear(input_size, 64)
self.relu = nn.ReLU()
self.linear2 = nn.Linear(64, 32)
self.relu2 = nn.ReLU()
self.linear3 = nn.Linear(32, 1) # Output is a single predicted price
def forward(self, x):
out = self.linear1(x)
out = self.relu(out)
out = self.linear2(out)
out = self.relu2(out)
out = self.linear3(out)
return out
# Get the input size (number of features after preprocessing)
input_size = X_train_tensor.shape[1]
model = HousePriceModel(input_size)
# 6. Define Loss Function and Optimizer
learning_rate = 0.01
criterion = nn.MSELoss() # Mean Squared Error for regression
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# 7. Training Loop
num_epochs = 100
for epoch in range(num_epochs):
for batch_idx, (features, labels) in enumerate(train_loader):
# Forward pass
outputs = model(features)
loss = criterion(outputs, labels)
# Backward and optimize
optimizer.zero_grad()
loss.backward()
optimizer.step()
if (batch_idx + 1) % 10 == 0:
print(f'Epoch [{epoch+1}/{num_epochs}], Batch [{batch_idx+1}/{len(train_loader)}], Loss: {loss.item():.4f}')
print('Finished Training')
# 8. Evaluation
with torch.no_grad():
model.eval()
test_loss = 0
for features, labels in test_loader:
outputs = model(features)
test_loss += criterion(outputs, labels).item()
avg_test_loss = test_loss / len(test_loader)
print(f'Test Loss: {avg_test_loss:.4f}')
# 9. Save the Trained Model (for Vertex AI deployment)
torch.save(model.state_dict(), 'house_price_model.pth')
print('Trained model saved as house_price_model.pth')
# To deploy on Vertex AI, you would typically need to:
# 1. Upload 'house_price_model.pth' and potentially the preprocessing pipeline
# (saved using pickle) to Google Cloud Storage.
# 2. Create a custom serving container that loads the PyTorch model and the
# preprocessing steps.
# 3. Deploy the container and model to a Vertex AI Endpoint.
Explanation:
- Load Data: The provided sample data is loaded using
pandas
. - Preprocessing:
- Identify Feature Types: Numerical and categorical features are separated.
- Create Preprocessor:
ColumnTransformer
fromsklearn.compose
is used to apply different preprocessing steps to different columns. - StandardScaler: Numerical features are scaled to have zero mean and unit variance.
- OneHotEncoder: Categorical features are converted into a one-hot encoded format.
handle_unknown='ignore'
is used to avoid errors if unseen categories appear during prediction. - Fit and Transform: The preprocessor is fitted on the training data and then used to transform both the training and testing data.
- Convert to Tensors: The processed NumPy arrays are converted to PyTorch Tensors.
- PyTorch Dataset: A custom
HousePriceDataset
class is created to load the features and labels in a PyTorch-friendly way. - DataLoader:
DataLoader
is used to create iterable batches of data for training and evaluation. - Neural Network Model (
HousePriceModel
):- A simple feedforward neural network with three linear layers and ReLU activation functions is defined.
- The output layer has a single neuron for predicting the house price.
- Loss Function and Optimizer:
nn.MSELoss()
(Mean Squared Error) is chosen as the loss function, suitable for regression tasks.optim.Adam()
is a popular and effective optimization algorithm.
- Training Loop:
- The model iterates through the training data for a specified number of epochs.
- In each batch:
- The forward pass calculates the model’s predictions.
- The loss is computed.
- Gradients are calculated using backpropagation (
loss.backward()
). - The optimizer updates the model’s weights (
optimizer.step()
).
- Evaluation:
- The model is set to evaluation mode (
model.eval()
). - The test loss is calculated without tracking gradients (
torch.no_grad()
). - The average test loss is printed.
- The model is set to evaluation mode (
- Save Model: The trained model’s state dictionary (the learned weights and biases) is saved to a
.pth
file.
To deploy this model on Vertex AI:
- Save Preprocessor: You would also need to save the fitted
preprocessor
(usingpickle
) so you can apply the same transformations to incoming prediction data. - Create Serving Container: You would need to create a custom Docker container that includes:
- Your PyTorch model (
house_price_model.pth
). - The saved
preprocessor
.
- Your PyTorch model (