In this blog, we will explore the structure and usage of a typical `params.yaml` file used in a project related to insurance price prediction.
### Example: The `params.yaml` Breakdown
Below is the content of a typical `params.yaml` file:
base:
project: insurance-price-prediction
random_state: 180
target_data: expenses
data_source:
local_data: data_given/insurance_updated.csv
load_data:
raw_data_csv: data/raw/insurance_updated.csv
raw_data:
raw: insurance_data/insurance.csv
split_data:
train_path: data/processed/train_insurance.csv
test_path: data/processed/test_insurance.csv
split_ratio: 0.250
estimators:
GradientBoostingRegressor:
params:
learning_rate: 0.1001
n_estimators: 100
alpha: 0.8
verbose: 0
validation_fraction: 0.000001
tol: 0.0001
ccp_alpha: 0.1
model_dirs: saved_models
reports:
scores: reports/scores.json
params: reports/params.json
webapp_model_dir: prediction_service/model/model.pkl
This `params.yaml` file is divided into different sections, each handling a specific part of the project configuration.
### 1. **Base Configuration**
base:
project: insurance-price-prediction
random_state: 180
target_data: expenses
- **Project**: This key specifies the project name (`insurance-price-prediction`). It is useful when working with multiple projects to keep configurations organized.
- **Random State**: The random state (`180`) ensures reproducibility. When you split the dataset or initialize models, setting a seed ensures you get the same results each time.
- **Target Data**: This specifies the target column (`expenses`) for prediction in the dataset.
### 2. **Data Source**
data_source:
local_data: data_given/insurance_updated.csv
- **Local Data**: This points to the file path of the input data (`data_given/insurance_updated.csv`). The project will use this file for processing and model training. It defines the source of the raw data, often used in local file systems.
### 3. **Load Data**
load_data:
raw_data_csv: data/raw/insurance_updated.csv
- **Raw Data CSV**: This specifies the path (`data/raw/insurance_updated.csv`) where the raw CSV file is saved after loading. It's a placeholder for where the raw data should be loaded and saved before any cleaning or processing happens.
### 4. **Raw Data**
raw_data:
raw: insurance_data/insurance.csv
- **Raw**: This parameter points to the original data file (`insurance_data/insurance.csv`) that contains the raw data. It is the base file that will be processed or transformed later.
### 5. **Split Data**
split_data:
train_path: data/processed/train_insurance.csv
test_path: data/processed/test_insurance.csv
split_ratio: 0.250
- **Train Path**: The file path where the processed training data will be saved (`data/processed/train_insurance.csv`).
- **Test Path**: The file path where the processed test data will be saved (`data/processed/test_insurance.csv`).
- **Split Ratio**: This indicates the proportion of the dataset that will be allocated to the test set. Here, the value is 0.250, meaning 25% of the data will be used for testing, while the remaining 75% will be used for training.
### 6. **Estimators**
estimators:
GradientBoostingRegressor:
params:
learning_rate: 0.1001
n_estimators: 100
alpha: 0.8
verbose: 0
validation_fraction: 0.000001
tol: 0.0001
ccp_alpha: 0.1
This section defines the machine learning model to be used, in this case, the `GradientBoostingRegressor`. The parameters for this model are set as follows:
- **Learning Rate**: This value (`0.1001`) controls how much to adjust the model in response to the estimated error at each step.
- **Number of Estimators**: This sets the number of boosting stages (`100`) to be run during training.
- **Alpha**: This parameter (`0.8`) controls the regularization strength, impacting how much the model is penalized for being overly complex.
- **Verbose**: A verbosity flag (`0`), controlling whether detailed logs are shown during the model's execution.
- **Validation Fraction**: A very small fraction (`0.000001`) of the data is reserved for validation.
- **Tolerance**: The tolerance (`0.0001`) defines when the training will stop if changes in loss are insignificant.
- **CCP Alpha**: This is the complexity parameter for Minimal Cost-Complexity Pruning (`0.1`). It helps in pruning the tree to avoid overfitting.
### 7. **Model Directories**
model_dirs: saved_models
- **Model Dirs**: This parameter defines where the trained models will be saved (`saved_models`). It is crucial for storing models to be reused or shared.
### 8. **Reports**
reports:
scores: reports/scores.json
params: reports/params.json
- **Scores**: The path (`reports/scores.json`) where the performance metrics of the model, such as accuracy or RMSE, are stored.
- **Params**: The path (`reports/params.json`) where the parameters of the trained model are saved for future reference or model reproduction.
### 9. **Web Application Model Directory**
webapp_model_dir: prediction_service/model/model.pkl
- **Webapp Model Directory**: The location where the serialized model (`model.pkl`) is saved, so it can be used in a web service for real-time predictions.
### Conclusion
The `params.yaml` file is an essential tool in machine learning projects, providing a clean, structured way to manage configuration settings. It allows for easy adjustments, reproducibility, and scalability across different stages of the pipeline, from data ingestion to model evaluation.
Key points:
- **Project Structure**: Well-organized projects are easier to maintain and reproduce.
- **Parameter Control**: Centralized control of all settings ensures consistency.
- **Scalability**: Easily modify parameters or switch models without changing the core code.
Using YAML files like `params.yaml` simplifies the configuration and makes collaboration among data scientists, engineers, and developers more efficient.