The `sklearn.datasets` module in scikit-learn provides a variety of datasets useful for machine learning tasks. Here are some commonly used datasets:
1. **`load_boston`**:
- Boston House Prices dataset.
2. **`load_iris`**:
- Iris dataset.
3. **`load_digits`**:
- Digits dataset.
4. **`load_diabetes`**:
- Diabetes dataset.
5. **`load_wine`**:
- Wine recognition dataset.
6. **`load_breast_cancer`**:
- Breast cancer Wisconsin dataset.
7. **`fetch_openml`**:
- Fetches datasets from the OpenML repository.
8. **`make_classification`**:
- Generates a random classification problem.
9. **`make_regression`**:
- Generates a random regression problem.
10. **`make_blobs`**:
- Generates isotropic Gaussian blobs for clustering.
11. **`make_moons`**:
- Generates two interleaving half circles for binary classification.
12. **`make_circles`**:
- Generates a large circle containing a smaller circle for binary classification.
13. **`make_checkerboard`**:
- Generates a checkerboard pattern for classification.
These examples are just a few of the datasets available in scikit-learn. For more details and to explore additional datasets, visit the official scikit-learn documentation: scikit-learn datasets
The `iris` dataset is a well-known dataset in machine learning, commonly used for classification problems. Here's a breakdown of what `iris.data` and `iris.target` mean:
### **1. What is `iris.data`? Why does it have 4 columns?**
- `iris.data` is a **NumPy array** that contains the feature values (measurements) of **150 samples of iris flowers**.
- Each **row** represents an **individual flower**.
- Each **column** represents a **specific feature** of the flower:
1. **Sepal length (cm)**
2. **Sepal width (cm)**
3. **Petal length (cm)**
4. **Petal width (cm)**
These four features are used to classify the flowers into different species.
### **2. What is `iris.target`? Why is it a flat array of 0s, 1s, and 2s?**
- `iris.target` is a **1D NumPy array** that contains the **labels** (or class IDs) for each flower.
- The numbers **0, 1, and 2** represent the three different species of iris flowers:
- **0** → `setosa`
- **1** → `versicolor`
- **2** → `virginica`
- So, if `iris.target[0]` is `0`, it means the first flower sample in `iris.data` is an **Iris Setosa**.
### **Example: Understanding `iris.data` and `iris.target`**
Let’s extract the first row:
from sklearn import datasets
# Load the iris dataset
iris = datasets.load_iris()
# Print the first sample
print("First sample (features):", iris.data[0])
print("First sample (target/class):", iris.target[0])
**Output:**
First sample (features): [5.1 3.5 1.4 0.2]
First sample (target/class): 0
- The flower has **sepal length = 5.1 cm**, **sepal width = 3.5 cm**, **petal length = 1.4 cm**, **petal width = 0.2 cm**.
- The class label is **0**, which means it is an **Iris Setosa**.