Yet Another Data Science Blog: Datasets

Saturday, August 3, 2024

Common Datasets in sklearn.datasets for Machine Learning

**The `sklearn.datasets` Module in scikit-learn**

The `sklearn.datasets` module in scikit-learn provides a variety of datasets useful for machine learning tasks. Here are some commonly used datasets:

1. **`load_boston`**:

- Boston House Prices dataset.

2. **`load_iris`**:

- Iris dataset.

3. **`load_digits`**:

- Digits dataset.

4. **`load_diabetes`**:

- Diabetes dataset.

5. **`load_wine`**:

- Wine recognition dataset.

6. **`load_breast_cancer`**:

- Breast cancer Wisconsin dataset.

7. **`fetch_openml`**:

- Fetches datasets from the OpenML repository.

8. **`make_classification`**:

- Generates a random classification problem.

9. **`make_regression`**:

- Generates a random regression problem.

10. **`make_blobs`**:

- Generates isotropic Gaussian blobs for clustering.

11. **`make_moons`**:

- Generates two interleaving half circles for binary classification.

12. **`make_circles`**:

- Generates a large circle containing a smaller circle for binary classification.

13. **`make_checkerboard`**:

- Generates a checkerboard pattern for classification.

These examples are just a few of the datasets available in scikit-learn. For more details and to explore additional datasets, visit the official scikit-learn documentation: scikit-learn datasets

The `iris` dataset is a well-known dataset in machine learning, commonly used for classification problems. Here's a breakdown of what `iris.data` and `iris.target` mean:

### **1. What is `iris.data`? Why does it have 4 columns?**

- `iris.data` is a **NumPy array** that contains the feature values (measurements) of **150 samples of iris flowers**.

- Each **row** represents an **individual flower**.

- Each **column** represents a **specific feature** of the flower:

1. **Sepal length (cm)**

2. **Sepal width (cm)**

3. **Petal length (cm)**

4. **Petal width (cm)**

These four features are used to classify the flowers into different species.

### **2. What is `iris.target`? Why is it a flat array of 0s, 1s, and 2s?**

- `iris.target` is a **1D NumPy array** that contains the **labels** (or class IDs) for each flower.

- The numbers **0, 1, and 2** represent the three different species of iris flowers:

- **0** → `setosa`

- **1** → `versicolor`

- **2** → `virginica`

- So, if `iris.target[0]` is `0`, it means the first flower sample in `iris.data` is an **Iris Setosa**.

### **Example: Understanding `iris.data` and `iris.target`**

Let’s extract the first row:

from sklearn import datasets

# Load the iris dataset

iris = datasets.load_iris()

# Print the first sample

print("First sample (features):", iris.data[0])

print("First sample (target/class):", iris.target[0])

**Output:**

First sample (features): [5.1 3.5 1.4 0.2]

First sample (target/class): 0

- The flower has **sepal length = 5.1 cm**, **sepal width = 3.5 cm**, **petal length = 1.4 cm**, **petal width = 0.2 cm**.

- The class label is **0**, which means it is an **Iris Setosa**.

Yet Another Data Science Blog

Pages

Saturday, August 3, 2024

Common Datasets in sklearn.datasets for Machine Learning

Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

Popular Posts

Posts Per Category

🎮 AI Fun Zone

🧠 AI Quiz

🎯 Guess Game

⚡ Speed Test

✊ Rock Paper Scissors

🔢 Quick Math

🧩 Memory Game

⌨️ Typing Speed

🟥 Color Click

🎲 Dice Game

Explore AI Hub

Latest Posts

AI Category

🚀 Trending AI Projects

📊 Data Science Resources

📚 Latest Research Papers

🔥 New AI Tools

💬 Developer Discussions

Contact Form

Followers