Showing posts with label Datasets. Show all posts
Showing posts with label Datasets. Show all posts

Saturday, August 3, 2024

Common Datasets in sklearn.datasets for Machine Learning

**The `sklearn.datasets` Module in scikit-learn**

The `sklearn.datasets` module in scikit-learn provides a variety of datasets useful for machine learning tasks. Here are some commonly used datasets:

1. **`load_boston`**:
   - Boston House Prices dataset.

2. **`load_iris`**:
   - Iris dataset.

3. **`load_digits`**:
   - Digits dataset.

4. **`load_diabetes`**:
   - Diabetes dataset.

5. **`load_wine`**:
   - Wine recognition dataset.

6. **`load_breast_cancer`**:
   - Breast cancer Wisconsin dataset.

7. **`fetch_openml`**:
   - Fetches datasets from the OpenML repository.

8. **`make_classification`**:
   - Generates a random classification problem.

9. **`make_regression`**:
   - Generates a random regression problem.

10. **`make_blobs`**:
    - Generates isotropic Gaussian blobs for clustering.

11. **`make_moons`**:
    - Generates two interleaving half circles for binary classification.

12. **`make_circles`**:
    - Generates a large circle containing a smaller circle for binary classification.

13. **`make_checkerboard`**:
    - Generates a checkerboard pattern for classification.

These examples are just a few of the datasets available in scikit-learn. For more details and to explore additional datasets, visit the official scikit-learn documentation: scikit-learn datasets

The `iris` dataset is a well-known dataset in machine learning, commonly used for classification problems. Here's a breakdown of what `iris.data` and `iris.target` mean:

### **1. What is `iris.data`? Why does it have 4 columns?**
- `iris.data` is a **NumPy array** that contains the feature values (measurements) of **150 samples of iris flowers**.
- Each **row** represents an **individual flower**.
- Each **column** represents a **specific feature** of the flower:
  1. **Sepal length (cm)**
  2. **Sepal width (cm)**
  3. **Petal length (cm)**
  4. **Petal width (cm)**

These four features are used to classify the flowers into different species.

### **2. What is `iris.target`? Why is it a flat array of 0s, 1s, and 2s?**
- `iris.target` is a **1D NumPy array** that contains the **labels** (or class IDs) for each flower.
- The numbers **0, 1, and 2** represent the three different species of iris flowers:
  - **0** → `setosa`
  - **1** → `versicolor`
  - **2** → `virginica`
- So, if `iris.target[0]` is `0`, it means the first flower sample in `iris.data` is an **Iris Setosa**.

### **Example: Understanding `iris.data` and `iris.target`**
Let’s extract the first row:


from sklearn import datasets

# Load the iris dataset
iris = datasets.load_iris()

# Print the first sample
print("First sample (features):", iris.data[0])
print("First sample (target/class):", iris.target[0])


**Output:**

First sample (features): [5.1 3.5 1.4 0.2]
First sample (target/class): 0


- The flower has **sepal length = 5.1 cm**, **sepal width = 3.5 cm**, **petal length = 1.4 cm**, **petal width = 0.2 cm**.
- The class label is **0**, which means it is an **Iris Setosa**.



Featured Post

How HMT Watches Lost the Time: A Deep Dive into Disruptive Innovation Blindness in Indian Manufacturing

The Rise and Fall of HMT Watches: A Story of Brand Dominance and Disruptive Innovation Blindness The Rise and Fal...

Popular Posts