From 23486197c23cba68c4c0f6bfce0a8adc46a1c870 Mon Sep 17 00:00:00 2001 From: Ajay Dhangar Date: Wed, 31 Dec 2025 20:52:00 +0530 Subject: [PATCH] added ml content --- .../regression/linear-regression.mdx | 200 ++++++++++++++++++ .../regression/polynomial-regression.mdx | 131 ++++++++++++ 2 files changed, 331 insertions(+) diff --git a/docs/machine-learning/machine-learning-core/supervised-learning/regression/linear-regression.mdx b/docs/machine-learning/machine-learning-core/supervised-learning/regression/linear-regression.mdx index e69de29..8eab36b 100644 --- a/docs/machine-learning/machine-learning-core/supervised-learning/regression/linear-regression.mdx +++ b/docs/machine-learning/machine-learning-core/supervised-learning/regression/linear-regression.mdx @@ -0,0 +1,200 @@ +--- +title: Linear Regression +sidebar_label: Linear Regression +description: "Mastering the fundamentals of predicting continuous values using lines, slopes, and intercepts." +tags: [machine-learning, supervised-learning, regression, linear-regression, ordinary-least-squares] +--- + +**Linear Regression** is a supervised learning algorithm used to predict a continuous numerical output based on one or more input features. It assumes that there is a linear relationship between the input variables ($X$) and the single output variable ($y$). + +## 1. The Mathematical Model + +The goal of linear regression is to find the "Line of Best Fit." Mathematically, this line is represented by the equation: + +$$ +y = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... + \epsilon +$$ + +Where: + +* **$y$**: The dependent variable (Target). +* **$x$**: The independent variables (Features). +* **$\beta_0$**: The **Intercept** (where the line crosses the y-axis). +* **$\beta_1, \beta_2$**: The **Coefficients** or Slopes (representing the weight of each feature). +* **$\epsilon$**: The error term (Residual). + +## 2. Ordinary Least Squares (OLS) + +How does the model find the "best" line? It uses a method called **Ordinary Least Squares**. + +The algorithm calculates the distance between every actual data point and the predicted point on the line. It then squares these distances (to remove negative signs) and sums them up. The "best" line is the one that minimizes this **Sum of Squared Errors (SSE)**. + +```mermaid +graph LR + subgraph LR["Linear Regression Model"] + X["$$x$$ (Input Feature)"] --> H["$$\hat{y} = wx + b$$"] + end + + subgraph ERR["Residuals"] + Y["$$y$$ (Actual Value)"] + H --> R["$$r = y - \hat{y}$$"] + Y --> R + end + + subgraph SSE["Sum of Squared Errors"] + R --> S1["$$r^2 = (y - \hat{y})^2$$"] + S1 --> S2["$$\text{SSE} = \sum_{i=1}^{n}(y_i - \hat{y}_i)^2$$"] + S2 --> S3["$$\text{Loss to Minimize}$$"] + end + + X -.->|"$$\text{Best Fit Line}$$"| Y + +``` + +In this diagram: + +* The input feature ($x$) is fed into the linear model to produce a predicted value ($\hat{y}$). +* The residual ($r$) is calculated as the difference between the actual value ($y$) and the predicted value ($\hat{y}$). +* The squared residuals are summed up to compute the SSE, which the model aims to minimize. + + +## 3. Simple vs. Multiple Linear Regression + +* **Simple Linear Regression:** Uses only one feature to predict the target (e.g., predicting house price based *only* on square footage). +* **Multiple Linear Regression:** Uses two or more features (e.g., predicting house price based on square footage, number of bedrooms, and age of the house). + +## 4. Key Assumptions + +For Linear Regression to be effective and reliable, the data should ideally meet these criteria: +1. **Linearity:** The relationship between $X$ and $y$ is a straight line. +2. **Independence:** Observations are independent of each other. +3. **Homoscedasticity:** The variance of residual errors is constant across all levels of the independent variables. +4. **Normality:** The residuals (errors) of the model are normally distributed. + +## 5. Implementation with Scikit-Learn + +```python title="Linear Regression with Scikit-Learn" +import numpy as np +import pandas as pd +from sklearn.model_selection import train_test_split +from sklearn.linear_model import LinearRegression +from sklearn.metrics import mean_squared_error, r2_score + +# -------------------------------------------------- +# 1. Create a sample dataset +# -------------------------------------------------- +# Example: Predict salary based on years of experience + +np.random.seed(42) + +X = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]).reshape(-1, 1) # Feature +y = np.array([30, 35, 37, 42, 45, 50, 52, 56, 60, 65]) # Target + +# -------------------------------------------------- +# 2. Split the data into training and testing sets +# -------------------------------------------------- +X_train, X_test, y_train, y_test = train_test_split( + X, y, test_size=0.2, random_state=42 +) + +# -------------------------------------------------- +# 3. Initialize the Linear Regression model +# -------------------------------------------------- +model = LinearRegression() + +# -------------------------------------------------- +# 4. Train the model +# -------------------------------------------------- +model.fit(X_train, y_train) + +# -------------------------------------------------- +# 5. Make predictions +# -------------------------------------------------- +y_pred = model.predict(X_test) + +# -------------------------------------------------- +# 6. Inspect learned parameters +# -------------------------------------------------- +print(f"Intercept (β₀): {model.intercept_}") +print(f"Coefficient (β₁): {model.coef_[0]}") + +# -------------------------------------------------- +# 7. Evaluate the model +# -------------------------------------------------- +mse = mean_squared_error(y_test, y_pred) +r2 = r2_score(y_test, y_pred) + +print(f"Mean Squared Error (MSE): {mse}") +print(f"R² Score: {r2}") + +# -------------------------------------------------- +# 8. Compare actual vs predicted values +# -------------------------------------------------- +results = pd.DataFrame({ + "Actual": y_test, + "Predicted": y_pred +}) + +print("\nPrediction Results:") +print(results) + +``` + +```bash title="Output" +Intercept (β₀): 26.025862068965512 +Coefficient (β₁): 3.836206896551725 +Mean Squared Error (MSE): 0.9994426278240237 +R² Score: 0.9936035671819262 + +Prediction Results: + Actual Predicted +0 60 60.551724 +1 35 33.698276 + +``` + + +## 6. Evaluating Regression + +Unlike classification (where we use accuracy), we evaluate regression using error metrics: + +* **Mean Squared Error (MSE):** The average of the squared differences between predicted and actual values. +* **Root Mean Squared Error (RMSE):** The square root of MSE (brings the error back to the original units). +* **R-Squared ($R^2$):** Measures how much of the variance in $y$ is explained by the model (ranges from 0 to 1). + +```python title="Evaluating Linear Regression Model" +from sklearn.metrics import mean_squared_error, r2_score +import numpy as np + +# Calculate evaluation metrics +mse = mean_squared_error(y_test, y_pred) +rmse = np.sqrt(mse) # Root Mean Squared Error +r2 = r2_score(y_test, y_pred) + +# Display results +print("Model Evaluation Metrics") +print("------------------------") +print(f"Mean Squared Error (MSE): {mse:.4f}") +print(f"Root Mean Squared Error (RMSE): {rmse:.4f}") +print(f"R-Squared (R²): {r2:.4f}") +``` + +```bash title="Output" +Model Evaluation Metrics +------------------------ +Mean Squared Error (MSE): 0.9994 +Root Mean Squared Error (RMSE): 0.9997 +R-Squared (R²): 0.9936 +``` + +## 7. Pros and Cons + +| Advantages | Disadvantages | +| --- | --- | +| **Highly Interpretable:** You can see exactly how much each feature influences the result. | **Sensitive to Outliers:** A single extreme value can significantly tilt the line. | +| **Fast:** Requires very little computational power. | **Assumption Heavy:** Fails if the underlying relationship is non-linear. | +| **Baseline Model:** Excellent starting point for any regression task. | **Overfitting:** Can overfit if there are too many features (Multicollinearity). | + +## References for More Details + +* **[Scikit-Learn Linear Models](https://scikit-learn.org/stable/modules/linear_model.html):** Technical details on OLS and alternative solvers. \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/supervised-learning/regression/polynomial-regression.mdx b/docs/machine-learning/machine-learning-core/supervised-learning/regression/polynomial-regression.mdx index e69de29..420ec59 100644 --- a/docs/machine-learning/machine-learning-core/supervised-learning/regression/polynomial-regression.mdx +++ b/docs/machine-learning/machine-learning-core/supervised-learning/regression/polynomial-regression.mdx @@ -0,0 +1,131 @@ +--- +title: "Polynomial Regression: Beyond Straight Lines" +sidebar_label: Polynomial Regression +description: "Learning to model curved relationships by transforming features into higher-degree polynomials." +tags: [machine-learning, supervised-learning, regression, polynomial-features, non-linear] +--- + +**Polynomial Regression** is a form of regression analysis in which the relationship between the independent variable $x$ and the dependent variable $y$ is modelled as an $n^{th}$ degree polynomial. + +While it fits a non-linear model to the data, as a statistical estimation problem, it is still considered **linear** because the regression function is linear in terms of the unknown parameters ($\beta$) that are estimated from the data. + +## 1. Why use Polynomial Regression? + +Linear regression requires a straight-line relationship. However, real-world data often follows curves, such as: +* **Growth Rates:** Biological growth or interest rates. +* **Physics:** The path of a projectile or the relationship between speed and braking distance. +* **Economics:** Diminishing returns on investment. + +## 2. The Mathematical Equation + +In a simple linear model, we have: + +$$ +y = \beta_0 + \beta_1x_1 +$$ + +In Polynomial Regression, we add higher-degree terms of the same feature: + +$$ +y = \beta_0 + \beta_1x + \beta_2x^2 + \beta_3x^3 + ... + \beta_nx^n + \epsilon +$$ + +Where: + +* **$y$**: The dependent variable (Target). +* **$x$**: The independent variable (Feature). +* **$\beta_0$**: The Intercept. +* **$\beta_1, \beta_2, ..., \beta_n$**: The Coefficients for each polynomial term. +* **$\epsilon$**: The error term (Residual). + +By treating $x^2, x^3, ...$ as distinct features, we allow the model to "bend" to fit the data points. + +## 3. The Danger of Degree: Overfitting + +Choosing the right **degree** ($n$) is the most critical part of Polynomial Regression: + +* **Underfitting (Degree 1):** A straight line that fails to capture the curve in the data. +* **Optimal Fit (Degree 2 or 3):** A smooth curve that captures the general trend. +* **Overfitting (Degree 10+):** A wiggly line that passes through every single data point but fails to predict new data because it has captured the noise instead of the signal. + +```mermaid +graph LR + subgraph UF["Underfitting (Low Degree)"] + X1["$$x$$"] --> L1["$$\hat{y} = w_1x + b$$"] + L1 --> U1["$$\text{High Bias}$$"] + U1 --> U2["$$\text{Misses Data Pattern}$$"] + U2 --> U3["$$\text{High Train Error}$$"] + end + + subgraph OFIT["Optimal Fit (Medium Degree)"] + X2["$$x$$"] --> M1["$$\hat{y} = w_1x + w_2x^2 + b$$"] + M1 --> O1["$$\text{Balanced Bias–Variance}$$"] + O1 --> O2["$$\text{Captures True Trend}$$"] + O2 --> O3["$$\text{Low Train \& Test Error}$$"] + end + + subgraph OVF["Overfitting (High Degree)"] + X3["$$x$$"] --> H1["$$\hat{y} = \sum_{k=1}^{d} w_k x^k$$"] + H1 --> V1["$$\text{Low Bias}$$"] + V1 --> V2["$$\text{High Variance}$$"] + V2 --> V3["$$\text{Fits Noise}$$"] + V3 --> V4["$$\text{Poor Generalization}$$"] + end + + U3 -.->|"$$\text{Increase Degree}$$"| O3 + O3 -.->|"$$\text{Too Complex}$$"| V4 +``` + +## 4. Implementation with Scikit-Learn + +In Scikit-Learn, we perform Polynomial Regression by using a **Transformer** to generate new features and then passing them to a standard `LinearRegression` model. + +```python title="Polynomial Regression with Scikit-Learn" +from sklearn.preprocessing import PolynomialFeatures +from sklearn.linear_model import LinearRegression +from sklearn.pipeline import make_pipeline + +# 1. Generate data (Example: a parabola) +# X, y = ... + +# 2. Create a pipeline that: +# a) Generates polynomial terms (x^2) +# b) Fits a linear regression to those terms +degree = 2 +poly_model = make_pipeline(PolynomialFeatures(degree), LinearRegression()) + +# 3. Train the model +poly_model.fit(X, y) + +# 4. Predict +y_pred = poly_model.predict(X) + +``` + +## 5. Feature Scaling is Mandatory + +When you square or cube features, the range of values expands drastically. + +* If , then and . +* If , then and . + +Because of this explosive growth, you should **always scale your features** (e.g., using `StandardScaler`) before or after applying polynomial transformations to prevent numerical instability. + +## 6. Pros and Cons + +| Advantages | Disadvantages | +| --- | --- | +| Can model complex, non-linear relationships. | Extremely sensitive to outliers. | +| Broad range of functions can be mapped under it. | High risk of overfitting if the degree is too high. | +| Fits into the linear regression framework. | Becomes computationally expensive with many features. | + + +## References for More Details + +* **[Interactive Polynomial Regression Demo](https://phet.colorado.edu/sims/html/least-squares-regression/latest/least-squares-regression_en.html):** Visualizing how adding degrees changes the line of best fit in real-time. + +* **[Scikit-Learn: Polynomial Features](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html):** Understanding how the `interaction_only` parameter works for multiple variables. + +--- + +**Polynomial models can easily become too complex and overfit. How do we keep the model's weights in check?** \ No newline at end of file