diff --git a/docs/machine-learning/machine-learning-core/scikit-learn/data-loading.mdx b/docs/machine-learning/machine-learning-core/scikit-learn/data-loading.mdx index e69de29..6e61791 100644 --- a/docs/machine-learning/machine-learning-core/scikit-learn/data-loading.mdx +++ b/docs/machine-learning/machine-learning-core/scikit-learn/data-loading.mdx @@ -0,0 +1,112 @@ +--- +title: "Loading Data in Scikit-Learn" +sidebar_label: Data Loading +description: "How to use Scikit-Learn's built-in datasets, fetchers, and external loaders to prepare data for modeling." +tags: [scikit-learn, data-loading, python, machine-learning, datasets] +--- + +Before you can train a model, you need to get your data into a format that Scikit-Learn understands. Scikit-Learn works primarily with **NumPy arrays** or **Pandas DataFrames**, but it also provides built-in tools to help you get started quickly. + +## 1. The Scikit-Learn Data Format + +Regardless of how you load your data, Scikit-Learn expects two main components: + +1. **The Feature Matrix ($X$):** A 2D array of shape `(n_samples, n_features)`. +2. **The Target Vector ($y$):** A 1D array of shape `(n_samples)` containing the labels or values you want to predict. + +## 2. Built-in "Toy" Datasets + +Scikit-Learn comes bundled with small datasets that require no internet connection. These are perfect for testing your code or learning new algorithms. + +* `load_iris()`: Classic classification dataset (flowers). +* `load_diabetes()`: Regression dataset. +* `load_digits()`: Classification dataset (handwritten digits). + +```python +from sklearn.datasets import load_iris + +# Load the dataset +iris = load_iris() + +# Access data and labels +X = iris.data +y = iris.target + +print(f"Features: {iris.feature_names}") +print(f"Target Names: {iris.target_names}") + +``` + +## 3. Fetching Large Real-World Datasets + +For larger datasets, Scikit-Learn provides "fetchers" that download data from the internet and cache it locally in your `~/scikit_learn_data` folder. + +* `fetch_california_housing()`: Predict median house values. +* `fetch_20newsgroups()`: Text dataset for NLP. +* `fetch_lfw_people()`: Labeled Faces in the Wild (for face recognition). + +```python +from sklearn.datasets import fetch_california_housing + +housing = fetch_california_housing() +print(f"Dataset shape: {housing.data.shape}") + +``` + +## 4. Loading from External Sources + +In a professional environment, you will rarely use the built-in datasets. You will likely load data from **CSVs**, **SQL Databases**, or **Pandas DataFrames**. + +### From Pandas to Scikit-Learn + +Scikit-Learn is designed to be "Pandas-friendly." You can pass DataFrames directly into models. + +```python +import pandas as pd +from sklearn.linear_model import LinearRegression + +# Load your own CSV +df = pd.read_csv('my_data.csv') + +# Split into X and y +X = df[['feature1', 'feature2']] # Select specific columns +y = df['target_column'] + +# Train model directly +model = LinearRegression().fit(X, y) + +``` + +## 5. Generating Synthetic Data + +Sometimes you need to create "fake" data to test how an algorithm handles specific scenarios (like high noise or non-linear patterns). + +```python +from sklearn.datasets import make_blobs, make_moons + +# Create 3 distinct clusters for a classification task +X, y = make_blobs(n_samples=100, centers=3, n_features=2, random_state=42) + +``` + +## 6. Understanding the "Bunch" Object + +When you use `load_*` or `fetch_*`, Scikit-Learn returns a **`Bunch` object**. This is essentially a dictionary that contains: + +* `.data`: The feature matrix. +* `.target`: The labels. +* `.feature_names`: The names of the columns. +* `.DESCR`: A full text description of where the data came from. + +:::tip +Use `as_frame=True` in your loader to get the data returned as a Pandas DataFrame immediately: `data = load_iris(as_frame=True).frame` +::: + +## References for More Details + +* **[Sklearn Dataset Loading Guide](https://scikit-learn.org/stable/datasets.html):** Exploring all 20+ available fetchers and loaders. +* **[OpenML Integration](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html):** Accessing thousands of community-uploaded datasets via `fetch_openml`. + +--- + +**Now that you can load data, the next step is to ensure it's in the right shape and split correctly for training and testing.** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/scikit-learn/data-preparation.mdx b/docs/machine-learning/machine-learning-core/scikit-learn/data-preparation.mdx index e69de29..43d2535 100644 --- a/docs/machine-learning/machine-learning-core/scikit-learn/data-preparation.mdx +++ b/docs/machine-learning/machine-learning-core/scikit-learn/data-preparation.mdx @@ -0,0 +1,120 @@ +--- +title: Data Preparation in Scikit-Learn +sidebar_label: Data Preparation +description: "Transforming raw data into model-ready features using Scikit-Learn's preprocessing and imputation tools." +tags: [scikit-learn, preprocessing, encoding, scaling, imputation] +--- + +Before feeding data into an algorithm, it must be cleaned and transformed. Scikit-Learn provides a robust suite of **Transformers**—classes that follow a standard `.fit()` and `.transform()` API—to automate this work. + +## 1. Handling Missing Values + +Machine Learning models cannot handle `NaN` (Not a Number) or `null` values. The `SimpleImputer` class helps fill these gaps. + +```python +from sklearn.impute import SimpleImputer +import numpy as np + +# Sample data with missing values +X = [[1, 2], [np.nan, 3], [7, 6]] + +# strategy='mean', 'median', 'most_frequent', or 'constant' +imputer = SimpleImputer(strategy='mean') +X_filled = imputer.fit_transform(X) + +``` + +## 2. Encoding Categorical Data + +Computers understand numbers, not words. If you have a column for "City" (New York, Paris, Tokyo), you must encode it. + +### A. One-Hot Encoding (Nominal) + +Creates a new binary column for each category. Best for data without a natural order. + +```python +from sklearn.preprocessing import OneHotEncoder + +encoder = OneHotEncoder(sparse_output=False) +cities = [['New York'], ['Paris'], ['Tokyo']] +encoded_cities = encoder.fit_transform(cities) + +``` + +### B. Ordinal Encoding (Ranked) + +Converts categories into integers (). Use this when the order matters (e.g., Small, Medium, Large). + +## 3. Feature Scaling + +As discussed in our [Data Engineering module](/tutorial/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/feature-scaling), scaling ensures that features with large ranges (like Salary) don't overpower features with small ranges (like Age). + +### Standardization (`StandardScaler`) + +Rescales data to have a mean of and a standard deviation of . + +$$ +z = \frac{x - \mu}{\sigma} +$$ + +### Normalization (`MinMaxScaler`) + +Rescales data to a fixed range, usually . + +```python +from sklearn.preprocessing import StandardScaler + +scaler = StandardScaler() +X_scaled = scaler.fit_transform(X_filled) + +``` + +## 4. The `fit` vs `transform` Rule + +One of the most important concepts in Scikit-Learn is the distinction between these two methods: + +* **`.fit()`**: The transformer calculates the parameters (e.g., the mean and standard deviation of your data). **Only do this on Training data.** +* **`.transform()`**: The transformer applies those calculated parameters to the data. +* **`.fit_transform()`**: Does both in one step. + +```mermaid +graph TD + Train[Training Data] --> Fit[Fit: Learn Mean/Std] + Fit --> TransTrain[Transform Training Data] + Fit --> TransTest[Transform Test Data] + + style Fit fill:#f3e5f5,stroke:#7b1fa2,color:#333 + +``` + +:::warning +Never `fit` on your Test data. This leads to **Data Leakage**, where your model "cheats" by seeing the distribution of the test set during training. +::: + +## 5. ColumnTransformer: Selective Processing + +In real datasets, you have a mix of types: some columns need scaling, others need encoding, and some need nothing. `ColumnTransformer` allows you to apply different prep steps to different columns simultaneously. + +```python +from sklearn.compose import ColumnTransformer + +preprocessor = ColumnTransformer( + transformers=[ + ('num', StandardScaler(), ['age', 'income']), + ('cat', OneHotEncoder(), ['city', 'gender']) + ]) + +# X_processed = preprocessor.fit_transform(df) + +``` + +--- + +## References for More Details + +* **[Scikit-Learn Preprocessing Guide](https://scikit-learn.org/stable/modules/preprocessing.html):** Discovering advanced transformers like `PowerTransformer` or `PolynomialFeatures`. +* **[Imputing Missing Values](https://scikit-learn.org/stable/modules/impute.html):** Learning about `IterativeImputer` (MICE) and `KNNImputer`. + +--- + +**Manual data preparation can get messy and hard to replicate. To solve this, Scikit-Learn uses a powerful tool to chain all these steps together into a single object.** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/scikit-learn/hyperparameter-tuning.mdx b/docs/machine-learning/machine-learning-core/scikit-learn/hyperparameter-tuning.mdx index e69de29..64d4748 100644 --- a/docs/machine-learning/machine-learning-core/scikit-learn/hyperparameter-tuning.mdx +++ b/docs/machine-learning/machine-learning-core/scikit-learn/hyperparameter-tuning.mdx @@ -0,0 +1,117 @@ +--- +title: Hyperparameter Tuning +sidebar_label: Hyperparameter Tuning +description: "Optimizing model performance using GridSearchCV, RandomizedSearchCV, and Halving techniques." +tags: [scikit-learn, hyperparameter-tuning, grid-search, optimization, model-selection] +--- + +In Machine Learning, there is a crucial difference between **Parameters** and **Hyperparameters**: + +* **Parameters:** Learned by the model during training (e.g., weights in a regression or coefficients in a neural network). +* **Hyperparameters:** Set by the engineer *before* training starts (e.g., the depth of a Decision Tree or the number of neighbors in KNN). + +**Hyperparameter Tuning** is the automated search for the best combination of these settings to minimize error. + +## 1. Why Tune Hyperparameters? + +Most algorithms come with default settings that work reasonably well, but they are rarely optimal for your specific data. Proper tuning can often bridge the gap between a mediocre model and a state-of-the-art one. + +## 2. GridSearchCV: The Exhaustive Search + +`GridSearchCV` takes a predefined list of values for each hyperparameter and tries **every possible combination**. + +* **Pros:** Guaranteed to find the best combination within the provided grid. +* **Cons:** Computationally expensive. If you have 5 parameters with 5 values each, you must train the model $5^5 = 3,125$ times. + +```python +from sklearn.model_selection import GridSearchCV +from sklearn.ensemble import RandomForestClassifier + +param_grid = { + 'n_estimators': [50, 100, 200], + 'max_depth': [None, 10, 20], + 'min_samples_split': [2, 5] +} + +grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5) +grid_search.fit(X_train, y_train) + +print(f"Best Parameters: {grid_search.best_params_}") + +``` + +## 3. RandomizedSearchCV: The Efficient Alternative + +Instead of trying every combination, `RandomizedSearchCV` picks a fixed number of random combinations from a distribution. + +* **Pros:** Much faster than GridSearch. It often finds a result almost as good as GridSearch in a fraction of the time. +* **Cons:** Not guaranteed to find the absolute best "peak" in the parameter space. + +```python +from sklearn.model_selection import RandomizedSearchCV +from scipy.stats import randint + +param_dist = { + 'n_estimators': randint(50, 500), + 'max_depth': [None, 10, 20, 30, 40, 50], +} + +random_search = RandomizedSearchCV(RandomForestClassifier(), param_dist, n_iter=20, cv=5) +random_search.fit(X_train, y_train) + +``` + +## 4. Advanced: Successive Halving + +For massive datasets, even Random Search is slow. Scikit-Learn offers **HalvingGridSearch**. It trains all combinations on a small amount of data, throws away the bottom 50%, and keeps "promising" candidates for the next round with more data. + +```mermaid +graph TD + S1[Round 1: 100 candidates, 10% data] --> S2[Round 2: 50 candidates, 20% data] + S2 --> S3[Round 3: 25 candidates, 40% data] + S3 --> S4[Final Round: Best candidates, 100% data] + + style S1 fill:#fff3e0,stroke:#ef6c00,color:#333 + style S4 fill:#e8f5e9,stroke:#2e7d32,color:#333 + +``` + +## 5. Avoiding the Validation Trap + +If you tune your hyperparameters using the **Test Set**, you are "leaking" information. The model will look great on that test set, but fail on new data. + +**The Solution:** Use **Nested Cross-Validation** or ensure that your `GridSearchCV` only uses the **Training Set** (it will internally split the training data into smaller validation folds). + +```mermaid +graph LR + FullData[Full Dataset] --> Split{Initial Split} + Split --> Train[Training Set] + Split --> Test[Hold-out Test Set] + + subgraph Optimization [GridSearch with Internal CV] + Train --> CV1[Fold 1] + Train --> CV2[Fold 2] + Train --> CV3[Fold 3] + end + + Optimization --> BestModel[Best Hyperparameters] + BestModel --> FinalEval[Final Evaluation on Test Set] + +``` + +## 6. Tuning Strategy Summary + +| Method | Best for... | Resource Usage | +| --- | --- | --- | +| **Manual Tuning** | Initial exploration / small models | Low | +| **GridSearch** | Small number of parameters | High | +| **RandomSearch** | Many parameters / large search space | Moderate | +| **Halving Search** | Large datasets / expensive training | Low-Moderate | + +## References for More Details + +* **[Sklearn Tuning Guide](https://scikit-learn.org/stable/modules/grid_search.html):** Deep dive into `HalvingGridSearchCV` and custom scoring. + +--- + +**Now that your model is fully optimized and tuned, it's time to evaluate its performance using metrics that go beyond simple "Accuracy."** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/scikit-learn/model-selection.mdx b/docs/machine-learning/machine-learning-core/scikit-learn/model-selection.mdx index e69de29..9324b59 100644 --- a/docs/machine-learning/machine-learning-core/scikit-learn/model-selection.mdx +++ b/docs/machine-learning/machine-learning-core/scikit-learn/model-selection.mdx @@ -0,0 +1,104 @@ +--- +title: Model Selection & Validation +sidebar_label: Model Selection +description: "How to choose the right algorithm, split data correctly, and use Cross-Validation to ensure model reliability." +tags: [scikit-learn, model-selection, cross-validation, train-test-split, machine-learning] +--- + +**Model Selection** is the process of selecting the most appropriate Machine Learning algorithm for a specific task. However, a model that performs perfectly on your training data might fail miserably in the real world. To prevent this, we use validation techniques to ensure our model **generalizes**. + +## 1. The Scikit-Learn Estimator API + +In Scikit-Learn, every model (classifier or regressor) is an **Estimator**. They all share a consistent interface: + +1. **Initialize:** `model = RandomForestClassifier()` +2. **Train:** `model.fit(X_train, y_train)` +3. **Predict:** `y_pred = model.predict(X_test)` + +## 2. Training vs. Testing: The Fundamental Split + +The "Golden Rule" of Machine Learning is to never evaluate your model on the same data it used for training. We use `train_test_split` to create a "hidden" set of data. + +```python +from sklearn.model_selection import train_test_split + +# Usually 80% for training and 20% for testing +X_train, X_test, y_train, y_test = train_test_split( + X, y, test_size=0.2, random_state=42, stratify=y +) + +``` + +:::tip Why `stratify=y`? +For classification, this ensures the ratio of classes (e.g., 90% "No" and 10% "Yes") is identical in both the training and testing sets. +::: + +## 3. Cross-Validation (K-Fold) + +A single train-test split can be lucky or unlucky depending on which rows ended up in the test set. **K-Fold Cross-Validation** provides a more stable estimate of model performance. + +**How it works:** + +1. Split the data into equal parts (folds). +2. Train the model times. Each time, use 1 fold for testing and the remaining folds for training. +3. Average the scores from all rounds. + +### Implementation: `cross_val_score` + +```python +from sklearn.model_selection import cross_val_score +from sklearn.ensemble import RandomForestClassifier + +model = RandomForestClassifier() + +# Perform 5-Fold Cross Validation +scores = cross_val_score(model, X, y, cv=5) + +print(f"Mean Accuracy: {scores.mean():.2f}") +print(f"Standard Deviation: {scores.std():.2f}") + +``` + +## 4. Comparing Different Models + +Model selection often involves running several candidates through the same validation pipeline to see which performs best. + +| Algorithm | Strengths | Weaknesses | +| --- | --- | --- | +| **Logistic Regression** | Fast, interpretable | Assumes linear relationships | +| **Decision Trees** | Easy to visualize | Prone to overfitting | +| **Random Forest** | Robust, handles non-linear data | Slower, "Black box" | +| **SVM** | Good for high dimensions | Memory intensive | + +## 5. Learning Curves: Diagnosing Your Model + +A **Learning Curve** plots the training and validation error against the number of training samples. It helps you identify: + +* **High Bias (Underfitting):** Both training and validation errors are high. +* **High Variance (Overfitting):** Low training error but high validation error. + +## 6. The Model Selection Workflow + +```mermaid +graph TD + Start[Load Data] --> Pre[Preprocess Data] + Pre --> Split[Train-Test Split] + Split --> Candidates[Try Multiple Algorithms] + Candidates --> CV[K-Fold Cross-Validation] + CV --> Best{Compare Scores} + Best --> Tune[Fine-tune Hyperparameters] + Best --> Fail[Revise Features/Data] + + style CV fill:#f3e5f5,stroke:#7b1fa2,color:#333 + style Best fill:#e1f5fe,stroke:#01579b,color:#333 + +``` + +## References for More Details + +* **[Scikit-Learn Model Evaluation](https://scikit-learn.org/stable/modules/model_evaluation.html):** Learning about scoring metrics like F1-Score and ROC-AUC. +* **[Cross-Validation Guide](https://scikit-learn.org/stable/modules/cross_validation.html):** Advanced techniques like `StratifiedKFold` and `TimeSeriesSplit`. + +--- + +**Selecting the right model is only half the battle. Once you've chosen an algorithm, you need to "turn the knobs" to find its peak performance.** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/scikit-learn/predictions.mdx b/docs/machine-learning/machine-learning-core/scikit-learn/predictions.mdx index e69de29..d53c18b 100644 --- a/docs/machine-learning/machine-learning-core/scikit-learn/predictions.mdx +++ b/docs/machine-learning/machine-learning-core/scikit-learn/predictions.mdx @@ -0,0 +1,101 @@ +--- +title: Making Predictions +sidebar_label: Predictions +description: "How to use trained Scikit-Learn estimators to generate point predictions and probability estimates." +tags: [scikit-learn, machine-learning, inference, probability, prediction] +--- + +Once a model has been trained using `.fit()`, it is ready for **Inference**. In Scikit-Learn, this is handled by a consistent set of methods that allow you to generate outcomes for new data. + +## 1. Point Predictions with `.predict()` + +The most common way to get an answer from your model is the `.predict()` method. It returns a single value for each input sample. + +* **In Regression:** Returns the predicted continuous value (e.g., $250,000$). +* **In Classification:** Returns the predicted class label (e.g., "Spam"). + +```python +from sklearn.ensemble import RandomForestClassifier + +# Assuming model is already trained +# X_new contains unseen samples +predictions = model.predict(X_new) + +print(f"Predicted labels: {predictions}") + +``` + +## 2. Predicting Probabilities with `.predict_proba()` + +In many classification tasks, knowing the **label** isn't enough; you need to know how **confident** the model is. Most Scikit-Learn classifiers provide the `.predict_proba()` method. + +It returns an array of shape `(n_samples, n_classes)`, where each value represents the probability of a sample belonging to a specific class. + +```python +# Returns probabilities for [Class 0, Class 1] +probs = model.predict_proba(X_new) + +# Example output: [0.1, 0.9] means 90% confidence it is Class 1 +print(f"Confidence levels: {probs}") + +``` + +### Why use probabilities? + +1. **Risk Management:** In medical diagnosis, you might only take action if the probability is . +2. **Threshold Tuning:** By default, `.predict()` uses a threshold. You can manually set a higher threshold to reduce False Positives. + +## 3. Decision Functions + +Some models, like **SVM (Support Vector Machines)** or **Linear Classifiers**, provide a `.decision_function()`. + +Unlike probabilities (which range from to ), the decision function returns a "signed distance" to the decision boundary. + +* **Positive value:** Predicted as the positive class. +* **Negative value:** Predicted as the negative class. +* **Magnitude:** Indicates how far the point is from the boundary (certainty). + +## 4. Predicting in Regression + +For regression models, `.predict()` returns the expected numerical value. Note that standard Scikit-Learn regression models do not provide "probabilities" because the output is a continuous range, not a discrete set of classes. + +$$ +\hat{y} = w_1x_1 + w_2x_2 + ... + b +$$ + +## 5. Deployment Checklist: The "Input Shape" Trap + +One of the most frequent errors during prediction is a **Shape Mismatch**. + +* Scikit-Learn estimators expect a **2D array** for the input . +* If you are predicting for a **single sample**, you must reshape it. + +```python +# Error: model.predict([1, 2, 3]) +# Correct: +single_sample = [1, 2, 3] +model.predict([single_sample]) # Wrapped in a list to make it 2D + +``` + +## 6. The Workflow Summary + +```mermaid +graph LR + NewData[New Unseen Data] --> Prep[Same Preprocessing as Training] + Prep --> Model[Trained Estimator] + Model --> Out1[predict: Hard Labels] + Model --> Out2[predict_proba: Soft Probabilities] + + style Model fill:#f3e5f5,stroke:#7b1fa2,color:#333 + style Out2 fill:#e1f5fe,stroke:#01579b,color:#333 + +``` + +## References for More Details + +* **[Probability Calibration](https://scikit-learn.org/stable/modules/calibration.html):** Learning how to turn decision scores into reliable probability estimates. + +--- + +**Predictions are useless if they aren't accurate. Now that you know how to get answers from your model, you must learn how to verify if those answers are correct.** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/scikit-learn/text-data.mdx b/docs/machine-learning/machine-learning-core/scikit-learn/text-data.mdx index e69de29..e0c4a9d 100644 --- a/docs/machine-learning/machine-learning-core/scikit-learn/text-data.mdx +++ b/docs/machine-learning/machine-learning-core/scikit-learn/text-data.mdx @@ -0,0 +1,108 @@ +--- +title: Working with Text Data +sidebar_label: Text Data +description: "Transforming raw text into numerical features using Bag of Words, TF-IDF, and Scikit-Learn's feature extraction tools." +tags: [scikit-learn, nlp, text-processing, tf-idf, vectorization] +--- + +Machine Learning algorithms operate on fixed-size numerical arrays. They cannot understand a sentence like *"I love this product"* directly. To process text, we must convert it into numbers. In Scikit-Learn, this process is called **Feature Extraction** or **Vectorization**. + +## 1. The Bag of Words (BoW) Model + +The simplest way to turn text into numbers is to count how many times each word appears in a document. This is known as the **Bag of Words** approach. + +1. **Tokenization:** Breaking sentences into individual words (tokens). +2. **Vocabulary Building:** Collecting all unique words across all documents. +3. **Encoding:** Creating a vector for each document representing word counts. + +### Implementation: `CountVectorizer` + +```python +from sklearn.feature_extraction.text import CountVectorizer + +corpus = [ + 'Machine learning is great.', + 'Learning machine learning is fun.', + 'Data science is the future.' +] + +vectorizer = CountVectorizer() +X = vectorizer.fit_transform(corpus) + +# View the vocabulary +print(vectorizer.get_feature_names_out()) +# View the resulting matrix +print(X.toarray()) + +``` + +## 2. TF-IDF: Beyond Simple Counts + +A major problem with simple counts is that words like "is", "the", and "and" appear frequently but carry very little meaning. **TF-IDF (Term Frequency-Inverse Document Frequency)** fixes this by penalizing words that appear too often across all documents. + +$$ +W_{i,j} = TF_{i,j} \times \log\left(\frac{N}{DF_i}\right) +$$ + +* **TF (Term Frequency):** How often a word appears in a specific document. +* **IDF (Inverse Document Frequency):** How rare a word is across the entire corpus. + +### Implementation: `TfidfVectorizer` + +```python +from sklearn.feature_extraction.text import TfidfVectorizer + +tfidf = TfidfVectorizer() +X_tfidf = tfidf.fit_transform(corpus) + +# High values are given to unique, meaningful words like 'future' or 'fun' +print(X_tfidf.toarray()) + +``` + +## 3. Handling Large Vocabularies: Hashing + +If you have millions of unique words, your feature matrix becomes massive and may crash your memory. The **HashingVectorizer** uses a mathematical hash function to map words to a fixed number of features without storing a vocabulary in memory. + +## 4. Text Preprocessing Pipeline + +Before vectorizing, it is common practice to "clean" the text to reduce noise: + +* **Lowercasing:** Converting all text to lowercase. +* **Stop-word Removal:** Removing common words (a, an, the) using `stop_words='english'`. +* **N-grams:** Looking at pairs or triplets of words (e.g., "not good" instead of just "not" and "good") using `ngram_range=(1, 2)`. + +```python +# Advanced Vectorizer configuration +vectorizer = CountVectorizer( + stop_words='english', + ngram_range=(1, 2), # Captures single words and two-word phrases + max_features=1000 # Only keep the top 1000 most frequent words +) + +``` + +## 5. The "Sparsity" Challenge + +Text data results in **Sparse Matrices**. Since most documents only contain a tiny fraction of the total vocabulary, most entries in your matrix will be zero. Scikit-Learn stores these as `scipy.sparse` objects to save RAM. + +```mermaid +graph LR + Raw[Raw Text] --> Clean[Pre-processing] + Clean --> Vector[Vectorizer] + Vector --> Sparse[Sparse Matrix] + Sparse --> Model[ML Algorithm] + + style Vector fill:#f3e5f5,stroke:#7b1fa2,color:#333 + style Sparse fill:#fff3e0,stroke:#ef6c00,color:#333 + +``` + +## References for More Details + +* **[Sklearn Text Feature Extraction](https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction):** Understanding the math behind TF-IDF implementation. +* **[Natural Language Processing with Python](https://www.nltk.org/book/):** Deep diving into linguistics and advanced tokenization. + +--- + +**Now that you can convert text and numbers into features, you need to learn how to organize these steps into a clean, repeatable workflow.** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/supervised-learning/classification/knn.mdx b/docs/machine-learning/machine-learning-core/supervised-learning/classification/knn.mdx index e69de29..f0da543 100644 --- a/docs/machine-learning/machine-learning-core/supervised-learning/classification/knn.mdx +++ b/docs/machine-learning/machine-learning-core/supervised-learning/classification/knn.mdx @@ -0,0 +1,96 @@ +--- +title: "K-Nearest Neighbors (KNN)" +sidebar_label: "K-Nearest Neighbors" +description: "Understanding the proximity-based classification algorithm: distance metrics, choosing K, and the curse of dimensionality." +tags: [machine-learning, supervised-learning, classification, knn, distance-metrics] +--- + +**K-Nearest Neighbors (KNN)** is one of the simplest and most intuitive supervised learning algorithms. It belongs to the family of **Lazy Learners** (or Instance-Based Learners) because it doesn't build a mathematical model during training. Instead, it stores the entire training dataset and performs a calculation only when a new prediction is requested. + +## 1. How KNN Works + +The core philosophy of KNN is: *"Show me your neighbors, and I'll tell you who you are."* + +When you want to classify a new data point: +1. **Calculate Distance:** Find the distance between the new point and all points in the training set. +2. **Find Neighbors:** Pick the $K$ closest points (the "neighbors"). +3. **Vote:** The new point is assigned the class that is most common among its $K$ neighbors. + +## 2. Distance Metrics + +To find the "nearest" neighbor, we need a mathematical way to measure distance. + +### Euclidean Distance +The most common metric, representing the "straight-line" distance between two points. + +$$ +d(p, q) = \sqrt{\sum_{i=1}^{n} (q_i - p_i)^2} +$$ + +### Manhattan Distance + +Also known as "City Block" distance, it measures distance along axes at right angles. + +$$ +d(p, q) = \sum_{i=1}^{n} |q_i - p_i| +$$ + +## 3. Choosing the Right 'K' + +The choice of $K$ (the number of neighbors) is critical to the model's performance: + +* **Small K (e.g., $K=1$):** The model is extremely sensitive to noise and outliers. This leads to **Overfitting**. +* **Large K (e.g., $K=100$):** The model becomes too "smooth" and might ignore local patterns. This leads to **Underfitting**. + +:::tip +**Rule of Thumb:** A common practice is to set $K$ to the square root of the number of training samples: + +$$ +K \approx \sqrt{N} +$$ +where $N$ is the number of training samples. +However, always validate your choice of $K$ using techniques like Cross-Validation. +::: + +## 4. Implementation with Scikit-Learn + +```python +from sklearn.neighbors import KNeighborsClassifier +from sklearn.preprocessing import StandardScaler + +# 1. IMPORTANT: KNN requires scaling! +scaler = StandardScaler() +X_train_scaled = scaler.fit_transform(X_train) +X_test_scaled = scaler.transform(X_test) + +# 2. Initialize and Train +# n_neighbors is the 'K' parameter +knn = KNeighborsClassifier(n_neighbors=5, metric='euclidean') +knn.fit(X_train_scaled, y_train) + +# 3. Predict +y_pred = knn.predict(X_test_scaled) + +``` + +## 5. Pros and Cons + +| Advantages | Disadvantages | +| --- | --- | +| Simple to understand and implement. | **Slow Prediction:** It must calculate distance to every point for every prediction. | +| No assumptions about data distribution. | **Memory Intensive:** Must store the entire dataset in RAM. | +| Naturally handles multi-class classification. | **Sensitive to Scale:** Features with larger units will dominate the distance. | + +## 6. The Curse of Dimensionality + +KNN suffers significantly from the "Curse of Dimensionality." As the number of features increases, the "volume" of the space grows so fast that even the "nearest" neighbors become very far away. + +**Solution:** Always perform [Dimensionality Reduction (PCA)](/tutorial/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/dimensionality-reduction) or [Feature Selection](/tutorial/machine-learning/data-engineering-basics/data-cleaning-and-preprocessing/feature-selection) before using KNN on high-dimensional data. + +## References for More Details + +* **[Scikit-Learn KNN Documentation](https://scikit-learn.org/stable/modules/neighbors.html):** Learning about advanced algorithms like BallTree and KDTree for faster searches. + +--- + +**KNN is great for simple patterns, but what if you want a model that learns a "boundary" or a logic tree?** \ No newline at end of file diff --git a/docs/machine-learning/machine-learning-core/supervised-learning/classification/logistic-regression.mdx b/docs/machine-learning/machine-learning-core/supervised-learning/classification/logistic-regression.mdx index e69de29..adf46f0 100644 --- a/docs/machine-learning/machine-learning-core/supervised-learning/classification/logistic-regression.mdx +++ b/docs/machine-learning/machine-learning-core/supervised-learning/classification/logistic-regression.mdx @@ -0,0 +1,135 @@ +--- +title: Logistic Regression +sidebar_label: Logistic Regression +description: "Understanding binary classification, the Sigmoid function, and decision boundaries." +tags: [machine-learning, supervised-learning, classification, logistic-regression, sigmoid] +--- + +**Logistic Regression** is the go-to algorithm for binary classification (problems with two possible outcomes). While Linear Regression predicts a continuous number, Logistic Regression predicts the **probability** that an input belongs to a specific category. + +## 1. The Sigmoid Function + +The core difference between linear and logistic regression is the **Activation Function**. To turn a real-valued number into a probability between $0$ and $1$, we use the **Sigmoid (or Logistic) function**. + +**The Formula:** + +$$ +\sigma(z) = \frac{1}{1 + e^{-z}} +$$ + +**Where:** +* $z$ is the input (a linear combination of features). +* $e$ is Euler's number (approximately $2.71828$). + +**Key Properties:** +* If $z$ is a large positive number, $\sigma(z)$ approaches $1$. +* If $z$ is a large negative number, $\sigma(z)$ approaches $0$. +* If $z = 0$, $\sigma(z) = 0.5$. + +## 2. From Linear to Logistic + +Logistic Regression starts by calculating a linear combination of inputs, just like Linear Regression: + +$$ +z = \beta_0 + \beta_1x_1 + \beta_2x_2 + ... +$$ + +It then passes that result through the Sigmoid function to get the probability ($p$): + +$$ +p = \sigma(z) +$$ + +```mermaid +graph LR + X["$$x$$ (Input Feature)"] --> LR["Linear Regression"] + + LR --> L1["$$y = mx + c$$"] + L1 --> L2["$$\text{Straight Line}$$"] + L2 --> L3["$$y \in (-\infty, +\infty)$$"] + L3 --> L4["$$\text{❌ Not suitable for } y \in \{0,1\}$$"] + + X --> LOGR["Logistic Regression"] + + LOGR --> Z["$$z = wx + b$$"] + Z --> S["$$\sigma(z) = \frac{1}{1 + e^{-z}}$$"] + S --> S1["$$\text{S-curve (Sigmoid)}$$"] + S1 --> S2["$$P(y=1|x) \in [0,1]$$"] + S2 --> S3["$$\text{✅ Best fit for Binary Data}$$"] + + L4 -.->|"$$\text{Comparison}$$"| S3 + +``` + +## 3. The Decision Boundary + +To make a final classification, we apply a **threshold** (usually $0.5$). +* If $p \geq 0.5$, classify as **Class 1** (e.g., "Spam"). +* If $p < 0.5$, classify as **Class 0** (e.g., "Not Spam"). + +The line (or plane) where the probability is exactly $0.5$ is called the **Decision Boundary**. + +## 4. Implementation with Scikit-Learn + +```python +from sklearn.linear_model import LogisticRegression +from sklearn.model_selection import train_test_split + +# 1. Initialize the model +# 'liblinear' is a good solver for small datasets +model = LogisticRegression(solver='liblinear') + +# 2. Train +model.fit(X_train, y_train) + +# 3. Predict Class Labels +y_pred = model.predict(X_test) + +# 4. Predict Probabilities +y_probs = model.predict_proba(X_test)[:, 1] # Probability of being Class 1 + +``` + +## 5. Cost Function: Log Loss + +In Linear Regression, we use Mean Squared Error. However, because of the Sigmoid function, MSE would result in a non-convex function that is hard to optimize. Instead, Logistic Regression uses **Log Loss** (Cross-Entropy). + +Log Loss penalizes the model heavily when it is confident about a wrong prediction. + +$$ +J(\theta) = -\frac{1}{m} \sum_{i=1}^{m} [y^{(i)} \log(\hat{y}^{(i)}) + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)})] +$$ + +## 6. Multi-class Classification (One-vs-Rest) + +By default, Logistic Regression is binary. To handle multiple classes (e.g., classifying an image as "Cat", "Dog", or "Bird"), Scikit-Learn uses the **One-vs-Rest (OvR)** strategy, where it trains one binary classifier per class. + +## 7. Pros and Cons + +| Advantages | Disadvantages | +| --- | --- | +| Highly interpretable (you can see feature weights). | Assumes a linear relationship between features and log-odds. | +| Fast to train and predict. | Easily outperformed by more complex models (like Random Forests). | +| Provides probabilities, not just hard labels. | Can struggle with highly non-linear data. | + + +## References for More Details + +* **[Scikit-Learn Logistic Regression Documentation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html):** Understanding regularization parameters like `C` (inverse of regularization strength). +* In this video, StatQuest provides an excellent visual explanation of Logistic Regression concepts: + +