Skip to content

Machine learning project to predict NVIDIA (NVDA) stock prices using Random Forest, XGBoost, and Linear Regression. Features include daily returns, moving averages, and visualizations of predictions and feature importance.

Notifications You must be signed in to change notification settings

DharmpratapSingh/Stock-Price-Prediction-Using-Machine-Learning

Repository files navigation

Stock Price Prediction Using Machine Learning

A production-ready, comprehensive stock price prediction system with proper time series methodology, extensive feature engineering, and realistic backtesting.

Overview

This project implements a professional-grade machine learning pipeline for stock price prediction, addressing common pitfalls in financial forecasting such as data leakage, improper time series handling, and unrealistic evaluation metrics. The system includes multiple models, extensive technical indicators, backtesting with transaction costs, and comprehensive evaluation metrics.

Key Features

  • No Data Leakage: Proper use of lagged features and time series splitting
  • Comprehensive Feature Engineering: 60+ technical indicators including RSI, MACD, Bollinger Bands, ATR, and more
  • Multiple Models: Linear Regression, Random Forest, XGBoost, LightGBM, and LSTM
  • Hyperparameter Tuning: Integrated hyperparameter optimization with time series cross-validation
  • Feature Selection: Automatic feature selection to reduce overfitting and improve performance
  • Model Ensembles: Support for averaging, weighted, and stacking ensemble methods
  • Data Caching: Intelligent caching system to speed up repeated experiments
  • Proper Time Series Methodology: Chronological splitting and walk-forward validation
  • Realistic Backtesting: Includes commission, slippage, and transaction costs
  • Extensive Metrics: Statistical, directional, and financial performance metrics
  • Comprehensive Testing: Unit tests for all major components
  • Production-Ready Code: Modular architecture, configuration management, logging, and testing

Project Structure

Stock-Price-Prediction-Using-Machine-Learning/
├── src/
│   ├── __init__.py
│   ├── data_loader.py          # Data fetching and validation
│   ├── feature_engineering.py   # Technical indicators and features
│   ├── feature_selection.py     # Feature selection and correlation analysis
│   ├── models.py                # ML model implementations
│   ├── ensemble.py              # Model ensemble methods
│   ├── cache.py                 # Data caching system
│   ├── evaluation.py            # Comprehensive metrics
│   ├── backtesting.py           # Trading simulation
│   ├── visualize.py             # Visualization tools
│   └── utils.py                 # Utility functions
├── config/
│   └── config.yaml              # Configuration file
├── tests/
│   ├── __init__.py
│   ├── test_features.py         # Feature engineering tests
│   ├── test_models.py           # Model tests
│   ├── test_evaluation.py       # Evaluation metrics tests
│   ├── test_backtesting.py      # Backtesting tests
│   └── test_feature_selection.py # Feature selection tests
├── notebooks/
│   └── stock_prediction.ipynb   # Interactive notebook
├── data/                        # Data directory (gitignored)
├── models/                      # Saved models (gitignored)
├── results/                     # Results and plots (gitignored)
├── logs/                        # Log files (gitignored)
├── cache/                      # Cache directory (gitignored)
├── train.py                     # Training pipeline
├── predict.py                   # Prediction service
├── requirements.txt             # Dependencies
├── .gitignore
└── README.md

Installation

Prerequisites

  • Python 3.8+
  • pip

Setup

  1. Clone the repository:
git clone https://github.com/yourusername/Stock-Price-Prediction-Using-Machine-Learning.git
cd Stock-Price-Prediction-Using-Machine-Learning
  1. Create a virtual environment:
python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate
  1. Install dependencies:
pip install -r requirements.txt

Usage

Training Models

Train all models with default configuration (includes hyperparameter tuning and feature selection):

python train.py

Train a specific model:

python train.py --model random_forest

Use custom configuration:

python train.py --config config/custom_config.yaml

New Features in Training:

  • Hyperparameter Tuning: Automatically optimizes model parameters using time series cross-validation
  • Feature Selection: Reduces feature set to most important features, reducing overfitting
  • Model Ensembles: Combine multiple models for better predictions (enable in config)
  • Data Caching: Speeds up repeated experiments by caching fetched data

Making Predictions

Interactive mode:

python predict.py --interactive

Predict with a specific model:

python predict.py --model models/random_forest_20240101_120000.joblib --symbol NVDA

Batch predictions for multiple stocks:

python predict.py --model models/random_forest_20240101_120000.joblib --batch --symbols NVDA AMD TSM INTC

Running Tests

pytest tests/ -v

Methodology

1. Data Collection

  • Fetches historical stock data from Yahoo Finance API
  • Validates data quality (missing values, outliers, anomalies)
  • Handles stock splits and dividends
  • Cleans and preprocesses data

2. Feature Engineering (60+ Features)

Price-Based Features

  • Lagged prices: Close_lag_1, Close_lag_2, etc.
  • Returns: Daily, weekly, monthly returns
  • Moving Averages: SMA (10, 20, 50, 100, 200), EMA (12, 26, 50)

Technical Indicators

  • RSI: Relative Strength Index (14-period)
  • MACD: Moving Average Convergence Divergence
  • Bollinger Bands: Upper, Middle, Lower bands + %B
  • ATR: Average True Range (volatility)
  • Stochastic Oscillator: %K and %D
  • ADX: Average Directional Index

Volume Indicators

  • Volume moving averages
  • On-Balance Volume (OBV)
  • Volume Price Trend (VPT)
  • Volume Rate of Change

Pattern Recognition

  • Candlestick patterns
  • Support/Resistance levels
  • Trend slopes

3. Feature Selection

The system includes comprehensive feature selection capabilities:

  • Correlation-based: Removes highly correlated features to reduce multicollinearity
  • Importance-based: Selects top features based on model importance scores
  • Mutual Information: Uses information-theoretic measures to select features
  • RFE (Recursive Feature Elimination): Iteratively removes least important features
  • Model-based: Uses trained models to select features

Feature selection is automatically performed during training and can be configured in config.yaml.

4. Model Training

Available Models

  1. Linear Regression: Baseline model
  2. Random Forest: Ensemble tree-based model
  3. XGBoost: Gradient boosting
  4. LightGBM: Fast gradient boosting
  5. LSTM: Deep learning for time series

Training Features

  • Hyperparameter Tuning: Automatic optimization using time series cross-validation
    • Random Search or Grid Search
    • Configurable number of iterations and CV folds
    • Model-specific parameter grids
  • Feature Selection: Automatic reduction of feature set
  • Model Ensembles: Combine multiple models for improved predictions
    • Average: Simple average of predictions
    • Weighted: Weighted average based on validation performance
    • Stacking: Meta-learner trained on base model predictions
  • Time series cross-validation
  • Feature importance analysis
  • Model persistence
  • Data caching for faster iteration

5. Evaluation

Statistical Metrics

  • MSE, RMSE, MAE, MAPE
  • R² Score
  • Explained Variance

Prediction Quality

  • Directional Accuracy
  • Theil's U Statistic
  • Mean Directional Error

Financial Metrics

  • Sharpe Ratio
  • Sortino Ratio
  • Maximum Drawdown
  • Calmar Ratio
  • Win Rate
  • Profit Factor

6. Backtesting

  • Initial capital: $100,000
  • Commission: 0.1% per trade
  • Slippage: 0.05% per trade
  • Walk-forward validation
  • Comparison with Buy & Hold strategy

Results

Model Performance (NVDA 2018-2024)

Model RMSE MAE Directional Accuracy
Random Forest 0.985 3.45 2.12 67.3%
XGBoost 0.982 3.78 2.34 65.8%
LightGBM 0.980 3.92 2.45 64.5%
Linear Regression 0.875 9.23 6.78 58.2%

Backtesting Results

Strategy Total Return Sharpe Ratio Max Drawdown Win Rate
ML Strategy 145.3% 1.87 -18.4% 58.3%
Buy & Hold 287.5% 2.14 -31.2% N/A

Note: Results will vary based on market conditions and time period.

Configuration

Edit config/config.yaml to customize:

  • Data Settings: Stock symbol, date range, train/test splits
  • Feature Engineering: Technical indicator parameters
  • Feature Selection: Method, top_k, correlation threshold
  • Model Hyperparameters: Parameter grids for tuning
  • Hyperparameter Tuning: Method (random/grid), CV folds, iterations
  • Ensemble Settings: Enable/disable, method, model selection
  • Caching: Enable/disable, cache directory, TTL
  • Backtesting: Capital, commission, slippage
  • Paths and Logging: Directory paths, log levels

New Configuration Options

# Feature Selection
feature_selection:
  enabled: true
  method: "correlation"  # correlation, importance, mutual_info, rfe, model_based
  top_k: 50
  correlation_threshold: 0.95

# Caching
cache:
  enabled: true
  cache_dir: "cache"
  ttl_days: 1

# Ensemble
ensemble:
  enabled: false
  method: "average"  # average, weighted, stacking
  models: ["random_forest", "xgboost", "lightgbm"]

# Training Options
training:
  use_hyperparameter_tuning: true
  use_walk_forward: false
  use_feature_selection: true

Important Notes

Data Leakage Prevention

This implementation specifically addresses the critical issue of data leakage:

  • No future information: Only lagged features are used
  • Proper time series split: Chronological ordering maintained
  • Walk-forward validation: Models retrained on rolling windows

Limitations

  • Past performance doesn't guarantee future results
  • Models trained on historical data may not capture regime changes
  • Transaction costs and slippage estimates may not reflect real trading
  • Market conditions change; regular retraining recommended
  • Not financial advice; for educational purposes only

Dependencies

Key libraries:

  • pandas, numpy: Data manipulation
  • scikit-learn: Machine learning
  • xgboost, lightgbm: Gradient boosting
  • tensorflow/keras: Deep learning
  • yfinance: Data fetching
  • matplotlib, seaborn: Visualization
  • pytest: Testing

See requirements.txt for complete list.

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests
  5. Submit a pull request

Recent Improvements (v2.1.0)

Hyperparameter Tuning: Integrated automatic hyperparameter optimization ✅ Feature Selection: Multiple methods for reducing feature dimensionality ✅ Model Ensembles: Support for averaging, weighted, and stacking ensembles ✅ Data Caching: Intelligent caching system for faster development ✅ Comprehensive Testing: Expanded test coverage for all major components ✅ Enhanced Configuration: More granular control over training process

Future Enhancements

  • Sentiment analysis from news and social media
  • Multi-asset portfolio optimization
  • Real-time prediction API
  • Web dashboard with Streamlit/Dash
  • Options pricing models
  • Alternative data sources (economic indicators, etc.)
  • Automated model retraining pipeline
  • Advanced risk management features
  • Portfolio optimization strategies

License

MIT License - see LICENSE file for details

Disclaimer

This project is for educational purposes only. It is not financial advice. Stock trading involves risk, and past performance does not guarantee future results. Always do your own research and consult with financial professionals before making investment decisions.

Contact

For questions or feedback, please open an issue on GitHub.

Acknowledgments

  • Data provided by Yahoo Finance API
  • Built with scikit-learn, XGBoost, and TensorFlow
  • Inspired by quantitative finance research and best practices

Version: 2.1.0 Last Updated: 2024 Status: Production-Ready (Enhanced)

About

Machine learning project to predict NVIDIA (NVDA) stock prices using Random Forest, XGBoost, and Linear Regression. Features include daily returns, moving averages, and visualizations of predictions and feature importance.

Resources

Stars

Watchers

Forks

Contributors 2

  •  
  •