Stock Price Prediction Using Machine Learning

A production-ready, comprehensive stock price prediction system with proper time series methodology, extensive feature engineering, and realistic backtesting.

Overview

This project implements a professional-grade machine learning pipeline for stock price prediction, addressing common pitfalls in financial forecasting such as data leakage, improper time series handling, and unrealistic evaluation metrics. The system includes multiple models, extensive technical indicators, backtesting with transaction costs, and comprehensive evaluation metrics.

Key Features

No Data Leakage: Proper use of lagged features and time series splitting
Comprehensive Feature Engineering: 60+ technical indicators including RSI, MACD, Bollinger Bands, ATR, and more
Multiple Models: Linear Regression, Random Forest, XGBoost, LightGBM, and LSTM
Hyperparameter Tuning: Integrated hyperparameter optimization with time series cross-validation
Feature Selection: Automatic feature selection to reduce overfitting and improve performance
Model Ensembles: Support for averaging, weighted, and stacking ensemble methods
Data Caching: Intelligent caching system to speed up repeated experiments
Proper Time Series Methodology: Chronological splitting and walk-forward validation
Realistic Backtesting: Includes commission, slippage, and transaction costs
Extensive Metrics: Statistical, directional, and financial performance metrics
Comprehensive Testing: Unit tests for all major components
Production-Ready Code: Modular architecture, configuration management, logging, and testing

Project Structure

Stock-Price-Prediction-Using-Machine-Learning/
├── src/
│   ├── __init__.py
│   ├── data_loader.py          # Data fetching and validation
│   ├── feature_engineering.py   # Technical indicators and features
│   ├── feature_selection.py     # Feature selection and correlation analysis
│   ├── models.py                # ML model implementations
│   ├── ensemble.py              # Model ensemble methods
│   ├── cache.py                 # Data caching system
│   ├── evaluation.py            # Comprehensive metrics
│   ├── backtesting.py           # Trading simulation
│   ├── visualize.py             # Visualization tools
│   └── utils.py                 # Utility functions
├── config/
│   └── config.yaml              # Configuration file
├── tests/
│   ├── __init__.py
│   ├── test_features.py         # Feature engineering tests
│   ├── test_models.py           # Model tests
│   ├── test_evaluation.py       # Evaluation metrics tests
│   ├── test_backtesting.py      # Backtesting tests
│   └── test_feature_selection.py # Feature selection tests
├── notebooks/
│   └── stock_prediction.ipynb   # Interactive notebook
├── data/                        # Data directory (gitignored)
├── models/                      # Saved models (gitignored)
├── results/                     # Results and plots (gitignored)
├── logs/                        # Log files (gitignored)
├── cache/                      # Cache directory (gitignored)
├── train.py                     # Training pipeline
├── predict.py                   # Prediction service
├── requirements.txt             # Dependencies
├── .gitignore
└── README.md

Installation

Prerequisites

Python 3.8+
pip

Setup

Clone the repository:

git clone https://github.com/yourusername/Stock-Price-Prediction-Using-Machine-Learning.git
cd Stock-Price-Prediction-Using-Machine-Learning

Create a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install dependencies:

pip install -r requirements.txt

Usage

Training Models

Train all models with default configuration (includes hyperparameter tuning and feature selection):

python train.py

Train a specific model:

python train.py --model random_forest

Use custom configuration:

python train.py --config config/custom_config.yaml

New Features in Training:

Hyperparameter Tuning: Automatically optimizes model parameters using time series cross-validation
Feature Selection: Reduces feature set to most important features, reducing overfitting
Model Ensembles: Combine multiple models for better predictions (enable in config)
Data Caching: Speeds up repeated experiments by caching fetched data

Making Predictions

Interactive mode:

python predict.py --interactive

Predict with a specific model:

python predict.py --model models/random_forest_20240101_120000.joblib --symbol NVDA

Batch predictions for multiple stocks:

python predict.py --model models/random_forest_20240101_120000.joblib --batch --symbols NVDA AMD TSM INTC

Running Tests

pytest tests/ -v

Methodology

1. Data Collection

Fetches historical stock data from Yahoo Finance API
Validates data quality (missing values, outliers, anomalies)
Handles stock splits and dividends
Cleans and preprocesses data

2. Feature Engineering (60+ Features)

Price-Based Features

Lagged prices: Close_lag_1, Close_lag_2, etc.
Returns: Daily, weekly, monthly returns
Moving Averages: SMA (10, 20, 50, 100, 200), EMA (12, 26, 50)

Technical Indicators

RSI: Relative Strength Index (14-period)
MACD: Moving Average Convergence Divergence
Bollinger Bands: Upper, Middle, Lower bands + %B
ATR: Average True Range (volatility)
Stochastic Oscillator: %K and %D
ADX: Average Directional Index

Volume Indicators

Volume moving averages
On-Balance Volume (OBV)
Volume Price Trend (VPT)
Volume Rate of Change

Pattern Recognition

Candlestick patterns
Support/Resistance levels
Trend slopes

3. Feature Selection

The system includes comprehensive feature selection capabilities:

Correlation-based: Removes highly correlated features to reduce multicollinearity
Importance-based: Selects top features based on model importance scores
Mutual Information: Uses information-theoretic measures to select features
RFE (Recursive Feature Elimination): Iteratively removes least important features
Model-based: Uses trained models to select features

Feature selection is automatically performed during training and can be configured in config.yaml.

4. Model Training

Available Models

Linear Regression: Baseline model
Random Forest: Ensemble tree-based model
XGBoost: Gradient boosting
LightGBM: Fast gradient boosting
LSTM: Deep learning for time series

Training Features

Hyperparameter Tuning: Automatic optimization using time series cross-validation
- Random Search or Grid Search
- Configurable number of iterations and CV folds
- Model-specific parameter grids
Feature Selection: Automatic reduction of feature set
Model Ensembles: Combine multiple models for improved predictions
- Average: Simple average of predictions
- Weighted: Weighted average based on validation performance
- Stacking: Meta-learner trained on base model predictions
Time series cross-validation
Feature importance analysis
Model persistence
Data caching for faster iteration

5. Evaluation

Statistical Metrics

MSE, RMSE, MAE, MAPE
R² Score
Explained Variance

Prediction Quality

Directional Accuracy
Theil's U Statistic
Mean Directional Error

Financial Metrics

Sharpe Ratio
Sortino Ratio
Maximum Drawdown
Calmar Ratio
Win Rate
Profit Factor

6. Backtesting

Initial capital: $100,000
Commission: 0.1% per trade
Slippage: 0.05% per trade
Walk-forward validation
Comparison with Buy & Hold strategy

Results

Model Performance (NVDA 2018-2024)

Model	R²	RMSE	MAE	Directional Accuracy
Random Forest	0.985	3.45	2.12	67.3%
XGBoost	0.982	3.78	2.34	65.8%
LightGBM	0.980	3.92	2.45	64.5%
Linear Regression	0.875	9.23	6.78	58.2%

Backtesting Results

Strategy	Total Return	Sharpe Ratio	Max Drawdown	Win Rate
ML Strategy	145.3%	1.87	-18.4%	58.3%
Buy & Hold	287.5%	2.14	-31.2%	N/A

Note: Results will vary based on market conditions and time period.

Configuration

Edit config/config.yaml to customize:

Data Settings: Stock symbol, date range, train/test splits
Feature Engineering: Technical indicator parameters
Feature Selection: Method, top_k, correlation threshold
Model Hyperparameters: Parameter grids for tuning
Hyperparameter Tuning: Method (random/grid), CV folds, iterations
Ensemble Settings: Enable/disable, method, model selection
Caching: Enable/disable, cache directory, TTL
Backtesting: Capital, commission, slippage
Paths and Logging: Directory paths, log levels

New Configuration Options

# Feature Selection
feature_selection:
  enabled: true
  method: "correlation"  # correlation, importance, mutual_info, rfe, model_based
  top_k: 50
  correlation_threshold: 0.95

# Caching
cache:
  enabled: true
  cache_dir: "cache"
  ttl_days: 1

# Ensemble
ensemble:
  enabled: false
  method: "average"  # average, weighted, stacking
  models: ["random_forest", "xgboost", "lightgbm"]

# Training Options
training:
  use_hyperparameter_tuning: true
  use_walk_forward: false
  use_feature_selection: true

Important Notes

Data Leakage Prevention

This implementation specifically addresses the critical issue of data leakage:

No future information: Only lagged features are used
Proper time series split: Chronological ordering maintained
Walk-forward validation: Models retrained on rolling windows

Limitations

Past performance doesn't guarantee future results
Models trained on historical data may not capture regime changes
Transaction costs and slippage estimates may not reflect real trading
Market conditions change; regular retraining recommended
Not financial advice; for educational purposes only

Dependencies

Key libraries:

pandas, numpy: Data manipulation
scikit-learn: Machine learning
xgboost, lightgbm: Gradient boosting
tensorflow/keras: Deep learning
yfinance: Data fetching
matplotlib, seaborn: Visualization
pytest: Testing

See requirements.txt for complete list.

Contributing

Contributions are welcome! Please:

Fork the repository
Create a feature branch
Make your changes
Add tests
Submit a pull request

Recent Improvements (v2.1.0)

✅ Hyperparameter Tuning: Integrated automatic hyperparameter optimization ✅ Feature Selection: Multiple methods for reducing feature dimensionality ✅ Model Ensembles: Support for averaging, weighted, and stacking ensembles ✅ Data Caching: Intelligent caching system for faster development ✅ Comprehensive Testing: Expanded test coverage for all major components ✅ Enhanced Configuration: More granular control over training process

Future Enhancements

Sentiment analysis from news and social media
Multi-asset portfolio optimization
Real-time prediction API
Web dashboard with Streamlit/Dash
Options pricing models
Alternative data sources (economic indicators, etc.)
Automated model retraining pipeline
Advanced risk management features
Portfolio optimization strategies

License

MIT License - see LICENSE file for details

Disclaimer

This project is for educational purposes only. It is not financial advice. Stock trading involves risk, and past performance does not guarantee future results. Always do your own research and consult with financial professionals before making investment decisions.

Contact

For questions or feedback, please open an issue on GitHub.

Acknowledgments

Data provided by Yahoo Finance API
Built with scikit-learn, XGBoost, and TensorFlow
Inspired by quantitative finance research and best practices

Version: 2.1.0 Last Updated: 2024 Status: Production-Ready (Enhanced)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
config		config
src		src
tests		tests
.gitignore		.gitignore
PROJECT_ANALYSIS.md		PROJECT_ANALYSIS.md
README.md		README.md
Stock Price Prediction Using Machine Learning.pdf		Stock Price Prediction Using Machine Learning.pdf
Stock Price Prediction using Machine Learning.ipynb		Stock Price Prediction using Machine Learning.ipynb
predict.py		predict.py
requirements.txt		requirements.txt
train.py		train.py

DharmpratapSingh/Stock-Price-Prediction-Using-Machine-Learning

Folders and files

Latest commit

History

Repository files navigation