Skip to content
/ mkr_coin Public

Analysis on MKRUSD, trying different machine learning models to find patterns for percentage change in the closing price

License

Notifications You must be signed in to change notification settings

pynat/mkr_coin

Repository files navigation

Overview

Project Focus:

  • Analyze MKR price volatility and predict its price changes using machine learning models

MKR Overview:

  • Native governance token of the MakerDAO ecosystem
  • MakerDAO supports DAI, a decentralized stablecoin pegged to the US dollar
  • MKR holders influence decision-making within the ecosystem

DAI Importance:

  • Stability is crucial for decentralized finance (DeFi) applications
  • DAI is created by locking collateral in MakerDAO smart contracts

Value of Predicting MKR Price:

  • MKR impacts the health of DAI and MakerDAO's protocol
  • Accurate predictions provide insights into market sentiment, governance decisions, and DAI stability
  • Beneficial for DeFi participants and investors

Prediction Target (y):

  • y represents the percentage change in the closing price of MKR over consecutive time periods
  • It is calculated as:
y = (close - close_lag_1) / close_lag_1
  • close: Closing price of MKR at the current time step
  • close_lag_1: Closing price of MKR at the previous time step
  • Continuous variable representing the relative change in MKR price
  • Positive values = price increase; negative values = price decrease
  • Values expressed as decimals (e.g., 0.05 = 5% increase, -0.03 = 3% decrease)

Why Use Percentage Change:

  • Normalizes price movements, reducing sensitivity to absolute price levels
  • Captures relative price movements, essential for understanding volatility and predicting trends in a highly volatile asset like MKR

Importance of Analyzing Percentage Change:

  • Reveals patterns in MKR's volatility and behavior
  • Highlights MKR's impact on DAI stability and the MakerDAO ecosystem
  • Provides actionable insights for DeFi participants and investors

Features

Crypto Data Fetcher:

  • Retrieves OHLC data for selected cryptocurrencies and stablecoins using the Binance and Kraken API
  • Includes additional derived metrics and timezone conversion

Stock Data Fetcher:

  • Fetches hourly stock data for predefined tickers using Yahoo Finance
  • Enriches data with calculated metrics

Feature Engineering:

  • Creates various technical features and custom calculations
  • Utilizes the TA-Lib library for advanced technical analysis

MKRUSDT Analysis:

  • Focuses on the governance token MKR
  • Examines factors influencing its price growth
  • Uses machine learning models to predict the target variable (y)

Machine Learning Models:

  • Implements models such as Linear Regression (LR), Decision Trees (DT), Random Forest (RF), and XGBoost
  • Designed to predict MKR price trends

Flask:

  • Included for programmatic interaction with the data
  • Optional, suitable for deployment

Docker Support:

  • A Dockerfile is provided for easy deployment in containerized environments

Datasets

Crypto Data:
Link to Cryptocurrencies Binance Dataset

Link to Cryptocurrencies Kraken Dataset

Stock Data:
Link to Stocks Dataset

Merged Data with Features:
Link to merged Dataset

Structure

stable_coin/  
├── images/                                     # Contains the images that are generated through EDA     
│   ├── boxplot_mkr.png      
│   ├── correlation_matrix_mkr.png   
│   ├── distribution_price_change.png            
│   ├── timeseries_mkrusdt.png      
│   ├── timeseries_eur.png    
│   ├── price_change_correlation_with_volume.png 
├── README.md                      
├── notebook.ipynb/       
│   ├── get_coins                               # Fetches and processes cryptocurrency data     
│   ├── get_stocks                              # Fetches and processes stock data      
│   ├── feature engineering                     # Adds derived metrics for ML models      
│   ├── model evaluation and tuning             # Compares models and saves the best as a pickle file      
├── train.py                                    # Trains the best model            
├── predict.py                                  # Flask application for making predictions       
├── requirements.txt                            # List of required Python packages       
├── environment.yml                             # Conda environment file     
├── LICENSE      
├── Dockerfile                                  # For containerized deployment    

Data Exploration:

MKRUSDT (Maker):
Number of data points: 1862
Close Price:
Mean: 1566.29
Min: 1063.00
Max: 2411.00
Standard Deviation: 298.14

Price Change:
Mean: 0.50%
Max: 6.62%
Min: -4.03%

Volume:
Mean: 572.94
Min: 16.25
Max: 8915.15

7-Day Moving Average (7d_ma):
Mean: 1564.15
Min: 201.77
Max: 2374.29

7-Day Volatility:
Mean: 17.30%
Max: 751.05%

DAIUSD (DAI Stablecoin):
Number of data points: 613
Maximum price change: 0.47%

Correlation for MKRUSDT

Correlation Matrix
Key observations:

  • 7d_ma and 30d_ma: Highly correlated with close and open, indicating their importance for identifying price trends
  • atr is moderately correlated with price indicators, emphasizing its role in volatility analysis
  • Indicators like adx and rsi have weak correlations with price-related variables but useful for providing additional signals
  • volume' and volume_change are moderately correlated with certain price metrics, making them valuable for demand-supply analysis
  • growth_future_1h and growth_future_24h have weak correlations with other features, suggesting they may be challenging targets to predict directly
  • Feature Combination: Moving averages, volatility, and volume-based features form a strong foundation for predicting MKR price trends

Boxplot for Closing Prices for MKRUSDT

Key observations:

  • Median price around 1500
  • Outliers visible around 2200, indicating occasional price spikes
  • Spread and Support Level: Moderate spread within the core trading range, lower whisker extends to ~1000, suggesting a historical support level
  • Interquartile Range (IQR): Middle 50% of price activity is relatively concentrated
  • Overall Pattern: Indicates a stable trading range with occasional upside volatility
    Boxplot

Timeseries for MKRUSDT and DAIUSD

Timeseries Key observations for MKRUSDT:

  • Starting Point: Began around $1200 with initial sideways movement until early November
  • Upward Trend: Strong rally from November to early December, peaking at ~$2400 in early December
  • December Volatility: Multiple peaks above $2000, significant price fluctuations
  • Downward Trend: Gradual decline since mid-December, currently trading around $1400 with bearish momentum
  • Overall Range: $1000–2400, with most activity between $1400–2000
  • Market Pattern: Suggests a completed pump-and-distribution phase

Timeseries
Key observations for DAIUSD:

  • Price Stability: Consistent around $1.00, typical for a stablecoin
  • Minimal Volatility: Fluctuations mostly within the $0.999–$1.001 range
  • Notable Spikes: Brief spike to $1.005 on January 9th, small spike to $1.002 on January 1st
  • Peg Stability: Maintains excellent peg stability around $1.00
  • Recent Activity (Jan 9–13): Slightly increased volatility, but remains within acceptable ranges

Distribution of Price Change for MKRUSDT

Distribution of Price Change
Key observations:

  • Distribution Shape: Appears normal (bell-shaped) and centered around 0, indicating balanced price movements
  • Most Frequent Changes: Small fluctuations, typically between -1 and +1
  • Outliers: A few extreme positive outliers, reaching up to +6
  • Tails: Distribution tails extend from roughly -4 to +6
  • Peak Frequency: Most frequent changes occur around 250 occurrences for the smallest price movements

Machine Learning Models

Target Variable Analysis: y

Mean y: 0.0557
Standard Deviation y: 5.3460
Histogram of y Key observation:

  • Illustrates the frequency distribution of the target variable y across the Train, Validation, and Test datasets
  • Majority of values concentrated near 0
  • Extreme outliers present, with some values exceeding 5000
  • Distribution is highly skewed, with most values in a small range and a few significantly larger ones
  • Extreme outliers may adversely affect the model by increasing error and reducing prediction accuracy
  • Skewness suggests the model might face difficulty in accurately predicting y

Boxplot of y Key observations:

  • Interquartile Range (IQR) is small, suggesting that most data points are closely clustered
  • Numerous strong outliers exceed 1000
  • This aligns with the histogram: the majority of values are small, with a few extreme values
  • These outliers can significantly distort metrics like MSE and RMSE during training and validation
  • Next Step: further analysis to decide whether to remove or transform the outliers

Transformation of y

  • Addressing the skewness and extreme outliers, a logarithmic transformation was applied to y:
y_train_log = np.log1p(y_train)
y_val_log = np.log1p(y_val)
y_test_log = np.log1p(y_test)

Histogram of y
Boxplot of y

Linear Regression (LR)

Accuracy Drop Linear Regression

  • Features ppo, trix, atr show positive accuracy drop, meaning removing these features decreases model accuracy
  • Features like sma20, cci, and roc show negative accuracy drop, meaning removing these features could improves model accuracy
  • Various features have no influence on the accuracy and could be considered for removal

Distribution of Predicted Values for Linear Regression
Key Observations:

  • Most predictions are centered around 0, with a sharp peak and minimal spread
  • Indicates that the model is predicting a narrow range of values, which could suggest underfitting or that the target variable has a limited variance

Decision Trees (DT)

Cross-Validation MSE Heatmap for Decision Tree

Key observations:

  • Higher values for min_samples_leaf (13-15) yield better results, with MSE around 12.9
  • max_depth has minimal impact, with stable MSE across depths, except for min_samples_leaf=1, where a significant deterioration occurs at depth=4 (MSE spikes to ~34)
  • Optimal Configuration: max_depth=4, min_samples_leaf=13, achieving an MSE of 12.9
  • Interpretation: The model benefits from higher leaf sample restrictions, preventing overfitting

Random Forest (RF)

MSE vs Number of Trees for Different Minimum Sample Leafs Random Forest
Key observations:

  • Best number of trees (n_estimators): 180, achieving an MSE of ~2900.731
  • Best max_depth: 10, balancing between underfitting (depth=5) and overfitting (depth=15)
  • Best min_samples_leaf: 1, enabling the model to capture detailed patterns
  • Final model performance: MSE: 2870.841, RMSE: 53.580, R² Score: 0.085
  • While RMSE is reasonable, the low R² score (8.5%) indicates the model has limited explanatory power

Residual Plot For Random Forest Key observations:

  • The residuals (in log scale) are mostly clustered around zero, indicating accurate model predictions in this scale
  • A few residuals deviate from zero, suggesting areas where the model struggles with accuracy
  • No clear pattern in the residuals, indicating that the model is well-calibrated in the log scale
  • Despite the well-calibrated residuals in the log scale, the low R² suggests that the model’s explanatory power remains limited

XGBoost

Scatterplot Actual vs Predicted Values XGBOOST
Key observations:

  • Majority of points cluster around the diagonal (pink dashed line), indicating model predictions generally align with actual values
  • Strong linear relationship between predicted and actual values, suggesting the model captures underlying patterns well
  • Most data points are concentrated around the 0 value on both axes
  • Data spans from approximately -2 to 6 on both axes, with sparse points in higher value ranges (4-6)
  • Some outliers visible, particularly around (-2, 0) and (6, 0)
  • Slight tendency to underpredict at extreme values
  • Sparsity at higher values: Indicates less reliable predictions in these ranges
  • Generating more training data for extreme value ranges (4-6) could improve model reliability in these areas

Feature Importance For XGBOOST
Key observations:

  • Dominant Features: Technical indicators like trix, roc, ppo, cmo, cci, and bop dominate, suggesting strong predictive power for 1-hour predictions
  • Time-Based Features: hour, day, month, and year show very low importance, indicating that price movements are more influenced by technical factors than by time (but consider the short time-frame of 3m)
  • Cross-Crypto Correlation: Most cryptocurrency tickers (e.g., BTCUSDT, ETHUSDT) have minimal impact, showing limited cross-crypto correlation in 1-hour predictions
  • Traditional Market Indicators: Indicators like ^SPX and ^VIX show low importance, with limited correlation to traditional markets
  • Fibonacci Levels: Surprisingly low importance across all timeframes, despite their common use in technical analysis
  • Focusing on technical indicators is more valuable than time-based or cross-crypto features, while traditional market indicators and Fibonacci levels have limited predictive power

So I considered model simplification by focusing on top 10-15 features and looked at SHAP

SHAP For XGBOOST

Key observations:

  • SHAP Analysis: Provides insight into how individual feature values influence predictions
  • Each dot represents a data point, with color indicating feature value (blue = low, red = high)
  • For "price_change", high values (red dots) positively push predictions, while low values (blue dots) negatively influence them
  • Features like cci and roc exhibit a mix of positive and negative effects, suggesting non-linear relationships with the target variable
  • Features with narrow SHAP value distributions, like day or ln_volume, have a limited effect on predictions across the dataset

Retraining Result:

Surprisingly, retraining the model with the most important features resulted in lower scores

The best model is:

XGBoost

  • Best Hyperparameters:
    • eta 0.250000
    • max_depth 5.000000
    • min_child_weight 1.000000
    • rmse 1.467122
    • MSE on the Validation Set: 2.1524
    • MAE on the Validation Set: 0.0163
    • R² Score on the Validation Set: 0.9247

Installation

  • Clone the repository:
  git clone https://github.com/your-repo.git
  cd your-repo
  • Set up the environment Using Conda:
  conda env create -f environment.yml
  conda activate your-environment-name
  • Using pip:
  pip install -r requirements.txt

Installation Instructions for TA-Lib

  • TA-Lib library is required for this project but is not installed automatically via the environment.yaml file
  • You need to install it manually due to potential platform-specific compilation requirements

To install TA-Lib, follow these steps:

  • Using Conda (Recommended):
conda install -c conda-forge ta-lib
  • Using pip: If you prefer pip, ensure you have the required dependencies installed and run:
pip install TA-Lib
  • On macOS with Homebrew: First, install the TA-Lib C library:
brew install ta-lib
  • Then install the Python wrapper:
pip install TA-Lib
  • On Linux: Install the required development library (e.g., for Ubuntu):
sudo apt-get install libta-lib-dev
  • Then install the Python wrapper:
pip install TA-Lib
  • On Windows: Download and install the precompiled binaries for your system from the TA-Lib website, then install the Python wrapper:
pip install TA-Lib

Make sure TA-Lib is installed before running the application. If you encounter any issues, refer to the TA-Lib documentation for further assistance

How to Use

  • Jupyter Notebook (notebook.ipynb):

    • Fetch cryptocurrency and stock market data:
      • Fetches data for cryptocurrencies and stablecoins defined in the coins list.
      • Processes the data and adds derived metrics (e.g., price change).
      • Saves the final dataset as stable_coins.csv.
    • Fetch hourly stock data for predefined tickers:
      • Adds derived metrics.
      • Formats timestamps.
      • Combines all stock data into a single DataFrame and saves it as a CSV.
      • Logs missing or delisted stocks/cryptos as warnings or errors.
    • Perform feature engineering and derive metrics.
    • Evaluate multiple machine learning models.
    • Save the best models as .pkl files.
  • Train the Model:

    • Use train.py to train the best-performing model (default: XGBoost) on the processed data.
    • Save the trained model as a .pkl file.
  • Deploy the Model with Flask:

    • Use predict.py to deploy the model and provide predictions via Flask.

Flask

  • The repository includes a Flask (predict.py) to interact with the trained XGBoost model. The API allows users to predict the 'close' within the next hour.

  • Steps to Use:

    • Start the Flask Server
    • Ensure the conda environment is active and run:
    python predict.py --port=<PORT>
  • Replace with the desired port number (e.g., 5001). If no port is specified, the server defaults to port 8000

  • Example:

    python predict.py --port=5001

Make Predictions

  • Send an HTTP POST request with the input features as JSON to the /predict endpoint. Replace with the port you specified earlier

  • Example Input:

curl -X POST http://127.0.0.1:5001/predict \
-H "Content-Type: application/json" \
-d '{"ln_volume": -25.422721545090816, "bop": -0.44444, "ppo": 1.0097517730496455}'
  • Example Response:
{
  "predicted_growth_rate": -0.002064734697341919
}

Run with Docker

  • To simplify deployment, a Dockerfile is provided. To build and run the Docker container:

  • Build the Docker image:

docker pull continuumio/anaconda3
docker build -t mkr-coin-analysis .
  • Run the container:
docker run -p 5001:5001 mkr-coin-analysis

License

This project is open-source and licensed under the MIT License.

About

Analysis on MKRUSD, trying different machine learning models to find patterns for percentage change in the closing price

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages