- Analyze MKR price volatility and predict its price changes using machine learning models
- Native governance token of the MakerDAO ecosystem
- MakerDAO supports DAI, a decentralized stablecoin pegged to the US dollar
- MKR holders influence decision-making within the ecosystem
- Stability is crucial for decentralized finance (DeFi) applications
- DAI is created by locking collateral in MakerDAO smart contracts
- MKR impacts the health of DAI and MakerDAO's protocol
- Accurate predictions provide insights into market sentiment, governance decisions, and DAI stability
- Beneficial for DeFi participants and investors
- y represents the percentage change in the closing price of MKR over consecutive time periods
- It is calculated as:
y = (close - close_lag_1) / close_lag_1
- close: Closing price of MKR at the current time step
- close_lag_1: Closing price of MKR at the previous time step
- Continuous variable representing the relative change in MKR price
- Positive values = price increase; negative values = price decrease
- Values expressed as decimals (e.g., 0.05 = 5% increase, -0.03 = 3% decrease)
- Normalizes price movements, reducing sensitivity to absolute price levels
- Captures relative price movements, essential for understanding volatility and predicting trends in a highly volatile asset like MKR
- Reveals patterns in MKR's volatility and behavior
- Highlights MKR's impact on DAI stability and the MakerDAO ecosystem
- Provides actionable insights for DeFi participants and investors
- Retrieves OHLC data for selected cryptocurrencies and stablecoins using the Binance and Kraken API
- Includes additional derived metrics and timezone conversion
- Fetches hourly stock data for predefined tickers using Yahoo Finance
- Enriches data with calculated metrics
- Creates various technical features and custom calculations
- Utilizes the TA-Lib library for advanced technical analysis
- Focuses on the governance token MKR
- Examines factors influencing its price growth
- Uses machine learning models to predict the target variable (y)
- Implements models such as Linear Regression (LR), Decision Trees (DT), Random Forest (RF), and XGBoost
- Designed to predict MKR price trends
- Included for programmatic interaction with the data
- Optional, suitable for deployment
- A Dockerfile is provided for easy deployment in containerized environments
Crypto Data:
Link to Cryptocurrencies Binance Dataset
Link to Cryptocurrencies Kraken Dataset
Stock Data:
Link to Stocks Dataset
Merged Data with Features:
Link to merged Dataset
stable_coin/
├── images/ # Contains the images that are generated through EDA
│ ├── boxplot_mkr.png
│ ├── correlation_matrix_mkr.png
│ ├── distribution_price_change.png
│ ├── timeseries_mkrusdt.png
│ ├── timeseries_eur.png
│ ├── price_change_correlation_with_volume.png
├── README.md
├── notebook.ipynb/
│ ├── get_coins # Fetches and processes cryptocurrency data
│ ├── get_stocks # Fetches and processes stock data
│ ├── feature engineering # Adds derived metrics for ML models
│ ├── model evaluation and tuning # Compares models and saves the best as a pickle file
├── train.py # Trains the best model
├── predict.py # Flask application for making predictions
├── requirements.txt # List of required Python packages
├── environment.yml # Conda environment file
├── LICENSE
├── Dockerfile # For containerized deployment
MKRUSDT (Maker):
Number of data points: 1862
Close Price:
Mean: 1566.29
Min: 1063.00
Max: 2411.00
Standard Deviation: 298.14
Price Change:
Mean: 0.50%
Max: 6.62%
Min: -4.03%
Volume:
Mean: 572.94
Min: 16.25
Max: 8915.15
7-Day Moving Average (7d_ma):
Mean: 1564.15
Min: 201.77
Max: 2374.29
7-Day Volatility:
Mean: 17.30%
Max: 751.05%
DAIUSD (DAI Stablecoin):
Number of data points: 613
Maximum price change: 0.47%
7d_ma
and30d_ma
: Highly correlated withclose
andopen
, indicating their importance for identifying price trendsatr
is moderately correlated with price indicators, emphasizing its role in volatility analysis- Indicators like
adx
andrsi
have weak correlations with price-related variables but useful for providing additional signals volume'
andvolume_change
are moderately correlated with certain price metrics, making them valuable for demand-supply analysisgrowth_future_1h
andgrowth_future_24h
have weak correlations with other features, suggesting they may be challenging targets to predict directly- Feature Combination: Moving averages, volatility, and volume-based features form a strong foundation for predicting MKR price trends
Key observations:
- Median price around 1500
- Outliers visible around 2200, indicating occasional price spikes
- Spread and Support Level: Moderate spread within the core trading range, lower whisker extends to ~1000, suggesting a historical support level
- Interquartile Range (IQR): Middle 50% of price activity is relatively concentrated
- Overall Pattern: Indicates a stable trading range with occasional upside volatility
- Starting Point: Began around $1200 with initial sideways movement until early November
- Upward Trend: Strong rally from November to early December, peaking at ~$2400 in early December
- December Volatility: Multiple peaks above $2000, significant price fluctuations
- Downward Trend: Gradual decline since mid-December, currently trading around $1400 with bearish momentum
- Overall Range: $1000–2400, with most activity between $1400–2000
- Market Pattern: Suggests a completed pump-and-distribution phase
- Price Stability: Consistent around $1.00, typical for a stablecoin
- Minimal Volatility: Fluctuations mostly within the $0.999–$1.001 range
- Notable Spikes: Brief spike to $1.005 on January 9th, small spike to $1.002 on January 1st
- Peg Stability: Maintains excellent peg stability around $1.00
- Recent Activity (Jan 9–13): Slightly increased volatility, but remains within acceptable ranges
- Distribution Shape: Appears normal (bell-shaped) and centered around 0, indicating balanced price movements
- Most Frequent Changes: Small fluctuations, typically between -1 and +1
- Outliers: A few extreme positive outliers, reaching up to +6
- Tails: Distribution tails extend from roughly -4 to +6
- Peak Frequency: Most frequent changes occur around 250 occurrences for the smallest price movements
Mean y: 0.0557
Standard Deviation y: 5.3460
Key observation:
- Illustrates the frequency distribution of the target variable y across the Train, Validation, and Test datasets
- Majority of values concentrated near 0
- Extreme outliers present, with some values exceeding 5000
- Distribution is highly skewed, with most values in a small range and a few significantly larger ones
- Extreme outliers may adversely affect the model by increasing error and reducing prediction accuracy
- Skewness suggests the model might face difficulty in accurately predicting y
- Interquartile Range (IQR) is small, suggesting that most data points are closely clustered
- Numerous strong outliers exceed 1000
- This aligns with the histogram: the majority of values are small, with a few extreme values
- These outliers can significantly distort metrics like MSE and RMSE during training and validation
- Next Step: further analysis to decide whether to remove or transform the outliers
- Addressing the skewness and extreme outliers, a logarithmic transformation was applied to y:
y_train_log = np.log1p(y_train)
y_val_log = np.log1p(y_val)
y_test_log = np.log1p(y_test)
- Features
ppo
,trix
,atr
show positive accuracy drop, meaning removing these features decreases model accuracy - Features like
sma20
,cci
, androc
show negative accuracy drop, meaning removing these features could improves model accuracy - Various features have no influence on the accuracy and could be considered for removal
- Most predictions are centered around 0, with a sharp peak and minimal spread
- Indicates that the model is predicting a narrow range of values, which could suggest underfitting or that the target variable has a limited variance
Key observations:
- Higher values for min_samples_leaf (13-15) yield better results, with MSE around 12.9
- max_depth has minimal impact, with stable MSE across depths, except for min_samples_leaf=1, where a significant deterioration occurs at depth=4 (MSE spikes to ~34)
- Optimal Configuration: max_depth=4, min_samples_leaf=13, achieving an MSE of 12.9
- Interpretation: The model benefits from higher leaf sample restrictions, preventing overfitting
- Best number of trees (n_estimators): 180, achieving an MSE of ~2900.731
- Best max_depth: 10, balancing between underfitting (depth=5) and overfitting (depth=15)
- Best min_samples_leaf: 1, enabling the model to capture detailed patterns
- Final model performance: MSE: 2870.841, RMSE: 53.580, R² Score: 0.085
- While RMSE is reasonable, the low R² score (8.5%) indicates the model has limited explanatory power
- The residuals (in log scale) are mostly clustered around zero, indicating accurate model predictions in this scale
- A few residuals deviate from zero, suggesting areas where the model struggles with accuracy
- No clear pattern in the residuals, indicating that the model is well-calibrated in the log scale
- Despite the well-calibrated residuals in the log scale, the low R² suggests that the model’s explanatory power remains limited
- Majority of points cluster around the diagonal (pink dashed line), indicating model predictions generally align with actual values
- Strong linear relationship between predicted and actual values, suggesting the model captures underlying patterns well
- Most data points are concentrated around the 0 value on both axes
- Data spans from approximately -2 to 6 on both axes, with sparse points in higher value ranges (4-6)
- Some outliers visible, particularly around (-2, 0) and (6, 0)
- Slight tendency to underpredict at extreme values
- Sparsity at higher values: Indicates less reliable predictions in these ranges
- Generating more training data for extreme value ranges (4-6) could improve model reliability in these areas
- Dominant Features: Technical indicators like
trix
,roc
,ppo
,cmo
,cci
, andbop
dominate, suggesting strong predictive power for 1-hour predictions - Time-Based Features:
hour
,day
,month
, andyear
show very low importance, indicating that price movements are more influenced by technical factors than by time (but consider the short time-frame of 3m) - Cross-Crypto Correlation: Most cryptocurrency tickers (e.g.,
BTCUSDT
,ETHUSDT
) have minimal impact, showing limited cross-crypto correlation in 1-hour predictions - Traditional Market Indicators: Indicators like
^SPX
and^VIX
show low importance, with limited correlation to traditional markets - Fibonacci Levels: Surprisingly low importance across all timeframes, despite their common use in technical analysis
- Focusing on technical indicators is more valuable than time-based or cross-crypto features, while traditional market indicators and Fibonacci levels have limited predictive power
So I considered model simplification by focusing on top 10-15 features and looked at SHAP
Key observations:
- SHAP Analysis: Provides insight into how individual feature values influence predictions
- Each dot represents a data point, with color indicating feature value (blue = low, red = high)
- For "price_change", high values (red dots) positively push predictions, while low values (blue dots) negatively influence them
- Features like
cci
androc
exhibit a mix of positive and negative effects, suggesting non-linear relationships with the target variable - Features with narrow SHAP value distributions, like
day
orln_volume
, have a limited effect on predictions across the dataset
Surprisingly, retraining the model with the most important features resulted in lower scores
- Best Hyperparameters:
- eta 0.250000
- max_depth 5.000000
- min_child_weight 1.000000
- rmse 1.467122
-
- MSE on the Validation Set: 2.1524
-
- MAE on the Validation Set: 0.0163
-
- R² Score on the Validation Set: 0.9247
- Clone the repository:
git clone https://github.com/your-repo.git
cd your-repo
- Set up the environment Using Conda:
conda env create -f environment.yml
conda activate your-environment-name
- Using pip:
pip install -r requirements.txt
- TA-Lib library is required for this project but is not installed automatically via the environment.yaml file
- You need to install it manually due to potential platform-specific compilation requirements
- Using Conda (Recommended):
conda install -c conda-forge ta-lib
- Using pip: If you prefer pip, ensure you have the required dependencies installed and run:
pip install TA-Lib
- On macOS with Homebrew: First, install the TA-Lib C library:
brew install ta-lib
- Then install the Python wrapper:
pip install TA-Lib
- On Linux: Install the required development library (e.g., for Ubuntu):
sudo apt-get install libta-lib-dev
- Then install the Python wrapper:
pip install TA-Lib
- On Windows: Download and install the precompiled binaries for your system from the TA-Lib website, then install the Python wrapper:
pip install TA-Lib
Make sure TA-Lib is installed before running the application. If you encounter any issues, refer to the TA-Lib documentation for further assistance
-
Jupyter Notebook (notebook.ipynb):
- Fetch cryptocurrency and stock market data:
- Fetches data for cryptocurrencies and stablecoins defined in the coins list.
- Processes the data and adds derived metrics (e.g., price change).
- Saves the final dataset as
stable_coins.csv
.
- Fetch hourly stock data for predefined tickers:
- Adds derived metrics.
- Formats timestamps.
- Combines all stock data into a single DataFrame and saves it as a CSV.
- Logs missing or delisted stocks/cryptos as warnings or errors.
- Perform feature engineering and derive metrics.
- Evaluate multiple machine learning models.
- Save the best models as
.pkl
files.
- Fetch cryptocurrency and stock market data:
-
Train the Model:
- Use
train.py
to train the best-performing model (default: XGBoost) on the processed data. - Save the trained model as a
.pkl
file.
- Use
-
Deploy the Model with Flask:
- Use
predict.py
to deploy the model and provide predictions via Flask.
- Use
-
The repository includes a Flask (
predict.py
) to interact with the trained XGBoost model. The API allows users to predict the 'close' within the next hour. -
Steps to Use:
- Start the Flask Server
- Ensure the conda environment is active and run:
python predict.py --port=<PORT>
-
Replace with the desired port number (e.g., 5001). If no port is specified, the server defaults to port 8000
-
Example:
python predict.py --port=5001
- The server runs at http://0.0.0.0:, for example: http://0.0.0.0:5001
Make Predictions
-
Send an HTTP POST request with the input features as JSON to the /predict endpoint. Replace with the port you specified earlier
-
Example Input:
curl -X POST http://127.0.0.1:5001/predict \
-H "Content-Type: application/json" \
-d '{"ln_volume": -25.422721545090816, "bop": -0.44444, "ppo": 1.0097517730496455}'
- Example Response:
{
"predicted_growth_rate": -0.002064734697341919
}
-
To simplify deployment, a Dockerfile is provided. To build and run the Docker container:
-
Build the Docker image:
docker pull continuumio/anaconda3
docker build -t mkr-coin-analysis .
- Run the container:
docker run -p 5001:5001 mkr-coin-analysis
This project is open-source and licensed under the MIT License.