Link to paper The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract Predicted next-day stock end-of-day implied volatility using random forests Examined usefulness of different sources of predictors and value of attention and sentiment features from Twitter Studied approach on 165 most liquid US stocks across 11 traditional market sectors Discovered stocks in certain sectors are easier to predict than others Possible reasons for discrepancies caused by excess social media attention or low option liquidity Explored how approach fares throughout time by identifying four underlying market regimes in implied volatility Paper Content Introduction Social media has caused significant changes in the world, including financial markets Efficient Market Hypothesis (Fama, 1970) suggests rapid information diffusion could lead to higher price efficiency Behavioral economists argue social media could influence investors and incite herd behavior Data providers offer social media indicators to help financial institutions Academic studies have tried to quantify the interplay between social media and financial variables Existing research has mainly focused on Twitter and its influence on stock price, volatility and trading volume Current literature overlooks the interaction between social media and market implied volatility This study investigates the ability to predict one-day ahead movement in implied volatility using machine learning Stock universe is diversified across 11 traditional US stock market sectors Out-of-sample period spans from January 1st, 2013 till March 1st, 2019 Hidden Markov models used to identify four regimes in implied volatility and measure performance across them Preliminaries Explains market implied volatility and its relation to derivatives Describes random forest machine learning model for predictions Describes hidden Markov model for quantifying regimes in market implied volatility Market implied volatility Options are a type of financial instrument Sellers of options are exposed to risk Measuring this risk requires considering expected price fluctuations of the underlying asset The CBOE Volatility Index combines implied volatility of different option contracts into an index The VIX is a measure of expected price fluctuations in the S&P 500 Index over the next 30 days Equation 1 is used to compute the VIX for a given term Option contracts typically have fixed expiration dates The VIX is calculated by linearly interpolating between two computed measures Random forests Random forests are a machine learning approach for learning a predictive model They consist of multiple decision (or regression) trees whose predictions are combined Combination is typically done by taking the mode (or average) of all outputs Random forests are fast to build, not affected by feature scaling, robust to irrelevant predictors and noisy data Constructing an ensemble model by randomly subsampling both data points and features helps reduce overfitting Hidden markov models Hidden Markov models (HMM) are a generative approach for modeling systems that follow a Markov process HMM models the joint distribution of a sequence of hidden states and observations Parameters of HMM are initial state distribution, state transition model, and emission probabilities model Three key tasks associated with HMM are: probability of sequence of observations, best sequence of hidden states, and learning an HMM Methodology Main goal of study is to explore 3 questions related to stock market performance Study uses random forests to predict stock market performance using stock price, implied volatility, and Twitter features Study covers 165 stocks over a 6 year period Performance is grouped by 11 traditional stock sectors Hidden Markov models used to identify 4 distinct implied volatility regimes per stock Stock universe selection Looked at popular ETFs to obtain a diversified universe of stocks Selected 15 most liquid stocks per sector for a total of 165 stocks Excluded some stocks due to stock splits, late introduction, and ambiguous names Replaced excluded stocks to maintain 15 stocks per sector Data acquisition and feature generation Data from Jan 1, 2011 to March 1, 2019 was used from 3 sources: stock prices, option contracts, and Twitter 4 features were extracted per stock for each trading day: closing price, 30-day implied volatility, total tweet count, and average sentiment polarity Sentiment polarity was calculated using VADER 2 additional predictors were generated per feature: daily difference and difference between daily value and exponential moving average of last 10 trading days Predicting movements in implied volatility This study aims to predict one-day ahead movements in a stockโs 30-day implied volatility....