# Pattern Recognition in Sector ETFs: Industry Return Predictability - Machine Learning Approach

Our last __publication__ regarding pattern recognition using sectorial ETFs did not yield outstanding results. The dependencies among these ETFs do not manifest themselves easily and did not allow for a simplistic prediction; at least, we could not get ourselves to obtain one. Further research has led us to this publication:

"__Industry Return Predictability: A Machine Learning Approach__" by Rapach, Strauss, Tu, and Zhou.

This publication concludes that: "Controlling for post-selection inference and multiple testing, in-sample results provide extensive evidence of industry return predictability, pointing to the existence of industry-related information frictions in the equity market." We are looking for exactly that predictability that describes the wheel of the economy as a cyclic pattern.

The paper uses the database on Dr. French´s __web page.__ We will keep using those ETFs that track (or attempt to track) sector returns and try to replicate and reconcile the research paper's predictive results with our own previous work. We will check whether information across sectorial returns persists robustly with independence from the "exact" underlying assets employed. We will use this __notebook__ published by __Alpha Architect__ in 2018 as a guideline; we will simplify it by not replicating the research paper's exact results. We will try to reproduce the strategy in a practical manner discarding some scientific rigor in the process. We will proceed slowly and publish the notebook at each stage; we will start with the data set-up and by obtaining the __LASSO regression__ values.

The initial modules we will use are:

```
self = QuantBook()
import
```__numpy__ as __np__
import __pandas__ as __pd__
from __IPython____.____display__
import display
from __sklearn____.____linear_model__ import LassoLarsIC, LinearRegression, Ridge
from __sklearn____.____metrics__ import r2_score

We will import the ridge regression module for future comparison purposes. After comparing the ordinary least squares regressions with the LASSO candidates and with all candidates, we can run a cross-validation ridge regression to understand LASSO feature selection's effects better, including cross-validation effects.

These are the ETFs we are going to use, with the "common" name to align them easier with sector risk categories as described by Dr. French´s industry portfolios; of course, these do not match one to one:

```
sector_ETF = ['XLK', 'XLY', 'XLC', 'XLB', 'XLV', 'XLP',
'XLI', 'XLU', 'XLF', 'XLE', 'XHB']
names = {'XLK':'Tech', 'XLY':'Disc', 'XLC':'Commns',
'XLB':'Mats', 'XLV':'Health', 'XLP':'Staples',
'XLI':'Ind', 'XLU': 'Util', 'XLF': 'Fin',
'XLE': 'Energy', 'XHB': 'Home'}
'''
Energy Select Sector SPDR Fund (XLE)
Materials Select Sector SPDR ETF (XLB)
Industrial Select Sector SPDR Fund (XLI)
Consumer Discretionary Select Sector SPDR Fund (XLY)
SPDR S&P Homebuilders ETF (XHB)
Consumer Staples Select Sector SPDR Fund (XLP)
Health Care Select Sector SPDR Fund (XLV)
Financial Select Sector SPDR Fund (XLF)
Technology Select Sector SPDR Fund (XLK)
Communication Services Select Sector SPDR Fund (XLC)
Utilities Select Sector SPDR Fund (XLU)
'''
start = datetime(2000, 1, 1)
end = datetime(2015, 1, 1)
ETF_symbols = {etf: str(self.AddEquity(etf).Symbol.ID) for etf in sector_ETF}
history = self.History(self.Securities.Keys, start, end,
Resolution.Daily,
dataMode=DataNormalizationMode.Adjusted)
```

The history will run from 2000 to 2015. Most of these ETFs started trading in 1998; with these date limits, we have a sufficient amount of data for our research, and we can reserve 2015 onwards for validation through backtesting. For this history, we will get the daily returns and resample them at a monthly frequency. We will keep just the 3 first characters for each symbol identifier for easy recognition of the ETF:

```
returns = history['close'].unstack(level=0).pct_change()
returns.columns = list(pd.Series(returns.columns).map(lambda x: names[x[:3]]))
monthly_returns = returns.resample('M').sum()
features = monthly_returns.columns
display(monthly_returns)
```

The monthly returns look like this:

The returns are very volatile at the "head" of the returns; the dotcom technology bubble was bursting, the volatility at technology was exceptionally high during this period. It is worth checking the returns and volatilities for these instruments monthly and yearly to confirm that the numbers make sense a priori.

For the monthly returns and volatility:

And for the yearly time frame:

These data frame views are obtained with the display functions below; volatility can also be computed yearly directly, instead of by annualizing the monthly volatilities, the difference is usually minimal:

```
isplay(pd.DataFrame(returns.resample('M').sum().mean(axis=0), columns = ['Monthly - Mean Returns']))
display(pd.DataFrame(returns.resample('M').sum().std(axis=0), columns = ['Monthly - Volatility']))
display(pd.DataFrame(returns.resample('Y').sum().mean(axis=0), columns = ['Year - Mean Returns']))
display(pd.DataFrame(returns.resample('M').sum().std(axis=0), columns = ['Yearly - Volatility'])*np.sqrt(12))
```

The returns we have found will be the features for our machine learning models. The targets we are trying to predict are the returns one month into the future. We will initially limit our predictions to this 1-month-ahead horizon; the model can always be extended to additional future windows at additional computation costs.

```
targets = []
for col in monthly_returns.columns:
name = col+'_Future'
monthly_returns[name] = monthly_returns[col].shift(-1).dropna()
targets.append(name)
```

We bring a small innovation to the research publications by using __cross-validation in our LASSO__ model selection step. There are drawbacks to using cross-validation to estimate the LASSO model parameters; we may end up with a model that leaves out useful, good features. We can compare the LASSO with cross-validation to that without cross-validation and even obtain a mixture of both in the future if needed. The approach will be to use a time series split cross-validation into our LASSO and record to a dictionary the selected model features:

`from `__sklearn____.____linear_model__ import LassoCV
from __sklearn____.____model_selection__ import TimeSeriesSplit
import __warnings__
warnings.filterwarnings("ignore")
tscv = TimeSeriesSplit()
lasso_cv = LassoCV(cv = tscv, max_iter = 10000)
lasso_cv_results = pd.DataFrame(0, index = targets, columns=features)
for target in targets:
X = monthly_returns[features]
y = monthly_returns[target]
lasso_cv.fit(X, y)
lasso_cv_results.loc[target] = np.round(lasso_cv.coef_,5)
score = lasso_cv.score(X, y)
print("Score for {} - {}".format(target, score))
warnings.filterwarnings('default')

The intermediate data frame with the LASSO results is this, showing the relevant sectors that predict the target as a non-zero parameter:

We will "list-comprehend" this data frame into a dictionary for easy reading of the sector returns that can be used to predict the future of another sector according to the LASSO model:

```
cv_candidates = {}
for target in targets:
comp = [feature for feature in lasso_cv_results.loc[target].index if lasso_cv_results.loc[target, feature] != 0]
cv_candidates[target] = comp
```

The resulting dictionary contains the predictors for each sectorial ETF:

```
{'Home_Future': ['Home'],
'Mats_Future': [],
'Energy_Future': ['Energy', 'Fin', 'Tech'],
'Fin_Future': [],
'Ind_Future': [],
'Tech_Future': [],
'Staples_Future': [],
'Util_Future': ['Home', 'Energy', 'Fin', 'Ind', 'Tech', 'Health'],
'Health_Future': [],
'Disc_Future': ['Tech']}
```

The dictionary leaves us with a very, very sparse model with factors of dubious usability. Home sector is auto-regressive, Energy also, with financials and technology predicting it to a lesser degree. The Utilities sector contains a large number of candidate predictors and not themselves. In any case, the comparison with the non-cross-validated LASSO model is this:

```
{'Home_Future': [],
'Mats_Future': [],
'Energy_Future': ['Home', 'Energy', 'Fin', 'Ind', 'Tech', 'Staples',
'Health'],
'Fin_Future': ['Home', 'Energy', 'Fin', 'Tech', 'Health'],
'Ind_Future': ['Energy', 'Fin', 'Tech', 'Health'],
'Tech_Future': [],
'Staples_Future': ['Mats', 'Energy', 'Fin', 'Tech', 'Util', 'Health'],
'Util_Future': ['Home', 'Energy', 'Fin', 'Ind', 'Tech', 'Health'],
'Health_Future': [],
'Disc_Future': ['Energy', 'Tech']}
```

We will believe and use the cross-validated model, for the time being, we can always come back to this LASSO model and investigate in-depth the resulting predictors. With the predictors in place now, it is just a matter of fitting the regression using them as features.

```
ols = LinearRegression()
for target in targets:
candidate_features = cv_candidates[target]
if len(candidate_features) < 1:
print('No features for {}. Skipping.'.format(target))
print("---")
continue
X = monthly_returns[candidate_features]
y = monthly_returns[target]
ols.fit(X, y)
y_pred = ols.predict(monthly_returns[candidate_features])
print(target)
print ("In-sample OLS R-squared: %.2f%%" % (100 * r2_score(y, y_pred)))
print("---")
```

Results for the R-square of predictions in linear regressions are not easy to interpret. It seems that the LASSO selected variables do a generally poor job at predicting the future, and in general, the more feature variables, the better the coefficient of determination:

```
Home_Future
In-sample OLS R-squared: 0.95%
---
No features for Mats_Future. Skipping.
---
Energy_Future
In-sample OLS R-squared: 5.73%
---
No features for Fin_Future. Skipping.
---
No features for Ind_Future. Skipping.
---
No features for Tech_Future. Skipping.
---
No features for Staples_Future. Skipping.
---
Util_Future
In-sample OLS R-squared: 14.14%
---
No features for Health_Future. Skipping.
---
Disc_Future
In-sample OLS R-squared: 3.32%
---
```

For comparison purposes, these are the same values when fitting the linear regression to all available features, not only those selected by the LASSO:

```
Home_Future
In-sample OLS R-squared: 3.78%
---
Mats_Future
In-sample OLS R-squared: 7.63%
---
Energy_Future
In-sample OLS R-squared: 10.22%
---
Fin_Future
In-sample OLS R-squared: 7.56%
---
Ind_Future
In-sample OLS R-squared: 9.46%
---
Tech_Future
In-sample OLS R-squared: 4.83%
---
Staples_Future
In-sample OLS R-squared: 9.79%
---
Util_Future
In-sample OLS R-squared: 14.54%
---
Health_Future
In-sample OLS R-squared: 7.23%
---
Disc_Future
In-sample OLS R-squared: 8.31%
---
```

Which may, at this point, indicate nothing more than overfitting to a multitude of noisy variables. We will abstain from further interpreting the resulting model and use this LASSO and OLS regression as a backbone for an initial backtest. In future publications, we will fit this model dynamically inside a backtest and check our returns by blindly following the generated target predictions. The research notebook with the initial steps is attached at the very end below.

Information in __ostirion.net__ does not constitute financial advice; we do not hold positions in any of the companies or assets that we mention in our posts at the time of posting. If you require quantitative model development, deployment, verification, or validation, do not hesitate and __contact us__. We will also be glad to help you with your machine learning or artificial intelligence challenges when applied to asset management, trading, or risk evaluations.