The Buy-and-Hold Trap: What Backtests must really measure (Python included)

When developing new trading strategies, it’s crucial to compare their performance against a benchmark. This practice not only validates your strategy’s effectiveness but also offers valuable insights into risk management and capital efficiency. In this article, we’ll explore the importance of benchmarking, demonstrate how to implement a simple trading strategy using Python, and compare it to a buy-and-hold approach using various financial metrics.

Why Benchmarking Matters in Trading Strategies

Benchmarking serves as a reference point to evaluate the performance of your trading strategy. Without it, you might misinterpret results, attributing success or failure to your strategy rather than market movements.

  • Stock Strategies: When trading stocks, a common benchmark is a broad market index like the S&P 500. If your stock portfolio performs worse than the S&P 500, it might be more efficient to invest in an index fund.
  • Forex Strategies: In forex trading, benchmarks could be the interest rate differentials between two currencies or the performance of a currency index.

By comparing your strategy against an appropriate benchmark, you ensure that your efforts add value beyond what the market naturally provides.

I don’t get into calculating slippage, fees, etc. this article describes a pretty basic comparison to be used when prototyping a strategy before deciding if it is worthy getting into more detail.

What should you expect:

In this article, I will run a basic strategy for SPY over the last 4+ years, and besides comparing the strategy to the Buy-And-Hold, I will introduce and explain some other metrics like:

  • Risk Exposure
  • Risk-Adjusted Returns
  • Drawdown Comparison
  • Capital Efficiency
  • Volatility

First things first

Initially let’s do our imports and read the SPY using yfinance library.

import yfinance as yf
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import pandas_ta as ta

ticker = 'SPY'
data = yf.download(ticker, start='2020-01-01', end='2024-09-30')
data.index = pd.to_datetime(data.index)




Now the strategy that we will compare to the SPY itself, will be quite simple. We will have 2 moving averages. One of 50 days and one of 100 days. Also, we calculate the ADX for 14 days. We will go long when the fast MA is on top of the slow one (trend following and short opposite), however, we will stay neutral when the ADX is less than 20 (which means that the trend is not strong or there is no trend). Let’s calculate them and create the signal:

fast_MA_period = 50
slow_MA_period = 100
adx_period = 14
adx_threashold = 20
# Calculating technicals and drop the NA
data[f'MA_fast'] = data['Adj Close'].rolling(window=fast_MA_period).mean()
data[f'MA_slow'] = data['Adj Close'].rolling(window=200).mean()
data['ADX'] = data.ta.adx(length=adx_period)[f'ADX_{adx_period}']
data.dropna(inplace=True)

# Determine the trend based on ADX and Moving Averages
def identify_trend(row):
    if row['ADX'] > adx_threashold and row['MA_fast'] > row['MA_slow']:
        return 1
    elif row['ADX'] > adx_threashold and row['MA_fast'] < row['MA_slow']:
        return -1
    else:
        return 0

data['Signal'] = data.apply(identify_trend, axis=1)
#count values per position
data['Signal'].value_counts()

Output:

Signal
 1    456
 0    400
-1    138
Name: count, dtype: int64




We can see from the output that for the 456 days of this period we will go long, 138 we will go short, and 400 we will not hold a position. Let’s plot it. With green color you will see the days that we go long, and with red the days that we go short.

df = data.copy()
# Plotting with adjusted subplot heights
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(14, 10), sharex=True, 
                               gridspec_kw={'height_ratios': [3, 1]})

# Plotting the close price with the color corresponding to the trend
for i in range(1, len(df)):
    ax1.plot(df.index[i-1:i+1], df['Close'].iloc[i-1:i+1], 
             color='green' if df['Signal'].iloc[i] == 1 else 
                   ('red' if df['Signal'].iloc[i] == -1 else 'darkgrey'), linewidth=2)

# Plot the Moving Averages
ax1.plot(df['MA_fast'], label='Fast MA', color='blue')
ax1.plot(df['MA_slow'], label='Slow MA', color='orange')
ax1.set_title(f'{ticker} - Price, ADX and Moving Averages')
ax1.legend(loc='best')

# Plot ADX on the second subplot (smaller height)
ax2.plot(df.index, df['ADX'], label='ADX', color='purple')
ax2.axhline(25, color='black', linestyle='--', linewidth=1)  # Add a horizontal line at ADX=25
ax2.set_title(f'{ticker} - ADX')
ax2.legend(loc='best')

plt.show()




Now we should calculate the equity curve of the strategy and the benchmark (SPY):

def calculate_returns(df_for_returns, col_for_returns = 'Adj Close', col_for_signal = 'Trend'):

    stats = {}

    # Calculate daily returns
    df_for_returns['Daily_Returns'] = df_for_returns[col_for_returns].pct_change()
    df_for_returns['Returns'] = df_for_returns['Daily_Returns'] * df_for_returns[col_for_signal].shift(1)
    df_for_returns['Returns'] = df_for_returns['Returns'].fillna(0)
    df_for_returns['Equity_Curve'] = 100 * (1 + df_for_returns['Returns']).cumprod()

    return df_for_returns

data = calculate_returns(data, col_for_returns = 'Adj Close', col_for_signal = 'Signal')

def calculate_benchmark_returns(df_for_returns, col_for_returns = 'Adj Close'):

    stats = {}

    # Calculate daily returns
    df_for_returns['Benchmark_Returns'] = df_for_returns[col_for_returns].pct_change()
    df_for_returns['Benchmark_Returns'] = df_for_returns['Benchmark_Returns'].fillna(0)
    df_for_returns['Benchmark_Equity_Curve'] = 100 * (1 + df_for_returns['Benchmark_Returns']).cumprod()

    return df_for_returns

data  = calculate_benchmark_returns(data, col_for_returns = 'Adj Close')

And finally, let's plot both equity curves:

# Set up the figure and axes
plt.figure(figsize=(14, 8))

# Plot both equity curves
plt.plot(data.index, data['Equity_Curve'], label='Equity Curve', color='blue')
plt.plot(data.index, data['Benchmark_Equity_Curve'], label='Bechmark Equity Curve', color='green')

# Add labels and legend
plt.xlabel('Date')
plt.ylabel('Equity Value')
plt.title('Comparison of Two Equity Curves')
plt.legend()

# Show the plot
plt.grid()
plt.show()




With the naked eye, we can see:

  • Benchmark is performing better overall, but strategy is quite close
  • During 2021, the benchmark significantly outperforms our strategy, mostly because there are extensive periods of being neutral
  • During 2022 that benchmark has significant losses, but the strategy does much better.

Comparing the strategy to the benchmark

Now we should start the actual job and start checking the metrics.

Risk Exposure

First, let’s see how much time we are exposed to the market during this period.

time_in_market = data['Signal'][data['Signal'] != 0].count() / len(data)
print(f"Strategy Time Exposure: {time_in_market * 100:.2f}%")

Output:
Strategy Time Exposure: 59.76%




It looks like the strategy had a lower time exposure to the market (around 60%). This means that while it underperformed in absolute terms, it involved less risk because it was not always in the market. The buy-and-hold approach is exposed to market risk 100% of the time.

This brings us to our second metric:

Capital Efficiency

If the strategy is only active for a portion of the time, the capital could theoretically be deployed elsewhere when not in use. In Europe at least there are some brokers, that give you interest on your cash, apart from the fact that you can always invest on treasury bills or other “risk-free” instruments as well.

So, to make the strategy’s equity curve more fair, on the days that there is a neutral position, I will add income interest. The rate will be calculated based on the approximate annual rate of USD Vanguard Federal Money Market Fund as follows:

  • 2020: ~0.25%
  • 2021: ~0.03%
  • 2022: ~2.10%
  • 2023: ~4.70%
  • 2024: ~5.30%

# Set up the annual rates dictionary
annual_rates = {
    2020: 0.0025,
    2021: 0.0003,
    2022: 0.0210,
    2023: 0.0470,
    2024: 0.0530
}

# Create a new column that holds the difference in days between the current row and the previous one
data['Days_Difference'] = data.index.to_series().diff().dt.days
data['Days_Difference'] = data['Days_Difference'].fillna(1)

# Extract the year from the index and create a new column for the year
data['Year'] = data.index.year
# Calculate daily rates
daily_rates= {year: (1 + rate) ** (1 / 365) - 1 for year, rate in annual_rates.items()}
# Map the daily rates to the corresponding year and create a new column for the daily rate
data['Daily_Rate_Returns'] = data['Year'].map(daily_rates) * data['Days_Difference']

# Calculate the days that there is an open position
data['Open_Position'] = data['Signal'].shift(1)
# data['Open_Position'] = data['Signal'].shift(1)
data['Open_Position'] = data['Open_Position'].fillna(0)

data['Returns_Combined'] = data.apply(lambda row: row['Open_Position'] * row['Daily_Returns'] if row['Open_Position'] != 0 else row['Daily_Rate_Returns'], axis=1)
data['Returns_Combined'] = data['Returns_Combined'].fillna(0)
data['Equity_Curve_including_interest'] = 100 * (1 + data['Returns_Combined']).cumprod()

And let’s plot it:

# Set up the figure and axes
plt.figure(figsize=(14, 8))

# Plot both equity curves
plt.plot(data.index, data['Equity_Curve'], label='Equity Curve', color='blue')
plt.plot(data.index, data['Benchmark_Equity_Curve'], label='Bechmark Equity Curve', color='green')
plt.plot(data.index, data['Equity_Curve_including_interest'], label='Equity Curve Including Interest', color='red')

# Add labels and legend
plt.xlabel('Date')
plt.ylabel('Equity Value')
plt.title('Comparison of Three Equity Curves')
plt.legend()

# Show the plot
plt.grid()
plt.show()




You will see that the equity curve including the interest (red) is slightly more profitable on the days that the blue curve is flat (neutral position). In the end, the return is higher since 40% of the days the strategy is neutral.

This is somehow a controversial approach, since in practice, traders try to use various strategies and not keep cash unutilised. So, if you like this approach, you can go ahead and use it. But for the rest of the article I will present the outcome of both strategies (with and without interest) compared to the benchmark.

Drawdown Comparison

Maximum drawdown is also one of the most common ways to calculate the risk. Let’s calculate it:

def max_drawdown(cumulative_returns):
    roll_max = cumulative_returns.cummax()
    drawdown = cumulative_returns / roll_max - 1.0
    max_drawdown = drawdown.cummin()
    return max_drawdown.min()

max_dd_strategy = max_drawdown(data['Equity_Curve'])
max_dd_market = max_drawdown(data['Benchmark_Equity_Curve'])
max_dd_market_combined = max_drawdown(data['Equity_Curve_including_interest'])

print(f"Strategy Maximum Drawdown: {max_dd_strategy * 100:.2f}%")
print(f"Benchmark Maximum Drawdown: {max_dd_market * 100:.2f}%")
print(f"Strategy Combined Maximum Drawdown: {max_dd_market_combined * 100:.2f}%")

Output:
Strategy Maximum Drawdown: -16.22%
Benchmark Maximum Drawdown: -24.50%
Strategy Combined Maximum Drawdown: -16.02%




The outcome looks quite positive for the strategy. While the benchmark (SPY) suffered a 24.5% drawdown during this period, the strategy had significantly lower as 16.22% (and slightly less with the interest option 16.02%)

Risk-Adjusted Returns

Now, I will introduce metrics such as the Sharpe ratio and Sortino ratio using Python, which adjust returns based on risk. Let’s see how it goes and explain those metrics.

The Sharpe Ratio helps determine whether the returns of an investment are due to smart decisions or excessive risk. A higher Sharpe Ratio means better risk-adjusted returns, indicating that a strategy or investment has achieved more return relative to the risk it has taken on. Essentially, it tells you if you’re getting adequately rewarded for the risk you’re taking.

Generally, these guidelines can help interpret the Sharpe Ratio:

  • Below 1.0: Suboptimal, indicating that the return does not adequately compensate for the risk.
  • 1.0 to 1.99: Acceptable or good, showing a reasonable risk-return balance.
  • 2.0 to 2.99: Very good, suggesting a strong return relative to risk.
  • Above 3.0: Excellent, indicating high returns with relatively low risk.

# Sharpe Ratio
annual_return_strategy = data['Returns'].mean() * 252
annual_volatility_strategy = data['Returns'].std() * np.sqrt(252)
sharpe_ratio_strategy = annual_return_strategy / annual_volatility_strategy

annual_return_market = data['Benchmark_Returns'].mean() * 252
annual_volatility_market = data['Benchmark_Returns'].std() * np.sqrt(252)
sharpe_ratio_market = annual_return_market / annual_volatility_market

annual_return_market_combined = data['Returns_Combined'].mean() * 252
annual_volatility_market_combined = data['Returns_Combined'].std() * np.sqrt(252)
sharpe_ratio_market_combined = annual_return_market_combined / annual_volatility_market_combined

print(f"Strategy Sharpe Ratio: {sharpe_ratio_strategy:.2f}")
print(f"Benchmark Sharpe Ratio: {sharpe_ratio_market:.2f}")
print(f"Strategy Combined Sharpe Ratio: {sharpe_ratio_market_combined:.2f}")

Output:
Strategy Sharpe Ratio: 0.99
Benchmark Sharpe Ratio: 0.93
Strategy Combined Sharpe Ratio: 1.06




The output looks interesting. Our strategy is better risk-optimized. Interestingly, the SPY itself has a Sharpe Ratio below the threshold of 1, meaning that you are not being adequately compensated for the risk.

Sortino Ratio is similar to the Sharpe Ratio but focuses only on downside risk, meaning it measures the return earned relative to harmful volatility (negative returns). It shows how well a strategy compensates for negative fluctuations, making it a better measure when you’re concerned about downside risk rather than overall volatility. A higher Sortino Ratio indicates better risk-adjusted returns, especially focusing on avoiding significant losses.

# Downside Volatility
downside_volatility_strategy = data[data['Returns'] < 0]['Returns'].std() * np.sqrt(252)
sortino_ratio_strategy = annual_return_strategy / downside_volatility_strategy

downside_volatility_market = data[data['Benchmark_Returns'] < 0]['Benchmark_Returns'].std() * np.sqrt(252)
sortino_ratio_market = annual_return_market / downside_volatility_market


downside_volatility_market_combined = data[data['Benchmark_Returns'] < 0]['Benchmark_Returns'].std() * np.sqrt(252)
sortino_ratio_market_combined = annual_return_market_combined / downside_volatility_market_combined

print(f"Strategy Sortino Ratio: {sortino_ratio_strategy:.2f}")
print(f"Benchmark Sortino Ratio: {sortino_ratio_market:.2f}")
print(f"Strategy Combined Sortino Ratio: {sortino_ratio_market_combined:.2f}")

Output:
Strategy Sortino Ratio: 1.11
Benchmark Sortino Ratio: 1.31
Strategy Combined Sortino Ratio: 1.20




As you can see in this case, the benchmark is better than the strategy itself. What is interesting, is that even if the benchmark seems to handle negative results better on average, it still had the worst maximum drawdown as we saw earlier. This usually means that the maximum drawdown happened for a single event that caused a sharp down movement, and it is not a consistent issue.

Volatility

Volatility is another measure that risk-averse investors are looking for. It is a simple calculation of the standard deviation of the changes of the price. Let’s calculate it

# Function to calculate volatility (standard deviation of returns)
def volatility(cumulative_returns):
    returns = cumulative_returns.pct_change().dropna()
    return returns.std()

# Assuming `data` is your DataFrame containing multiple equity curves
vol_strategy = volatility(data['Equity_Curve'])
vol_market = volatility(data['Benchmark_Equity_Curve'])
vol_market_combined = volatility(data['Equity_Curve_including_interest'])

print(f"Strategy Volatility: {vol_strategy:.4f}")
print(f"Benchmark Volatility: {vol_market:.4f}")
print(f"Strategy Combined Volatility: {vol_market_combined:.4f}")

Output:
Strategy Volatility: 0.0084
Benchmark Volatility: 0.0105
Strategy Combined Volatility: 0.0084




The benchmark volatility is higher than the strategy, 0.0105 compared to 0.0084. This also implies that using the strategy above, you are looking for less returns, but less risky.

You can find the Python notebook of the above code at my github repository here.

Conclusions

Risk is not necessarily bad. If a strategy has a smaller payoff, and it is less risky, it doesn’t mean it is better. Risk pays massive (and loses massive). The worst thing about risk is when it is unnecessary. The above metrics can (or cannot) be used, as below:

  • Comparing strategies (even the buy-and-hold is a strategy). If the returns are the same, use the one that is less risky.
  • If you have a risky strategy, make sure that the returns are enough to compensate you for your stress 😉
  • There is no one-fits-all metric that will tell you this strategy is better than the other one. Each metric has its own reason for existance.

Next Steps

I hope this article has triggered some ideas at your end. A few more ideas that all of us can extend this analysis are:

  • Introduce these metrics to your backtesting and optimization results
  • Weigh each metric to arrive at a single number. Run various backtests, with various weights, and check how each optimized strategy would finally return in future data.
  • You might want to have a look to a previous article of mine with more risk metrics

12 Risk metrics for investments with Python. From standard deviation to R-squared

In the investing world, where gains and losses are changing every second, understanding the risk is one of the most…

medium.com

I hope you enjoyed the article! If you found this useful, please clap and share your thoughts below!