Welcome to Magic: The Gathering, a trading card game produced by Wizards of the Coast where each player pretends to be a wizard casting spells and summoning creatures to reduce their opponent's life points from 20 to 0. This game has been out since 1993, and has changed dramatically since then, with the game increasing in power level and complexity over its 30 years of existence.
Throughout the analysis I will use several Magic: the Gathering specific terms. While it is not necessary to have played magic to understand this analysis, the following definition will certainly be helpful.
Name: Each card has a unique name. The name of the card below is colossal dreadmaw.
Our goal today is to take a look at the mana costs of the creatures of Magic, and see if we can build a model to predict these mana costs based on the other statistics of the card. We will then see if creatures have become more aggressively costed over time (meaning that the same stats are less mana to cast), a phenomenon known a "power creep" by the Magic community.
Here we import all necessary modules for the project.
import pandas as pd
import json
import requests
import matplotlib.pyplot as plt
import re
import numpy as np
import datetime as dt
from sklearn.linear_model import LinearRegression
import statsmodels.api as sm
import statistics
The data we will use is from scryfall, a community-ran magic the gathering site. Scryfall's data is often more reliable and accurate than the data that Wizards produces, and it is also freely available for download through their API. This data contains every single card in Magic's history that was printed in English - 78,242 card objects. However, many of these objects are extraneous and would hurt our data analysis. For the next section, I will prune down these cards to exclude reprinted cards, illegal cards, joke cards, and many other types of cards that have been created over the years.
URL = "https://data.scryfall.io/default-cards/default-cards-20221212220657.json"
full_scryfall_df = pd.DataFrame(json.loads(requests.get(URL).text))
full_scryfall_df.head()
object | id | oracle_id | multiverse_ids | mtgo_id | mtgo_foil_id | tcgplayer_id | cardmarket_id | name | lang | ... | tcgplayer_etched_id | attraction_lights | color_indicator | life_modifier | hand_modifier | printed_type_line | printed_text | content_warning | flavor_name | variation_of | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | card | 0000579f-7b35-4ed3-b44c-db2a538066fe | 44623693-51d6-49ad-8cd7-140505caf02f | [109722] | 25527.0 | 25528.0 | 14240.0 | 13850.0 | Fury Sliver | en | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
1 | card | 00006596-1166-4a79-8443-ca9f82e6db4e | 8ae3562f-28b7-4462-96ed-be0cf7052ccc | [189637] | 34586.0 | 34587.0 | 33347.0 | 21851.0 | Kor Outfitter | en | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
2 | card | 0000a54c-a511-4925-92dc-01b937f9afad | dc4e2134-f0c2-49aa-9ea3-ebf83af1445c | [] | NaN | NaN | 98659.0 | NaN | Spirit | en | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
3 | card | 0000cd57-91fe-411f-b798-646e965eec37 | 9f0d82ae-38bf-45d8-8cda-982b6ead1d72 | [435231] | 65170.0 | 65171.0 | 145764.0 | 301766.0 | Siren Lookout | en | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
4 | card | 00012bd8-ed68-4978-a22d-f450c8a6e048 | 5aa12aff-db3c-4be5-822b-3afdf536b33e | [1278] | NaN | NaN | 1623.0 | 5664.0 | Web | en | ... | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
5 rows × 84 columns
This dataframe also comes with 84 columns, many of which we do not need. I will remove all columns except for the useful ones in determining a card's qualities, and in determining whether we want to analyze the card or not.
df = full_scryfall_df[['name', # the name of the card - not technically necessary but helpful for debugging
'mana_cost', # what type of mana the card costs to summon
'cmc', # how much mana the card costs
'type_line', # the type of the card (creature, sorcery, etc)
'oracle_text', # what the card does
'power', 'toughness', # the strength of the card if it's a creature
'colors', 'color_identity', # more info on what type of mana the card costs
'keywords', # the keywords on the card (more on this later)
'set', 'released_at', # when the card was released
'rarity', # how much the card was printed
'games', # games tells if it is legal online or in paper (we exclude online-only cards)
'legalities']] # which formats the card is legal in
First, we will remove online-only cards. Wizards of the Coast released a program called Magic Arena, and to promote it they released cards that were only legal for that program. However, these cards were not created with the balance of the paper format in mind, and reference random effects and things only possible online. Therefore, I am excluding them from this analysis.
df = df[df['games'].apply(lambda i: 'paper' in i)]
Next, we will remove all cards that are in there multiple times (e.g. they were printed in multiple sets). Wizards does this sometimes to bring back fan favorite cards or to have some basic cards that always work well.
df = df.sort_values(by=['released_at', 'name'])
df = df.drop_duplicates(subset=['name'])
Some cards are illegal to play for power-level reasons (too strong for the format); however, we can still analyze these. The "not legal" designation means cards that are literally unplayable: they are printed alongside magic cards, but just say promotional text or act as other game pieces. Tokens are one such piece; some cards create tokens, but you can't put the actual token cards in your deck. However, Scryfall treats all of these as "card objects" and puts them in.
def legal(legalities):
v = legalities.values()
if len(set(v)) == 1 and "not_legal" in v:
return False
return True
df = df[df['legalities'].apply(legal)]
df = df[~df["type_line"].str.contains("Token", na=False)] # remove tokens
Finally, some cards were designed as jokes by the Wizards designers in sets called "unsets". These cards, like the online cards, aren't tuned for interacting with any other cards, and so I will exclude them from this dataset.
unsets = ['unglued', 'unhinged', 'unstable', 'unsanctioned', 'unfinity']
sets = json.loads(requests.get("https://api.scryfall.com/sets").text)
for s in sets["data"]:
if s['name'].lower() in unsets:
df = df[~df["set"].str.contains(s['code'])]
Some of these columns aren't qute in the format we'd like them to be in, so I'm going to change them to be friendlier to numerical algorithms, and add some useful columns.
The actual release date isn't particularly mathematically helpful, so I will change it to the number of years since the first set (known as Alpha) was released, on August 5th 1993.
df = df.dropna() # drop NAN values
alpha_release_date = dt.datetime(1993, 8, 5)
df['released_at'] = df['released_at'].apply(lambda i: (dt.datetime.strptime(i, '%Y-%m-%d').year - alpha_release_date.year))
One helpful factor in our analysis will show how deep into a color a card is. Wizards often makes cards cost more mana of a certain color if they want to nerf the card somewhat; e.g. if they want to make a more powerful creature at the same mana cost. So, counting the amount of colored mana required to cast the spell will be a useful factor in the analysis. A similar factor is the number of different colors: a spell that requires multiple types of mana is considered more difficult to cast, and so may have a lower mana cost.
df["num_colored_pips"] = df["mana_cost"].apply(lambda mana_cost: len(re.findall("\{[^\d]\}", str(mana_cost))))
df["num_colors"] = df["colors"].apply(len)
Finally, for this analysis we will just be looking at creatures. Let's also make those CMC, power, and toughness values integers to be easier to work with. If they aren't underlylingly integers (some have X, *, or ?, or nothing), we wil ignore those as they are difficult to work with without also parsing the ability text.
df = df[df["type_line"].str.contains("Creature", na=False)]
def make_int(i):
try:
return int(i)
except ValueError:
return np.nan
for i in ['cmc', 'power', 'toughness']:
df[i] = df[i].apply(make_int).astype('Int64')#
df = df.dropna()
Let's take a look!
We can look at the number of creatures in each color (extremely even).
order = ['White', 'Blue', 'Black', 'Red', 'Green']
plt.pie([len(df[df['colors'].apply(lambda x: c[0] in x)].index) for c in order],
labels=order, colors=order,
wedgeprops={"edgecolor":"k",'linewidth': 2}) # to draw a border for the white creatures, otherwise invisible
plt.axis('equal')
plt.title("Creatures by Color")
plt.show()
Or we can look at number of words over time (an estimate of complexity). Another large problem noticed by the community is "complexity creep": magic cards getting more complex over time. As we can see, their claims are somewhat correct: by this very rough meaurement, cards have certainly been getting more and more complex.
r = range(max(df['released_at']))
plt.plot(r, [statistics.mean(df[df['released_at'] == i]['oracle_text'].apply(len)) for i in r], color='purple')
plt.xlabel("Years since Alpha")
plt.ylabel("Average number of Words")
plt.title("Complexity Creep over Time")
plt.show()
df.head()
name | mana_cost | cmc | type_line | oracle_text | power | toughness | colors | color_identity | keywords | set | released_at | rarity | games | legalities | num_colored_pips | num_colors | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32371 | Air Elemental | {3}{U}{U} | 5 | Creature — Elemental | Flying | 4 | 4 | [U] | [U] | [Flying] | lea | 0 | uncommon | [paper] | {'standard': 'not_legal', 'future': 'not_legal... | 2 | 1 |
5420 | Benalish Hero | {W} | 1 | Creature — Human Soldier | Banding (Any creatures with banding, and up to... | 1 | 1 | [W] | [W] | [Banding] | lea | 0 | common | [paper] | {'standard': 'not_legal', 'future': 'not_legal... | 1 | 1 |
26263 | Birds of Paradise | {G} | 1 | Creature — Bird | Flying\n{T}: Add one mana of any color. | 0 | 1 | [G] | [G] | [Flying] | lea | 0 | rare | [paper] | {'standard': 'not_legal', 'future': 'not_legal... | 1 | 1 |
59195 | Black Knight | {B}{B} | 2 | Creature — Human Knight | First strike (This creature deals combat damag... | 2 | 2 | [B] | [B] | [First strike, Protection] | lea | 0 | uncommon | [paper] | {'standard': 'not_legal', 'future': 'not_legal... | 2 | 1 |
31516 | Bog Wraith | {3}{B} | 4 | Creature — Wraith | Swampwalk (This creature can't be blocked as l... | 3 | 3 | [B] | [B] | [Landwalk, Swampwalk] | lea | 0 | uncommon | [paper] | {'standard': 'not_legal', 'future': 'not_legal... | 1 | 1 |
As you can see, we've now got the important parts of a card, and the cards are sorted conveniently by their release date. We are now looking at the first magic cards ever released. How strong were they? Let's find out!
A vanilla creature is a creature with no text whatsoever - just stats! The "classic" vanilla creature is the Grizzly Bears, a 2/2 for 2 mana in green.
df[df["name"].str.contains("Grizzly Bears")]
name | mana_cost | cmc | type_line | oracle_text | power | toughness | colors | color_identity | keywords | set | released_at | rarity | games | legalities | num_colored_pips | num_colors | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
63158 | Grizzly Bears | {1}{G} | 2 | Creature — Bear | 2 | 2 | [G] | [G] | [] | lea | 0 | common | [paper] | {'standard': 'not_legal', 'future': 'not_legal... | 1 | 1 |
However, not all creatures are created equally. For example, the Coral Eel has the same mana cost, but only 1 toughness. The difference? It's in blue!
df[df["name"].str.contains("Coral Eel")]
name | mana_cost | cmc | type_line | oracle_text | power | toughness | colors | color_identity | keywords | set | released_at | rarity | games | legalities | num_colored_pips | num_colors | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
16522 | Coral Eel | {1}{U} | 2 | Creature — Fish | 2 | 1 | [U] | [U] | [] | por | 4 | common | [paper] | {'standard': 'not_legal', 'future': 'not_legal... | 1 | 1 |
Clearly, some colors are better at producing creatures than other colors. But how much better? Let's start with vanilla creatures, since we know none of their abilities is influencing their mana cost. Therefore, we can just get a look at how much each point of power and toughness is costing, mana-wise.
vanilla_df = df[df["oracle_text"] == ""]
vanilla_df.head()
name | mana_cost | cmc | type_line | oracle_text | power | toughness | colors | color_identity | keywords | set | released_at | rarity | games | legalities | num_colored_pips | num_colors | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
58740 | Craw Wurm | {4}{G}{G} | 6 | Creature — Wurm | 6 | 4 | [G] | [G] | [] | lea | 0 | common | [paper] | {'standard': 'not_legal', 'future': 'not_legal... | 2 | 1 | |
54563 | Earth Elemental | {3}{R}{R} | 5 | Creature — Elemental | 4 | 5 | [R] | [R] | [] | lea | 0 | uncommon | [paper] | {'standard': 'not_legal', 'future': 'not_legal... | 2 | 1 | |
66828 | Fire Elemental | {3}{R}{R} | 5 | Creature — Elemental | 5 | 4 | [R] | [R] | [] | lea | 0 | uncommon | [paper] | {'standard': 'not_legal', 'future': 'not_legal... | 2 | 1 | |
35350 | Gray Ogre | {2}{R} | 3 | Creature — Ogre | 2 | 2 | [R] | [R] | [] | lea | 0 | common | [paper] | {'standard': 'not_legal', 'future': 'not_legal... | 1 | 1 | |
63158 | Grizzly Bears | {1}{G} | 2 | Creature — Bear | 2 | 2 | [G] | [G] | [] | lea | 0 | common | [paper] | {'standard': 'not_legal', 'future': 'not_legal... | 1 | 1 |
Only 2.57% of creatures ever printed have been vanilla, so this analysis won't tell us much about a lot of creatures. However, it gives us a base from which to peform similar analyses on more complicated creatures.
plt.pie([len(vanilla_df.index), len(df.index)],
explode=(0.1, 0),
labels=['Vanilla Creatures', 'Other Creatures'],
colors=['red', 'purple'],
startangle=90)
plt.axis('equal')
plt.show()
Vanilla creature are also a dying breed: a similar complaint to complexity creep is that there are less and less vanilla creatures with each new set. Vanilla creatures are often touted as an easy way for beginners to get into the game, so this is a graph of Magic's beginner-friendliness over time. Perhaps Magic entered an age from around 6-12 years after its release of the designers getting excited to make new designs, and we are entering another similar age. Or is this the end of vanilla creatures, with none being printed in all of 2022? Time will tell.
r = range(max(df['released_at']))
plt.plot(r, [len(vanilla_df[vanilla_df['released_at'] == i]['oracle_text'].index) for i in r], color='red')
plt.xlabel("Years since Alpha")
plt.ylabel("Number of Vanilla Creatures")
plt.title("Vanilla Creatures over Time")
plt.show()
Now we can look at plots of mana cost, power, and toughness. We can see from this plot that the vast majority of vanilla creatures are very small, with less than 4 power and toughness. Generally, when a creature has larger power and toughness, it is a more powerful creature in the Magic storyline, and the designers want to make the card more exciting. So, they give it abilities, which means it doesn't show up in this plot.
mp, mt = max(vanilla_df['power']) + 1, max(vanilla_df['toughness']) + 1
data = np.zeros((mp, mt))
fig, ax = plt.subplots()
for power in range(mp):
for toughness in range(mt):
count = len(vanilla_df[(vanilla_df['power'] == power) & (vanilla_df['toughness'] == toughness)]['cmc'].values)
data[mp - power - 1][toughness] = count
text = ax.text(toughness, mp - power - 1, count,
ha="center", va="center", color="w")
ax.set_yticks(np.arange(mp), labels=list(range(mp))[::-1]) # invert power so that 0/0 is the bottom left corner
im = ax.imshow(data)
fig.tight_layout()
plt.xlabel("Toughness")
plt.ylabel("Power")
plt.title("Power vs Toughness of Vanilla Creatures")
plt.show()
Magic cards are distributed in packs, and so they have rarities which determine how often they show up in packs. In a 15-card pack, there are 11 commons, 3 uncommons, and only 1 rare, so getting a vanilla creature as your rare would not be good for business. Therefore, Magic doesn't print very many rare vanilla creature cards, but they do print some.
plt.bar(['Common', 'Uncommon', 'Rare'], vanilla_df['rarity'].value_counts(), color=['black', 'silver', 'gold'])
plt.title("Vanilla Rarity Distribution")
plt.xlabel("Rarity")
plt.ylabel("Number")
plt.show()
Now let's try to predict the mana cost of the creature based on the data we have. First, let's drop all unnecessary data and get just the good stuff into a data_df
. This will contain all the variables with which we will predict Converted Mana Cost.
Our goal is to predict the Converted Mana Cost of a creature based just on its vanilla stats. These stats include its power, toughness, color, rarity, release date, number of colored pips, and number of colors. My predictions are as follows:
data_df = vanilla_df.drop(columns=["mana_cost", "type_line", "oracle_text", "color_identity", "keywords", "set", "games", "legalities"])
data_df.head()
name | cmc | power | toughness | colors | released_at | rarity | num_colored_pips | num_colors | |
---|---|---|---|---|---|---|---|---|---|
58740 | Craw Wurm | 6 | 6 | 4 | [G] | 0 | common | 2 | 1 |
54563 | Earth Elemental | 5 | 4 | 5 | [R] | 0 | uncommon | 2 | 1 |
66828 | Fire Elemental | 5 | 5 | 4 | [R] | 0 | uncommon | 2 | 1 |
35350 | Gray Ogre | 3 | 2 | 2 | [R] | 0 | common | 1 | 1 |
63158 | Grizzly Bears | 2 | 2 | 2 | [G] | 0 | common | 1 | 1 |
We still have some categorical variables. A linear regression cannot handle these, so I wrote a custom function that will expand them and then one-hot encode them. One-hot encoding means that instead of having a single rarity column with values "common" and "uncommon", we now have two rarity columns, "rarity_common" and "rarityuncommon", which contain integer values which are 0 if the card is rare and 1 if the card is uncommon. We will do the same thing for colors -- note that a card can have multiple colors, so there could be multiple `colors` columns that have a 1 in them for a given card.
To learn more about one-hot encoding, visit this link.
This will give us a total of 11 variables for our regression.
def dummy_list(data_df, one_hot_df, column, predicate=lambda i, j: i == j):
x = set(data_df.explode(column)[column].values) # get all values from the column
if np.nan in x: # remove NaNs that might be in there
x.remove(np.nan)
for i in x: # make the new one-hot column
one_hot_df[f'{column}_{i}'] = data_df[column].apply(lambda j: int(predicate(i, j)))
one_hot_df = data_df.drop(columns=['colors', 'rarity'])
dummy_list(data_df, one_hot_df, 'colors', predicate=lambda i, j: i in j)
dummy_list(data_df, one_hot_df, 'rarity')
one_hot_df.head()
name | cmc | power | toughness | released_at | num_colored_pips | num_colors | colors_W | colors_R | colors_U | colors_B | colors_G | rarity_common | rarity_rare | rarity_uncommon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
58740 | Craw Wurm | 6 | 6 | 4 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
54563 | Earth Elemental | 5 | 4 | 5 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
66828 | Fire Elemental | 5 | 5 | 4 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
35350 | Gray Ogre | 3 | 2 | 2 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
63158 | Grizzly Bears | 2 | 2 | 2 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 |
We will now perform the linear regression. The basic idea behind linear regression is that we are trying to find the best way to predict the target variable (CMC, or y
in the below code) using a linear combination of the other variables in our dataset (the other columns in one_hot_df
, or X
in the below code). I chose to perform a linear regression on this data because we have a variable we are trying to predict, and a suite of variables that it is easy to expect will combine well to fit this variable, since Magic cards are designed intelligently with all of these variables in mind. Other algorithms could work, such as a neural network, but that seemed rather overkill and computationally intensive while a linear regression is computationally light and requires much less data.
First, we split the dataframe to get the variables and the expected output:
X = one_hot_df.drop(columns=['name', 'cmc'])
y = one_hot_df['cmc']
Linear regressions also contain a term known as a constant or an intercept; this is equivalent to the intercept (b) in the typical formula for a line y = mx + b
, where mx
is actually $$\sum_{i=1}^n m_ix_i$$ for each variable x and coefficient m, creating a linear combination. To add the constant, we use the add_constant
function, which just adds a column of 1's to x.
X = sm.add_constant(X)
Finally, we run the Least Squares algorithm. This tries to find the linear combination "regression line" that fits the data the best, where "the best" means the sum of the squares of all the errors is minimized. You can learn more about the intuition behind least squares regression here.
# Run statsmodel's OLS (Generalized Least Squares) model, making sure it's on integers
model = sm.OLS(np.asarray(y, dtype=int),
np.asarray(X, dtype=int)).fit()
# Plot the coefficients, with error bars
plt.errorbar(model.params, X.columns, xerr=model.bse, fmt='o',
alpha=0.5, ecolor='grey', capsize=5)
plt.xlabel("Coefficient")
plt.ylabel("Variable")
plt.title("Linear Regression for Converted Mana Cost of Vanilla Creatures")
plt.show()
There are several interesting results to discuss. Firstly, the rarity was as expected; commons are overcosted, while rares are expected to cost one full mana less (which is a large amount in magic). There aren't very many rare vanilla creatures, leading to the wide error bars.
Secondly, green creatures are expected to cost half a mana less, with the other colors roughly around the same place. This makes sense, as green mana is all about creatures; blue or black has the biggest expected mana cost increase.
A point of power is also expected to be worth more than a point of toughness; this was as I expected, as again cheap defenders are more plentiful than cheap attackers.
The number of colors did not have much effect, but what intrigued me the most was that released_at
had almost no effect. One of the biggest complaints about magic is that cards increase in power over time, making older cards unplayable; however, it seems that the Grizzly Bears from Alpha is still pretty playable! Let's see.
To check the expected mana value of a card, we can take each value from its row and multiply it by the coefficient from the linear regression, then add the intercept:
def expected_mana_cost(row):
return sum(row * model.params[1:]) + model.params[0]
one_hot_df['expected_mana_value'] = one_hot_df.drop(columns=['name', 'cmc']).apply(expected_mana_cost, axis=1)
one_hot_df
name | cmc | power | toughness | released_at | num_colored_pips | num_colors | colors_W | colors_R | colors_U | colors_B | colors_G | rarity_common | rarity_rare | rarity_uncommon | expected_mana_value | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
58740 | Craw Wurm | 6 | 6 | 4 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 5.690056 |
54563 | Earth Elemental | 5 | 4 | 5 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 4.993144 |
66828 | Fire Elemental | 5 | 5 | 4 | 0 | 2 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 5.151129 |
35350 | Gray Ogre | 3 | 2 | 2 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 2.764881 |
63158 | Grizzly Bears | 2 | 2 | 2 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 2.186319 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
11291 | Highborn Vampire | 4 | 4 | 3 | 27 | 1 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 3.967672 |
73427 | Murasa Brute | 3 | 3 | 3 | 27 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 2.636510 |
22628 | Grizzled Outrider | 5 | 5 | 5 | 28 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 5.092426 |
22694 | Ageless Guardian | 2 | 1 | 4 | 28 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2.308297 |
59796 | Spined Karok | 3 | 2 | 4 | 28 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 2.449176 |
324 rows × 16 columns
We can see that it got fairly close on some of these old cards. Looking at the newest cards, we can see that Gray Ogre (a 2/2 for 3 in Red) is much more overcosted than the newer version, a 2/4 for 3 in Green. so there certainly is some power creep, but it is not as strong as some people say. Or is it?
Now we will look at slightly more interesting creatures: creatures with keywords. While these creatures still don't have ability text, they can actually have useful keywords on them that help during combat, help to protect the creature, or anything in between. I will provide examples of some of the most common abilities after a bit of data processing.
To find keywords, scryfall helpfully has a column called keywords
. However, there are some newer cards that contain words called "flavor words", which we don't want to count here but scryfall counts them as keywords. Therefore, I am excluding them from the analysis; you can read some logic behind there introduction here.
keywords = set()
for i, row in df.iterrows():
if row['set'] not in ['afr', '40k', 'clb', 'sld']:
# normal set
keywords = keywords.union(row['keywords'])
else:
# ignore flavor words by ignoring keywords with spaces
keywords = keywords.union([kw for kw in row['keywords'] if ' ' not in kw])
keyword_soup = ' '.join(keywords).lower() # easy way to convert everything to lowercase
First, let's write some code to extract the relevant text from a magic card. Magic cards can have lots of things in their oracle text; for example, text in parentheses that reminds the player how a certain ability works. Additionally, some abilities have arguments; for example, Ward is an ability that comes with an associated cost, while Protection comes with a description of what it protects from. Because these aren't part of the keyword, they won't be in our keyword_soup
variable, and so we have to process them out.
removes = [r'(\(.*?\))', # remove parentheses (reminder text)
r'(\{.*?\})', # remove mana costs for certain abilities (outlast, etc.)
r'(—[^ ][^\n]*)', # remove ward costs
r'[Pp]rotection(?! F)([^\n]*)', # remove protection type
r'(\d*)', # remove numbers for certain abilities (rampage, etc.)
r'Prototype([^\n]*)'] # remove prototype costs
def extract_ability_text(row):
text = row['oracle_text']
if text is np.nan or text == '':
return ''
for r in removes:
if m := re.search(r, text):
start, end = m.span(1) # remove first capturing group
text = text[:start] + text[end:]
text = text.replace(',', '').replace(';', '')
text = text.lower()
text = text.strip()
return text
df['ability_text'] = df.apply(extract_ability_text, axis=1)
Now we're prepared for our is_french_vanilla
function. This function will take a row and determine if it's french vanilla by extracting the relevant text and making sure that every token in the relevant text is a keyword. If there is a non-keyword token, it is not french vanilla (remember, a french vanilla creature only has keywords as its abilities).
def is_french_vanilla(row):
text = row['ability_text']
for i in text.split():
if i not in keyword_soup: # check that every word is a kewyord
# note that keywords are never in front of a period or any punctuation other
# than , or ; so we don't need to do any complicated tokenization
return False
return True
df['is_french_vanilla'] = df.apply(is_french_vanilla, axis=1)
Making the dataframe of only french vanilla creatures is quite easy; simply apply our is_french_vanilla
to each row, then take only the rows that have True
in that column.
french_vanilla_df = df[df['is_french_vanilla']]
french_vanilla_df.head()
name | mana_cost | cmc | type_line | oracle_text | power | toughness | colors | color_identity | keywords | set | released_at | rarity | games | legalities | num_colored_pips | num_colors | ability_text | is_french_vanilla | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32371 | Air Elemental | {3}{U}{U} | 5 | Creature — Elemental | Flying | 4 | 4 | [U] | [U] | [Flying] | lea | 0 | uncommon | [paper] | {'standard': 'not_legal', 'future': 'not_legal... | 2 | 1 | flying | True |
5420 | Benalish Hero | {W} | 1 | Creature — Human Soldier | Banding (Any creatures with banding, and up to... | 1 | 1 | [W] | [W] | [Banding] | lea | 0 | common | [paper] | {'standard': 'not_legal', 'future': 'not_legal... | 1 | 1 | banding | True |
59195 | Black Knight | {B}{B} | 2 | Creature — Human Knight | First strike (This creature deals combat damag... | 2 | 2 | [B] | [B] | [First strike, Protection] | lea | 0 | uncommon | [paper] | {'standard': 'not_legal', 'future': 'not_legal... | 2 | 1 | first strike \nprotection | True |
31516 | Bog Wraith | {3}{B} | 4 | Creature — Wraith | Swampwalk (This creature can't be blocked as l... | 3 | 3 | [B] | [B] | [Landwalk, Swampwalk] | lea | 0 | uncommon | [paper] | {'standard': 'not_legal', 'future': 'not_legal... | 1 | 1 | swampwalk | True |
58740 | Craw Wurm | {4}{G}{G} | 6 | Creature — Wurm | 6 | 4 | [G] | [G] | [] | lea | 0 | common | [paper] | {'standard': 'not_legal', 'future': 'not_legal... | 2 | 1 | True |
There is now a new rarity, mythic rare! Mythic rare is a rarity in Magic that is six times rarer than rare. Most mythics are extremely powerful and so have more abilities than just keywords, but there have been a few french vanilla mythics printed:
plt.bar(['Common', 'Uncommon', 'Rare', 'Mythic'],
french_vanilla_df['rarity'].value_counts(),
color=['black', 'silver', 'gold', 'red'])
plt.title("French Vanilla Rarity Distribution")
plt.xlabel("Rarity")
plt.ylabel("Number")
plt.show()
Now we can make a similar data_df
to before, except now we are including the keywords
column. We can do the same process to one-hot encode the dataframe. Notice there is a new rarity column, rarity_mythic
.
fv_data_df = french_vanilla_df.drop(columns=['mana_cost', 'type_line', 'color_identity', 'set', 'is_french_vanilla', 'games', 'legalities'])
fv_one_hot_df = fv_data_df.drop(columns=['colors', 'oracle_text', 'keywords', 'rarity'])
dummy_list(fv_data_df, fv_one_hot_df, 'colors', predicate=lambda i, j: i in j)
dummy_list(fv_data_df, fv_one_hot_df, 'rarity')
fv_one_hot_df.head()
name | cmc | power | toughness | released_at | num_colored_pips | num_colors | ability_text | colors_W | colors_R | colors_U | colors_B | colors_G | rarity_common | rarity_mythic | rarity_rare | rarity_uncommon | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
32371 | Air Elemental | 5 | 4 | 4 | 0 | 2 | 1 | flying | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
5420 | Benalish Hero | 1 | 1 | 1 | 0 | 1 | 1 | banding | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 |
59195 | Black Knight | 2 | 2 | 2 | 0 | 2 | 1 | first strike \nprotection | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
31516 | Bog Wraith | 4 | 3 | 3 | 0 | 1 | 1 | swampwalk | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
58740 | Craw Wurm | 6 | 6 | 4 | 0 | 2 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 |
For keywords, we have to do something slightly different. It is possible for one card to have multiple instances of the same keyword (only 7 cards ever do, but potentially more could in the future). So, we have to count the instances of each keyword.
keyword_counts = {}
for keyword in keywords:
counts = fv_data_df.apply(lambda row: row['ability_text'].count(keyword.lower()), axis=1)
if sum(counts) > 0: # some keywords never appear on french vanilla creatures
keyword_counts[f'keywords_{keyword}'] = sum(counts)
fv_one_hot_df[f'keywords_{keyword}'] = counts
Let's run the regression! I'm going to handpick some keywords to show off. I predict similar things to the vanilla dataframe for the other variables, but we'll regress on them too.
X = fv_one_hot_df.drop(columns=['name', 'cmc', 'ability_text'])
X = sm.add_constant(X)
y = fv_one_hot_df['cmc']
model = sm.OLS(np.asarray(y, dtype=int),
np.asarray(X, dtype=int)).fit()
I'm going to handpick some keywords, and get the indices of the things I will plot.
indices = []
keyword_indices = []
per_year = 0
for i, name in enumerate(X.columns):
if name not in keyword_counts:
indices.append(i)
if name == 'released_at':
per_year = model.params[i] # for later
elif any(i in name for i in ['Haste', 'Defender', 'Double strike', 'Flying', 'Trample', 'Cascade', 'Lifelink', 'Hexproof', 'Shroud', 'Fear', 'Intimidate', 'Delve', 'Convoke', 'Undying', 'Flash', 'Vigilance', 'Echo']):
keyword_indices.append([i, model.params[i]])
keyword_indices.sort(key=lambda i: i[1])
keyword_indices = [i[0] for i in keyword_indices]
Looking at the previous variables, we can now see that mythics are incredibly powerful for their cost, expected to be 2.5 cheaper than their power level suggests. Interestingly, rares now have a slightly positive effect on the mana cost; this is completely unexpected. We again see that the Green creatures are the strongest for their mana cost, and that the release date did not have much effect. This time, since we have more multicolored creatures we can see that the number of colors has a fairly strong effect on the mana cost. Toughness and power have similar values to previously.
plt.errorbar([model.params[i] for i in indices],
[X.columns[i] for i in indices],
xerr=[model.bse[i] for i in indices],
color='green', alpha=0.5, fmt='o', ecolor='black', capsize=5)
plt.xlabel("Coefficient")
plt.ylabel("Variable")
plt.title("Linear Regression for Converted Mana Cost of French Vanilla Creatures")
plt.show()
Now, let's look at the keywords! There are many types of keywords; my current prediction is that keywords that reduce costs will give a large increase in mana cost, followed by "evasion" keywords (keywords that protect the creature), followed by keywords that give the creature an advantage in combat. I think this last group of keywords is not particularly high valued because while evasion helps the creature against all spells, combat keywords only help against other creatures.
Finally, I included some keywords that impart a negative effect on the creature that I expect will have a strongly negative effect on the mana cost.
plt.errorbar([model.params[i] for i in keyword_indices],
[X.columns[i] for i in keyword_indices],
xerr=[model.bse[i] for i in keyword_indices],
color='green', alpha=0.5, fmt='o', ecolor='black', capsize=5)
plt.xlabel("Coefficient")
plt.ylabel("Variable")
plt.title("Linear Regression for Converted Mana Cost of French Vanilla Creatures")
plt.show()
I chose many popular keywords, as well as keywords that I thought would be interesting. For example, the Delve and Convoke keywords both have the ability to reduce a creature's mana cost, and they both place towards the top of the list. Keywords such as double strike and flying performed much better than I thought they would, as they are only effective in combat; keywords such as hexproof and shroud don't give as much of a boost as I expected. The negative keywords, echo and defender (echo makes you pay the mana cost twice, and defender means the creature cannot attack) decreased the mana cost sharply as I expected.
One final thing is to analyze the power creep! Let's apply the expected_mana_value
function we wrote earlier and take a look at expected mana values vs actual mana values over time. We will subtract out our coefficient that was controlling for the release date.
fv_one_hot_df['expected_mana_value'] = fv_one_hot_df.drop(columns=['name', 'cmc', 'ability_text']).apply(expected_mana_cost, axis=1)
fv_one_hot_df['power_level'] = fv_one_hot_df['expected_mana_value'] - per_year * fv_one_hot_df['released_at'] - fv_one_hot_df['cmc']
r = range(max(fv_one_hot_df['released_at']))
plt.plot(r, [statistics.mean(fv_one_hot_df[fv_one_hot_df['released_at'] == i]['power_level']) for i in r])
plt.xlabel("Years since Alpha")
plt.ylabel("Difference in Expected vs Actual Mana Value")
plt.title("Power Level over Time")
plt.show()
As you can see, without controlling for release date there is a steady increase in power level (expected mana value minus actual mana value) over time. Similarly to the complexity creep graph, this makes the game more fun to play as time goes on but also more difficult to hold onto old cards, and therefore more expensive.
Vanilla and French Vanilla creatures provide interesting insights into the minds working on designing cards for Magic: The Gathering. These creatures are often designed with beginners in mind, as introductions to the game's mechanics (a common complaint among vets is the card "bear with set's mechanic", or a card that does nothing but introduce a new keyword). These creatures are a great signpost for seeing what R&D is thinking, as we can directly compare the mana value of a vanilla 2/2 and a 2/2 with flying and see the differences in how R&D treats these creatures.
Today, we did a more complicated analysis by using a regression on many different statistics found on a magic card. We found that some colors are certainly more efficient at producing creatures; that R&D definitely takes number of colors and number of colored pips into account when designing a card; and that rares and mythics are definitely more powerful than commons. We learned some interesting facts about keywords, such as the fact that combat keywords are rated as better by R&D and therefore increase the expected mana cost by more. Finally, we looked at expected mana costs compared to actual mana costs over time and provided some tentative evidence for power creep.
There is much more to do with this analysis. For one, some creatures have activated, triggered, and static abilities, none of which we analyzed. I wrote code to detect actvated abilities, but there are so many different ones that running a regression becomes difficult. Something new, perhaps a neural network or a NLP algorithm, will be required to parse the text of each individual ability and rate its strength. We also only analyzed creatures; there are many other interesting types, such as artifacts or planeswalkers, that we could analyze with similar methods. Overall, this analysis showed some evidence for power creep, but a deeper look is definitely needed.