Using Bayesian estimation to decide the best Spring crop to plant your second year


The realization

I had been eyeing Stardew Valley for a while and finally started playing it.

After planting a variety of crops and going through my first harvest, I noticed I was obtaining different amounts and qualities of crops. For instance, harvesting a potato plant can yield 1, 2, or 3 (possibly more) potatoes and each potato can be either regular, silver, or gold quality. I have seen a fourth quality (i.e. iridium) for other items but not for any of the Spring crops I'll discuss.

At this point I realized that these kinds of factors and others would result in ranges/distributions of revenue and profit instead of a single value.

The question

Once I realized each crop has a distribution of potential profits, I wondered how I could use statistics to make Stardew Valley ACTUALLY fun decide what to plant to make the most profit?

My approach

The main metric I'll focus on is profit-per-square (of land) per year. The reason is the main exhaustable resource, that I've encountered so far at least, is land.

Profit is revenue minus expenses.

Yearly expenses are constant for each product/crop, at least for the Spring crops I'm focusing on. They are determined by the number of harvests and the cost of seeds (i.e. at Pierre's).

Revenue can vary due to quality and, potentially, because more than one item can result from each pull or harvest of a plant. Hence, the statistical focus of this article will be on determining the distribution of yearly revenue and then subtracting the yearly expenses from each potential revenue value to obtain the corresponding yearly profit distribution.

Most of the crops I'll analyze vary only in quality and are thus easier to build revenue distributions for. There are 2 crops (i.e. potatoes and unmilled rice) that vary in both quality and items upon harvest.

Here is an overview of the steps I'll take to derive profit-per-square per year distributions for each Spring crop:

  1. Collect quality and, if applicable, items upon harvest counts
  2. Obtain, if applicable, the items upon harvest distribution using those counts and a Bayesian statistical technique
  3. Obtain the quality distribution using a Bayesian statistical technique
  4. Obtain the price distribution from the quality distribution and the corresponding quality prices
  5. Obtain the revenue per harvest distribution by combining the items per harvest and price distributions
  6. Calculate the number of harvests per year using using the days until first harvest
  7. Obtain the revenue per year distribution using the revenue per harvest distribution and number of harvests.
  8. Calculate the expenses per year using the number of harvests per year and seed cost
  9. Obtain the distribution of profit per year using the revenue per year distribution and the yearly expenses.

Note that this process includes an assumption that the quality and count distributions are independent, meaning the number of items upon harvest does not affect the chance of those items having different qualities. This is based on some analysis of the data I collected and realizing I don't have enough data to say either way and my assumption that coding sampling of independent probabilities is easier than coding sampling of a joint distribution. Hence, I will derive the quality and count distributions separately for these multi-item crops.

Once I've obtained the profit-per-square per year distributions I can compare them to figure out the most profitable crop(s).

The data I collected

I planted blue jazz, cauliflower, garlic, potatoes, rice, kale, parsnips, and tulips and made sure they were watered everyday and protected from birds by scarecrows.

I then harvested them and recorded quality and count data for each harvest for the full Spring growing season. Here are the final yearly counts of the different item qualities:

item quality count
blue_jazz regular 15
blue_jazz silver 11
blue_jazz gold 7
cauliflower regular 87
cauliflower silver 71
cauliflower gold 39
garlic regular 67
garlic silver 28
garlic gold 24
kale silver 13
kale regular 12
kale gold 7
parsnip regular 27
parsnip silver 23
parsnip gold 15
potato regular 25
potato silver 13
potato gold 5
tulip silver 14
tulip regular 13
tulip gold 6
unmilled_rice silver 14
unmilled_rice regular 12
unmilled_rice gold 6

Unmilled rice and potatoes were the multi-item harvest crops. Here are the item counts upon harvest for them:

item item_per_harvest count
potato 1 33
potato 2 6
potato 3 2
unmilled_rice 1 27
unmilled_rice 2 3

I also kept track of the days until the first harvest, the initial investment (seed cost), and the prices of regular, silver, and gold quality versions of each item in Pierre's store:

item days_til_first_harvest initial_investment regular silver gold
blue_jazz 7 30 50 62 75
cauliflower 12 80 175 218 262
garlic 4 40 60 75 90
potato 6 50 80 100 360
unmilled_rice 8 40 30 37 45
kale 6 70 110 137 165
parsnip 4 20 35 43 52
tulip 6 20 30 37 45

How I obtained the distribution of item counts upon harvest

I imagined the process the game developer used to choose the number of items upon harvest as, starting with a guaranteed first item, choosing a random number between 0 and 1, determining if that number is within a range of values that decides whether another item is drawn, and continuing this process until the latest random number is not in the extra item draw range.

I considered 2 other processes: 1) choosing an average item count upon harvest and sampling from a Poisson distribution and 2) hard-coded probability ranges for each item count upon harvest value. The first method seemed too into the statistical weeds for a single game developer with no other employees. The second constrained me to the data I had collected thus far (i.e. no more than 3 items could be harvested). I suspect more than 3 items might be possible and the method I did go with seemed like the easiest to code for the developer.

After a lot of consideration, I decided to model the counts using a geometric distribution with a modification: I subtracted 1 from the exponent of the success class probability (i.e. new item is drawn), to reflect that we already have 1 item at the start.

To estimate the range of values of this, presumably hard-coded, extra item draw chance I assumed a flat prior distribution between 0 and 1 and the modified geometric likelihood. Following Baye's theorem I multiplied the equal probability vector of the prior by a vector representing the probability of each item count upon harvest, assuming the modified geometric form of the likelihood, until I had exhausted the data.

Heres the resulting posterior distribution, representing the probability of an extra draw that is consistent with the data for potatoes:

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

The 94% credible interval for the extra potato draw chance ranges from 11.4 to 32%.

This posterior distribution can be used, with the modified geometric distribution, to build the posterior predictive distribution, which is the probability of future items upon harvest data given the data encountered so far.

It's calculated by taking each extra draw chance value from the gridded posterior distribution, building the corresponding probability of different items upon harvest using the modified geometric distribution, and forming the sum of these weighted by the corresponding draw chance probability.

Heres the result for potatoes:

Potato items upon harvest chance fractional probability items per harvest 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 2 3 4 5 6 7 8 9 10

This matches well with the count data that I collected.

How I obtained the distribution of qualities

Thankfully, qualities have only four values: regular, silver, gold, and iridium. I've only encountered three for the Spring crops I've encountered so far so, I'll just ignore iridium until I encounter it in future data.

I prefer starting with simple gridded approaches if possible. Since the probabilities must sum to one I really only need to iterate over two them.

For the Bayesian update I'll assume the data follows a multinomial distribution. This is equivalent to sampling from a bag with 3 different colors of marbles with replacement.

Since I don't have any idea what the quality values should be I'll just use a flat prior again. This distribution consists of the same probability for all combinations of draw chances for regular, silver, and gold qualities that sum to one.

I used Bayes Theorem to update this prior with a similarly structured likelihood joint distribution, for all of the same quality draw chance combinations, but with the probability of the observed quality counts, as dictated by the multinomial distribution. Here's the resulting posterior distribution for potatoes:

Joint posterior distribution of quality draw chances

The darker red means a higher probability for that combination of draw chances for regular, silver, and gold qualities.

I multiplied each of these quality probability distributions/points by the corresponding probability of it (i.e. the value represented by the red color) and summed them to obtain the distribution of qualities (i.e. the posterior predictive). Here's the version for potatoes:

Potato quality chance fractional probability quality 0.0 0.1 0.2 0.3 0.4 0.5 0.6 regular silver gold

Again this is close to the data.

How I obtained the distribution of revenue upon harvest

Now I need to combine the quality-based price distribution (i.e. the quality distribution but with the qualities replaced with their corresponding price) and that for the item counts upon harvest to obtain that for the revenue per harvest.

To do this, I need to build the revenue distribution for each potential item harvest count in that item's count per harvest distribution, weight it by the probability of that count, and then sum all of them.

For example, to build the revenue distribution for 2 items harvested, we can take the sum of all potential price combinations between a price distribution and itself and multiply the corresponding probabilities. This will sometimes result in duplicate revenues. In these cases the duplicate probabilities will be added so there is only one instance of each revenue.

If we wanted to extend this scenario to 3 items per harvest we would keep the resulting revenue distribution for 2 items per harvest and perform the same process with it and another of the original price distributions. This process can be repeated to obtain the revenue distribution for any number of items per harvest.

Here's the potato revenue per harvest distribution:

Potato revenue per harvest fractional probability revenue per harvest 0.0 0.1 0.2 0.3 0.4 0.5 80 100 160 180 200 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520

The patterns make sense to me. 80 gold per harvest, corresponding to 1 regular potato, is the most likely by far. Next is 1 silver potato (100 gold) then 1 gold potato (360 gold).

I want to point out that I've cut off this figure drastically, since it ranges from 80 (1 regular potato) to 3600 gold (10 gold potatoes) and the probabilities above 520 are negligible.

Calculating the number of harvests per year

The number of harvests per year is required to use the revenue per harvest distribution to calculate the revenue per year.

Harvests per year is equivalent to the number of harvest/mature/plant periods in the days remaining after the first plant/mature period plus an additional harvest for any remaining partial harvest/mature/plant period over the entire growing/harvesting season. This can also vary depending on whether the plant is destroyed upon harvest and/or whether it is never destroyed (like the fruit trees I've encountered).

If this isn't clear enough I've implemented a function in the corresponding section of the notebook on Github.

Turns out that the number of harvests per year for potatoes is 3.

Calculating the revenue per year distribution

The same process used to calculate the revenue per harvest for a specific item per harvest count is used to calculate the revenue per year distribution. First, the summed distribution is calculated for 2 harvests using the revenue per harvest distribution by summing all combinations of revenues and multiplying the corresponding probabilities, then instances of the same revenue have their probabilties summed. This process is repeated using this new distribution with the original revenue per harvest distribution for the number of harvests.

Here's the revenue per year distribution for potatoes:

Potato yearly revenue fractional probability yearly revenue 0.00 0.05 0.10 0.15 240 260 280 300 320 340 360 380 400 420 440 460 480 500 520 540 560 580 600 620 640 660 680 700 720 740 760 780 800 820 840 860 880 900 920 940 960 980 1000 1020 1040 1060 1080 1100

I cut off the figure at 1100 gold because it goes all the way to 10800 (10 gold potatoes for all 3 harvests)!

Oddly enough, harvesting 3 regular potatoes (240 gold) appears to be less likely than 2 regular and 1 silver (260 gold). I suspect something is happening akin to 2 of a kind is more likely than 3 of a kind in Poker. It is true that regular potatoes are more probable than gold. So, this analogy breaks down a bit. However, I did some completely different sampling simulations and, basically, obtained the exact same distribution. Hence, I have high confidence in the result.

Obtaining the profit per year distribution

Profit is revenue minus expenses. Basically we just need to calculate the expenses for every crop and subtract that from each of the revenue values in the yearly revenue distributions.

For these simple crops the expense is, simply, the seed cost or initial investment times the number of harvests.

I calculated all of the profit distributions for all crops and visualized their cumulative distribution functions (CDFs), obtained by ordering by the profit and summing all probabilities up until that point:

Crop profit CDFs 0.0 0.2 0.4 0.6 0.8 1.0 -200 0 200 400 600 800 1000 blue_jazz cauliflower garlic potato unmilled_rice kale parsnip tulip Cumulative fractional probability Yearly profit per square

For those unfamiliar with CDFs, they represent the sum of all probabilities up until the value on the x-axis, in this case profit. Note how the y-axis ranges from 0 to 1, since the total probability of everything is 1 (100%). I find it is easier to visualize multiple probability distributions, especially those that aren't continuous this way.

I cut the figure off at 1000 gold for better visibility. Unmilled rice's distribution actually extends all the way to 1230 and that for potatoes extends to 10650 gold.

One thing that immediately stands out to me is that there is a good chance unmilled rice is unprofitable (~62.3% chance). Now, I'm aware that certain items can be converted to other products (e.g. cherries to cherry wine). Though I don't seem to have access a new product trasnformation for unmilled rice yet.

There appears to be a clear hierachy of unmilled_rice < tulip < blue_jazz. The hierarchy is less clear after that.

The main contenders for highest profit are potatoes, cauliflower, and kale.

One way to make a decision about what crop to plant is to derive the probability that the profit distribution of one crop is higher than another, what I've read referred to as the "probability of superiority".

To calculate this for two discrete distributions, we simply iterate through all pairs of hypotheses or profit values, multiply their probabilities (because this is the probability that they're both drawn together), filter to only the instances where the expected higher distribution's profit value is higher than the other one, then sum them together.

Here are the fractional probabilities of superiority for all pairs of the top contender crops:

item_lower cauliflower kale potato
item_higher      
cauliflower 0.396024 0.880723 0.492459
kale 0.119277 0.432060 0.389162
potato 0.494593 0.609618 0.469376

Each numerical value is the fractional probability that the item in the left most column is higher than the item in the corresponding column. For instance, cauliflower is about 88.1% likely to result in a higher profit than kale. I've made the higher probability values proportionally darker for faster visual reference.

An item's probability of having a higher profit than itself might seem kinda low. The reason for this is there is overlap where the values are the same.

To my eye, kale is rather less likely to have a higher profit than cauliflower. The decision is less clear in comparison to potatoes.

Regarding whether cauliflower or potatoes has a better chance to have a higher profit it seems very evenly split. We could decide to plant potatoes or maybe both. This hints at another article I'm thinking of writing, where I plant crops in proportion to how high their simulated profits are, based on a multi-armed bandit approach.

For now, I'll just focus on upsides and downsides of planting the top 3 crops. Here are the cumulative distribution functions of the profit differences of potatoes over cauliflower, cauliflower over kale, and potatoes over kale:

Top profit crop difference CDFs 0.0 0.2 0.4 0.6 0.8 1.0 -400 -200 0 200 400 600 800 1000 potato - cauliflower cauliflower - kale potato - kale

Note that the figure extends all the way to 10530 gold, but I've cut it off soon after leveling off for better visibility.

Planting cauliflower instead of kale is, again much more likely to result in a higher profit. The worst downside would be 95 gold less in profit. The best upside would be only 244 more profit. For comparison, the yearly seed costs for potatoes is 150 gold. Neither scenario is terrible or exciting.

I think the choice really boils down to cauliflower versus potatoes. The worst outcome of planting potatoes instead of cauliflower is making 274 less gold. This corresponds to making the least amount of profit possible from potatoes and making the most amount from cauliflower. Yet the value isn't that bad. Given the count data I collected and the analysis I did, there's about a 49.5% chance that I could make between 0 and 10460 gold additional profit. I'm not particularly worried about the worst scenario and the best the upside could be enormous. I would caution, however, that the 94% credible interval is between -210 and 574 extra gold.

Given the better upside I would, probably, plant potatoes. However, the machine learning geek in me is really leaning towards that multi-armed bandit approach ;)

Conclusion

I manually collected count data for different crops during the Spring season of my 1st year in Stardew Valley, used Bayesian inference to derive profit-per-square per year distributions, and compared them to figure out that cauliflower and potatoes are probably the most profitable. Hope you enjoyed this article!

If you're interested in calculating the profit distribution of potatoes with code here's the corresponding notebook in Github.