We don't actually know when many Steam games were first released

One of the most unreliable fields on a Steam's store page is the release date field.

One of the most unreliable fields on a Steam's store page is the release date field. Although there is some real actual day in history upon which each and every game on Steam was truly first available to the public for download, the "release date" field doesn't fully and accurately capture that value. Instead, the developer can stick just about anything they want in there, and they can change it even after the game has released. You can't fault developers too much, however, because the release date doesn't really make it clear what it's supposed to mean.

Here's a sampling of various things I've seen it used for:

  • The first day the game was actually released on Steam (i.e., you can play it now)
  • The first day the game was historically released (e.g. a DOS game from 1993)
  • For a former Early Access game, the date it graduated from Early Access
  • For a former Early Access game, the date it launched under Early Access
  • The first day the game was available for pre-order on Steam prior to it actually being available to play
  • For an unreleased game, the day the game will release in the future
  • Something random and weird that makes no sense at all

Okay, so why should anybody care, and how is this a big deal?

Hi there! If you're new to this blog, my name's Lars and I'm an independent game developer, games business analyst, and maintainer of the free database www.gamedatacrunch.com. Prior to starting that site I spent a little over a year at Valve as a contractor on the Steam Labs initiative. I don't work for them anymore though, and all my work on gamedatacrunch has been entirely independent, and I don't speak for Valve in any way. All of my Steam data is sourced from public web pages and APIs that anyone can access – no NDA violations or secret sauce here.

The release date field being unreliable is a big deal because it makes it really hard to answer, "how many days has this game been listed on Steam for?" as well as, "in what year did this game first release on Steam?" These two pieces of information are crucial not only for the historical record, but also for properly analyzing the performance of games for my clients. It's not uncommon to find two games with the same stated release date where one has only been listed on Steam for a few days, while the other has been there for literal years but only just now exited Early Access. If I don't account for this my analysis will be flawed.

Note on Steam

Before we dive in, I want to make it clear I'm not really interested in "bashing Steam" for not having great records of this information. It'd be great if they had all the information I wanted, but the fact is they don't, and they're probably frustrated they're missing it too. Hindsight is 20/20, and honestly the amount and quality of data we have for many other leading platforms (including 3rd party efforts at recording them) is far worse. We're kind of lucky to have the piles of data we do on Steam, disorganized as they are.

The Steam codebase must be close to 20 years old by now and as easy as it is to make fun of legacy software, the whole world runs on it and it's easy to mock if you've never had to maintain such things yourself. Besides, I run a rinky-dink third party database with tons of embarrassing bugs I keep promising I'll get around to solving one day. Glass houses and stones and all that.

With that said, let's look at some examples of the issue.

Examples

Hades' release date reflects not the first day it was available on Steam, but the date on which its Early Access period ended:

Fortunately, Steam seems to have started tracking this information. On recent games like Hades there's an information box down below that tells you when the Early Access period ended:

But this isn't available for all games. Overgrowth claims to have been released on Oct 16, 2017:

...but the info-box down below gives no indication it was ever in Early Access (it was listed as coming soon back in 2008, and the beta version launched on Steam under Early Access on 2017-01-17).

According to my database, 3,858 games have stated Early Access release dates visible on their Steam store pages. The observed gap in reporting suggests that at some point the "former Early Access tracking" feature was added, which means there's a bunch of missing data from before that point. Let's see if we can find it.

According to my database, the oldest Early Access release date noted on a store page is for a well-received voxel-based vehicular combat game called From the Depths, which launched under E.A. on 2014-08-07 and graduated to full release on 2020-09-06.

We know that's not the earliest Early Access game, however. An article from the Verge reports that Steam Early Access started on March 20, 2013, with one of the inaugural twelve E.A. titles being none other than the legendary Kerbal Space Program.

Kerbal's release date doesn't reflect this:

Nor does its info box below:

Detective Work

Poking around my database, I see a pattern: the "Early Access release date" fields seem to be populated starting on some date X, triggered by a particular game leaving Early Access. If a game had already left Early Access prior to that point, then it wouldn't have an "Early Access release date" field in its lower infobox.

This theory generates several testable hypotheses:

  • Games that started E.A. before X and left before X will not be tracked
  • Games that started E.A. before X and left after X will be tracked
  • Games that started E.A. after X and left after X will be tracked

Based on the available evidence, I believe X is 2018-10-31, or Halloween of 2018. Queries based on the three hypotheses confirm that X could be as early as this, but we have further evidence for this particular date. We see only 55 "Early Access release dates" tracked before that date, and 3853 on or after that date. 2018-10-30 has only one record, whereas starting from 2018-10-31 we have a long pattern of multiple records per day, which is consistent with the observed number of new games being released on Steam throughout that year (8,045 according to my DB). Finally, Halloween typically marks a Steam sale, which adds a final sprinkling of confidence for this being the kind of date Valve would choose to silently launch a new feature on.

That gives us a gap between 2013-05-20 and 2018-10-31, a little less than five and a half years. If a game was released before that, we know it can't have been Early Access. If a game was released after that, we can just check if it has an Early Access transition date listed on its lower info box; if it doesn't, it was never in Early Access.

But we still need to contend with the fact that the release date field in the upper info-box itself is unreliable. What if a game was "originally" released off of Steam in 2004, and then re-released on Steam in 2015? Early Access aside, we wouldn't be able to know.

Luckily we can check several other sources of information.

Steam provides a review graph (and matching API) that lets you query every review that has ever been posted for a game. So you can just check if a game's review history dates to before the game's current stated release date:

Even better, individual reviews are marked with whether they were written during Early Access. All it takes is one:

This is proof positive that Overgrowth is a former Early Access game, and the date of the earliest E.A. review is the latest possible date for the Early Access period.

(There is sometimes some occasional weirdness with the review API which we'll address later, but it's usually not a problem).

But this isn't the only thing we can check. The amazing site IsThereAnyDeal.com tracks historical prices on games dating back to 2010, and the site maintainer has generously allowed me to use their data to augment my database. Here's the record for Overgrowth:

According to this, the earliest price listed for the game dates back to December 17, 2013, three days older than the first review. So that's a second data point that corroborates the review data.

If we take the stated release date, the date of the earliest review, and the date of the earliest IsThereAnyDeal price, we can make a pretty confident guess at the "earliest listing date." This is the first day that the game was originally available for purchase (or download, if free) on Steam, or close to it.

Now, you might be thinking, "Hey Lars, why not just google 'when did Overgrowth launch in Early Access'? Or, why not just go to the news page, scroll to the bottom, and look for the answer there? Or check archive.org?"

That will certainly work, but it doesn't scale. There's well over 55,000 Steam games on my database and more being released every day. When I started this journey I had over 42,000 ambiguous cases to identify:

All of the above methods are good ways to check against ground truth, however, so if I wind up making any bad calls it will at least be possible in theory to identify some of them. Another source of truth could be tucked away somewhere in SteamDB.info's records, but I'm not sure how easy that is to query at scale.

Divide and Conquer

The good news is we now have nearly all the tools we need. For every game on Steam we now know the following information:

  • Is it released now
  • Is it Early Access now
  • The date of its earliest review
  • The date of its earliest price
  • When it claims to have been released
  • Our best guess for its earliest listing date
  • If it has an Early Access review, and if so, the earliest date for such a review

Looking only at games that are both released right now (not "coming soon") and classified as "Not currently Early Access, but former status is unknown", we can apply the following rules:

A) Has a single Early Access review FORMER E.A.
B) Has an Early Access info box   FORMER E.A.
C) Listed before March 20, 2013 NEVER E.A.
D) Listed after October 31, 2018, but no to A&B NEVER E.A.

That identifies a wide swathe of titles whose status we can be pretty confident in. But we still have a few cases which are ambiguous. These are games where the following holds true:

  • Released between March 20, 2013 and October 31, 2018
  • Has no Early Access reviews
  • Has no Early Access info box

So under what circumstances could games in this status be former Early Access games that we still haven't detected? They would have had to release between the two dates, never gained a single review during their entire Early Access period, then launched before their first review.

There are three basic categories here:

  • Games with no reviews at all
  • Games with extremely small amounts of reviews (so the slow rate of reviews could have plausibly missed a hypothetical hidden E.A. period)
  • Games with an extremely short E.A. period (so it's plausible there was no time to catch a single review)

In the first case (no reviews at all) there's genuinely no way to know based on the available data other than brute force. The game could have entered and left E.A. without leaving any trace I can detect. The only way to rule these in or out would be to look at the text of their update information manually, or scrape Archive.org. Given these are extremely low performing games that's not high on my priority list.

In the second case (extremely small amounts of reviews) it's a bit ambiguous, but I can narrow the field a bit – typically a game's biggest launch is its Early Access launch – some games (like Hades) will enjoy a "second launch spike" when they graduate from Early Access, but most games won't. Therefore, if a game currently has some reasonable threshold of reviews X, and was in Early Access for some reasonable length of time Y, it becomes increasingly implausible that given its current popularity it wasn't able to gain even a single review during a hypothetical Early Access period. Keep in mind I gather all reviews, even those redeemed with Steam Keys or hidden by default because of review bomb activity. The developer didn't even get a single friend to review the game for all of E.A., but the game is reasonably popular now? The most likely explanation is that there simply was no E.A. period for these games. Now I just have to define reasonable thresholds X and Y. I think 1,000 reviews is probably a nice conservative figure for X, and as little as 30 days might be reasonable for Y.

Punting on that for now, we consider games with an extremely short window for an E.A. period to be hiding in. We can detect this by comparing the distance between the first price for a game and the first review. If it's only a day or so, that's strong evidence that the game was reviewed almost immediately. If that earliest review is not marked as Early Access, it's as close as there is to a sure bet the game was never in Early Access. Now what if the gap wasn't one day, but three days? Four days? Five days? I decided to stop at seven. If the gap between first price and first review is a week or less, I rule out the possibility of a hidden E.A. period where we just weren't lucky enough to catch a review in time. Besides, if some random game somehow managed to actually have a tiny E.A. period that lasted for a week or less without catching a single review, it's reasonable to dismiss that as not having been in Early Access at all for my purposes.

Whenever I make these more "fuzzy" rulings that could be wrong in theory, I flag them as such in my database along with the reasoning I used, so that I can undo them later if someone points out a flaw in my methods:

With all that said and done, how many ambiguous cases do we have left?

That's honestly pretty good. Yeah 3K is still a lot, but we've reduced our overall uncertainty by 93%. And from the looks of it a huge portion of the remaining ambiguous cases are low performing games that simply have no reviews at all. To reduce uncertainty to zero we'll have to move on to more expensive methods like textual analysis of news pages or scraping archive.org, but I think this is a good stopping point for now as we've likely captured most of the games people want to know about, and we can address the rest later.

What about mistrakes?

How confident are we that all our rulings have been accurate? I'm pretty confident, but I would be shocked if I didn't get at least a few of these wrong out of tens of thousands of games. Is it reasonable to just charge forward like this?

For my purposes, yes. Having this data properly coded lets me do analysis at scale, which isn't going to be impacted by a few mis-called titles; and besides, the status quo is to just naively trust the stated release dates and what's immediately discernable from Steam store pages, which is going to be far more error prone.

Further, when I drill down to a specific game for analysis purposes I typically double check the Early Access status manually (and update it in the DB if necessary). There is the chance I could be misinforming someone on a particular game, but the best I can do there is publish this article and ask for people's corrections and suggestions and slowly improve over time.

With all that said, let's enumerate all the possible sources of errors:

IsThereAnyDeal could be wrong
I could have faulty price history data. Discrepancies like this should stick out if e.g. a price history goes back years before the first review as well as the stated release date, especially for a game with a decent number of total reviews. I haven't checked for that case explicitly but it's on my radar.

Steam's review API could be wrong
There are at least three different ways to gather review data from Steam – the summarized number of reviews on the top info box on the store page, from the API used to build the review graph on store pages, and from the API that fetches individual reviews one by one. Steam also has various filters for getting "filtered" reviews (excluding review bombs and reviews where the player used a steam key or received the title as a gift rather than e.g. paying for the game). Usually the data from all sources is reconcilable, but it can be off slightly sometimes. This could lead to me misjudging an "earliest review" date. I've also detected a few anomalous cases where the earliest reviews for a game were somehow posted long before the first game's first price or stated release date, which I don't fully understand. Further complicating matters, before Steam's review system was launched (November 16, 2013) there was a "recommendation" system, and it seems all the personal "recommendations" were converted into "reviews" at some point (with the original dates retained). And if Steam didn't start tagging Early Access reviews until a certain date, that undoes a bunch of my prior assumptions.

Pre-orders, unfortunately, exist
I haven't dealt with them yet, but we know that certain games on Steam have been available for pre-order (Cyberpunk 2077, for instance). This means a price would be posted well in advance of any ability to play or review the game. So far I've treated the "first listing date" as the day the game was first available to play on Steam but pre-orders would split that into "first date you could play the game" and "first date you could pay for the game" which further complicates things. You could argue that most games that do pre-orders are AAA games which probably launch out of the gate without going into Early Access, but xPaw from SteamDB pointed out to me that you sometimes see bizarre edge cases in the wild like games that offer pre-orders on Deluxe Edition packages, with the Deluxe Edition being under Early Access. So there's that.

I haven't yet thought about how to detect if a game was formerly available for pre-order after it launches for real, but I will have to go back and deal with this case eventually.

My scrapers could have bugs
Web scraping, be it from API's or directly from web pages, is inherently fragile. If one of my sources updates their HTML structure or changes the inputs to an API endpoint it can force me to rewrite scripts or lose data in the interim. Also, the amount of data I need to gather can be staggering, so I do my best to stay on all the platforms' good side – I do my best to respect rate limits and aggressively cache information on my server to minimize calls. For third party databases I always try to reach out and explain what I'm looking for and offer to make a donation to cover the cost of the bandwidth I'm consuming before I start gathering data. But the bottom line to all this is I constantly find little bugs in my stack that I have to go back in to fix. I've organized my database in a flexible and robust way so that if I lose a few days data I can generally back-fill it from external sources, and I have multiple ways to double-check my internal results for consistency, but it will never be perfect.

My assumptions could be wrong
Almost all of the above methods rely on trusting some assumption. If any of those are wrong, that would undermine my conclusions.

The good news is that any individual game is fairly easy to check by hand, so this isn't one of those situations where it's impossible to know in principle if I got something wrong.

What Next?

Now that we've (mostly) worked out Early Access status, the next step is to identify clear Early Access launch and graduation dates for all former Early Access games.

We've gathered some of these already from store pages that have an explicit Early Access info box information, but we a) don't know how accurate those "Early Access launch date" fields are, and b) we are missing that data for many games in our data set.

The next step is to use the IsThereAnyDeal price data along with timestamps from user reviews to put a latest bound on all the missing Early Access launch dates, but we'll need more robust methods to nail down every single last one of them. To get to final confident ground truth we'll probably have to scrape steam pages on Archive.org and ask other 3rd-party Steam database maintainers to weigh in.

When that's all been done, we'll finally be able to say with confidence how many days each and every game has actually been available on Steam for.