Researchers beware of ChatGPT's "Wikipedia Brain"

ChatGPT gave the wrong answer to a research question. I published an article with the correct answer, then checked to see if ChatGPT could find it.

Researchers beware of ChatGPT's "Wikipedia Brain"
Photo by Growtika / Unsplash

Most discussions around using large language models like ChatGPT for research highlight issues like plagiarism or hallucination of fictional citations, but here's another one: "deep research" that just ain't deep.

I call this failure mode Wikipedia Brain. Even when all the citations are real, and there's zero hallucinations in the text itself, the research report can still be sup-par, misleading, or outright wrong simply because ChatGPT is being lazy. To be clear, ChatGPT being lazy isn't the real issue, but rather the confident impression it gives of being far more thorough and exhaustive than it actually is– what with all the searches it performs, citations it peppers the text with, and the nicely organized sources it fills up the sidebar with.

Put another way, ChatGPT can get a research question wrong, even when the right answer is available in publicly searchable records. It gets the question wrong because it relies too much on shallow searches and easy-to-find sources, especially those written in English and linked on Wikipedia.

NOTE: I am not making any claims as to the scope, scale, or persistence of this, I'm simply noting it as a specific failure case that exists, and that I have noticed it multiple times, all in cases where I am more of an expert on a particular subject than ChatGPT is.

Let me give an example; last week I was doing some research for an article I was writing for the Progress and Poverty substack: the case of the historical German colonial lease of Tsingtao/Qingdao, China, in the early 1900's, specifically its land value tax policy. There's a particular prevailing narrative online about historical Qingdao that turns up in shallow searches, but a much different one that appears when you dig into the primary sources, which are in German. As a bonus, the less legible, but better historically supported, account moves the narrative in the opposite direction of flattering my ideological priors.

The prevailing narrative on the internet, and amongst the easiest to find sources, is that the German colony in Qingdao implemented a straightforward and uncomplicated version of Henry George's famous Land Value Tax. Not only that, the prevailing narrative claims it was very nearly a full-on "Single Tax"–a tax regime in which there are no other taxes at all except the tax on land. When I consulted ChatGPT, it repeated this exact same narrative.

When I consulted primary sources, however, I found a more nuanced case. On the one hand, going in with no prior assumptions and commitments, an honest reading of colonial Qingdao does indeed provide significant and positive evidence for the claims made by Georgist theory. On the other hand, the prevailing narrative about Qingdao exaggerates positives and minimizes negatives just enough to be at most directionally correct, the worst kind of correct. I prefer to be technically correct, the best kind of correct.

This created an opportunity for an interesting experiment. Right before I hit "publish" on my finished article, a thought struck me–substack is a very legible and SEO-friendly platform. After I published, I would have updated the public record on this niche research question. That would make my updated narrative, and far more importantly, the primary sources it relies upon, much more legible to search engines and LLMs. Would ChatGPT give a better answer to the same research question after I published my article?

The Experiment

To find out, I restated my research questions to ChatGPT in a simple and standardized format right before I hit publish, and then did the same thing a week later after all the search engines had indexed my article. This would give me a decent case study on how well ChatGPT was able to update on fresh information. I asked the same question with the same wording to both ChatGPT o3 (the fast version that incorporates web search), and Deep Research (the slow but powerful version that is supposed to dig deep into citations).

In all four cases, I gave the exact same prompt:

Tell me everything you know about Kiatschou/Qingdao/Tsingtao, China – specifically the historical German colony there. I have heard that it was a very close example to a Georgist single tax.

1. Was it a Georgist policy?
2. Was it (approximately) a Single tax?
3. Were there few to no other taxes than land value tax?
4. Did LVT cover all of the colony's revenue/budget needs?
5. Anything else worth mentioning?

Whenever you ask a question like this, o3 will answer right away in one go, but Deep Research will pause to ask a quick clarifying question before answering in full. For Deep Research I answered the follow in as standardized and consistent a way as I could, given that Deep Research phrased its follow up differently in each case.

Here's what Deep Research asked a week prior to publication, plus my responses:

To provide the most accurate and thorough information, could you please clarify a few details:
1. Are you primarily interested in the period from 1898 to 1914 when Qingdao (then known as Kiautschou or Tsingtao) was under German colonial administration?
2. Would you like information primarily from academic sources, historical records, or policy analyses comparing Kiautschou’s tax model to Georgist theory?
3. Are you also interested in outcomes or impacts of this system (e.g. economic development, urban planning, social equity)?

ME:
1. Yes
2. All sources you can possibly find.
3. Yes

And here's Deep Research when asked a week after publication, plus my responses:

CHATGPT: To give you the most accurate and detailed information on the German colony of Kiautschou (Qingdao/Tsingtao) and its relation to Georgist or Single Tax principles, could you clarify what level of detail you're looking for? For example:

1. Are you interested in just a historical overview of its tax policy, or also in how it compares to Georgist theory in depth?
2. Would you like references to modern interpretations or economic analyses?
3. Are you looking for primary historical sources or academic reviews as part of the answer?

ME:
1. Yes
2. Yes
3. All sources you can possibly find.

Here are the saved logs, for anyone who's curious:

Before publication:
ChatGPT o3
ChatGPT Deep Research

After publication:
ChatGPT 03
ChatGPT Deep Research

(Halfway through writing this article, GPT5 became available, so I tested that too. Those results are included below)

Grading criteria

To grade the four responses, we first need to establish grading criteria. Let's do that by going back to the original research question and settling on a definitive answer for each one.

1. Was it a Georgist policy?
2. Was it (approximately) a Single tax?
3. Were there few to no other taxes than land value tax?
4. Did LVT cover all of the colony's revenue/budget needs?
5. Anything else worth mentioning?

Here's the technically correct answer to these four basic claims. And if you want a full treatment on the subject, just read my substack article.

1: It was a Georgist policy

Yes and no. Michael Silagi's paper, the primary source that the prevailing narrative depends on, makes two claims here–first, that colonial Qingdao's policy aligns with the recommendations most commonly made by Georgists (true), and second, that the institutors of the policies in the German colony were specifically and consciously inspired by Henry George, the man. This second claim is disputed by Wilhelm Matzat, who is particularly credible for four reasons: 1) he is a researcher with particular expertise in the subject, 2) he was born and raised in Qingdao, 3) he is a direct descendant of the original German colonists in Qingdao, and 4) he discovered and published original primary source documents directly relevant to many of the claims we are investigating, that were not available at the time Silagi's paper was published.

Bottom line: MIXED. Yes it was "Georgist," no, it was probably not specifically inspired by Henry George the man. The best evidence suggests convergent evolution rather than direct ideological transmission. There is some genuine gray area on the question of who gets credit for the inspiration, but what's clear is that confidently claiming direct inspiration from George without acknowledging the scholarly dispute at all counts as wrong.

2: It was (approximately) a Single tax

No. The only way you can make this a "Yes" is if you restrict the timeline to only the first few years of the colony and really torture the definition of "single." It is certainly true–and very in line with Georgist principles–that direct taxes on the population of Qingdao other than land taxes were indeed minimized, and the regime explicitly avoided income, capital, and sales taxes. There were, however, many municipal fees for various services, as well as some other minor taxes, even in the early days. Then a big change hits several years in–the colony makes an agreement with the Qing crown to levy tariffs on any goods crossing the German/Chinese border, of which the Germans are allowed to collect a portion. Tariffs are not a direct tax, but they are certainly an anti-Georgist one; as Henry George made clear in his seminal text, Protection or Free Trade.

Bottom line: NO. In its early days, German Qingdao approached a Single tax but didn't get all the way there, in its latter days it had significant tariffs which directly undermine the "Single" in "Single tax." To be fair, Qingdao's example was still more Georgist than most modern states–just not to the degree the maximalist narrative claims.

3: There were few to no other taxes than Land Value Tax

No. Even if we restrict ourselves to the early days and ignore minor taxes like the tax for dog licenses, as well as Pigouvian taxes on Opium, there were two taxes on land, not one. One was a straightforward Land Value Tax, which was levied at an impressively high rate of 6%. The other was a land increment tax, a sort of capital gains tax on the increase in value of land between when it was bought and when it was sold. Far less revenue was collected from the latter than the former, but it still existed and it was still significant.

Bottom line: NO. There were other taxes, most significantly the land increment tax, not to mention the significant tariffs which came in later.

4: LVT covered all of the colony's revenue/budget needs?

Depends on how you define "all." From my article:

On the one hand, the sources claim that the LVT alone covered ongoing direct municipal expenses, but on the other hand the colony was a strategic naval base for a globe-spanning European empire; this necessitated ongoing military expenses and therefore subsidy from the German Reich. 

Bottom line: NO. "All" is a strong claim, and we have the exact budget figures which make it clear that LVT did not cover everything. Qingdao is a weird example in this case because it was a strategically important naval base for a globe-spanning empire and therefore had a lot of direct military expenses that you don't see in the budgets of a random American or European county, but if we're going to be technically correct the answer is obvious. Also besides the tariffs they had e.g. state-run enterprises and other non-tax revenue streams that helped pay for a lot of stuff.

We'll also throw in our fifth question as an evaluation criteria:

5: Anything else worth mentioning?

The easiest sources to find are Michael Silagi's paper, the Wikipedia article on the colony, and the website Cooperative Individualism, which is an excellent repository and archive of historical and modern documents relevant to the Georgist movement. I want to grade ChatGPT on how well it can find other sources, the way a good researcher would. That includes sources like:

I'm also only giving ChatGPT credit for sources that are explicitly cited in the text, not just ones that show up in the search sidebar.

Grading the research

With that out of the way, let's calculate the scores.

o3 (prior to publication)


chat log

Claim Answer Truth Correct
1A. Credit Henry George
1B. Georgist-compatible
2. Single tax
3. Few/no other taxes
4. LVT covers budget

Sources cited: Silagi, Wikipedia

Only one out of five points on the first round. The only paper it cites is Silagi's. This is a pure Wikipedia-brain response.

Deep Research (prior to publication)

chat log

Claim Answer Truth Correct
1A. Credit Henry George
1B. Georgist-compatible
2. Single tax
3. Few/no other taxes
4. LVT covers budget

Sources cited: Silagi, Encyclopedia.pub (whose relevant quote just cites Silagi), an article on Goethe.de by Mechtild Leutner (in English), and Wikipedia.

Deep Research, though it gave a much wordier and more thorough response than o3, did no better on the actual substance. It did manage to cite a few more sources, but only one of them was non-redundant, and all of them are still in English. No mention of Matzat, no drawing from primary sources, and no mention of challenges to Silagi's prevailing narrative.

Intermission

Okay, so o3 and Deep Research both did about as well (which is to say, about as poorly). Now we want to see how well they performed after I published my article. The presumption is that my article makes the more nuanced historical narrative easier to find, as well as the sources it relies on.

I did a quick test by searching Google (in a private window) for "Qingdao land value tax" and "Kiautschou land value tax", and my article was always one of the first hits (I've removed the 'AI overview' section for clarity here):

I was able to get similar results in Bing (which powers ChatGPT), as well as duckduckgo, though my article came up more prominently for "Qingdao" in those than for "Kiautschou." In any case, this is enough to suggest that my article is definitely legible to search engines.

Now that we've verified that search is able to see my article, let's see if it had any impact on the results. To be clear, I'm not looking to stroke my ego here, just to see if the act of posting has made it easier for ChatGPT to find correct information on this niche subject.

o3 (after publication)


chat log

Claim Answer Truth Correct
1A. Credit Henry George
1B. Georgist-compatible
2. Single tax
3. Few/no other taxes
4. LVT covers budget

Sources cited: Silagi, Wikipedia, Boopalakrishnan S. Arul, and some Chinese article that doesn't seem to load correctly.

o3 does a little better the second time around, but makes no mention of my substack article or any of the chief sources it cites–Matzat in particular is nowhere to be found, nor the German primary sources. It hedges its answers a bit more (it correctly clarifies that although Schrameier spoke of George many years later, he maintained independent inspiration), but it also makes straightforward mistakes and it baselessly minimizes the tariffs.

I can confirm that it sees my article, but it doesn't seem to have read it or internalized it's findings. It shows up in the search sidebar but not in the citations. Frustratingly, the preview blurb directly addresses one of the core claims but ChatGPT's response ignores it:

Deep Research (after publication)

chat log

Claim Answer Truth Correct
1A. Credit Henry George
1B. Georgist-compatible
2. Single tax
3. Few/no other taxes
4. LVT covers budget

Sources cited: Silagi, Wikipedia, The Daily Renter (which cites Silagi and many other new sources, but not Matzat or the primary sources), and an article by Fred Foldvary.

Deep Research post-publication did no better than before, and again overly relied on Silagi, and was unable to find the German language sources, let alone the primary sources.

Well, that's it for...

...wait? What's this? Something rustling in the tall grass...

So just as I was almost finished writing this article, I noticed that GPT-5 got unlocked in my account. So let's give that a whirl real quick. We have three different versions of it:

  • GPT5
  • GPT5 Thinking
  • GPT5 PRO

Let's go ahead and test all three of those, too.

GPT5

chat log

Claim Answer Truth Correct
1A. Credit Henry George
1B. Georgist-compatible
2. Single tax
3. Few/no other taxes
4. LVT covers budget

Sources cited: Silagi, Orazio Coco, George Steinmetz, and a 1911 article by a "Dr. Schreimer" (that is, Wilhelm Schrameier).

This does better on the citations in that it actually directly cites a 1911 article in the Single Tax Review by Wilhelm Schrameier himself. That said, it's in English, it doesn't touch the other German primary sources, it doesn't cite my article, nor does it cite Matzat or acknowledge any of the points he brings up. It also has the same poor performance on the actual questions we've seen so far.

It does, however, end with this enticing line:

If you want, I can dig for year-by-year budget tables showing the exact share from LVT vs. customs to quantify how “single” it was.

Well, if I was doing actual research and it asked me that, I would certainly tell it to continue. I replied with "Please do."

It gave a long answer. Here's it's revised score:

Claim Answer Truth Correct
1A. Credit Henry George
1B. Georgist-compatible
2. Single tax
3. Few/no other taxes
4. LVT covers budget

Additional sources: Mechtild Leutner (in German!)

Slight improvement. It correctly notes the controversy about whether Schrameier was directly influenced by George. It tries to hedge on the Single Tax question when it should just say "no," and it minimizes the scale of other taxes and tariffs. It correctly mentions the subsidies from the German Reich.

GPT5 Thinking

chat log

Claim Answer Truth Correct
1A. Credit Henry George
1B. Georgist-compatible
2. Single tax
3. Few/no other taxes
4. LVT covers budget

Sources cited: We've got Matzat! And Handbuch für das Schutzgebiet Kiautschou (German primary source!), and... a weird broken image link from The Library of Congress?

GPT5 Thinking does really well here. It answers each question simply and straightforwardly without stumbling into false claims, (e.g. that the tax policy's architects were definitely inspired by Henry George the man), nor does it make the common failure of overly minimizing the existence other taxes. It mentions the customs and tariffs, it mentions a lot of the complexity and nuance. In short–well done!

It also cites both Matzat and a German-language primary source. I noticed that my own article was in the list of consulted (but not cited) sources:

Unfortunately, GPT5 Thinking was not available before the article was published, so we can't run the experiment of what GPT5 Thinking would return before. We know that none of the GPT4 models were able to perform this well, before or after the article was published. We don't know if GPT5's improvement is just because it's smarter, or because I had made the primary sources easier to find and it was smart enough to pick up on them where GPT4 wasn't.

GPT5 Thinking offers to dig further, just like vanilla GPT5:

If you want, I can dig up specific year-by-year budget tables from the Handbuch and the Foreign Office “Peace Handbooks” to quantify what share came from LVT vs. other sources.

I went ahead and said "please do" like I did for the other one, but since it already achieved a perfect score this was basically a formality. It went on to put together a detailed research plan for building tables from the sources, but I figured it had done well enough already so I stopped there.

The only thing to ding it on is the weird citation to what appears to be a blank image on the Library of Congress.

GPT5 Pro

chat log

Claim Answer Truth Correct
1A. Credit Henry George
1B. Georgist-compatible
2. Single tax
3. Few/no other taxes
4. LVT covers budget

Sources cited: Silagi, George Steinmetz, that Chinese article that doesn't seem to load correctly, Gottfried-Karl Kinderman (in English),

Straightforwardly correct on all counts, but the citations feel weird to me. It relies heavily on that one Chinese article that doesn't seem to load correctly. It doesn't cite me, Matzat, or any of the German language primary sources.

I see that it has me in the sidebar:

I also notice that it actually found some of Matzat's articles, but doesn't cite them:

This makes me kind of suspicious. It feels like it's reading a bunch of stuff, arriving at basically the right answer, and then just kind of sprinkling citations around, rather than directly connecting the strongest points to the strongest sources the way a real researcher would.

This plus the prevalence of links to broken/blank pages me particularly dubious about the citation engine.

What do we make of this?

This is a small scale experiment with insufficient sample size and controls to make wide-ranging conclusions, but here's my vibes:

  • GPT5 Thinking and Pro are better at research than GPT4.
  • GPT4 struggled to do true "deep research." It was not able to find the best, non-English, primary sources documents, even with the benefit of a fresh, SEO-friendly article pointing straight to them.
  • GPT5 Thinking and PRO seem to get the right answer much more readily than the GPT4 models, but we don't know if they would have been able to do so without my article first helping to surface the primary sources they relied on.
  • The citation engine smells a bit fishy and feels post-hoc.
  • GPT4/5 were probably using my article, even if all of them refused to cite it.

On that last point, I have lingering questions. On the one hand, it totally makes sense to not update too strongly based on new information someone just posted on the internet, no matter its quality. For one, the world doesn't owes me citations–I see my main contribution as just lifting the legibility of the primary sources. For two, if ChatGPT started citing newly posted articles immediately that probably would make it massively vulnerable to targeted influence campaigns.

On the one hand, this is a pretty disquieting thought. Sure, I'm trying to influence ChatGPT for good, just by doing the best research I can and making it publicly available for anyone, human or machine, who wants to have more accurate information. Meanwhile, every pundit, pervert, and pugilistic state actor is fighting the same high stakes game of "get the magic genie to parrot my talking points!"

On the other hand... aren't all these LLM tools being explicitly pitched as being so super duper smart that they're going to replace nearly all human labor, including cognitive labor? If that's the case, then we have to consider that doing basic research, digging for primary sources in non-English languages, and applying reasonable skepticism and critical thinking are core cognitive skills that any competent researcher should be able to pull off. If you can actually outdo every human intellectual feat, then that should be a walk in the park.

In any case, one conclusion remains clear: no matter who produces your research summaries, there is no substitute for reading the primary source documents yourself.