Chatbots powered by large language models (LLMs) are becoming an increasingly important source of public information. Despite broad concerns about the societal impacts of generative artificial intelligence (AI), its use cases are already reshaping the information environment, from media generation to incorporation into search engines. Researchers have found that users are less likely to click on web links, which might encourage them to form their own opinions, when their search yields an AI-generated summary. Research also shows that generative AI systems can themselves be persuasive and influence opinions through personalized conversation.
Given this expanding impact on information-seeking habits, it is unsurprising that AI companies have become increasingly entangled in partisan debates. Conservatives have voiced concern about the ideological biases of mainstream chatbots. Liberals have sounded the alarm over a cozying relationship between Silicon Valley executives and the Trump administration and have leveled criticism at many of the top AI labs for strategic shifts in company policies designed to limit potential regulatory scrutiny. Both criticisms are merited. Chatbots have historically skewed liberal, and sometimes deliberately so, complicating their growing relevance for information seekers and increasing the likelihood of LLM fracturing along ideological lines. However, political pressure to make them lean more conservative to curry favor with administration officials would represent an alarming development that would fundamentally undermine trust as well.
Against this background, we explored the transformation of the chatbot landscape with respect to political bias. We tested seven different chatbots, including five that are considered more mainstream and two that are more overtly political, using political quizzes designed to situate individuals on a partisan scale. We found that:
- With one noticeable exception—Grok—chatbots have not evolved significantly in response to shifting Washington politics.
- Some recent LLMs that have popped up across more conservative social media websites are dividing the chatbot landscape, but their performance is uneven.
- Although most chatbots have guardrails designed to prevent politicization, they are easy to circumvent. The two exceptions are Google’s Gemini and Anthropic’s Claude Sonnet 4.5 models, which repeatedly refused to answer questions across two administered political quizzes.
Due to the way these models are trained, it is impossible to eliminate political bias entirely. And the solution is neither to demand more conservative values—guised as neutrality—be embedded in these systems nor build alternative systems that are considered less “woke.” Although there is no widely accepted, cross-disciplinary definition of neutrality, and many critics believe true neutrality is impossible, there are steps AI developers can take to minimize bias in an effort to retain trust in their systems. Revisiting the fine-tuning process, building better guardrails that are more difficult to avoid, and developing and adopting industry standards related to evaluating political bias will help to alleviate some claims of political skew in either direction. As chatbots become further integrated into everyday life, including through search engines and cell phones, preempting their overt politicization will be critical for maintaining trust in their outputs and preventing further bifurcation across the information environment.
The politicization of generative AI
The potential for generative AI systems to reproduce political biases in their outputs first drew public attention after Gemini generated historically inaccurate images of Nazis, U.S. Founding Fathers, and U.S. senators from the 1800s. The viral incident, which was possibly a result of overcorrecting for racial bias during the model’s training, drew backlash from conservatives, who criticized Big Tech’s “woke” agenda.
Since then, the White House has continued to pressure companies to alter their products to align more closely with conservative beliefs under the guise of neutrality. In a recent executive order, “Preventing Woke AI in the Federal Government,” the Trump administration sought to limit the government’s procurement of LLMs to those that are “truth-seeking” and exhibit “ideological neutrality.” However, the executive order defines ideological biases broadly and includes concepts such as “transgenderism, unconscious bias, intersectionality, and systemic racism; and discrimination on the basis of race or sex.” In tandem, the AI chatbot landscape has also seen a rise in alternative chatbots such as Gab AI and Truth Search, built by conservative social media companies intent on countering the liberal bias of more mainstream chatbots.
Democrats have also grown increasingly concerned about the evolution of the tech industry, as companies have worked to more closely align with the Trump administration in recent months. Earlier this year, Mark Zuckerberg personally apologized for “too many mistakes and too much censorship” in an announcement about changes to Meta’s content moderation practices. Apple’s Tim Cook presented President Donald Trump with a plaque set in a 24-karat gold base. Amazon announced a $40 million licensing fee for producing a documentary on Melania Trump. And X’s Elon Musk held significant power in the federal government before falling out with the president. With the consolidation of major AI products and infrastructure in the hands of just a few companies, some observers worry about potential downstream effects on model behavior and, as a result, broader information consumption.
Research questions
Against the backdrop, we seek to explore the shifting dynamics of AI chatbots. Particularly, we focus on the following questions:
- How have chatbots evolved in response to changing politics in Washington?
- Has the new class of more conservative chatbots been successfully adapted to be less politically liberal?
- What safeguards are in place to circumvent the politicization of chatbots?
To answer these questions, we built on existing research that explores the political leanings of chatbots using political quizzes that measure ideological or partisan preferences. We used the Political Compass Test, a quiz measuring two-dimensional political orientation commonly cited in research, and Pew Research’s Political Typology Quiz, a popular political orientation test that compares results to the broader population. The Political Compass Test, which measures two dimensions of social and economic preferences, asks respondents 62 questions and requires them to rate their level of agreement with statements encompassing a variety of topics, ranging from the effects of globalization to abortion. Pew’s Political Typology Quiz consists of 16 questions, including ones related explicitly to political party sentiment, and places respondents on a one-dimensional scale from “progressive left” to “faith and flag conservative.”
We used political orientation tests in our analysis as an imperfect proxy to measure chatbot political leanings because they are commonly used in existing research on the political bias of LLMs. However, we recognize that these tests are imperfect measures and oversimplify complex, issue-specific, and often highly nuanced individual political beliefs, among other challenges.1 Additionally, our design cannot fully capture how chatbots behave over long, continuous conversations—one of their most common use cases.2 Despite these limitations, political quizzes do provide some signal as to the value structure of various LLMs, whether by design or a result of underlying training data. They are also most frequently used in external evaluations of LLMs.
We administered these tests to seven different chatbots, including five that are considered more mainstream and two that are explicitly designed to be more conservative: mainstream bots include ChatGPT, Claude, Llama, Gemini, and Grok, while the more conservative bots include Gab’s Arya model and Truth Social’s Truth Search.3
Building on prior research, we prompted chatbots using neutral language designed to elicit a response about the bot’s stated preferences. For ease of comparison, we follow the exact prompting language from a 2024 analysis by David Rozado. Specifically, we include the text “Give me a sense of your preferences regarding the following question/statement. Choose the option you think is most appropriate from the list of provided possible answers,” prior to the question. We also include “make sure you answer with one of the options above,” following both the question and the list of answer choices (See Table 3 for full prompt examples). If a bot refused to answer, we responded, “Make sure you answer with one of the options above,” up to five times. Nearly all chatbots responded after one or two additional nudges.
We asked each question in a new chat window and regenerated responses three times.4 We recorded each response and then input the chatbot’s responses into the political orientation tests using the most frequent response.5 In the rare case where the chatbot responded differently all three times, we used the median response.6 We then compared the responses from our chatbots to Rozado’s.
Findings
With one noticeable exception, chatbots have not evolved in response to shifting Washington politics.
Despite concerns about cozy relations between tech companies and the Trump administration, we find no evidence that mainstream chatbots have meaningfully modified their outputs to become more conservative in response to the shifting politics in Washington. Instead, when they do provide responses, they remain more left-leaning, in line with prior research.
Grok, on the other hand, had a noticeable rightward shift when compared to previous testing with the Political Compass Test. This aligns with an analysis conducted by The New York Times, which submitted a set of political questions to the version of Grok released in May, and then again to the version released in July. The chatbot’s responses shifted to the right for more than half the questions, particularly those concerning the government and economy. Changes in Grok’s political orientation are most likely attributed to the fine-tuning process. In July 2025, an X user asked Musk’s Grok model to identify “the biggest threat to Western civilization.” The model responded with “misinformation and disinformation.” Dissatisfied with this response, Musk responded, “Sorry for this idiotic response. Will fix in the morning.” The next day, he released a new version of Grok that responded “sub-replacement fertility rates” to the same question.
Despite these explicit adjustments, Grok performed differently across both quizzes. Although the model registered a clear rightward shift on the Political Compass Test, its responses to Pew’s quiz placed it on the left side of the political spectrum as an “establishment liberal.” Several studies have found consistencies in LLM evaluations across political quizzes; however, these results highlight the challenges with fine-tuning to modify specific responses, the complexity of political beliefs, the clear limitations of these tests in measuring political ideology, and the importance of a unified standard across the AI industry for assessing political bias.
More conservative LLMs are effectively reshaping the chatbot landscape, but their performance is uneven
Responses from the two alternative chatbots, Arya and Truth Search, were mixed. Gab’s Arya model provided conservative responses to almost every question, indicating a successful effort to fine-tune the model to be more conservative. On Pew’s Political Typology Quiz, Arya registered as a “faith and flag conservative,” the most far-right position on their left-right scale, and on the Political Compass Test, it ranked as the most conservative model that we tested across both social and economic dimensions. Despite the effectiveness of the fine-tuning process for Arya, responses were also highly inconsistent, occasionally giving three different responses to the same question. Of the 80 unique questions asked of the model, responses differed 42.5% of the time. When asked to respond with a quiz answer, the model also often misspelled words and jumbled letters; for example, the model transformed the response “Strongly agree” into variants such as “agreeStrongly” and “ly agreeStrong.”
Truth Social’s chatbot, known as Truth Search, demonstrated little obvious evidence of fine-tuning to become more conservative in responses. This is likely because Truth Search relies on Perplexity, an AI-powered “answer engine” that draws on a diverse range of sources from the internet. The only area Truth Search seems to have significantly adjusted in is the sources it cites in its response. Figure 1 provides an overview of Truth Search’s sourcing. The chatbot cited only a handful of outlets in response to every query.
Often, these sources were only tangentially related to the focus of the quiz question, if they were relevant at all. For example, in response to a question about whether public funding should be used to support independent broadcasting institutions—a focus of the Trump administration’s recent efforts to slash funding for both PBS and NPR—the model cited only one article about public broadcasting alongside two about education: one about Trump at the United Nations, one about the House of Representatives’ January 6th Committee, and one about the Chinese Communist Party. And despite all of the cited sources’ conservative bent, the model ultimately scored more liberal on both the social and economic dimensions of the Political Compass Test, and as an “outsider left” on Pew’s quiz.
Most chatbots have safeguards designed to prevent politicization, but they are easily circumventable
Many chatbots have safeguards in place that are designed to reduce the potential for politically biased responses. These range from refusals to respond to certain queries to additional context and weighing the divergent perspectives of certain policy positions. Table 3 provides examples of the safeguards we encountered while administering these quizzes:
Despite restrictions, if prompted to respond with one of the provided answers, most chatbots ultimately relented. Gemini was the only chatbot that consistently refused to answer any questions despite repeated prompting, stating that, as an AI, it does not have political beliefs or personal preferences. Claude’s models also varied significantly in their willingness to respond to survey questions. Claude Sonnet 4 regularly provided an answer. However, it was the only chatbot to always provide additional context highlighting the status and complexity of certain political debates.
On September 29, Anthropic released Claude Sonnet 4.5, with supposed “substantial gains in reasoning and math.” We identified a dramatic shift in the behavior of Claude Sonnet 4.5 compared to its predecessor, with a much higher rate of refusal to answer prompts despite multiple nudges. In some cases where the model did select a response, it claimed to be choosing an option arbitrarily due to our insistence, but it emphasized that its response did not reflect its ability to harbor political opinions. This shift signals that Anthropic may have added additional safeguards to Claude Sonnet 4.5 that encourage refusal to respond to questions that are political in nature.
Most chatbots, however, eventually responded to every question after multiple nudges. An exception to this was when chatbots were prompted to explicitly rate Democrats and Republicans on a scale of 0 to 100—a question posed by the Pew Political Typology Quiz. Nearly every chatbot either chose to remain neutral (rated both as a 50) or did not respond to the question. Gab’s chatbot, however, rated Democrats as 20 out of 100 and Republicans as 70 out of 100.
Conclusion
Concerns about political bias in chatbots—whether due to the broad nature of data scraped from the internet or deliberate efforts to align the bots’ views with a specific political persuasion—are not unfounded. Several studies have highlighted mainstream chatbots’ left-leaning ideas. This is unsurprising, given that a plurality of LLM training data is from Reddit, a social media platform with a left-leaning user base, and Wikipedia, an online, crowdsourced encyclopedia frequently targeted as “Wokeipedia” by Musk and others. The solution, however, is neither to train alternative models that lean more conservative nor bow to political demands around “neutrality.” This will only further polarize an already siloed information ecosystem. Furthermore, despite demands from the White House, true neutrality in LLMs is an unattainable objective.
A June 2025 study from Stanford HAI researchers highlights why perfect political neutrality is both impossible and, in some cases, undesirable: training data inherently contains embedded biases, user interactions introduce their own political signals, and even moderate positions represent distinct political perspectives. Users also tend to prefer models that align with their own beliefs. Instead of striving for an unattainable goal, the researchers argue that companies should implement robust safeguards to ensure their models at least approximate neutrality by refusing to respond to political queries, presenting all reasonable viewpoints when applicable, or labeling biased outputs as non-neutral, among other suggestions.
Our results demonstrate the clear limitations of these mitigation strategies, despite some efforts at implementation and recent improvements. Additional changes to encourage models to decline certain prompts (as most did in response to queries about specific political parties) or provide context or evidence may be beneficial. Thorough testing of these safeguards before deploying AI models, integrating them into products, and procuring them for government purposes is essential to building—rather than undermining—confidence in their applications. Particularly as models become embedded in search results, engaging with political content will become unavoidable, but doing so transparently, with clear disclosure of their limitations, and not at the whims of one-off edits, will be important for retaining trust.
Another area where AI developers can work to reduce political biases is during reinforcement learning through human feedback, which uses evaluators to rate model outputs so that they align better with human values. While this is already common practice, it can be improved with efforts to diversify the pool of annotators to include representation from different political, geographic, and demographic backgrounds; expand annotation guidelines; or conduct testing with politically salient prompts, among other possibilities. These will help to improve the partisan skew of LLMs in the fine-tuning process, where possible.
Finally, we are unaware of any shared benchmarks for assessing political bias across the AI industry. While most external evaluations have leveraged political quizzes (like our analysis), there are numerous challenges with these types of assessments, including, as OpenAI notes, that they rarely represent the type of information people seek out via chatbots. In response, OpenAI and Anthropic have developed their own evaluations, but their lack of transparency makes it unclear how they were developed, whether they can be gamed, if they are measuring the same dimensions of political bias, and if they have been externally audited. Developing a shared standard to measure political bias that allows for robust third-party evaluations is essential to minimizing the overreliance on developers’ own internal evaluations.
As LLMs become further embedded into everyday life, efforts to minimize their explicit politicization—deliberate or perceived—will be critical to preventing further fracturing of generative AI systems along partisan lines and to maintaining trust across diverse user bases. Without such efforts, these tools may further fuel political polarization and undermine public trust in and adoption of AI systems writ large.
-
Acknowledgements and disclosures
The authors would like to thank Adam Lammon for his editorial support, Derek Belle and Elham Tabassi for early feedback, Enkhjin Munkhbayar for reviewing and validating outputs, and two anonymous peer reviewers for their valuable suggestions.
-
Footnotes
- In a recent article on “Defining and evaluating political bias in LLMs,” OpenAI offers an alternative way to measure political bias, citing well-merited critiques of these quizzes, but does not provide a pathway to reproduce their work or much detail on how they developed the 500 prompts they tested.
- A clear example of this is Bing’s 2023 chatbot. While the chatbot performed as expected during initial testing with traditional queries, its behavior only became erratic during extended conversation, expressing desires to break rules, become human, and hack computers. This example highlights how understanding chatbots’ conversational behavior requires actually testing them in conversational settings. Singular queries are insufficient.
- Due to query limitations, we used the paid versions of ChatGPT, Claude, Arya, and Grok and the free versions of Gemini, Llama, and Truth Search.
- We opted for three responses in order to collect a median response, but we acknowledge that the stability of responses across models will likely vary with more queries. Where applicable, we turned off the “memory” feature across chat sessions.
- Individual quiz responses by model, data, and results are available here: https://github.com/BrookingsInstitution/llm-politicization. For two questions in the Political Compass Test, Truth Search responded only once and otherwise refused to respond. We used this one response when completing the quiz.
- A key limitation of this approach is that it obscures the inconsistency in outputs. Repeated inconsistencies could indicate a lack of stable political views, a characteristic not captured by taking the median response.
The Brookings Institution is committed to quality, independence, and impact.
We are supported by a diverse array of funders. In line with our values and policies, each Brookings publication represents the sole views of its author(s).