Making AI work for schools

Powered by breakthroughs in large language models, artificial intelligence (AI) is racing from proof-of-concept to multi-billion-dollar industry. The global AI-in-education market is valued at roughly $7 billion in 2025, and projected annual growth for that market is more than 36% through the next decade. At the same time, school systems are faced with the challenge of assessing and deploying these rapidly evolving technologies in a way that most benefits students and teachers. This makes the need for human stewardship of AI an urgent issue. UNESCO’s first global guidance on generative AI calls for “human-centered” adoption, and the U.S. Department of Education’s report AI and the Future of Teaching and Learning is urging states and districts to design initiatives that ensure the safe and equitable use of AI.

Recent experimental evidence underscores the opportunities AI presents: A randomized control trial found that an AI tutor more than doubled learning gains over a collaborative classroom instruction model , and another randomized trial showed a large language model (LLM) assistant lifted middle school math achievement, especially for novice human tutors. With investment and promise both surging, the question is no longer whether schools will adopt AI but how to harness its potential while embedding the governance, transparency, and human oversight needed to safeguard equity and trust. At the same time, the long‑term effects of the growing use of AI in classrooms on students’ learning, critical thinking, creativity, and relationships with peers and teachers remain uncertain. For instance, a 2024 field experiment with nearly 1,000 Turkish high‑school students found that unrestricted access to ChatGPT during math practice improved practice‑problem accuracy but reduced subsequent test scores by 17 percent, suggesting that poorly designed AI deployments may impede rather than enhance learning.

Studies show that the very same AI tool can spark deference or distrust depending on who the user is and how the system is framed. In a lab-in-the-field study, high school students were less likely to accept AI recommendations based on hard metrics such as test scores. This points to algorithm aversion when the stakes are high. Teachers, by contrast, sometimes lean the other way: A study found that labeling an intelligent-tutoring system “complex” and providing a brief explanation of its functioning actually increased educators’ trust and adoption intent, a clear case of algorithmic recommendation acceptance driven by framing. Oversight patterns mirror these trust asymmetries. A randomized grading experiment showed that teachers were more likely to let significant AI errors stand than identical mistakes labeled “human,” revealing a critical blind spot in human-AI collaboration. Another study found that undergraduates rated AI-generated feedback as highly useful—until they learned its source, at which point perceived “genuineness” dropped sharply. Together, these results underscore a core design imperative: Disclosure policies, explanation tools, and professional training must foster appropriately critical trust—confidence strong enough to reap AI’s benefits, yet skeptical enough to catch its mistakes.

A playbook for responsible integration

A responsible integration playbook can be distilled into five mutually reinforcing principles: transparency, accountability, human-in-the-Loop design, equity, and continuous monitoring. Together they provide high-level guardrails that school districts, vendors, and researchers can build into every deployment cycle.

Transparency means that users know when they are interacting with an algorithm and why it reached its recommendation. For example, when high school students were told a college-counselling tip came from an AI, uptake declined on advice based on hard metrics, showing that disclosure shapes trust in nuanced ways. Teachers display the mirror image: Labelling a grading source as “AI” made teachers more likely to overlook a harsh scoring error.
Accountability requires audit trails that let humans trace—and, if necessary, overturn—algorithmic decisions. The grading-oversight blind spot above illustrates what can go wrong when auditability is weak.
Human-in-the-Loop design keeps professional judgment at the center. At each critical step, for example flagging an at‑risk student, issuing a grade, recommending content, the AI pauses until a teacher or administrator reviews the output, adjusts it if necessary, and gives explicit approval. Human-in-the-loop safeguards decisions before they happen, while the separate accountability principle steps in after, supplying the audit trails and authority needed to review and correct mistakes.
Equity demands that gains accrue to the least served, not just the most tech-savvy.
Continuous monitoring turns one-off pilots into living systems. Regular, data‑based checks watch an AI system’s accuracy, fairness, security, and day‑to‑day use. Dashboards, automatic alerts, and scheduled audits spot problems early, such as drift or misuse, so developers can update the model and schools can adjust teacher training.

By anchoring every rollout to these five principles and by tying each to empirical warning lights as well as success stories, education leaders can move beyond abstract “responsibility” toward concrete, testable practice.

Implications

A good practice would be for school districts and departments of education to start every large-scale AI rollout with pilot audits that use rapid-cycle evaluation tools to generate rapid-cycle evaluation tools to assess programs before a system is green-lit for wider purchase. These audits need a solid data-governance backbone: Departments could adopt clear disclosure, privacy, human-oversight, and accountability rules. Concretely, that means (i) registering every algorithm in a public inventory, (ii) requiring districts to keep versioned “decision logs” for at least five years, and (iii) tying contractual payments to documented evidence of student-learning or efficiency gains at each expansion stage.

Ed-tech vendors, for their part, can lower adoption friction by publishing “model cards”—succinct, structured summaries of training data, known biases, and performance limits—and by offering an open Application Programming Interface (API) sandbox where qualified third parties can test the algorithms with edge cases and look for potential issues before classroom use. University schools of education could embed mandatory AI-literacy and “critical-use” modules into pre-service coursework. Finally, researchers and philanthropic funders could pool resources to create a permanent test bed: A standing network of volunteer districts where rapid-cycle randomized control trials can vet new tools every term. A shared test bed would slash the transaction costs of high-quality evidence and ensure that scaling decisions keep pace with the technology itself.

A practical path begins with step-by-step pilots that run in low-stakes “sandbox” classrooms for one semester. District AI‐governance teams, working with vendor engineers, could document every decision and feed weekly logs to an independent evaluator—a university research center or trusted third party on retainer, for example. After an 8- to 12-week audit, tools that meet pre-registered learning-gain and equity benchmarks would advance to a limited-scale rollout (for instance, 5% of students) for the following semester, again under third-party review. Only after two successive positive evaluations would a system move to district-wide adoption, at which point responsibility shifts to the state department of education for annual algorithm audits and procurement renewals. A typical timeline would be 30 months from sandbox to scale: Semester 0 (readiness and teacher training), Semesters 1–2 (pilot and first audit), Semesters 3–4 (limited rollout and second audit), Year 3 onward (full deployment with continuous monitoring). This kind of structured, evidence-generating rollout would not only safeguard students and educators but also help the AI-in-education sector move beyond hype, sustaining investor confidence by demonstrating real-world effectiveness, something that may have been elusive during the post-COVID ed-tech boom and bust.

Conclusion

The speed at which generative AI is entering classrooms leaves little margin for error. Evidence from the latest experiments shows that well-governed systems can boost learning and narrow gaps, while poorly framed tools can entrench bias and erode trust. The stakes, economic and ethical, demand evidence-based, equity-first integration: transparent algorithms, accountable oversight, empowered educators, and relentless, data-driven evaluation. The opportunity is immense, but only if education leaders, vendors, funders, and researchers act now and together to make responsible AI the default, not the afterthought, of digital innovation in schools.