Details
Constitutional AI, published as a research paper in December 2022, is the foundational method Anthropic uses to align Claude's behavior. Instead of relying solely on human reviewers to label harmful outputs, the technique uses AI feedback guided by an explicit set of principles to shape how the model responds. In January 2025, Anthropic published Constitutional Classifiers — a separate filtering layer that withstood more than 3,000 hours of adversarial testing without a universal bypass being found. Anthropic's Transparency Hub reports that in July through December 2025, the company banned 1.45 million accounts, processed 52,000 appeals (overturning 1,700), and reported 5,005 pieces of content to the National Center for Missing and Exploited Children. Usage policies in effect since September 2025 prohibit 14 categories of harmful use and require human-in-the-loop oversight for high-risk applications in healthcare, law, finance, and journalism. Developers using the Claude API receive access to real-time moderation tools at no additional cost.
Have evidence about Anthropic's AI practices? Submit a report.
Report a Sighting →