In the fast-changing era of artificial intelligence (AI), one question has become paramount: how do creators and site owners get to dictate whether their information gets used to train Large Language Models (LLMs)? A new, proposed standard dubbed llms.txt aims to answer this with certainty.
Llms.txt is a text file placed on a website that specifies limitations for AI web crawlers in a way analogous to the existing robots.txt file does for search engine web crawlers. Its primary function is to specify which parts of a website, if any, can be utilized when training AI models. The idea was initially suggested by a consortium of technology companies and is gaining traction as an easy means by which creators of content can express their desires (Broussard, 2024).
The Pros: Why You Might Prefer llms.txt
Having llms.txt offers a number of solid reasons why webmasters might prefer to have it.
Plain Control: The best reason is having control to state clearly your wishes. It takes the doubt away if you consent to your data being utilized to train AI.
Protection of Intellectual Property: For companies, artists, and writers, content on websites is a valuable asset. llms.txt provides a means to keep this proprietary material from being drawn into third-party AI models without permission or compensation.
Future-Proofing: Since the legal and ethical standards for training AIs are not yet fully developed, adopting llms.txt is a proactive move. It prepares your site for a potential future industry standard.
The Cons: The Current Limitations
Despite its potential, llms.txt is far from perfect and has certain major drawbacks.
Voluntary Adoption: The most significant flaw is that adoption of llms.txt is entirely voluntary. Malicious actors or companies who object to the standard can simply ignore the file and scrape your data in any case (TechCrunch, 2024).
Not a Legal Shield: llms.txt is a technical guideline, not a legally binding agreement. It does not eliminate the need for clear-cut terms of service or copyright notice. Its validity relies on the “good faith” of AI developers.
A Developing Standard: The suggestion remains fairly recent, and it is by no means certain that it will ever be the widely accepted standard. Other alternate methods may arise, or large AI companies may devise their own proprietory systems.
If you’re interested in learning more on how to protect your website data from AI, contact us today!