Navigating the Line Between Fair Use and Plagiarism in the Age of AI Chatbots

The emergence of generative AI has raised questions about the ethics of content usage and summarization. Perplexity AI, a startup that combines a search engine with a language model, has faced accusations of unethical practices. Forbes accused Perplexity of plagiarizing one of its news articles, while Wired claimed the startup illicitly scraped its website and others. Perplexity denies any wrongdoing and argues that it operates within fair use copyright laws. The situation revolves around the Robots Exclusion Protocol, which indicates whether websites permit web crawlers to access their content, and fair use in copyright law, which allows limited use of copyrighted material without permission or payment. Wired and developer Robb Knight conducted experiments that suggested Perplexity was scraping content from websites against the wishes of publishers. However, Perplexity argues that summarizing a URL is not the same as crawling, as it only responds to direct user requests for information. The distinction between summarization and scraping is a key point of contention between Perplexity and publishers. Additionally, Wired accused Perplexity of hallucinating entire stories, highlighting the challenges in using AI models for summarization. Perplexity’s CEO has cited fair use as justification for summarizing articles, comparing it to how journalists use information from other sources. Plagiarism, while frowned upon, is not illegal. Fair use allows limited use of copyrighted material for purposes such as commentary or news reporting. The legality of Perplexity’s summaries depends on whether they reproduce significant portions of the original articles. The lack of bright lines in fair use law makes the situation complex. Perplexity aims to protect itself by entering advertising revenue-sharing deals with publishers and offering them access to its technology for Q&A experiences. However, if AI scrapers continue to repurpose publishers’ work, it could undermine their ability to earn ad revenue and lead to a decrease in content availability. This, in turn, could result in generative AI systems training on synthetic data, potentially creating biased and inaccurate content.