I’m sympathetic to those rallying against allowing companies to scrape the open web to train their models, but unfortunately that ship has sailed. By trying to block scrapers today, you guarantee that only OpenAI and Google can build large models since they already scraped everything, granting them a duopoly and turning their feature into a platform.

At this point, we’re better off either fostering competition and making it easier to bootstrap a new foundational model, or legislating that any model trained on public data must be fully open sourced.