The Flow Report - technical drift in web scraping
Web scraping presents a perplexing set of risks that fluctuate with developments in software, law, and business norms. These risks shift as use of the web becomes more automated overall. The current trend in major markets does track the anticipated growth of web scraping as an almost conventional means of accessing data, checked by click-wrap legal terms and access controls such as CAPTCHA. But site operators and scrapers are deploying less well-known tools to fortify those walls and to profit from the data they protect.
Scraping went mainstream with the rise of AI training
Commentators predicted that, following the hiQ litigation in 2022, risks in scraping would become more predictable based on site-specific terms and software used to prevent scraping. This has been the case, as in the ongoing litigation between Bright Data and X Corp. Yet artificial intelligence deployed by the largest platforms is normalizing the basic concept of collecting other platforms' data, leading to less inhibited views of valid scraping practices and data collection.
(Of course, there are also high-profile examples of litigation where creators, artists, and content providers have not accepted data collection for the purposes of training AI. And there is a recent 11th Circuit case in the United States (listed below) finding liability for high-volume scraping. The use case for particular scraped data – whether unrelated or competitive – remains a highly relevant risk factor.)
Technical norms appear to be shifting
A few patterns are emerging as large companies train AI:
Web scrapers seem to be more comfortable ignoring robots.txt on the premise that it is optional (or only applies to users that log in).
Increased site traffic is considered defensible, as long as sites are not disrupted by scraping.
The speed of automated collection is increasing, with software such as Bytespider (owned by ByteDance, provider of TikTok) crawling at a reported 25 times the speed of OpenAI.
Proxy server use also seems to be more widespread.
Sites adopt defensive measures
At the same time, site operators are deploying software-based defenses to scraping such as Cloudflare's (relatively recent) "turnstile" control, which gates access somewhat like CAPTCHA and is widely used following its 2022 beta release. But even smaller scraping firms can quickly download open-source tools, such as human mouse movement simulators, on GitHub or elsewhere to solve or bypass these controls. Many of these software tools have been available for years; however, their use is becoming commonplace.
Glacier’s perspective
The internet is moving towards free and open data access as legal cases, data storage, and user-generated content support an increasingly common view that much of this data is public or, rather, should not be the intellectual property of any one platform.
Glacier nevertheless encourages both providers and consumers of scraped data to carefully consider acceptable and safe parameters for the technical issues mentioned above, whether in a formal written policy (for regulated data buyers) or at the point of contracting for scraping services, in accordance with applicable law.
Practical considerations
In light of both recent changes in commercial norms around web scraping and recent case law in the United States, consumers of web scraped data should take several steps to recalibrate their scraping practices:
Revisit written policies and procedures to ensure that these reflect one’s current practices (including vendor practices) in the market,
Ask existing vendors about their adoption of new software tools and changes to scraping protocols on contract renewal, and
Determine the extent to which scraped data is collected by subcontractors, including those that may be ex-US.
Recent legal cases and related resources
X Corp. v. Bright Data Ltd. (CA Northern District Court) (next hearing set for 10/23/24 at 8AM PDT)
Compulife Software, Inc, v. Newman et al. (11th Circuit Court of Appeals) (finding trade secret misappropriation in scraping a substantial portion of a database)
Ethical Web Data Collection Initiative (EWDCI) – Draft Principles
©2024 Glacier Network LLC d/b/a Glacier Risk (“Glacier”). This post has been prepared by Glacier for informational purposes and is not legal, tax, or investment advice. This post is not intended to create, and receipt of it does not constitute, a lawyer-client relationship. This post was written by Don D’Amico without the use of generative AI tools. Don is the Founder & CEO of Glacier, a data risk company providing services to users of external and alternative data. Visit www.glaciernetwork.co to learn more.