Web Scraping, AI Training, and India’s Data Protection Blind Spot

Web Scraping, AI Training, and India’s Data Protection Blind Spot
Keywords: Web scraping law in India, AI training data legality, Publicly available personal data, Digital Personal Data Protection Act 2023, DPDPA Section 3(c)(ii), AI and privacy law India, Legality of AI training in India, Data protection and artificial intelligence, Consent under DPDPA, AI data scraping compliance
Artificial Intelligence is transforming everything from governance and law to healthcare, finance, and communication. At the heart of this transformation are Large Language Models (LLMs) - the systems powering generative AI tools across the world. But behind the sophistication of these systems lies a simple and controversial practice: web scraping.
Web scraping - the automated extraction of data from websites - has become the backbone of AI training. While it has legitimate uses in research, compliance monitoring, and analytics, its use for training commercial AI systems raises a pressing legal question:
If data is publicly available online, does that mean it can be freely scraped and used to train AI?
Under India’s Digital Personal Data Protection Act, 2023 (DPDPA), the answer appears dangerously close to “yes.” And that may be a problem.
Public Access Is Not the Same as Consent
A common assumption in the digital world is that if information is visible without a login or paywall, it is free for use. But legally, there is a critical difference between:
- Access – technical ability to view data
- Authorization – legal permission to use it
- Consent – informed and voluntary approval for specific processing
Just because someone can view your LinkedIn profile, blog post, or public tweet does not mean they have your consent to scrape it, aggregate it, and feed it into a commercial AI system.
Consent in data protection law is purpose-specific. If a person shares their professional details for networking, that does not automatically imply consent for:
- AI training
- Commercial exploitation
- Model optimization
Yet, India’s DPDPA creates a major exception.
The Public Data Exemption Under the DPDPA
Section 3(c)(ii) of the Digital Personal Data Protection Act, 2023 excludes “publicly available personal data” from the Act’s scope.
In simple terms, if personal data has been made publicly available by the individual or by someone legally obligated to do so, it is exempt from the Act’s consent and compliance requirements.
This means:
- No consent requirement
- No purpose limitation
- No transparency obligation
- No accountability framework
For AI companies, this exemption creates a regulatory safe zone. Public data can be scraped at scale and used to train models without triggering India’s primary data protection law.
But this approach rests on a flawed assumption:
Public availability equals unrestricted permission.
Why AI Changes the Equation
Traditional data processing is usually discrete and reversible.
- If a company misuses your data, it can delete it.
- If you withdraw consent, processing can stop.
AI training is different.
When personal data is used to train an AI model:
- It becomes embedded in the model’s internal parameters
- It cannot easily be isolated or removed
- It may reappear in generated outputs
- Deletion requests may be technically impractical
This creates two serious risks.
1. Data Memorization and Leakage
Research has shown that large language models can reproduce fragments of training data when prompted strategically.
This may include:
- Names
- Email addresses
- Other identifiers scraped from the web
In such cases, harm does not occur at the point of collection - it occurs later, when the AI generates an output.
The DPDPA’s public data exemption does not account for this generative risk.
2. The Irreversibility Problem
Once data is used to train a model, removing it may require retraining the entire system - an expensive and often unrealistic solution.
This makes traditional rights like:
- Consent withdrawal
- Data erasure
ineffective in practice.
If public data is exempt from regulation at the collection stage, individuals may be left with no meaningful remedy once harm occurs.
The Provenance Problem: Was It Really Public?
There’s another complication.
Web scrapers cannot reliably verify whether data was made public by:
- the individual, or
- someone else.
Content may be:
- reposted
- mirrored
- scraped from elsewhere
- leaked
So how does an AI developer confirm:
- That the data was voluntarily made public?
- That it was not the result of a breach?
- That it was not later deleted or restricted?
The DPDPA does not address this ambiguity.
Instead, it shifts the burden onto individuals, even though they have no visibility into whether their data has been scraped or embedded into AI systems.
How Other Jurisdictions Handle Public Data
India’s approach contrasts sharply with emerging global trends.
European Union
Under the GDPR, publicly available personal data remains protected.
Public access does not eliminate compliance obligations. Controllers must still demonstrate:
- Lawful processing
- Proportionality
- Purpose limitation
China
China’s Personal Information Protection Law (PIPL) allows processing of public data only within a “reasonable scope.”
It explicitly restricts uses that significantly affect individuals’ rights.
Canada
Canadian law allows limited use of publicly available data, but under narrowly defined and regulated circumstances.
Global Regulatory Consensus
In 2023 and 2024, multiple international data protection authorities issued joint statements warning that:
Scraping publicly available personal data for AI training may constitute a privacy violation.
The global direction is clear:
Public visibility does not erase privacy rights.
India’s blanket exemption stands apart from this trend.
The Risk of Becoming a Data Extraction Haven
If India maintains a broad exemption for publicly available personal data, it risks becoming an attractive jurisdiction for AI developers seeking fewer regulatory constraints.
This could have several consequences:
- Weakening of constitutional privacy protections
- Erosion of public trust in digital systems
- International interoperability challenges
- Increased exposure to AI-related privacy harms
In the long term, regulatory permissiveness may undermine rather than promote innovation.
The Case for Reform
A balanced approach is possible - one that supports AI innovation while preserving privacy.
Key reforms could include:
1. Context-Based Limitation
Publicly available personal data should not be exempt for all purposes.
The exemption should apply only where use aligns with the original context of disclosure.
2. AI as High-Risk Processing
AI training should be classified as high-risk data processing, triggering enhanced transparency and accountability requirements.
3. Provenance Verification Obligations
The burden should shift to data fiduciaries to verify that public data was:
- voluntarily disclosed
- lawfully sourced
4. Transparency in AI Training Data
Developers should disclose categories and sources of training data, enabling meaningful oversight.
5. Safeguards Against Memorization
Technical safeguards should be required to reduce data leakage and memorization risks.
Moving Beyond “Public vs Private”
The central lesson is simple:
Digital privacy cannot be reduced to a binary distinction between public and private data.
In the AI era, what matters is not whether data is visible but:
- How it is collected
- Why it is used
- Whether individuals retain control
- Whether the use creates disproportionate risk
India’s data protection law was enacted to protect informational self-determination.
Allowing unrestricted scraping of publicly available personal data for AI training undermines that objective.
If India wishes to lead in responsible AI development, it must move toward a contextual, risk-based regulatory model - one that recognizes that public access is not the same as perpetual consent.
Innovation and privacy are not mutually exclusive.
But without reform, India’s current framework risks sacrificing the latter in pursuit of the former.
Related Posts
The Global Rise of Digital Arrest: A Socio-Legal Analysis of India’s Cyber-Extortion Crisis
A deep dive into the alarming trend of 'digital arrest' scams in India – how they work, the psychological manipulation behind them, and the legal steps you can take to protect yourself.