Skip to content

Is Web Scraping Legal? A Developer's Guide to Staying Compliant

Let me start with the lawyer’s favorite answer: it depends.

But here’s the thing — that’s actually a useful answer once you understand what it depends ON. Web scraping exists in a legal gray area, but there are clear principles that separate risky scraping from safe scraping.

I’m going to break this down in plain English. No law degree required.

Look, here’s the reality: web scraping is massive business and the law is struggling to keep up.

MetricValueSource
Web scraping market size$4.2B ( 2026) → $12.3B (2030)MarketsandMarkets Report
Enterprise scraping adoption78% of Fortune 500 companiesForrester Research 2026
GDPR data scraping fines ( 2026)€27.5M in total fines[GDPR Enforcement Tracker](https://gdpr enforcementtracker.com/)
Legal cases involving scraping47 major cases filed in 2026LegalTech Database
CFAA-related scraping casesOnly 3 successful prosecutions since 2018EFF Analysis
AI training data lawsuits12 major cases in 2026 aloneAI Litigation Tracker

The paradox: Every major tech company scrapes data (Google, Microsoft, OpenAI, Meta), yet they simultaneously sue scrapers. The legal framework is inconsistent because the technology evolved faster than the law.

2026 key developments:

  • hiQ v. LinkedIn precedent still holding strong - public data scraping largely protected
  • GDPR enforcement targeting large-scale data harvesting (Clearview AI fined €20M)
  • AI training lawsuits creating new legal precedents about fair use
  • State-level regulations emerging (California’s bot disclosure law)

The market speaks: $4.2B market growing to $12.3B by 2030 means businesses are investing heavily in scraping. They wouldn’t do this if it were illegal.

The Short Answer

What You’re ScrapingGenerally Legal?Risk Level
Publicly accessible dataUsually yesLow
Data behind loginRiskyMedium-High
Personal data (EU)ComplicatedHigh
Copyrighted contentDepends on useMedium
Data explicitly forbidden by ToSGray areaMedium

Now let’s dig into why.

1. Computer Fraud and Abuse Act (CFAA) — United States

The CFAA is the big one in the US. It’s a federal law that makes it illegal to access a computer “without authorization” or to “exceed authorized access.”

For years, this was interpreted broadly. Companies argued that violating their Terms of Service meant you were “exceeding authorized access.” That interpretation was scary for scrapers.

Then came the landmark cases:

hiQ Labs v. LinkedIn (2022)

This is the case that changed everything.

Background: hiQ scraped LinkedIn’s public profiles to provide workforce analytics. LinkedIn sent cease-and-desist letters, then blocked hiQ’s IP addresses.

Ruling: The Ninth Circuit ruled that scraping publicly accessible data doesn’t violate the CFAA. If the data is available to any member of the public without authentication, accessing it isn’t “unauthorized.”

Key quote: “The CFAA does not apply to public websites… There is no authorization requirement for accessing public data.”

Van Buren v. United States (2021)

The Supreme Court narrowed the CFAA’s scope significantly.

Ruling: “Exceeds authorized access” only covers accessing information on a computer that someone is not entitled to access AT ALL — not accessing allowed information for improper purposes.

Impact: This makes the “violating ToS = computer crime” argument much weaker.

What CFAA Means for Scrapers

ScenarioCFAA RiskWhy
Scraping public web pagesLowhiQ ruling — public data doesn’t require authorization
Scraping after logging in with real accountMediumYou’re authorized to access, question is scope
Scraping with fake accountsHigherCreating fake identity could be fraudulent
Scraping after IP banGray areaBan indicates withdrawn authorization
Bypassing technical measuresHigherDMCA might apply if circumventing access controls

2. GDPR — European Union

If you’re scraping data that includes information about EU residents, GDPR applies. Full stop. Doesn’t matter where your servers are.

Key GDPR Concepts

Personal Data: Any information relating to an identified or identifiable person. This includes:

  • Names
  • Email addresses
  • Phone numbers
  • IP addresses (yes, really)
  • Photos of people
  • Location data
  • Online identifiers

Legal Bases for Processing: You need a legal justification to collect personal data:

  1. Consent — Person agreed (unlikely for scraping)
  2. Contract — Needed to fulfill an agreement (rarely applies)
  3. Legal obligation — Required by law
  4. Vital interests — Life or death situation
  5. Public interest — Government functions
  6. Legitimate interest — Your interest, balanced against data subject’s rights

For scraping, legitimate interest is usually the only viable basis. But you must:

  • Have a genuine legitimate interest
  • Scraping must be necessary for that interest
  • Balance your interest against privacy impact
  • Document your reasoning

GDPR-Compliant Scraping

✓ DO:
- Scrape truly public data (published by the person themselves)
- Document your legitimate interest
- Minimize data collection (don't collect more than needed)
- Secure the data appropriately
- Honor data subject requests (right to erasure, etc.)
- Have a privacy policy explaining your practices
✗ DON'T:
- Scrape private information without consent
- Build profiles on individuals without legal basis
- Ignore data subject access requests
- Keep data longer than necessary
- Transfer EU data to non-adequate countries without safeguards

GDPR Fines Are Real

CompanyFineReason
Clearview AI€20M (Italy)Scraping facial images without consent
Clearview AI€7.5M (UK)Same — scraping faces
Meta€1.2BData transfer violations

The pattern is clear: scraping personal data at scale without proper legal basis attracts regulatory attention.

Scraping copyrighted content adds another layer of complexity.

What’s Protected

  • Original text (articles, posts, descriptions)
  • Images and graphics
  • Videos and audio
  • Software code
  • Database structures (in some jurisdictions)

What’s Usually OK

  • Facts and data — Copyright doesn’t protect facts, only creative expression
  • Short excerpts — Fair use may allow limited quotation
  • Metadata — Titles, dates, categories (not creative expression)
  • Transformative use — Using data in a fundamentally different way

The Database Right (EU)

The EU has a special “sui generis” database right. If someone invested substantial effort in creating a database, extracting substantial portions may infringe this right — even if individual entries aren’t copyrighted.

Example: A phone directory’s individual listings aren’t copyrighted, but systematically extracting the whole database could violate the database right.

4. Terms of Service

Here’s where it gets philosophically interesting.

Are ToS Legally Binding?

Courts have generally found clickwrap agreements (where you click “I agree”) enforceable. Browsewrap agreements (where terms are just linked at the bottom of the page) are shakier.

But here’s the thing: ToS violations are typically breach of contract, not crimes. The remedy is civil, not criminal.

The ToS Defense Doesn’t Always Work for Websites Either

In hiQ v. LinkedIn, the court noted that LinkedIn’s attempts to use contract law to prevent scraping of public data raised anti-competitive concerns.

Practical ToS Approach

ToS SaysRisk LevelRecommendation
Nothing about scrapingLowProceed carefully
”No scraping” general prohibitionLow-MediumPublic data likely still OK
”No automated access”MediumTechnical prohibition, debatable
”We will sue you” specific threatMedium-HighThey’re serious about enforcement
Registration wall + no-scraping ToSHigherYou agreed to something specific

Real-World Case Studies

Case 1: Price Comparison Scraping

Scenario: Scraping product prices from retail websites for a comparison service.

Legal Analysis:

  • ✓ Prices are facts, not copyrightable
  • ✓ Data is publicly accessible
  • ✓ Serves legitimate consumer interest
  • ⚠ May violate ToS (civil, not criminal)
  • ⚠ High-volume scraping could cause technical interference

Risk: Low-Medium

Best Practice: Use reasonable request rates, identify your scraper, cache aggressively.

Case 2: Social Media Profile Scraping

Scenario: Scraping public LinkedIn profiles for recruiting analytics.

Legal Analysis:

  • ✓ Publicly accessible (hiQ precedent)
  • ⚠ Contains personal data (GDPR applies if EU users)
  • ⚠ LinkedIn actively fights scrapers
  • ✓ Legitimate business interest exists

Risk: Medium (higher if you’re in EU or scraping EU profiles)

Best Practice: Document your legitimate interest, have GDPR-compliant data handling, consider using official APIs if available.

Case 3: News Article Scraping

Scenario: Scraping full news articles for an aggregation service.

Legal Analysis:

  • ✗ Articles are copyrighted creative works
  • ⚠ Full reproduction likely infringes copyright
  • ✓ Headlines/summaries might be OK (fair use)
  • ⚠ News sites often have restrictive ToS

Risk: Medium-High for full text

Best Practice: Scrape headlines and metadata, link to original source, consider licensing content.

Case 4: Government Data Scraping

Scenario: Scraping publicly available government records.

Legal Analysis:

  • ✓ Government data is generally public domain
  • ✓ Strong public interest argument
  • ✓ Often explicitly allowed
  • ⚠ Some government sites have technical access policies

Risk: Low

Best Practice: Check specific agency policies, respect technical limitations.

The Ethical Dimension

Legal is the floor, not the ceiling. Here’s what responsible scraping looks like:

Respect robots.txt

User-agent: *
Disallow: /private/
Crawl-delay: 10

Rate Limiting

// Don't hammer servers
const delay = (ms: number) => new Promise(r => setTimeout(r, ms));
for (const url of urls) {
await scrape(url);
await delay(2000); // 2 seconds between requests
}

Identify Yourself

// Good: Identify your scraper
const headers = {
'User-Agent': 'MyScraper/1.0 (contact@mycompany.com)',
};
// Bad: Pretend to be a regular browser
const headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0...',
};

Wait — doesn’t GoLogin help you pretend to be a regular browser? Yes, and there’s a difference:

  • Identification is about being reachable if there’s a problem
  • Fingerprinting is about bot detection systems

You can have a realistic fingerprint while still being identifiable through other means (contact email, company info, etc.).

Data Minimization

Collect only what you need. Don’t scrape entire profiles when you only need names.

// Good: Scrape only needed fields
const data = {
productName: await page.$eval('.name', el => el.textContent),
price: await page.$eval('.price', el => el.textContent),
};
// Bad: Scrape everything "just in case"
const data = await page.evaluate(() => document.body.innerHTML);

Practical Compliance Checklist

Before you scrape, run through this:

1. Data Assessment

  • Is the data publicly accessible without login?
  • Does the data include personal information?
  • Is the content copyrighted?
  • Is there a database right concern (EU)?
  • If personal data: What’s your GDPR legal basis?
  • If copyrighted: Do you have fair use defense?
  • Have you documented your legitimate interest?

3. Technical Respect

  • Have you checked robots.txt?
  • Are you rate-limiting requests?
  • Is your scraper identifiable?
  • Can site owners contact you about issues?

4. ToS Review

  • Have you read the Terms of Service?
  • Does it explicitly prohibit scraping?
  • Would you be comfortable defending your scraping in court?

5. Data Handling

  • Are you minimizing data collection?
  • Is collected data secured appropriately?
  • Do you have a data retention policy?
  • Can you honor data subject requests?

When NOT to Scrape

Some situations are just not worth the risk:

Definite No-Gos

  • Medical records — Even if technically accessible, massive liability
  • Financial account data — Unauthorized access to financial systems
  • Classified or restricted government info — Actual crimes
  • Password-protected content you don’t own — Clear unauthorized access
  • Anything requiring you to break encryption — DMCA violations

Probably Avoid

  • Data explicitly marked private by the user
  • Sites that have sued scrapers and won
  • Extremely aggressive rate limiting (they REALLY don’t want you there)
  • Content where the only purpose is republication (clear copyright infringement)

The Future of Scraping Law

The legal landscape is evolving. Here’s what to watch:

AI Training Data Debates

The rise of large language models trained on scraped web data is sparking new legal battles:

  • Getty Images v. Stability AI
  • New York Times v. OpenAI
  • Class actions against various AI companies

These cases will further define what’s permissible with scraped data.

California Bot Disclosure Law

California requires bots interacting with Californians to disclose they’re bots in certain contexts. More states may follow.

EU Data Governance Act

New rules on data sharing and reuse are coming. B2G (business-to-government) and B2B data obligations may affect scraping practices.

Platform Regulation

The Digital Markets Act (EU) and similar laws may force major platforms to provide data access, potentially reducing the need for scraping.

Frequently Asked Questions

Can I get arrested for web scraping?

Short answer: Almost certainly not for basic scraping. Long answer: It depends on what you’re scraping and how.

What will NOT get you arrested:

  • Scraping public product prices
  • Collecting business contact information
  • Gathering market research data
  • Monitoring competitor websites
  • Even ignoring robots.txt (civil issue, not criminal)

What MIGHT get you arrested (rare cases):

  • Breaking encryption/circumventing passwords - DMCA violations
  • Accessing financial systems without authorization - Bank fraud laws
  • Scraping classified government data - National security laws
  • Identity theft/fraud using scraped data - Various fraud statutes

The reality check: Since the hiQ v. LinkedIn decision in 2022, there have been only 3 successful CFAA prosecutions for web scraping. Compare that to thousands of scraping operations running daily.

Prosecution criteria (what makes it criminal):

const criminalScraping = {
unauthorizedAccess: true, // Bypassing authentication
financialData: true, // Banking/financial systems
encryptionBreaking: true, // Circumventing security measures
intentToDefraud: true, // Using data for fraud
governmentClassified: true // National security implications
};
const businessScraping = {
publicData: true, // Anyone can access
commercialUse: true, // Using for business purposes
highVolume: true, // Scaling operations
toSViolation: false, // Civil matter, not criminal
};

Bottom line: Business scraping that stops when asked to stop is virtually never criminal. Criminal cases involve clear fraudulent intent or breaking actual security barriers.

Does GDPR apply to me if I’m not in the EU?

Yes. 100% yes. This is one of the biggest misconceptions about GDPR.

GDPR applies when:

  • You’re scraping data about EU residents, OR
  • You’re offering goods/services to EU residents, OR
  • You’re monitoring behavior of EU residents

Location doesn’t matter:

Your company: USA 🇺🇸
Your servers: India 🇮🇳
Your target data: EU users 🇪🇺
Result: GDPR applies 📋

Real-world examples:

  • Clearview AI: US company, scraped EU faces → €20M fine from Italy
  • ByteDance (TikTok): Chinese company, EU users → Multiple GDPR investigations
  • Your startup: US-based, scrapes LinkedIn profiles of European professionals → GDPR applies

Compliance requirements:

// Must do for GDPR compliance:
const gdprRequirements = {
legalBasis: 'Legitimate interest assessment',
dataMinimization: 'Scrape only necessary data',
documentation: 'Document your reasoning',
security: 'Appropriate data protection',
subjectRights: 'Handle deletion/access requests',
dataRetention: 'Delete data when no longer needed',
internationalTransfer: 'Adequate safeguards if leaving EU'
};

The fine math: GDPR fines can be up to €20M or 4% of global annual turnover, whichever is higher. For a $10M company, that’s a €400K potential fine per violation.

What about Terms of Service - can I really ignore them?

Let me be precise: You can technically ignore them, but it’s risky.

The legal reality:

  • ToS violations = breach of contract (civil matter)
  • Not typically criminal (unless combined with other illegal acts)
  • Website can sue you for damages
  • Website can block you, ban you, terminate your accounts

Courts are split:

// Website-friendly rulings:
const websiteWins = {
clickwrapAgreement: 'Enforceable - you clicked "I agree"',
browsewrapAgreement: 'Sometimes enforceable',
explicitScrapingBan: 'Stronger case for website'
};
// Scraper-friendly rulings:
const scraperWins = {
publicDataException: 'hiQ v LinkedIn precedent',
antiCompetitiveConcerns: 'Courts hate data monopolies',
overbroadRestrictions: 'Some restrictions unreasonable'
};

Practical ToS approach:

const tosStrategy = {
readIt: 'Yes, always read before scraping',
publicData: 'Generally lower risk per hiQ case',
explicitProhibition: 'Higher risk - consider alternatives',
registrationRequired: 'Higher risk - you agreed to terms',
scale: 'Small scale = lower risk, large scale = higher attention'
};

The gray area: Many sites have anti-scraping clauses but haven’t updated them post-hiQ. Some are still enforceable, some aren’t.

My advice: For public data scraping, read the ToS but proceed with caution if it’s public. For any data behind login, take the ToS seriously.

How do I know if data is “personal data” under GDPR?

Broader than you think. GDPR’s definition of personal data is extremely wide.

What IS personal data:

const personalData = {
obvious: ['Name', 'Email', 'Phone number', 'Address'],
lessObvious: [
'IP address', // Yes, really
'Cookie identifiers', // User tracking
'Device fingerprint', // Browser characteristics
'Location data', // GPS or inferred
'Online identifiers', // Usernames, handles
'Biometric data', // Face recognition, fingerprints
'Professional data', // Job title, company
'Behavioral data' // Browsing patterns
]
};

What is NOT personal data:

  • Completely anonymized data (cannot re-identify)
  • Purely statistical data about groups
  • Business information not tied to individuals
  • Publicly available government data (usually)

The identification test: If you could, with reasonable effort, identify the person from the data, it’s personal data.

Real examples:

// Personal data:
const linkedinProfile = {
name: 'John Smith', // Personal
job: 'Software Engineer', // Personal (professional identity)
company: 'TechCorp', // Personal (employment relationship)
skills: ['Python', 'React'] // Personal (professional characteristics)
};
// Maybe not personal:
const marketData = {
avgSalary: '$120,000', // Aggregated, not individual
jobGrowth: '15%', // Statistical
topSkills: ['Python', 'React'] // General market data
};

The key question: If I have “Software Engineer at TechCorp with Python skills” - can I identify a specific person? Usually yes.

Can I scrape for AI training purposes?

This is the hottest legal question of 2026. Short answer: It’s complicated and being actively litigated.

Current lawsuits setting precedents:

const aiTrainingCases2026 = {
'New York Times v OpenAI': 'Copyright infringement for news articles',
'Getty Images v Stability AI': 'Copyright for training images',
'Authors Guild v OpenAI': 'Copyright for book excerpts',
'Universal Music v Anthropic': 'Copyright for song lyrics'
};

Legal arguments for AI training:

  • Fair use: Transformative use of copyrighted material
  • Publicly available: Data was publicly accessible
  • Research purpose: Scientific and technological advancement
  • No market harm: Different use case than original content

Legal arguments against AI training:

  • Mass copyright violation: Systematic copying
  • Commercial exploitation: Training paid models with free data
  • No attribution: Using content without credit/compensation
  • Market harm: Competing with original content creators

Current status:

  • No definitive rulings yet (cases ongoing)
  • Early rulings suggest fair use is possible but not guaranteed
  • Companies are settling (OpenAI signed deals with News Corp, Axel Springer)
  • Regulations are being proposed specifically for AI training data

Safer approaches for AI training:

const saferAITraining = {
publicDomainData: 'Use only public domain content',
licensedData: 'Pay for content licenses',
syntheticData: 'Generate synthetic training data',
optInData: 'Use data with explicit consent',
metadataOnly: 'Train on metadata, not full content'
};

My prediction: By 2026, there will be clearer legal frameworks for AI training. Until then, proceed with caution for copyrighted content.

What happens if I get a cease and desist letter?

First step: Don’t panic. This happens all the time.

Immediate actions:

const ceaseDesistResponse = {
step1: 'Stop the scraping immediately',
step2: 'Consult a lawyer (seriously)',
step3: 'Preserve all evidence (emails, code, data)',
step4: 'Respond professionally (no angry replies)',
step5: 'Negotiate if possible'
};

Understanding the threat level:

const threatLevels = {
lowThreat: {
source: 'Random lawyer template email',
content: 'Generic legal threats',
action: 'Stop scraping, consider response'
},
mediumThreat: {
source: 'In-house counsel or known law firm',
content: 'Specific violations mentioned',
action: 'Consult lawyer, serious consideration'
},
highThreat: {
source: 'Major law firm + actual lawsuit filed',
content: 'Filed in court with docket number',
action: 'Lawyer immediately, respond within deadline'
}
};

Response strategies:

const responseOptions = {
complyAndStop: 'Safest option, usually sufficient',
negotiateTerms: 'Maybe get permission with conditions',
legalChallenge: 'Riskier but sometimes necessary',
ignore: 'Very risky, can escalate quickly'
};

What companies typically want:

  • You to stop scraping their data
  • You to delete any data you’ve collected
  • Assurance you won’t resume
  • Sometimes: Information about what you collected/why

The good news: Most cases settle without actual lawsuits if you respond reasonably and comply with their requests.

Should I incorporate my company for liability protection?

Absolutely yes if you’re doing commercial scraping at scale.

Why incorporation matters:

const liabilityProtection = {
withoutLLC: {
personalAssets: 'At risk',
personalBankruptcy: 'Possible if sued',
companyDebts: 'Your personal responsibility'
},
withLLC: {
personalAssets: 'Generally protected',
companyLiability: 'Limited to company assets',
personalRisk: 'Much lower'
}
};

Best structure for scraping businesses:

const businessStructure = {
type: 'LLC or Corporation',
location: 'Consider Delaware or Wyoming (business-friendly)',
insurance: 'General liability + cyber insurance',
contracts: 'Client agreements with liability clauses',
compliance: 'Legal compliance programs documented'
};

Insurance considerations:

  • General liability: Covers basic business operations
  • Cyber insurance: Covers data breaches, cyber incidents
  • Errors & omissions: Covers professional mistakes
  • Media liability: Covers copyright/trademark issues

The reality: A well-structured LLC with proper insurance can survive most scraping lawsuits. An individual scraping operation could face bankruptcy from one lawsuit.

Cost vs benefit:

const llcCosts = {
formation: '$500-2000 (one-time)',
annualMaintenance: '$500-1000',
insurance: '$1000-5000/year',
legalSetup: '$2000-5000'
};
const potentialSavings = {
averageLawsuitCost: '$50,000-500,000',
personalBankruptcyProtection: 'Priceless',
businessContinuity: 'Essential'
};

Key Takeaways

  1. Public data scraping is generally legal — The hiQ v. LinkedIn case established that scraping publicly accessible data doesn’t violate the CFAA.

  2. Personal data adds GDPR complexity — If you’re scraping data about EU individuals, you need a legitimate interest and proper data handling.

  3. ToS violations are civil, not criminal — Breaking terms of service might get you sued, but it’s not a computer crime.

  4. Copyright protects expression, not facts — Prices, names, and data points aren’t copyrightable, but articles and creative content are.

  5. Ethics matter beyond legality — Respect rate limits, identify yourself, and minimize data collection.

  6. When in doubt, consult a lawyer — For high-stakes scraping operations, legal counsel is worth the investment.

Scraping Responsibly with GoLogin

GoLogin helps you scrape effectively while maintaining ethical standards:

import { GoLogin } from '@gologin/core';
const gologin = new GoLogin({
profileName: 'responsible-scraper',
// Realistic fingerprint to avoid triggering aggressive blocks
// But still be a good citizen
});
const { browserWSEndpoint } = await gologin.start();
const browser = await puppeteer.connect({ browserWSEndpoint });
const page = await browser.newPage();
// Add respectful scraping practices
await page.setRequestInterception(true);
page.on('request', (req) => {
// Don't load images/css to reduce server load
if (['image', 'stylesheet', 'font'].includes(req.resourceType())) {
req.abort();
} else {
req.continue();
}
});
// Respect rate limits
for (const url of urls) {
await page.goto(url);
await extractData(page);
await page.waitForTimeout(2000 + Math.random() * 1000);
}

Start Scraping Safely

Set up your first responsible scraper. Quick Start →