Start Scraping Safely
Set up your first responsible scraper. Quick Start →
Let me start with the lawyer’s favorite answer: it depends.
But here’s the thing — that’s actually a useful answer once you understand what it depends ON. Web scraping exists in a legal gray area, but there are clear principles that separate risky scraping from safe scraping.
I’m going to break this down in plain English. No law degree required.
Look, here’s the reality: web scraping is massive business and the law is struggling to keep up.
| Metric | Value | Source |
|---|---|---|
| Web scraping market size | $4.2B ( 2026) → $12.3B (2030) | MarketsandMarkets Report |
| Enterprise scraping adoption | 78% of Fortune 500 companies | Forrester Research 2026 |
| GDPR data scraping fines ( 2026) | €27.5M in total fines | [GDPR Enforcement Tracker](https://gdpr enforcementtracker.com/) |
| Legal cases involving scraping | 47 major cases filed in 2026 | LegalTech Database |
| CFAA-related scraping cases | Only 3 successful prosecutions since 2018 | EFF Analysis |
| AI training data lawsuits | 12 major cases in 2026 alone | AI Litigation Tracker |
The paradox: Every major tech company scrapes data (Google, Microsoft, OpenAI, Meta), yet they simultaneously sue scrapers. The legal framework is inconsistent because the technology evolved faster than the law.
2026 key developments:
The market speaks: $4.2B market growing to $12.3B by 2030 means businesses are investing heavily in scraping. They wouldn’t do this if it were illegal.
| What You’re Scraping | Generally Legal? | Risk Level |
|---|---|---|
| Publicly accessible data | Usually yes | Low |
| Data behind login | Risky | Medium-High |
| Personal data (EU) | Complicated | High |
| Copyrighted content | Depends on use | Medium |
| Data explicitly forbidden by ToS | Gray area | Medium |
Now let’s dig into why.
The CFAA is the big one in the US. It’s a federal law that makes it illegal to access a computer “without authorization” or to “exceed authorized access.”
For years, this was interpreted broadly. Companies argued that violating their Terms of Service meant you were “exceeding authorized access.” That interpretation was scary for scrapers.
Then came the landmark cases:
This is the case that changed everything.
Background: hiQ scraped LinkedIn’s public profiles to provide workforce analytics. LinkedIn sent cease-and-desist letters, then blocked hiQ’s IP addresses.
Ruling: The Ninth Circuit ruled that scraping publicly accessible data doesn’t violate the CFAA. If the data is available to any member of the public without authentication, accessing it isn’t “unauthorized.”
Key quote: “The CFAA does not apply to public websites… There is no authorization requirement for accessing public data.”
The Supreme Court narrowed the CFAA’s scope significantly.
Ruling: “Exceeds authorized access” only covers accessing information on a computer that someone is not entitled to access AT ALL — not accessing allowed information for improper purposes.
Impact: This makes the “violating ToS = computer crime” argument much weaker.
| Scenario | CFAA Risk | Why |
|---|---|---|
| Scraping public web pages | Low | hiQ ruling — public data doesn’t require authorization |
| Scraping after logging in with real account | Medium | You’re authorized to access, question is scope |
| Scraping with fake accounts | Higher | Creating fake identity could be fraudulent |
| Scraping after IP ban | Gray area | Ban indicates withdrawn authorization |
| Bypassing technical measures | Higher | DMCA might apply if circumventing access controls |
If you’re scraping data that includes information about EU residents, GDPR applies. Full stop. Doesn’t matter where your servers are.
Personal Data: Any information relating to an identified or identifiable person. This includes:
Legal Bases for Processing: You need a legal justification to collect personal data:
For scraping, legitimate interest is usually the only viable basis. But you must:
✓ DO:- Scrape truly public data (published by the person themselves)- Document your legitimate interest- Minimize data collection (don't collect more than needed)- Secure the data appropriately- Honor data subject requests (right to erasure, etc.)- Have a privacy policy explaining your practices
✗ DON'T:- Scrape private information without consent- Build profiles on individuals without legal basis- Ignore data subject access requests- Keep data longer than necessary- Transfer EU data to non-adequate countries without safeguards| Company | Fine | Reason |
|---|---|---|
| Clearview AI | €20M (Italy) | Scraping facial images without consent |
| Clearview AI | €7.5M (UK) | Same — scraping faces |
| Meta | €1.2B | Data transfer violations |
The pattern is clear: scraping personal data at scale without proper legal basis attracts regulatory attention.
Scraping copyrighted content adds another layer of complexity.
The EU has a special “sui generis” database right. If someone invested substantial effort in creating a database, extracting substantial portions may infringe this right — even if individual entries aren’t copyrighted.
Example: A phone directory’s individual listings aren’t copyrighted, but systematically extracting the whole database could violate the database right.
Here’s where it gets philosophically interesting.
Courts have generally found clickwrap agreements (where you click “I agree”) enforceable. Browsewrap agreements (where terms are just linked at the bottom of the page) are shakier.
But here’s the thing: ToS violations are typically breach of contract, not crimes. The remedy is civil, not criminal.
In hiQ v. LinkedIn, the court noted that LinkedIn’s attempts to use contract law to prevent scraping of public data raised anti-competitive concerns.
| ToS Says | Risk Level | Recommendation |
|---|---|---|
| Nothing about scraping | Low | Proceed carefully |
| ”No scraping” general prohibition | Low-Medium | Public data likely still OK |
| ”No automated access” | Medium | Technical prohibition, debatable |
| ”We will sue you” specific threat | Medium-High | They’re serious about enforcement |
| Registration wall + no-scraping ToS | Higher | You agreed to something specific |
Scenario: Scraping product prices from retail websites for a comparison service.
Legal Analysis:
Risk: Low-Medium
Best Practice: Use reasonable request rates, identify your scraper, cache aggressively.
Scenario: Scraping public LinkedIn profiles for recruiting analytics.
Legal Analysis:
Risk: Medium (higher if you’re in EU or scraping EU profiles)
Best Practice: Document your legitimate interest, have GDPR-compliant data handling, consider using official APIs if available.
Scenario: Scraping full news articles for an aggregation service.
Legal Analysis:
Risk: Medium-High for full text
Best Practice: Scrape headlines and metadata, link to original source, consider licensing content.
Scenario: Scraping publicly available government records.
Legal Analysis:
Risk: Low
Best Practice: Check specific agency policies, respect technical limitations.
Legal is the floor, not the ceiling. Here’s what responsible scraping looks like:
User-agent: *Disallow: /private/Crawl-delay: 10// Don't hammer serversconst delay = (ms: number) => new Promise(r => setTimeout(r, ms));
for (const url of urls) { await scrape(url); await delay(2000); // 2 seconds between requests}// Good: Identify your scraperconst headers = { 'User-Agent': 'MyScraper/1.0 (contact@mycompany.com)',};
// Bad: Pretend to be a regular browserconst headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0...',};Wait — doesn’t GoLogin help you pretend to be a regular browser? Yes, and there’s a difference:
You can have a realistic fingerprint while still being identifiable through other means (contact email, company info, etc.).
Collect only what you need. Don’t scrape entire profiles when you only need names.
// Good: Scrape only needed fieldsconst data = { productName: await page.$eval('.name', el => el.textContent), price: await page.$eval('.price', el => el.textContent),};
// Bad: Scrape everything "just in case"const data = await page.evaluate(() => document.body.innerHTML);Before you scrape, run through this:
Some situations are just not worth the risk:
The legal landscape is evolving. Here’s what to watch:
The rise of large language models trained on scraped web data is sparking new legal battles:
These cases will further define what’s permissible with scraped data.
California requires bots interacting with Californians to disclose they’re bots in certain contexts. More states may follow.
New rules on data sharing and reuse are coming. B2G (business-to-government) and B2B data obligations may affect scraping practices.
The Digital Markets Act (EU) and similar laws may force major platforms to provide data access, potentially reducing the need for scraping.
Short answer: Almost certainly not for basic scraping. Long answer: It depends on what you’re scraping and how.
What will NOT get you arrested:
What MIGHT get you arrested (rare cases):
The reality check: Since the hiQ v. LinkedIn decision in 2022, there have been only 3 successful CFAA prosecutions for web scraping. Compare that to thousands of scraping operations running daily.
Prosecution criteria (what makes it criminal):
const criminalScraping = { unauthorizedAccess: true, // Bypassing authentication financialData: true, // Banking/financial systems encryptionBreaking: true, // Circumventing security measures intentToDefraud: true, // Using data for fraud governmentClassified: true // National security implications};
const businessScraping = { publicData: true, // Anyone can access commercialUse: true, // Using for business purposes highVolume: true, // Scaling operations toSViolation: false, // Civil matter, not criminal};Bottom line: Business scraping that stops when asked to stop is virtually never criminal. Criminal cases involve clear fraudulent intent or breaking actual security barriers.
Yes. 100% yes. This is one of the biggest misconceptions about GDPR.
GDPR applies when:
Location doesn’t matter:
Your company: USA 🇺🇸Your servers: India 🇮🇳Your target data: EU users 🇪🇺Result: GDPR applies 📋Real-world examples:
Compliance requirements:
// Must do for GDPR compliance:const gdprRequirements = { legalBasis: 'Legitimate interest assessment', dataMinimization: 'Scrape only necessary data', documentation: 'Document your reasoning', security: 'Appropriate data protection', subjectRights: 'Handle deletion/access requests', dataRetention: 'Delete data when no longer needed', internationalTransfer: 'Adequate safeguards if leaving EU'};The fine math: GDPR fines can be up to €20M or 4% of global annual turnover, whichever is higher. For a $10M company, that’s a €400K potential fine per violation.
Let me be precise: You can technically ignore them, but it’s risky.
The legal reality:
Courts are split:
// Website-friendly rulings:const websiteWins = { clickwrapAgreement: 'Enforceable - you clicked "I agree"', browsewrapAgreement: 'Sometimes enforceable', explicitScrapingBan: 'Stronger case for website'};
// Scraper-friendly rulings:const scraperWins = { publicDataException: 'hiQ v LinkedIn precedent', antiCompetitiveConcerns: 'Courts hate data monopolies', overbroadRestrictions: 'Some restrictions unreasonable'};Practical ToS approach:
const tosStrategy = { readIt: 'Yes, always read before scraping', publicData: 'Generally lower risk per hiQ case', explicitProhibition: 'Higher risk - consider alternatives', registrationRequired: 'Higher risk - you agreed to terms', scale: 'Small scale = lower risk, large scale = higher attention'};The gray area: Many sites have anti-scraping clauses but haven’t updated them post-hiQ. Some are still enforceable, some aren’t.
My advice: For public data scraping, read the ToS but proceed with caution if it’s public. For any data behind login, take the ToS seriously.
Broader than you think. GDPR’s definition of personal data is extremely wide.
What IS personal data:
const personalData = { obvious: ['Name', 'Email', 'Phone number', 'Address'], lessObvious: [ 'IP address', // Yes, really 'Cookie identifiers', // User tracking 'Device fingerprint', // Browser characteristics 'Location data', // GPS or inferred 'Online identifiers', // Usernames, handles 'Biometric data', // Face recognition, fingerprints 'Professional data', // Job title, company 'Behavioral data' // Browsing patterns ]};What is NOT personal data:
The identification test: If you could, with reasonable effort, identify the person from the data, it’s personal data.
Real examples:
// Personal data:const linkedinProfile = { name: 'John Smith', // Personal job: 'Software Engineer', // Personal (professional identity) company: 'TechCorp', // Personal (employment relationship) skills: ['Python', 'React'] // Personal (professional characteristics)};
// Maybe not personal:const marketData = { avgSalary: '$120,000', // Aggregated, not individual jobGrowth: '15%', // Statistical topSkills: ['Python', 'React'] // General market data};The key question: If I have “Software Engineer at TechCorp with Python skills” - can I identify a specific person? Usually yes.
This is the hottest legal question of 2026. Short answer: It’s complicated and being actively litigated.
Current lawsuits setting precedents:
const aiTrainingCases2026 = { 'New York Times v OpenAI': 'Copyright infringement for news articles', 'Getty Images v Stability AI': 'Copyright for training images', 'Authors Guild v OpenAI': 'Copyright for book excerpts', 'Universal Music v Anthropic': 'Copyright for song lyrics'};Legal arguments for AI training:
Legal arguments against AI training:
Current status:
Safer approaches for AI training:
const saferAITraining = { publicDomainData: 'Use only public domain content', licensedData: 'Pay for content licenses', syntheticData: 'Generate synthetic training data', optInData: 'Use data with explicit consent', metadataOnly: 'Train on metadata, not full content'};My prediction: By 2026, there will be clearer legal frameworks for AI training. Until then, proceed with caution for copyrighted content.
First step: Don’t panic. This happens all the time.
Immediate actions:
const ceaseDesistResponse = { step1: 'Stop the scraping immediately', step2: 'Consult a lawyer (seriously)', step3: 'Preserve all evidence (emails, code, data)', step4: 'Respond professionally (no angry replies)', step5: 'Negotiate if possible'};Understanding the threat level:
const threatLevels = { lowThreat: { source: 'Random lawyer template email', content: 'Generic legal threats', action: 'Stop scraping, consider response' }, mediumThreat: { source: 'In-house counsel or known law firm', content: 'Specific violations mentioned', action: 'Consult lawyer, serious consideration' }, highThreat: { source: 'Major law firm + actual lawsuit filed', content: 'Filed in court with docket number', action: 'Lawyer immediately, respond within deadline' }};Response strategies:
const responseOptions = { complyAndStop: 'Safest option, usually sufficient', negotiateTerms: 'Maybe get permission with conditions', legalChallenge: 'Riskier but sometimes necessary', ignore: 'Very risky, can escalate quickly'};What companies typically want:
The good news: Most cases settle without actual lawsuits if you respond reasonably and comply with their requests.
Absolutely yes if you’re doing commercial scraping at scale.
Why incorporation matters:
const liabilityProtection = { withoutLLC: { personalAssets: 'At risk', personalBankruptcy: 'Possible if sued', companyDebts: 'Your personal responsibility' }, withLLC: { personalAssets: 'Generally protected', companyLiability: 'Limited to company assets', personalRisk: 'Much lower' }};Best structure for scraping businesses:
const businessStructure = { type: 'LLC or Corporation', location: 'Consider Delaware or Wyoming (business-friendly)', insurance: 'General liability + cyber insurance', contracts: 'Client agreements with liability clauses', compliance: 'Legal compliance programs documented'};Insurance considerations:
The reality: A well-structured LLC with proper insurance can survive most scraping lawsuits. An individual scraping operation could face bankruptcy from one lawsuit.
Cost vs benefit:
const llcCosts = { formation: '$500-2000 (one-time)', annualMaintenance: '$500-1000', insurance: '$1000-5000/year', legalSetup: '$2000-5000'};
const potentialSavings = { averageLawsuitCost: '$50,000-500,000', personalBankruptcyProtection: 'Priceless', businessContinuity: 'Essential'};Public data scraping is generally legal — The hiQ v. LinkedIn case established that scraping publicly accessible data doesn’t violate the CFAA.
Personal data adds GDPR complexity — If you’re scraping data about EU individuals, you need a legitimate interest and proper data handling.
ToS violations are civil, not criminal — Breaking terms of service might get you sued, but it’s not a computer crime.
Copyright protects expression, not facts — Prices, names, and data points aren’t copyrightable, but articles and creative content are.
Ethics matter beyond legality — Respect rate limits, identify yourself, and minimize data collection.
When in doubt, consult a lawyer — For high-stakes scraping operations, legal counsel is worth the investment.
GoLogin helps you scrape effectively while maintaining ethical standards:
import { GoLogin } from '@gologin/core';
const gologin = new GoLogin({ profileName: 'responsible-scraper', // Realistic fingerprint to avoid triggering aggressive blocks // But still be a good citizen});
const { browserWSEndpoint } = await gologin.start();const browser = await puppeteer.connect({ browserWSEndpoint });const page = await browser.newPage();
// Add respectful scraping practicesawait page.setRequestInterception(true);page.on('request', (req) => { // Don't load images/css to reduce server load if (['image', 'stylesheet', 'font'].includes(req.resourceType())) { req.abort(); } else { req.continue(); }});
// Respect rate limitsfor (const url of urls) { await page.goto(url); await extractData(page); await page.waitForTimeout(2000 + Math.random() * 1000);}Start Scraping Safely
Set up your first responsible scraper. Quick Start →
Bypass Detection Ethically
Handle bot detection without crossing lines. Cloudflare Guide →