AI Scrapers from China: How to Defend Your Website in 2026
Cybersecurity

AI Scrapers from China: How to Defend Your Website in 2026

Bytespider, PetalBot, and DeepSeek's crawlers are harvesting Malaysian websites at scales that didn't exist 18 months ago. The honest threat assessment, and the practical defence stack we run with our clients in 2026.

By Choong Ruey Liuh 2026-04-21 10 min read
AI scrapers from China website defence 2026 — Bytespider, PetalBot, DeepSeek crawlers

I run production messaging systems for a living, not web defence. But the threat surface I keep encountering on behalf of my clients in 2026 starts at the same place: their content is being harvested at industrial scale by AI training crawlers, mostly from China, and most of them have no idea it's happening.

This article is the assessment I share when a Malaysian client's server bills mysteriously climb, their pages start loading slowly during off-hours, or their original research starts appearing — uncited — inside Chinese AI products they never licensed. The pattern is consistent enough now that it deserves its own playbook.

1. The 2026 scale problem

The numbers are uncomfortable. According to Fortune's reporting, ByteDance's Bytespider — the LLM training crawler operated by TikTok's parent company — was scraping the open web at roughly 25 times the rate of OpenAI's crawlers in late 2024, and 3,000 times the rate of Anthropic's ClaudeBot. Barracuda's 2025 threat spotlight documented similar surges across DeepSeek's crawler infrastructure and Huawei's PetalBot.

The defining characteristic of these crawlers, compared to traditional search-engine crawlers, is three-fold:

  • They generally do not respect robots.txt. Some attempt to. Many do not. The polite-norm of the early-2010s web has not survived the LLM-training arms race.
  • They scrape aggressively. Where Googlebot rate-limits itself out of respect for site bandwidth, training crawlers prioritise data acquisition speed over hosting cost externalised to the site owner.
  • They obscure provenance. User agents change, IP ranges rotate, and headers get spoofed. The bot you see today may not be the same bot tomorrow, even from the same operator.

2. The actors I see most in 2026 Malaysian client logs

Bytespider — ByteDance/TikTok's LLM training crawler. Heavy presence on consumer brand sites and content-rich Malaysian blogs. User-agent contains "Bytespider".

PetalBot — Huawei's search-and-AI training crawler. Suspended on around 2 percent of popular indexed websites globally, and seen with rising frequency in 2026 Malaysian server logs.

DeepSeekBot — DeepSeek's training crawler. Detected by Barracuda starting late 2024, growing throughout 2025–2026. Often arrives in conjunction with adversarial probing of public APIs.

Tencent / ByteDance content-platform crawlers — Variants tied to specific consumer products (Doubao, Tencent Yuanbao, ByteDance Coze) that crawl for product-specific use cases beyond just LLM training.

Generic "gray bot" infrastructure — Bots from cloud-rental IPs (Alibaba Cloud, Tencent Cloud, regional CDN providers) using rotating user agents with no clear publisher attribution. The riskiest category because they cannot be politely opted-out — they are explicitly designed to evade controls.

3. Why this matters for Malaysian businesses specifically

Three impacts I see consistently:

  • Bandwidth and hosting costs. A Malaysian SME with a content-rich website (industry directory, online publication, e-commerce catalogue) can see 30–60% of total bandwidth consumed by AI training crawlers. The hosting bill is often the first signal something is wrong.
  • Original research and IP being absorbed without attribution. If your business invests in original case studies, legal analysis, market research, or industry reports, those investments are being consumed into models that compete with your services without paying you a cent.
  • PDPA exposure. If your site contains personal data — even semi-public, like staff names and email addresses on team pages — that data is being ingested into models trained outside Malaysia's jurisdiction. The PDPA implications are not yet clear, but the regulatory wind is shifting.

4. The defence stack I deploy in 2026

No single control catches everything. The pattern that works is layered — multiple lightweight controls instead of one heavy one.

Layer 1: robots.txt (polite signal)

Add explicit Disallow rules for the user-agents you don't want training on your content:

User-agent: Bytespider
Disallow: /

User-agent: PetalBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: GPTBot
Disallow: /

User-agent: anthropic-ai
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

Honest expectation: this catches the crawlers that respect robots.txt (Anthropic, OpenAI, Google) and signals intent to the rest. For the bots that ignore robots.txt entirely (most Chinese training crawlers fall in this category), the signal is documentary — useful when you need to demonstrate intent in any future legal or commercial conversation.

Layer 2: User-agent and IP-range blocking at the edge

For known offending user-agents, return a 403 at the web server, CDN, or WAF layer rather than serving the content. Cloudflare, BunnyCDN, and most managed hosting providers offer this as a one-click toggle in 2026.

Cloudflare's "Block AI bots" feature in particular is the path of least resistance for most Malaysian SMEs already on the Cloudflare free tier — single click, blocks the major training crawlers, no infrastructure work.

Layer 3: Behavioural rate-limiting

For the gray bots that rotate user-agents and IPs, signature-based blocking fails. Behavioural controls work better: rate limits per-IP-per-minute, anomaly detection on user-agent diversity from a single ASN, and tighter limits on heavy resources (PDFs, large images, sitemap-traversal patterns).

Layer 4: Content gating for high-value pages

For original research, case studies, or downloadable resources, gate behind a lightweight email capture or a basic CAPTCHA. This loses some SEO juice and a fraction of organic conversions, but eliminates 95% of bot harvesting on those specific pages. Reserve this for content where the IP value is real.

Layer 5: Honeypots and watermarking

For organisations sufficiently advanced: serve unique synthetic phrases or watermarked content to bot-detected sessions, then periodically search those phrases on AI products to detect uncited reproduction. This is the discovery layer for proving harvesting in legal contexts. Not a defence per se, but useful in establishing the case.

5. What does NOT work

Three things I see Malaysian SMEs try and waste effort on:

  • Over-aggressive WAF rules. Blocking everything from Chinese cloud IPs blocks legitimate Malaysian users on regional VPNs, regional CDNs, and ASEAN business travellers. False-positive rate is painful.
  • Pure JavaScript-only content rendering. Modern training crawlers headless-browser-render. JS-only delivery is no longer a defence — if anything it raises hosting costs by forcing a larger dynamic surface to be served.
  • Public-facing "no AI training" notices. Useful as documentation of intent. Has zero technical effect.

6. Practical recommendations for the next 30 days

  • Audit your server logs for top 20 user-agents and top 20 source-ASNs over the last 30 days. The picture will be clearer than you expect.
  • Add the standard AI-bot Disallow block to robots.txt today. 5-minute task; documentary value for years.
  • If you're on Cloudflare free tier, enable the "Block AI bots" toggle immediately. If not on Cloudflare, evaluate moving — for SMEs, the free-tier protection alone justifies the migration.
  • Identify the 3-5 highest-IP-value pages on your site. Decide explicitly: leave open for SEO, gate for harvesting protection, or watermark for evidence.
  • Set up a quarterly review of your bot defence — these crawlers and tactics change every 90 days.

For Malaysian organisations needing to formalise web-content protection alongside AI deployment governance, our AI Agentic Security programme covers the broader threat-and-defence framework. HRDC SBL-KHAS claimable for eligible employers.

About the author

Choong Ruey Liuh →

15+ yrs Messaging Systems · WhatsApp Business API Specialist · APAC

Ruey has 15+ years architecting messaging, CRM, and conversational AI systems for banks, insurers, and consumer brands across Southeast Asia. A Meta Business Partner-grade WhatsApp Business API implementer, he designs WhatsApp-native AI assistants that scale to millions of monthly conversations without breaking compliance or unit economics.

Frequently Asked Questions

The legal situation in 2026 is unsettled. Most jurisdictions (including Malaysia under PDPA and copyright law) have established that scraping copyrighted content for commercial gain without permission is at minimum a civil cause of action. But enforcement against operators based in jurisdictions with weak IP-enforcement reciprocity (China being the most prominent case for Malaysian businesses) is impractical. Practical defence is technical, not legal, in 2026.

Mostly no. Bytespider has a documented history of disregarding robots.txt directives, as do several other major training crawlers from China. The signal is still worth setting because: (1) the polite crawlers (OpenAI's GPTBot, Anthropic's ClaudeBot, Google-Extended) DO honour robots.txt; (2) it documents your intent for any future legal or commercial conversation; and (3) some intermediate crawlers fall back to robots.txt directives as a politeness layer.

Not unless your business has zero legitimate Malaysian users on regional VPNs, ASEAN business travellers visiting from China, or Tencent/Alibaba cloud-hosted services calling your APIs. The false-positive rate on geo-blocking is high. Better approach: per-user-agent and per-ASN rate-limiting, plus the standard AI-bot Disallow block, plus Cloudflare's bot-management features.

On the Malaysian SME sites we audit, AI training crawlers consume 30-60 percent of total bandwidth on content-rich properties, often more on industry directories and online publications. After deploying Cloudflare's AI-bot blocking plus targeted user-agent rules at the WAF, that fraction typically drops to under 5 percent within a week. The hosting-cost saving is meaningful.

Yes — the AI Agentic Security programme covers AI-bot threat modelling, defence stack design, and the broader governance posture (PDPA implications, BNM RMiT alignment for FIs, ISO 27001 controls). HRDC SBL-KHAS claimable for eligible Malaysian employers.

Want to apply this in your organisation?

AITraining2U runs HRDC-claimable corporate AI training for Malaysian organisations — from leadership awareness to hands-on builder workshops. Talk to us about a programme tailored to your team.