I run production messaging systems for a living, not web defence. But the threat surface I keep encountering on behalf of my clients in 2026 starts at the same place: their content is being harvested at industrial scale by AI training crawlers, mostly from China, and most of them have no idea it's happening.
This article is the assessment I share when a Malaysian client's server bills mysteriously climb, their pages start loading slowly during off-hours, or their original research starts appearing — uncited — inside Chinese AI products they never licensed. The pattern is consistent enough now that it deserves its own playbook.
1. The 2026 scale problem
The numbers are uncomfortable. According to Fortune's reporting, ByteDance's Bytespider — the LLM training crawler operated by TikTok's parent company — was scraping the open web at roughly 25 times the rate of OpenAI's crawlers in late 2024, and 3,000 times the rate of Anthropic's ClaudeBot. Barracuda's 2025 threat spotlight documented similar surges across DeepSeek's crawler infrastructure and Huawei's PetalBot.
The defining characteristic of these crawlers, compared to traditional search-engine crawlers, is three-fold:
- They generally do not respect
robots.txt. Some attempt to. Many do not. The polite-norm of the early-2010s web has not survived the LLM-training arms race. - They scrape aggressively. Where Googlebot rate-limits itself out of respect for site bandwidth, training crawlers prioritise data acquisition speed over hosting cost externalised to the site owner.
- They obscure provenance. User agents change, IP ranges rotate, and headers get spoofed. The bot you see today may not be the same bot tomorrow, even from the same operator.
2. The actors I see most in 2026 Malaysian client logs
Bytespider — ByteDance/TikTok's LLM training crawler. Heavy presence on consumer brand sites and content-rich Malaysian blogs. User-agent contains "Bytespider".
PetalBot — Huawei's search-and-AI training crawler. Suspended on around 2 percent of popular indexed websites globally, and seen with rising frequency in 2026 Malaysian server logs.
DeepSeekBot — DeepSeek's training crawler. Detected by Barracuda starting late 2024, growing throughout 2025–2026. Often arrives in conjunction with adversarial probing of public APIs.
Tencent / ByteDance content-platform crawlers — Variants tied to specific consumer products (Doubao, Tencent Yuanbao, ByteDance Coze) that crawl for product-specific use cases beyond just LLM training.
Generic "gray bot" infrastructure — Bots from cloud-rental IPs (Alibaba Cloud, Tencent Cloud, regional CDN providers) using rotating user agents with no clear publisher attribution. The riskiest category because they cannot be politely opted-out — they are explicitly designed to evade controls.
3. Why this matters for Malaysian businesses specifically
Three impacts I see consistently:
- Bandwidth and hosting costs. A Malaysian SME with a content-rich website (industry directory, online publication, e-commerce catalogue) can see 30–60% of total bandwidth consumed by AI training crawlers. The hosting bill is often the first signal something is wrong.
- Original research and IP being absorbed without attribution. If your business invests in original case studies, legal analysis, market research, or industry reports, those investments are being consumed into models that compete with your services without paying you a cent.
- PDPA exposure. If your site contains personal data — even semi-public, like staff names and email addresses on team pages — that data is being ingested into models trained outside Malaysia's jurisdiction. The PDPA implications are not yet clear, but the regulatory wind is shifting.
4. The defence stack I deploy in 2026
No single control catches everything. The pattern that works is layered — multiple lightweight controls instead of one heavy one.
Layer 1: robots.txt (polite signal)
Add explicit Disallow rules for the user-agents you don't want training on your content:
User-agent: Bytespider
Disallow: /
User-agent: PetalBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
Honest expectation: this catches the crawlers that respect robots.txt (Anthropic, OpenAI, Google) and signals intent to the rest. For the bots that ignore robots.txt entirely (most Chinese training crawlers fall in this category), the signal is documentary — useful when you need to demonstrate intent in any future legal or commercial conversation.
Layer 2: User-agent and IP-range blocking at the edge
For known offending user-agents, return a 403 at the web server, CDN, or WAF layer rather than serving the content. Cloudflare, BunnyCDN, and most managed hosting providers offer this as a one-click toggle in 2026.
Cloudflare's "Block AI bots" feature in particular is the path of least resistance for most Malaysian SMEs already on the Cloudflare free tier — single click, blocks the major training crawlers, no infrastructure work.
Layer 3: Behavioural rate-limiting
For the gray bots that rotate user-agents and IPs, signature-based blocking fails. Behavioural controls work better: rate limits per-IP-per-minute, anomaly detection on user-agent diversity from a single ASN, and tighter limits on heavy resources (PDFs, large images, sitemap-traversal patterns).
Layer 4: Content gating for high-value pages
For original research, case studies, or downloadable resources, gate behind a lightweight email capture or a basic CAPTCHA. This loses some SEO juice and a fraction of organic conversions, but eliminates 95% of bot harvesting on those specific pages. Reserve this for content where the IP value is real.
Layer 5: Honeypots and watermarking
For organisations sufficiently advanced: serve unique synthetic phrases or watermarked content to bot-detected sessions, then periodically search those phrases on AI products to detect uncited reproduction. This is the discovery layer for proving harvesting in legal contexts. Not a defence per se, but useful in establishing the case.
5. What does NOT work
Three things I see Malaysian SMEs try and waste effort on:
- Over-aggressive WAF rules. Blocking everything from Chinese cloud IPs blocks legitimate Malaysian users on regional VPNs, regional CDNs, and ASEAN business travellers. False-positive rate is painful.
- Pure JavaScript-only content rendering. Modern training crawlers headless-browser-render. JS-only delivery is no longer a defence — if anything it raises hosting costs by forcing a larger dynamic surface to be served.
- Public-facing "no AI training" notices. Useful as documentation of intent. Has zero technical effect.
6. Practical recommendations for the next 30 days
- Audit your server logs for top 20 user-agents and top 20 source-ASNs over the last 30 days. The picture will be clearer than you expect.
- Add the standard AI-bot Disallow block to
robots.txttoday. 5-minute task; documentary value for years. - If you're on Cloudflare free tier, enable the "Block AI bots" toggle immediately. If not on Cloudflare, evaluate moving — for SMEs, the free-tier protection alone justifies the migration.
- Identify the 3-5 highest-IP-value pages on your site. Decide explicitly: leave open for SEO, gate for harvesting protection, or watermark for evidence.
- Set up a quarterly review of your bot defence — these crawlers and tactics change every 90 days.
For Malaysian organisations needing to formalise web-content protection alongside AI deployment governance, our AI Agentic Security programme covers the broader threat-and-defence framework. HRDC SBL-KHAS claimable for eligible employers.