The Data Wall AI Can't Climb — And What It Means for Vertical Platforms
When people want information today, more of them are skipping the search bar and going straight to AI. The appeal is intuitive: ask naturally, get a curated answer, skip the legwork.
But ask AI about something happening right now, and you'll often get this: "This event may still be ongoing, but schedules are subject to change — I recommend confirming directly."
Technically accurate. Practically useless.
This isn't just AI being cautious. It's AI drawing a boundary — one that emerged as the industry grappled with hallucination. Once models learned to flag uncertainty, real-time queries became the first casualty. The deeper question, then, isn't why AI hedges. It's why the information that clearly exists somewhere online is still out of reach.
That gap isn't a bug. It's structural.
Why AI Doesn't Know What's Happening in Your Neighborhood
When the Dubai Chewy Cookie craze swept Korea, nobody asked AI where to find one. Everyone already knew: something that spreads by word of mouth in days, generates lines overnight, and disappears a week later isn't something AI can track. Instead, people opened Kakao Map. They scrolled through Naver blogs.
But this isn't just a freshness problem. There's a deeper structural reason AI struggles with real-time information.
AI Lives in a Snapshot of the Past
AI's reluctance around real-time queries is a relatively recent behavioral shift. Not long ago, these same models would confidently describe menus from restaurants that had already closed, or direct users to popups that had ended months earlier.
As hallucination became an industry-wide concern, models began self-limiting — flagging uncertainty rather than confabulating. But the more important question is: why does AI classify these questions as uncertain in the first place?
The answer lies in how LLMs are built.
LLMs work by ingesting data up to a fixed cutoff point and compressing it into hundreds of billions of parameters — a static snapshot of the world at a specific moment in time. Unlike the human brain, which continuously layers new experiences onto existing memory, an LLM stores all knowledge diffused across one massive parameter matrix.
Adding new information means adjusting weights across that entire structure. The risk: overwriting previously learned patterns, degrading language reasoning and contextual understanding in the process. The more you update, the more you risk eroding the model's core capabilities.
Machine learning researchers call this catastrophic forgetting.
The industry's practical workaround is RAG — Retrieval-Augmented Generation. Rather than retraining the model, RAG retrieves relevant documents from an external database at query time and feeds them as context. Most AI search services are built on this architecture.
But RAG has limits. Because it matches queries to semantically similar documents, asking "Is this restaurant open right now?" may surface a document that says "This restaurant typically opens at 11 AM" — a historically accurate but situationally useless answer.
More fundamentally, RAG can only retrieve what's publicly accessible on the web. For entire categories of information, there's simply nothing there to find.
AI's real-time blind spot isn't a matter of insufficient technology. LLMs are designed to capture the world as it existed at a point in time — a photograph, not a live feed. The information that exists only in this moment lives exclusively with the vertical services that generate and manage it.
The 4 Types of Data AI Cannot Read
If the architecture makes real-time learning structurally difficult, the next question becomes: what, precisely, is off-limits?
Real-Time Status Data
Operational states that change by the minute: whether a restaurant is open, current inventory levels, hospital wait times, delivery availability. The "Preparing" or "Delayed" status on Baemin isn't floating on the web — it's piped directly from the merchant's POS system. No matter how sophisticated the inference, AI cannot know whether that restaurant is open right now. The data lives inside a vertical's operational infrastructure, and nowhere else.
Transaction-Based Data
Economically linked information: credit card benefits, airfare pricing, hotel rates, coupon eligibility. These aren't just frequently updated — they're governed by interlocking variables: annual fees, cashback tiers, promotional windows. The difference between Cardgorilla or Banksalad pulling live data via card company APIs and AI recalling benefit information from training data isn't a matter of degree. It's a matter of reliability.
Access-Restricted Data
Information that never enters the public web — and therefore can't be crawled at all. Current wait times on CatchTable. Real-time inventory and logistics status on Coupang. Internal pricing logic and personalized offers. This is data created, validated, and managed entirely by the service operator. No amount of web search surfaces it.
Behavioral Data
The layer where discovery converts to action. Finding a restaurant is not the same as booking one. Comparing credit cards is not the same as applying. AI agents are racing to capture this layer — but reservations, applications, and payments can only execute within each vertical's own workflow and authentication architecture.
In practice, the most defensible vertical platforms don't sit in just one of these categories — they occupy several simultaneously. Baemin holds real-time status, transaction-based, access-restricted, and behavioral data at once. Cardgorilla stacks transaction-based, access-restricted, and behavioral data. The more dimensions a vertical controls, the harder it is for AI to disintermediate.
This reframes the central competitive question for the AI era: who generates and validates the data — and who controls it?
When Data Walls Come Down
Controlling data also means controlling when — and whether — to open it. History offers a clear pattern. Platform data supply has always moved through three stages.
Stage 1 — Involuntary: The Platform Collects the Data Itself
When suppliers won't move, platforms move for them. In its earliest days, Baemin's founders physically collected restaurant flyers and manually entered business information. Airbnb, recognizing that listing photo quality directly predicted booking rates, dispatched professional photographers to hosts' homes.
AI services are at this stage right now — crawling publicly available data to build their knowledge base. The reason they can't access real-time data isn't a technology ceiling. Suppliers simply haven't opened the door.
Stage 2 — Incentivized: Data Supply Generates Revenue
Early adopters move when the value proposition becomes legible. OpenTable brought restaurants onto its platform by offering free reservation management software. Restaurants joined for the operational convenience. OpenTable eventually reached 18 million monthly covers — at which point restaurants started reaching out first.
Similar dynamics are emerging in AI. AIEO and GEO optimization services are proliferating rapidly. The perception that "not appearing in Google AI Overview is the same as not existing" is spreading fast. Google is accelerating this by linking Google Business Profile directly to Maps and AI Overview, giving operators structured incentives to self-submit and format their data. First movers are already moving.
Stage 3 — Obligatory: Not Participating Means Losing
The final stage is when competitive pressure removes optionality. When "not being on Baemin means not getting delivery orders" became conventional wisdom in Korean food service, platform participation stopped being a choice. The same logic applied to KakaoMap and Naver Maps — unregistered businesses don't get found, don't accumulate reviews, and don't build credibility.
AIEO and GEO vendors are racing to own this obligatory stage. They're betting that a moment is coming when AI invisibility is operationally equivalent to non-existence.
What determines how fast these transitions happen? Research on online service adoption offers a useful signal.
A 2014 analysis of Skype's real-world adoption data demonstrated for the first time that the probability of service adoption through social influence scales linearly with the proportion of adopters in one's network — one adoption raises surrounding adoption probability, which triggers the next, in a self-reinforcing chain.
A 2020 study refining Granovetter's collective action threshold model found that the critical mass for this chain to become explosive — what researchers call complex contagion — consistently appears around the 20–25% threshold.
These studies examined user-side adoption, not supplier-side data disclosure. But the underlying mechanism likely transfers. When the share of restaurants in a given district on Baemin crossed a certain threshold, the adoption probability for the remaining holdouts almost certainly spiked.
The same inflection logic applies to AI data openness. When AI-driven discovery reaches a critical share of consumer search behavior in a given category — and when enough competitors in that category have already opened their data — the pressure on remaining holdouts becomes structural, not just competitive.
The precise timing of that inflection? Nobody knows. And that uncertainty is the most accurate description of where this market stands. The infrastructure is being assembled. But when vertical platforms open their doors is not a technology question. It's a business decision.
The Data Openness Dilemma: Scenarios for Vertical Platforms
AI search is reshaping the starting point of consumer discovery. For vertical platforms, this isn't a marketing channel shift. It's a strategic decision about how much data to expose — and to whom.
The market is already moving. Referral traffic from generative AI sources to U.S. retail sites increased 3,500% between July 2024 and May 2025. Just as Naver blog SEO once determined search visibility and Baemin listing determined delivery revenue, AI visibility is becoming a baseline survival condition — and the pace is accelerating.
But for vertical platforms, the tradeoff is real and structural.
Scenario A — Open: Discoverable, But Disintermediated
Opening an MCP server to supply real-time data to AI platforms drives short-term discoverability. New users find you through AI-powered search. But when consumers complete the full journey — discovery, evaluation, decision — inside the AI interface, there's no reason to open your app.
No app traffic means no behavioral data accumulation. The ranking and curation algorithms built over years migrate to the AI platform. You've traded visibility for the core business logic you spent years building.
The proposed middle path is data tiering: open foundational information — location, hours, basic availability — to ensure discoverability, while keeping real-time inventory and personalized offers locked behind the app to drive conversion.
But this strategy only works if three things are in place simultaneously: long-tail supplier digitization, data structure standardization, and genuine incentives for suppliers to structure and submit their own data. Getting all three right is harder than it sounds.
Scenario B — Lock: Protected, But Invisible
Keeping data closed preserves the in-app experience and data control. But AI invisibility increasingly means absence from the consumer's starting point.
The penalty accumulates quietly — the way businesses unregistered on Google Maps slowly lost foot traffic. In high-discovery categories, where consumers actively search before deciding, the compounding effect is faster and more severe. The window for holding out is measured by how quickly AI search captures consumer behavior in your category.
Both scenarios carry costs. Open and you lose traffic. Close and you lose discovery. The question for vertical platforms is no longer whether to engage with this dilemma, but which cost they can sustain — and for how long.
The First-Mover Window Is Open
But this dilemma won't persist indefinitely. Baemin once had founders collecting restaurant flyers by hand. Then platform participation became non-negotiable. The same inflection is coming for AI. Nobody knows exactly when.
What vertical platforms need now — before that inflection arrives — is to define their relationship with AI on their own terms. Which data to open. Which data to protect. The decision should be driven by business logic, not by the pressure of AI visibility creeping up behind you.
Verticals that design a data tiering strategy now can use AI as a distribution channel while protecting their core workflow. Those that don't will find themselves pulled into the platform's logic the moment the critical threshold hits.
Two speeds are competing right now: how fast AI captures search, and how fast vertical platforms complete their data strategy. The outcome of that race will determine who remains an independent player in the next era of discovery.
About Kakao Ventures
Founded in 2012 and backed by Kakao — Korea's leading tech platform — Kakao Ventures is one of Korea's most active Seed-stage venture capital firms, with approximately $280M USD in AUM. We partner with founders before the path is fully defined, when conviction in people matters more than proof in numbers.
Our portfolio includes Lunit (AI cancer diagnostics), Rebellions (AI semiconductors), and Dunamu (operator of Upbit, one of Asia's largest crypto exchanges).
If you're building at the edge of what's possible — we'd like to hear from you.