This is a design and working prototype, not a deployed client engagement. Every dataset schema, batch script, sitemap generator, and validator below is real. The metrics that survive are runtime metrics from the build itself, not projected client outcomes.
What this system is
A programmatic inbound engine that treats content generation as a data pipeline: enrich a base dataset with structured API context, generate unique per-page introductions through a validated LLM extraction step, and deploy thousands of indexable pages through a hub-and-spoke architecture with a generated sitemap index.
It’s a go-to-market agent on the inbound surface. The agent runs the surface directly — it produces content at scale that lives publicly indexed. Supervision is explicit: an output-QA validator gates LLM intros against length and brand-voice rules before any page ships; a sitemap drip rate-limits the indexation request so the build doesn’t trip spam classifiers.
What motivated the build
A travel marketplace needed to capture “high-intent, low-volume” search queries like “Best boutique hotels in Austin Texas for families” across thousands of cities. The economics of writing 10,000 unique pages manually were unworkable — six-figure cost, two-year timeline, editorial bottleneck the whole way. The underlying data existed (hotels, cities, geography); the missing piece was the architecture that turned database rows into indexable, unique URL endpoints without triggering thin-content penalties.
The design goal: ship the corpus in three weeks, keep the per-page cost sub-cent, and survive Google’s quality thresholds at indexation time.
Technical architecture
graph TD
%% Phase 1: Data Aggregation
subgraph Phase 1: Data Enrichment
A[Core DB:<br/>Hotel Names & Locations] --> B[(Airtable Master)]
C[Weather API] -->|Zipcode Sync| B
D[Google Places API] -->|Nearby Attractions| B
end
%% Phase 2: AI Content Generation
subgraph Phase 2: Content Generation (Node.js)
B --> E[Node Script:<br/>Batch Fetch Rows]
E --> F[OpenAI API:<br/>Generate Unique Intro]
F -->|Write Back| B
end
%% Phase 3: Deployment
subgraph Phase 3: Headless Deployment
B -->|Real-time Sync| G[Whalesync]
G --> H[Webflow CMS / Next.js]
H --> I((10,000 Live URLs))
end
style I fill:#22c55e,color:#fff
Hub-and-spoke site architecture
graph TD
A["Homepage:<br/>/best-hotels"] --> B["State Hub (x50):<br/>/best-hotels/texas"]
A --> C["State Hub:<br/>/best-hotels/california"]
A --> D["State Hub:<br/>/best-hotels/new-york"]
B --> E["City Spoke:<br/>/best-hotels/texas/austin"]
B --> F["City Spoke:<br/>/best-hotels/texas/dallas"]
B --> G["City Spoke:<br/>/best-hotels/texas/houston"]
E --> H["Internal Links:<br/>Nearby City Spokes"]
F --> H
G --> H
style A fill:#6366f1,color:#fff
style B fill:#8b5cf6,color:#fff
style C fill:#8b5cf6,color:#fff
style D fill:#8b5cf6,color:#fff
Architecture-spec table (real numbers from the build)
| Spec | Value | Notes |
|---|---|---|
| Total page count | 10,000+ | 50 state hubs × ~200 city spokes |
| Per-page LLM cost | ~$0.015 per page | GPT-4-turbo, ~50-word intro |
| Total LLM API spend | ~$150 for the full corpus | one-time generation pass |
| Batch concurrency | 5 concurrent calls | p-limit queue throttle |
| Batch cadence | ~50 rows/minute | with 1.2s inter-batch delay |
| LLM output QA rejection rate | ~3% | re-queued with stricter prompt |
| Sitemap structure | sitemap index + 50 per-state sitemaps | ~200 URLs per state sitemap |
| Indexation drip schedule | week 1: 50/day · week 2: 100/day · week 3: 500/day | manual GSC monitoring |
| Programmatic data per page | ≥30% of rendered content | unique data points per city |
Master table schema (Airtable)
| Column | Type | Source | Description |
|---|---|---|---|
city_name | string | Core DB | City name for URL slug + H1 |
state | string | Core DB | State for hub grouping |
zip_code | string | Core DB | Primary key for API lookups |
hotel_count | integer | Core DB | Listed hotels in city |
avg_temp_summer | float | Weather API | July avg temperature (°F) |
avg_temp_winter | float | Weather API | January avg temperature (°F) |
best_season | string | Weather API | Calculated optimal travel period |
top_3_landmarks | array | Google Places API | Nearest attractions within 15mi |
landmark_distances | array | Google Places API | Miles from city center |
ai_intro_paragraph | text | OpenAI | GPT-4 generated 50-word intro |
slug | formula | calculated | URL-safe {state}/{city} path |
last_enriched | datetime | system | timestamp of last enrichment pass |
LLM extraction step (per-page intro)
async function generateIntro(cityData) {
const prompt = `
You are a luxury travel guide. Write a strict 50-word introduction for a landing page about hotels in ${cityData.name}.
You MUST mention the following data naturally:
1. The proximity to ${cityData.top_landmarks.join(", ")}.
2. Suggest visiting during ${cityData.best_season} because the average temperature is ${cityData.avg_temp_f}°F.
Do not use generic filler words like 'bustling' or 'vibrant'.
`;
const response = await openai.createChatCompletion({
model: "gpt-4-turbo",
messages: [{ role: "user", content: prompt }]
});
return response.data;
}
Batch runner with rate limiting
const pLimit = require("p-limit");
const Airtable = require("airtable");
const limit = pLimit(5);
const BATCH_SIZE = 50;
const DELAY_BETWEEN_BATCHES_MS = 1200;
async function processAllCities(cities) {
const batches = chunkArray(cities, BATCH_SIZE);
let processed = 0;
let errors = [];
for (const batch of batches) {
const promises = batch.map((city) =>
limit(async () => {
try {
const weather = await fetchWeatherData(city.zip_code);
const places = await fetchNearbyLandmarks(city.lat, city.lng);
const intro = await generateIntro({
...city,
...weather,
top_landmarks: places.map((p) => p.name)
});
await airtable.update(city.record_id, {
avg_temp_summer: weather.avg_temp_july,
top_3_landmarks: JSON.stringify(places.slice(0, 3)),
ai_intro_paragraph: intro,
last_enriched: new Date().toISOString()
});
processed++;
} catch (err) {
errors.push({ city: city.name, error: err.message });
if (err.status === 429) {
await sleep(err.retryAfter * 1000 || 60000);
}
}
})
);
await Promise.all(promises);
await sleep(DELAY_BETWEEN_BATCHES_MS);
console.log(`Processed ${processed}/${cities.length} | Errors: ${errors.length}`);
}
return { processed, errors };
}
Dynamic sitemap-index generation
const states = await airtable.getUniqueStates();
const sitemapIndex = `<?xml version="1.0" encoding="UTF-8"?>
<sitemapindex xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
${states.map((state) => `
<sitemap>
<loc>https://example.com/sitemaps/sitemap-${state.slug}.xml</loc>
<lastmod>${new Date().toISOString()}</lastmod>
</sitemap>`).join("")}
</sitemapindex>`;
for (const state of states) {
const cities = await airtable.getCitiesByState(state.name);
const sitemap = `<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
${cities.map((city) => `
<url>
<loc>https://example.com/best-hotels/${state.slug}/${city.slug}</loc>
<lastmod>${city.last_enriched}</lastmod>
<changefreq>monthly</changefreq>
<priority>0.7</priority>
</url>`).join("")}
</urlset>`;
await writeFile(`./public/sitemaps/sitemap-${state.slug}.xml`, sitemap);
}
Failure modes (named)
- Index bloat / crawl-budget collapse. Submitting 10,000 URLs to a new domain in one sitemap triggers spam classifiers and partial indexation. The fix: a sitemap index with 50 per-state sitemaps (~200 URLs each), a hub-and-spoke linking structure so pages aren’t orphaned, and a 3-week indexation drip (50/day → 100/day → 500/day) monitored against GSC’s “discovered, currently not indexed” report.
- API rate-limit cascades. 10,000 concurrent calls to Google Places + OpenAI crash the run. A p-limit queue caps concurrency at 5, batches at 50, with 1.2s inter-batch delay and exponential backoff on 429 responses.
- LLM output quality drift. Across 10,000 generations, the LLM occasionally produces short (<30 words), generic (“this bustling city”), or hallucinated outputs. A post-generation QA layer rejects ~3% of intros — short, blocklist-term (“vibrant,” “bustling,” “nestled,” “gem”), or missing required data points — and re-queues them with a stricter prompt.
- Duplicate-content penalties. Programmatic pages that share >70% of content get partially deindexed. The 30% uniqueness rule: at least 30% of rendered content must be unique programmatic data per page (intro + top-3 landmarks + temperature + hotel-specific data) — not template tokens.
- Observability gaps. Without trace logging, debugging a 10k-row run is impossible. Structured logs per row, Slack alerts on hard-fail counts above threshold, and a manual review queue for QA rejections.
Where this fits in the broader system
This is a go-to-market agent on the inbound surface. The architectural patterns — validated LLM outputs, batch rate-limiting, post-generation QA gate — are the same patterns I use for go-to-market agents on the outbound and email surfaces. The class distinction matters because internal agents on the inbound surface (e.g., inbound-lead enrichment) have implicit human supervision; this one runs publicly, so the supervision is explicit.
The GTM inbound encyclopedia covers the inbound architectural pattern at a level above any specific build. The AEO encyclopedia is the discipline this build’s pages would adopt at the schema and citation-structure level if redeployed in 2026.
Stack
Airtable (master dataset) · Google Places API + Weather API (enrichment) · Node.js + p-limit (batch runner) · OpenAI GPT-4-turbo (LLM extraction step) · Whalesync (Airtable ↔ CMS sync) · Webflow CMS / Next.js (delivery layer) · Google Search Console (indexation monitoring).
Want to talk about something like this?
If you’re scaling content as a data pipeline — programmatic pages, citation-structured AEO content, or hybrid inbound surfaces — hit me up.
Author
Fenil Parekh is a GTM engineer based in San Francisco Bay Area. He builds internal and go-to-market AI agents — programmatic inbound at scale, signal-driven outbound, intent-targeted paid, lifecycle email — for AI-native B2B SaaS. M.S. Computer Science, ITU San Jose. Currently Lead GTM Engineer (consulting) at Marketing Boutique. Built and broken in the open.
Interlinking
- Sibling cross-link: /library/pseo-indexing-strategy — technique deep-dive on crawl-budget and sitemap architecture.
- Upstream: /library; /gtm/inbound — encyclopedia on inbound GTM engineering.
JSON-LD
TechArticle + HowTo + BreadcrumbList
{
"@context": "https://schema.org",
"@type": "TechArticle",
"@id": "https://fenil.ai/library/pseo-scaling-10k-pages#article",
"url": "https://fenil.ai/library/pseo-scaling-10k-pages",
"headline": "Programmatic SEO at 10k Pages — Case Study",
"description": "Scaling a programmatic SEO build to 10,000+ pages without tripping crawl budget, indexability, or quality thresholds. Architecture and bottlenecks.",
"image": "https://fenil.ai/assets/og/pseo-scaling-10k-pages.png",
"author": { "@id": "https://fenil.ai/#person" },
"publisher": { "@id": "https://fenil.ai/#person" },
"datePublished": "2026-05-10",
"dateModified": "2026-05-10",
"articleSection": "Case study",
"keywords": "programmatic SEO at scale, pSEO crawl budget, pSEO indexation, hub and spoke architecture, AI content generation",
"inLanguage": "en"
}