IATO for Data Extraction

The problem with one-off extraction

Every team eventually needs to pull a specific value off a list of pages. Editorial wants the byline and last-modified date for 422 articles. E-commerce ops needs current price and stock status for 8,000 SKUs. Compliance wants the policy version number from every product doc. Sales wants the contact email on every supplier listing.

The default tools fall short. Bespoke scrapers are an engineering project — by the time they're written, tested, and wrapped in error handling, the deadline has passed and you own a fragile script that breaks the next time the markup shifts. GUI scrapers like Octoparse or ParseHub start at $99/month and force you to learn a per-template visual workflow. Standard SEO crawlers capture meta tags and headings but won't pull the specific cell value buried in a sidebar div. Manual copy-paste works for ten URLs and falls apart at a hundred.

Extraction is fundamentally a "rule once, run forever" problem, but most tools force you to rebuild the rule every time.

How IATO solves it in four phases

Phase 1: Define your selectors

Pick a field, write a selector. Open Settings → Extraction Rules and create one rule per field you want. CSS, XPath, or regex — whichever maps cleanest to the markup. For schema.org-rich sites that's almost always a one-line CSS selector: .author-name, time[itemprop="dateModified"], span.price.

Choose what to capture. Each rule has a target: text content, HTML, a specific attribute (the datetime attribute on a <time> element, the content attribute on a meta tag), or just the count of matches. Toggle Match all if a page has multiple instances and you want every one — IATO expands those into numbered columns automatically.

Phase 2: Test before you commit

Validate against any URL. The rule edit form has an inline Run test button. Paste any URL — including the one you'll eventually crawl — and IATO fetches it, applies the rule, and shows you exactly what comes back. No crawl needed. You see the actual extracted values within seconds, fix the selector if it's wrong, and only save once it's right.

This is the difference between writing a scraper that "should work" and shipping one that does work. The test loop catches selector typos, missing fallbacks, and ambiguous matches before they multiply across hundreds of URLs.

Phase 3: Import your URL list

Skip discovery; just crawl what you list. In the New Crawl modal, switch the mode toggle from Discover to Import URL list. Paste 5 URLs or 5,000. Or click Upload CSV and select the URL column from any spreadsheet — IATO scans every column for https:// matches so it doesn't matter which column it lives in.

The crawler visits exactly those URLs. Import mode auto-sets max_depth=0, disables sitemap discovery, and turns off external-link following. The crawler fetches each URL once, applies your selectors, and stops. Nothing else gets touched.

Phase 4: Export the spreadsheet you wanted

One row per URL, one column per rule. When the crawl finishes, the Extracted Data tab shows your data in pivoted form — exactly the shape you'd build in Excel by hand. URL down the left side, each rule's value across the top.

Multi-match fields expand into numbered columns. If a rule with Match all on found three distinct authors on a single page, the row gains Author, Author #2, and Author #3 columns. Atomic cells, sortable in Excel, no in-cell separators to parse later.

CSV export matches the on-screen layout. Click Export CSV and download the same shape. Drop it into your master spreadsheet, run an XLOOKUP keyed on URL, done.

What kinds of fields can you extract?

Common fields

Authors and bylines. Published and modified dates. Prices and currencies. SKUs and product IDs. Schema.org JSON-LD payloads. Open Graph tags. Twitter card metadata. Breadcrumb trails. Star ratings and review counts. Custom data-* attributes. Any cell of any table on any page.

Selector types

CSS for the 80% case — element class, attribute selector, descendant. XPath when you need DOM traversal that CSS can't express (parent navigation, indexed siblings, text-content matching). Regex for plucking values out of raw HTML when there's no clean container. Combine with match_all for repeating elements like comment authors or product variants.

How IATO compares

Approach	Setup time	Cost	IATO
Custom Python scraper	Hours to days	Engineering time	Minutes
Octoparse / ParseHub	Per-template GUI workflow	$99–$249/mo	Free up to 500 pages
Manual copy-paste	Hours per 100 URLs	Human time	Seconds per 100 URLs
Standard SEO crawler	Captures meta only	No custom fields	Any selector, any field
Browser automation (Playwright)	Maintain a script	Engineering + infra	Define a rule, done