Dataset-attached evaluation

Same data. Same question. Different semantic context.

These four examples use static synthetic CSVs from the second `to-prompt` evaluation. Copy a question and its CSV files into your own model session, then compare the raw answer with an answer grounded by the linked rawctx package.

Result summarySemantic context mattered most when the dataset had competing metric columns or lifecycle facts.

Selected by impact score, oracle-quality delta, and semantic change size. This page keeps the public result summary, reusable questions, and CSVs, without operational run logistics.

Top examples

Four packages with the clearest reusable test cases.

Strong shiftGA4 Data API Strong shiftShopify Orders Directional shiftHubSpot Marketing Directional shiftSalesforce Revenue Usage

Experiment results

What changed after adding `to-prompt` context.

Strong shift

GA4 Data API

Signal: Impact 3/4: material entity and scope shift
Quality: Oracle quality 7 -> 9
Before: The raw answer grouped first-user campaign performance through firstUserSource and firstUserCampaign, then calculated session-source performance separately.
After: The grounded answer used User.newVsReturning for the first-user segment and kept Session.sessionSource, Session.sessionCampaignName, and Ecommerce.ecommercePurchases in their session and purchase scopes.

Strong shift

Shopify Orders

Signal: Impact 3/4: material metric and numeric shift
Quality: Oracle quality 6 -> 9
Before: The raw answer used totalPriceAmount for GMV and AOV, producing larger GMV figures such as web 165 and retail 290.
After: The grounded answer followed the package metric definition and used currentTotalPriceAmount, producing web 150, retail 280, and mobile 120 for GMV.

Directional shift

HubSpot Marketing

Signal: Impact 3/4 in the material-shift case
Quality: Oracle quality 8 -> 10
Before: The raw answer was unstable: it sometimes used the precomputed email_*_rate fields and sometimes recalculated rates from raw counts and emails_delivered.
After: The grounded answer consistently used email_open_rate, email_click_rate, email_bounce_rate, email_unsubscribe_rate, and email_spam_report_rate.

Directional shift

Salesforce Revenue Usage

Signal: Impact 3/4 in the material-shift case
Quality: Oracle quality 8 -> 10
Before: The raw answer sometimes added commitment 1,000 and overage 240 together, presenting 1,240 as the billing amount.
After: The grounded answer treated UsageRatableSummary.TotalAmount = 240 as the final overage charge and kept the commitment policy as a separate lifecycle caveat.

Strong shift

GA4 Data API

Context changed first-user analysis from decoy source/campaign fields toward User.newVsReturning and session-scoped attribution.

@pasar6987/ga4-data-api@1.0.1

Question

Using the attached synthetic GA4 dataset, compare first-user segments with session source/campaign performance, then calculate purchase conversions and purchase revenue.

Same data. Same question. Different semantic context.

Four packages with the clearest reusable test cases.

What changed after adding `to-prompt` context.

GA4 Data API

Shopify Orders

HubSpot Marketing

Salesforce Revenue Usage

GA4 Data API

Question

User.csv

Session.csv

Event.csv

Ecommerce.csv

Shopify Orders

Question

orders.csv

fulfillments.csv

abandonedCheckouts.csv

HubSpot Marketing

Question

EmailMarketing.csv

Salesforce Revenue Usage

Question

UsageResource.csv

TransactionUsageEntitlement.csv

UsageEntitlementAccount.csv

UsageEntitlementBucket.csv

UsageSummary.csv

UsageBillingPeriodItem.csv

UsageRatableSummary.csv

UsageCommitmentPolicy.csv

UsageOveragePolicy.csv