CSV, Excel, and Google Sheets Ingestion

What gets ingested

One row = one record.
The first non-empty row is treated as the header (column names).
Only visible (non-hidden) Google Sheets tabs are processed.
For Google Sheets, the displayed (formatted) values are ingested, not the raw formulas.
All content is processed as UTF-8 text.

Tip: Each row is converted to a simple, readable block for AI. Make headers self-explanatory because they become the labels seen by the AI:

{
Product name: Widget A
Price amount: 19.99
Price currency: USD
URL: https://example.com/widget-a
Tags: outdoor; waterproof
}

Use a single header row with clear, human-readable column names.
Keep exactly one record per row; avoid merged cells.
Make spreadsheet and sheet/tab names meaningful.
Ensure consistent types in each column (all dates in one format, all prices as numbers, etc.).
Prefer one column per concept (split amounts, currencies, statuses, units).
Use visible sheets for data; hide or remove non-data sheets.
Share/permission the file so the crawler account can view it.

Prefer self-explanatory, natural language headers:
- Good: “Product name”, “Return policy”, “Support email”
- Avoid: “PN”, “ret_pol”, “email1”
Use a single header row (Row 1) with no blank cells.
Avoid duplicate headers across the same sheet.
Keep names short but specific. If needed, use “Title Case” or “snake_case”; just be consistent.
Avoid punctuation that looks like data (e.g., “Price ($)”); instead use two columns “Price amount”, “Price currency”.

No merged cells, multi-row headers, or pivot tables in the data area.
Avoid blank top rows or columns; put headers on the first non-empty row, data starts immediately below.
Do not place totals/subtotals within the data table; keep those on a separate sheet.
Keep one dataset per sheet/tab. If you have different entities, use separate tabs.
Name spreadsheet and sheet tabs descriptively; these names are surfaced in document titles.

To ingest a specific sheet, copy the tab URL (it contains “#gid=…”). If you share the document’s root URL without “gid”, the first visible sheet is used by default.
Hidden sheets are skipped. Hide any sheets you do not want ingested.
Ensure the crawler/service (data@gleen.ai) account has at least Viewer access to the file.
Avoid protected ranges that block read access to header or data cells.

Very large files can increase processing time and cost. Consider:
- Splitting large datasets into multiple files or tabs (e.g., by time period or category).
- Removing unused columns before upload.
- Keeping only the latest, relevant rows if historical data is not needed.

Column names
- Good: “Customer name”, “Order date”, “Price amount”, “Price currency”
- Needs Fix: “CN”, “date?”, “$price”, “cur”
Values
- Good: “19.99” and “USD”
- Needs Fix: “$19.99 USD” (mixes amount and currency into one cell)
Dates
- Good: “2025-08-26”
- Needs Fix: “Aug 26th, ‘25” (ambiguous)
Lists
- Good: “outdoor; waterproof”
- Needs Fix: “outdoor, waterproof / best” (multiple inconsistent separators)

Missing or unclear headers; duplicate column names.
Multi-row or merged headers; blank header cells.
Merged cells anywhere in the data area.
Embedding totals/subtotals within the data table.
Hyperlinks with display text instead of a plain “URL” column.
Mixed data types in the same column (e.g., numbers and “N/A” and dates).
Hidden sheets that you intended to ingest, or visible sheets you didn’t intend to ingest.
Very wide tables with many unused columns; remove noise.
Clear, descriptive headers and consistent values are the single biggest factor in good results.
Use one dataset per visible sheet, keep layout simple, and avoid merged cells or complex formatting.

Last updated 1 month ago