Alhena
  • Introduction
  • Getting Started
  • Architecture
  • Reference
    • Website SDK
      • Configure Proactive Nudges
    • Product FAQs
    • Website chatsdk events
    • Website chatsdk APIs
    • Chat SDK api and events examples
      • Open other external widget once human transfer is initiated
      • Show the Alhena AI widget only when someone scroll the page by 5 px
    • Website SDK - Custom data
      • Website SDK - Customer data with Agent
    • Website SDK - Internationalization
    • API Reference
      • API calls
    • Device Compatibility
  • Tutorials
    • AI Training
      • Training Steps
      • Training Data Sources
        • Websites
        • Youtube videos
        • Google Drive
        • Twitter Pages
        • Discord Messages
        • Confluence Pages
        • Upload Documents
        • Github
        • Zendesk Tickets
        • Freshdesk Tickets
        • Freshchat Tickets
        • Custom data sources
        • Shopify API
        • Woocommerce API
        • PDF Crawling
      • Training Frequently Asked Questions
    • Tuning Alhena AI Post Training
      • Best Practices for configuring the Alhena AI’s personality and guidelines
      • Adding Human Feedback for improving specific Questions
      • Adding to your knowledge base with FAQs
      • Frequently Asked Questions - Tuning Responses
    • QAing Al Conversations
      • Smart Flagging: Streamline Your AI Quality Assurance
    • Integrations
      • Alhena Website Chat SDK
        • Customizing Your Alhena Chat Widget
      • Integrating Alhena AI With Slack
      • Integrating Alhena AI With Discord
      • Integrating Alhena With Freshdesk
      • Integrating Alhena AI With Zendesk
      • Integrating Alhena AI With Email
      • Integrating Alhena AI With Shopify
      • Integration Alhena AI With Trustpilot
      • Integrating Alhena With Gorgias
      • Integrating Alhena With Kustomer
    • Notifications
    • Alhena Dashboard
      • Managing Team
Powered by GitBook
On this page
  • Supported Data Crawling Sources in Alhena AI
  • 1. Websites & Web Pages
  • 2. Google Drive
  • 3. Document Files
  • 4. Video & Media
  • 5. Social & Communication Platforms
  • 6. Helpdesk & Ticketing Systems
  • 7. Ecommerce Platforms
  • 8. GitHub
  • 9. Custom / Other Sources
  • Supported File Extensions (from INCLUDE_PATTERNS)
  • Summary Table
  1. Tutorials
  2. AI Training

Training Data Sources

Alhena AI supports various types of data sources for bots knowledge-base.

Supported Data Crawling Sources in Alhena AI

Alhena AI supports a wide range of data crawling capabilities, enabling seamless extraction and indexing of data from diverse sources. This article outlines all currently supported crawling integrations.


1. Websites & Web Pages

  • General Websites – HTML pages, blogs, wikis, etc.

  • Sitemap XML – Crawl all URLs listed in an XML sitemap.

  • Confluence Pages – Crawl pages from Atlassian Confluence.

  • Notion Pages – Crawl public Notion pages.

  • Private Notion Pages – With appropriate authentication.

  • ServiceNow Documentation – Crawl content from ServiceNow knowledge base.


2. Google Drive

  • Google Docs

  • Google Sheets (spreadsheets)

  • Google Slides (presentations)

  • Google Drive Folders – Recursive crawling supported.

  • Google Drive Files – Generic support for any file inside Drive.


3. Document Files

  • PDF (.pdf)

  • Word Documents (.doc, .docx)

  • Excel Spreadsheets (.xls, .xlsx)

  • PowerPoint Presentations (.ppt, .pptx)

  • CSV / TSV (.csv, .tsv)

  • Plain Text (.txt), Markdown (.md), RST (.rst)

  • Rich Text Format (.rtf)

  • OpenDocument Files (.odt, .ods, .odp)

  • Apple iWork (.pages, .numbers, .key)

  • Email Files (.eml, .msg)

  • EPUB (.epub), Org-mode (.org)

  • Config/Data Files (.ini, .yaml, .toml, .xml, .json)

  • Images – .jpg, .jpeg, .png, .webp, etc.


4. Video & Media

  • YouTube Videos – Crawled via GeminiVideoScraper or similar.

  • Other Video Files – .mp4, .avi, .mkv, .mov, and more.


5. Social & Communication Platforms

  • Discord Messages

  • Slack Messages

  • Twitter Pages – Accounts, posts.


6. Helpdesk & Ticketing Systems

  • Zendesk Articles

  • Freshdesk Articles

  • Freshservice Articles

  • Gladly Articles


7. Ecommerce Platforms

  • Shopify Products

  • Woocommerce Products

  • Salesforce Commerce Cloud

  • Magento

  • Generic Product Pages – Custom HTML product extraction.


8. GitHub

  • Code Repositories – Source code, documentation, config files.

  • Issues

  • Discussions


9. Custom / Other Sources

  • S3 Uploaded Files

  • Custom Data Sources – Extensible scraper support.

  • PDF Crawling – Via URL or file upload.

  • Document Uploads – Supports any of the formats listed above.


Supported File Extensions (from INCLUDE_PATTERNS)

Source Code

.py, .js, .java, .c, .cpp, .cs, .rb, .go, .rs, .php, .swift, .kt, .ts, .scala, .pl, .r, .sh

Web

.html

Config & Data

.yaml, .yml, .xml, .ini, .toml

Documentation

.md, .rst, .txt

Presentations

.ppt, .pptx, .odp, .key

Spreadsheets

.xls, .xlsx, .ods, .numbers

Archives

.zip, .rar, .7z, .tar, .gz, .bz2, .xz, .iso

Media

.png, .jpg, .jpeg, .gif, .bmp, .tiff, .svg, .webp, .ico, .mp3, .wav, .ogg, .flac, .aac, .m4a, .wma, .mp4, .avi, .mkv, .flv, .mov, .wmv, .m4v, .webm, .vob, .ogv

Other

.pdf, .doc, .docx, .pages, .eml, .msg, .epub, .org, .tsv, .rtf, .dockerfile, etc.


Summary Table

Category
Examples / Platforms Supported

Websites

General, Sitemap, Confluence, Notion, ServiceNow

Google Drive

Docs, Sheets, Slides, Folders, Files

Documents

PDF, Word, Excel, PowerPoint, CSV, TSV, TXT, Markdown, RST, RTF, ODT, ODS, ODP, Pages, Numbers, Key, EML, MSG, EPUB, Org, Images

Video

YouTube, MP4, AVI, MKV, MOV, etc.

Social/Comm

Discord, Slack, Twitter

Helpdesk

Zendesk, Freshdesk, Freshservice, Gladly

Ecommerce

Shopify, Woocommerce, Salesforce, Magento, Generic Product Pages

GitHub

Code, Issues, Discussions

Custom/Other

S3, Custom sources, PDF crawling, Uploads


PreviousTraining StepsNextWebsites

Last updated 26 days ago