Document Loaders
Last updated
Last updated
Document loaders serve as tools for importing data from various sources into a format that represents documents, each comprising textual content and associated metadata. These loaders are designed to handle diverse data sources, ranging from simple .txt files to the text content of web pages or even transcripts of YouTube videos.
Some loaders also have an optional input component. Here, an Input type component can be connected. This component will also specify the Input Type and Input key based on which the user will be given an upload option on the deployed app to upload their Inputs. The Input type can be File or URL or Text. The right type has to be selected according to the loader.
If no Input component is connected, the deployed Stack will use the data specified in the Document Loader for answering queries. Let's consider the following:
If we wanted the user to be able to upload a File and a URL as the input components for our PDF Loader and Web Based Loader, we would have to add the Input components for each of the loaders, and specify the File type for each of them as shown below. As you can notice, specifying an Input means you won't have to restrict the loader by providing it a file/URL!
NOTE: We will also need to define a separate Text Splitter for each loader before attaching it to a Vector Store. This will ensure that the user gets to Input multiple types of input in the sidebar of the deployment page. The result after deployment will have multiple Input types in the sidebar to allow user to upload their inputs. The user can also upload multiple inputs of each type, and while they are uploading they look like the below image:
Azure AI Document Intelligence (formerly known as Azure Form Recognizer) is a machine-learning-based service that extracts text (including handwriting), tables or key-value-pairs from scanned documents or images. It allows you to turn your documents into usable data and shift your focus to acting on information rather than compiling it.
Endpoint: Azure Cloud endpoint to integrate OCR
API Key: API key from Azure AI
File Path: Upload your input file
Example Usage:
These loaders can be connected with other components to create flows, for instance the AzureAIDocumentIntelligence Loader here can be connected to TextSplitters for the flow
Dropbox is a cloud storage service provider that enables easy file sharing and access to team data from your computer, mobile device, or any web browser in a highly secure manner. DropboxLoader can load data from a list of Dropbox file paths or a single Dropbox folder path. Both paths should be relative to the root directory of the Dropbox account linked to the access token.
Steps to follow:
Create your developer app on Dropbox.
Once the app is built, navigate to permissions and select files.metadata.read
and files.content.read
. Then, click Submit
.
From the app dashboard, click on Generate Access Token
and copy your code.
Parameters:
Dropbox Access Token: The developer access token to access files from Dropbox
Dropbox Folder Path: The folder name where the data is stored in Dropbox.
Example Usage
As per the above loader, the folder path refers to the folders stored in your Dropbox. The format to follow is '/' followed by your folder name. Please note that when you open the app, you will also find a key and secret; do not copy those. Instead, remember to generate an access token.
This document loader outlines the usage of WebBaseLoader to extract all text from HTML web pages and format it into a document structure for downstream processing. In this way, this loaders are used to load web resources.
Web Page : Enter the Webpage URL as your input
Metadata : Metadata is used to provide the source and tag for the given input.
Example Usage:
This document loader we can utilize the CSV Loader to process the delimited text, separating values and organizing them into a document format for further downstream applications.
File Path : Upload the *CSV file as your input.
Metadata : Metadata is used to provide the source and tag for the given input.
Example Usage:
This document loader, loads PDF using pypdf into an array of documents, where each document contains the page content and metadata with page number.
File Path : Upload the *PDF file as your input.
Metadata : Metadata is used to provide the source and tag for the given input.
Example Usage:
YouTubeLoader downloads the YouTube transcripts and video information.
Video URL : YouTube URL
Language : Language code to extract transcript. please check subtitles/cc to know available transcripts. By default: en (English)
Example Usage:
FigmaFileLoader enables loading Figma design files in a structured format that enables using the visual design data for things like automatic code generation.
Parameters:
Figma Access Token: The access token you generate in Figma settings.
Figma design URL: The URL of your figma design that you see in search bar.
Example Usage:
Gitloader is used to load files from a Git repository by cloning a repository from a URL
Parameters:
Branch: The branch you need to clone.
Clone URL: The URL of the Git repository.
File extension: The file type you need to clone (.py, .txt, etc)
Metadata: Metadata is used to provide the source and tag for the given input.
Example Usage:
The Gitbook loader is used to load GitBook data, using the URL of the web page.
Parameters:
Web Page: The URL of the web page of the Git book
Metadata: Metadata is used to provide the source and tag for the given input.
Example Usage:
The HNLoader is used to load data from Hacker News, either from the main page results or the comments page. It can fetch all URLs concurrently with rate limiting and scrape data from a webpage, returning it in BeautifulSoup format.
Parameters:
Web Page: The URL of the web page of Hacker News.
Metadata: Metadata is used to provide the source and tag for the given input.
Example Usage:
The TextLoader is to load a simple .txt file.
Parameters:
File Path: The path to .txt file to load.
Metadata: Metadata is used to provide the source and tag for the given input.
Example Usage:
The URL loader is used to load HTML documents using URL, you can select a specific loader from drop down list. By default, it's ‘WebBaseLoader’, which works for any URL.
Parameters:
URL: The URL of the web page you want to load.
Example Usage:
The AirbyteJSONLoader is used to load local Airbyte JSON files. It initializes with a file path and can load data into Document objects. This loader is specifically designed to handle data integration from Airbyte.
Parameters:
File Path: the file path to Airbyte JSON file.
Metadata: Metadata is used to provide the source and tag for the given input.
Example Usage:
In GenAI Stack, Unstructured currently supports loading of text files, powerpoints, html, pdfs, images, and more. Unstructured detects the file type and extracts the same types of elements.
File Path: Based on the given unstructured, Upload the appropriate file as your input.
Metadata: Metadata is used to provide the source and tag for the given input.
Example Usage: