Data Loader
Last updated
Last updated
The given data can be Website, PDF, Google Drive, PPT, Document, Markdown, YouTube video and more. In this quick start guide, we will use a PDF loader. You can check out more ways to upload documents in the Document Loader section!
This page outlines the core concept of a data loader, a foundational component for building applications that leverage Large Language Models (LLMs) on customized datasets. A data loader serves as a bridge between the LLM and various data sources, such as websites, PDF files, Google Drive documents, presentations, markdown files, and even YouTube videos. The primary goal of integrating a data loader is to enhance the LLM's understanding by directly referencing your specific data, thereby minimizing inaccuracies and reducing the likelihood of generating irrelevant or incorrect information.
The data loader is designed to ingest and preprocess data from a diverse set of sources. This enables the LLM to access and interpret the content more effectively, tailoring its responses to the specific context and information contained within your data. The process involves several key steps:
The data loader retrieves content from the specified data source. This could be extracting text from a PDF, scraping a website, or processing video subtitles from YouTube content.
While the data loader is capable of handling a variety of data formats, we'll illustrate its application using a PDF loader as an example. The PDF loader is specifically designed to extract text from PDF documents, making the content accessible to the LLM for context.
You can click on the file-upload icon and upload your PDF file from your local directory. And you can connect this to any other component which accepts documents as their inputs.
Note: You can always hover on the connector icon to see what components you can connect this component to.
Here are some more examples to which components you can connect these loaders to.
PyPDF -> TextSplitter
In this example we are connecting the output of the Pypdf loader to the TextSplitter component so that our contents in our pdf are chunked to a specific size before entering into the Vectordb.
To know more about the configurations of Text Splitters store refer the Text Splitter Components documentation.
PyPDF -> VectorStore
In this example we are connecting to the vectordb directly instead of chunking it. This is a case where you want long contexts to be stored and retrieved instead of smaller chunks. To know more about the configurations of vector store refer the Vector Store Components documentation.
Here we have provided an example case with PyPDF Loader but other loaders also work in a very similar way to know more about loaders refer their component documentation here.