LogoLogo
Home
  • Introduction
  • Quickstart
    • Starter guide
    • Core Concepts
      • Stack Type
      • Data Loader
      • Inputs/Outputs
      • Text Splitters
      • Embedding Model
      • Vector Store
      • Large Language Model
      • Memory
      • Chain
    • Testing Stack
    • Deployment
    • Knowledge Base
    • Organization and Teams
    • Secret Keys
    • Logs
  • Components
    • Inputs
    • Outputs
    • Document Loaders
    • Prompts
    • Text Splitters
    • Embeddings
    • Vector Store
    • Retrievers
    • Multi Modals
    • Agents
    • Large Language Models
    • Memories
    • Chains
    • Output Parsers
  • Customization
    • Writing Custom Components in GenAI Stack
    • Build your own custom component
    • Define parameters used for required components
  • Usecases
    • Simple QA using Open Source Large Language Models
    • Multilingual Indic Language Translation
    • Document Search and Chat
    • Chat with Multiple Documents
  • Terminologies
    • RAG - Retrieval Augmented Generation
    • Hybrid Search - Ensemble Retriever
  • REST APIs
    • GenAI Stack REST APIs
    • Chat API Reference
    • Text Generation API Reference
    • Rate Limiting and Sleep Mode
  • Troubleshooting
    • How to verify what is loaded and chunked from the loader?
  • Acknowledgements
    • Special Mentions
Powered by GitBook
On this page
  • CharacterTextSplitter
  • RecursiveCharacterTextSplitter
  • LanguageRecursiveTextSplitter

Was this helpful?

  1. Components

Text Splitters

PreviousPromptsNextEmbeddings

Last updated 11 months ago

Was this helpful?

In GenAI Stack, after loading documents, users frequently need to tailor them for optimal application use. This often involves breaking down lengthy texts into smaller chunks that align with model context windows. GenAI Stack simplifies this process with built-in document transformers, offering easy-to-use functionalities for splitting, combining, filtering, and manipulating documents according to application needs.

NOTE: All the Text splitters below have the same input and output components

CharacterTextSplitter

Splits text based on a user defined character. One of the simpler methods.

Parameters

  • Documents: Input documents to split.

  • chunk_size: Determines the maximum number of characters in each chunk when splitting a text. It specifies the size or length of each chunk.

  • chunk_overlap: Determines the number of characters that overlap between consecutive chunks when splitting text. It specifies how much of the previous chunk should be included in the next chunk.

  • separator: Specifies the character that will be used to split the text into chunks.

Example usage

The available input components for Text Splitters include Document objects (the output of loaders), other Text Splitters, Chains- PromptRunner and SequentialLLMChain or certain Utilities.

RecursiveCharacterTextSplitter

Text is recursively divided with the goal of maintaining the proximity of related content. How this workis is that, the first separator chunks data, and this chunked data is then recursively split using the subsequent separators.

Parameters

  • Documents: Input documents to split.

  • chunk_size: Determines the maximum number of characters in each chunk when splitting a text. It specifies the size or length of each chunk.

  • chunk_overlap: Determines the number of characters that overlap between consecutive chunks when splitting text. It specifies how much of the previous chunk should be included in the next chunk.

  • separator: Specifies the character(s) that will be used to split the text into chunks.

Example usage

The recursive text splitter uses the list of Seperators one after another, first the document is chunked according to the top most seperator (paragraphs), then the next seperator(newlines), and then sentences and words, and finally characters.

LanguageRecursiveTextSplitter

The LanguageRecursiveTextSplitter is a text splitter that splits the text into smaller chunks based on the (programming) language of the text.

Parameters

  • Documents: Input documents to split.

  • chunk_size: Determines the maximum number of characters in each chunk when splitting a text. It specifies the size or length of each chunk.

  • separator_type: The parameter allows the user to split the code with multiple language support. It supports various languages such as Ruby, Python, Solidity, Java, and more. Defaults to Python.

Example use case:

The Language Recursive Text Splitter, takes Document objects and splits them according to separators that can be chosen in the dropdown menu according to your required programming language (Ex: "\ndef", " \ndef", for Python).