Text Splitters

In GenAI Stack, after loading documents, users frequently need to tailor them for optimal application use. This often involves breaking down lengthy texts into smaller chunks that align with model context windows. GenAI Stack simplifies this process with built-in document transformers, offering easy-to-use functionalities for splitting, combining, filtering, and manipulating documents according to application needs.

NOTE: All the Text splitters below have the same input and output components

CharacterTextSplitter

Splits text based on a user defined character. One of the simpler methods.

Parameters

  • Documents: Input documents to split.

  • chunk_size: Determines the maximum number of characters in each chunk when splitting a text. It specifies the size or length of each chunk.

  • chunk_overlap: Determines the number of characters that overlap between consecutive chunks when splitting text. It specifies how much of the previous chunk should be included in the next chunk.

  • separator: Specifies the character that will be used to split the text into chunks.

Example usage

The available input components for Text Splitters include Document objects (the output of loaders), other Text Splitters, Chains- PromptRunner and SequentialLLMChain or certain Utilities.

RecursiveCharacterTextSplitter

Text is recursively divided with the goal of maintaining the proximity of related content. How this workis is that, the first separator chunks data, and this chunked data is then recursively split using the subsequent separators.

Parameters

  • Documents: Input documents to split.

  • chunk_size: Determines the maximum number of characters in each chunk when splitting a text. It specifies the size or length of each chunk.

  • chunk_overlap: Determines the number of characters that overlap between consecutive chunks when splitting text. It specifies how much of the previous chunk should be included in the next chunk.

  • separator: Specifies the character(s) that will be used to split the text into chunks.

Example usage

The recursive text splitter uses the list of Seperators one after another, first the document is chunked according to the top most seperator (paragraphs), then the next seperator(newlines), and then sentences and words, and finally characters.

LanguageRecursiveTextSplitter

The LanguageRecursiveTextSplitter is a text splitter that splits the text into smaller chunks based on the (programming) language of the text.

Parameters

  • Documents: Input documents to split.

  • chunk_size: Determines the maximum number of characters in each chunk when splitting a text. It specifies the size or length of each chunk.

  • separator_type: The parameter allows the user to split the code with multiple language support. It supports various languages such as Ruby, Python, Solidity, Java, and more. Defaults to Python.

Example use case:

The Language Recursive Text Splitter, takes Document objects and splits them according to separators that can be chosen in the dropdown menu according to your required programming language (Ex: "\ndef", " \ndef", for Python).

Last updated