Splitter - semantic
Introduction
The semantic segmenter is an implementation of the Document Transformer interface, used to segment long documents into smaller fragments based on semantic similarity. This component is implemented according to the Eino: Document Transformer guide.
Working Principle
The semantic segmenter operates through the following steps:
- First, it uses basic delimiters (such as newline characters, periods, etc.) to split the document into initial segments.
- It generates semantic vectors for each segment using a vector embedding model.
- It calculates the cosine similarity between adjacent segments.
- It decides whether to separate two segments based on a similarity threshold.
- It merges segments smaller than the minimum size.
Usage
Component Initialization
The semantic splitter initializes through the NewSplitter
function with the main configuration parameters as follows:
splitter, err := semantic.NewSplitter(ctx, &semantic.Config{
Embedding: embedder, // Required: Embedding instance used to generate text vectors
BufferSize: 2, // Optional: Context buffer size
MinChunkSize: 100, // Optional: Minimum chunk size
Separators: []string{"\n", ".", "?", "!"}, // Optional: List of separators
Percentile: 0.9, // Optional: Percentile for splitting threshold
LenFunc: nil, // Optional: Custom length calculation function
})
Explanation of configuration parameters:
Embedding
: Required parameter, instance of the embedder used to generate text vectorsBufferSize
: Context buffer size to include more context information when calculating semantic similarityMinChunkSize
: Minimum chunk size, chunks smaller than this size will be mergedSeparators
: List of separators used for initial splitting, used sequentiallyPercentile
: Percentile of the splitting threshold, range 0-1, the larger it is, the fewer splits there areLenFunc
: Custom text length calculation function, by default useslen()
Full Usage Example
package main
import (
"context"
"github.com/cloudwego/eino-ext/components/document/transformer/splitter/semantic"
"github.com/cloudwego/eino/components/embedding"
"github.com/cloudwego/eino/schema"
)
func main() {
ctx := context.Background()
// Initialize embedder (example usage)
embedder := &embedding.SomeEmbeddingImpl{} // eg: openai embedding
// Initialize splitter
splitter, err := semantic.NewSplitter(ctx, &semantic.Config{
Embedding: embedder,
BufferSize: 2,
MinChunkSize: 100,
Separators: []string{"\n", ".", "?", "!"},
Percentile: 0.9,
})
if err != nil {
panic(err)
}
// Prepare the document to be split
docs := []*schema.Document{
{
ID: "doc1",
Content: `This is the first paragraph, containing some important information.
This is the second paragraph, semantically related to the first.
This is the third paragraph, the topic has changed.
This is the fourth paragraph, continuing the new topic.`,
},
}
// Execute the split
results, err := splitter.Transform(ctx, docs)
if err != nil {
panic(err)
}
// Process the split results
for i, doc := range results {
println("Segment", i+1, ":", doc.Content)
}
}
Advanced Usage
Custom length calculation:
splitter, err := semantic.NewSplitter(ctx, &semantic.Config{
Embedding: embedder,
LenFunc: func(s string) int {
// Use the number of unicode characters instead of bytes
return len([]rune(s))
},
})
Adjust splitting granularity:
splitter, err := semantic.NewSplitter(ctx, &semantic.Config{
Embedding: embedder,
// Increase percentile to reduce split points
Percentile: 0.95,
// Increase minimum chunk size to avoid too small chunks
MinChunkSize: 200,
})
Optimize semantic judgment:
splitter, err := semantic.NewSplitter(ctx, &semantic.Config{
Embedding: embedder,
// Increase buffer size to include more context
BufferSize: 10,
// Custom separator priority
Separators: []string{"\n\n", "\n", ".", "!", "?", ","},
})
Related Documents
- Eino: Document Parser guide
- Eino: Document Loader guide
- Eino: Document Transformer guide
- Splitter - semantic
- Splitter - markdown
Last modified
February 21, 2025
: doc: add eino english docs (#1255) (4f6a3bd)