Parser - html
Basic Introduction
The HTML Document Parser is an implementation of the Document Parser interface, used to parse the content of HTML web pages into plain text. This component implements the Eino: Document Parser guide, mainly used in the following scenarios:
- When plain text content needs to be extracted from web pages
- When metadata of web pages (title, description, etc.) needs to be retrieved
Feature Introduction
The HTML parser has the following features:
- Supports selective extraction of page content with flexible content selector configuration (html selector)
- Automatically extracts web page metadata (metadata)
- Secure HTML parsing
Usage
Component Initialization
The HTML parser is initialized using the NewParser
function, with the main configuration parameters listed below:
import (
"github.com/cloudwego/eino-ext/components/document/parser/html"
)
parser, err := html.NewParser(ctx, &html.Config{
Selector: &selector, // Optional: content selector, defaults to body
})
Configuration parameter description:
Selector
: Optional parameter, specifies the content area to extract, using goquery selector syntax- For example:
body
indicates extracting the content of the<body>
tag #content
indicates extracting the content of the element with id “content”
- For example:
Metadata Description
The parser will automatically extract the following metadata:
html.MetaKeyTitle
("_title"): Webpage titlehtml.MetaKeyDesc
("_description"): Webpage descriptionhtml.MetaKeyLang
("_language"): Webpage languagehtml.MetaKeyCharset
("_charset"): Character encodinghtml.MetaKeySource
("_source"): Document source URI
Complete Usage Example
Basic Usage
package main
import (
"context"
"strings"
"github.com/cloudwego/eino-ext/components/document/parser/html"
"github.com/cloudwego/eino/components/document/parser"
)
func main() {
ctx := context.Background()
// Initialize parser
p, err := html.NewParser(ctx, nil) // Use default configuration
if (err != nil) {
panic(err)
}
// HTML content
html := `
<html lang="zh">
<head>
<title>Sample Page</title>
<meta name="description" content="This is a sample page">
<meta charset="UTF-8">
</head>
<body>
<div id="content">
<h1>Welcome</h1>
<p>This is the main content.</p>
</div>
</body>
</html>
`
// Parse the document
docs, err := p.Parse(ctx, strings.NewReader(html),
parser.WithURI("https://example.com"),
parser.WithExtraMeta(map[string]any{
"custom": "value",
}),
)
if (err != nil) {
panic(err)
}
// Use the parsing results
doc := docs[0]
println("Content:", doc.Content)
println("Title:", doc.MetaData[html.MetaKeyTitle])
println("Description:", doc.MetaData[html.MetaKeyDesc])
println("Language:", doc.MetaData[html.MetaKeyLang])
}
Using Selector
package main
import (
"context"
"github.com/cloudwego/eino-ext/components/document/parser/html"
)
func main() {
ctx := context.Background()
// Specify to only extract the content of the element with id "content"
selector := "#content"
p, err := html.NewParser(ctx, &html.Config{
Selector: &selector,
})
if (err != nil) {
panic(err)
}
// ... code to parse the document ...
}
Using Loader
Refer to the Eino: Document Parser guide for examples.
Related Documents
Last modified
February 21, 2025
: doc: add eino english docs (#1255) (4f6a3bd)