Parsing
Parsing HTML Documents
The Document
struct in dom_query
is designed to handle full HTML documents. You can create a Document
by passing in HTML content, which can be provided in several formats: &str
, String
, or StrTendril
.
extern crate dom_query;
use dom_query::Document;
use tendril::StrTendril;
// HTML content as a string slice
let contents_str = r#"<!DOCTYPE html>
<html><head><title>Test Page</title></head><body></body></html>"#;
let doc = Document::from(contents_str);
// HTML content as a String
let contents_string = contents_str.to_string();
let doc = Document::from(contents_string);
// HTML content as a StrTendril
let contents_tendril = StrTendril::from(contents_str);
let doc = Document::from(contents_tendril);
// Checking the root element of the `Document`
assert!(doc.root().is_document());
When parsing a full HTML document, Document
will recognize a <!DOCTYPE>
if it exists at the start of the input. In this case, the Doctype
will be added as the first child of the root Document
node. If you provide an HTML snippet without a <!DOCTYPE>
, Document
will ignore the Doctype.
assert!(doc.root().first_child().unwrap().is_doctype());
Parsing HTML Fragments
For cases where you need to parse only a part of an HTML document, such as a snippet or component, dom_query
provides Document::fragment()
. This function also accepts &str
, String
, or StrTendril
, but behaves a little differently from Document::from()
in that it treats the input as a fragment instead of a full document.
use dom_query::Document;
use tendril::StrTendril;
// Parsing an HTML fragment from a string slice
let contents_str = r#"<div><p>Example Fragment</p></div>"#;
let fragment = Document::fragment(contents_str);
// Parsing from a String
let contents_string = contents_str.to_string();
let fragment = Document::fragment(contents_string);
// Parsing from a StrTendril
let contents_tendril = StrTendril::from(contents_str);
let fragment = Document::fragment(contents_tendril);
// Checking the root element of the fragment
assert!(!fragment.root().is_document());
assert!(fragment.root().is_fragment());
When using Document::fragment()
, note that Doctype declarations are ignored, focusing only on the fragment itself.
// Confirming Doctype is excluded in the fragment
assert!(!fragment.root().first_child().unwrap().is_doctype());
Document::fragment()
is also used internally within the library to create new elements within the document tree.
Querying
Selecting Elements
The dom_query
crate provides several selection methods to locate HTML elements in the document. Using CSS-like selectors, you can select both single and multiple elements.
use dom_query::Document;
let html = r#"<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Test Page</title>
</head>
<body>
<h1>Test Page</h1>
<ul>
<li>One</li>
<li><a href="/2">Two</a></li>
<li><a href="/3">Three</a></li>
</ul>
</body>
</html>"#;
let document = Document::from(html);
// Select a single element
let a = document.select("ul li:nth-child(2)");
let text = a.text().to_string();
assert!(text == "Two");
// Selecting multiple elements
document.select("ul > li:has(a)").iter().for_each(|sel| {
assert!(sel.is("li"));
});
// Optionally select an element with `try_select`, which returns an `Option`
let no_sel = document.try_select("p");
assert!(no_sel.is_none());
The Selection::is
method checks whether elements in the current selection match a given selector, without performing a deep search within the elements.
dom_query
supports pseudo-classes that goes from selectors
crate and a few others from itself.
See also: List of supported CSS pseudo-classes
Selecting a Single Match and Multiple Matches
To retrieve only the first match of a selector, Selection::select_single
method is available. This method is useful when you want a single match without iterating through all matches.
use dom_query::Document;
let doc: Document = r#"<!DOCTYPE html>
<html lang="en">
<head></head>
<body>
<ul class="list">
<li>1</li><li>2</li><li>3</li>
</ul>
<ul class="list">
<li>4</li><li>5</li><li>6</li>
</ul>
</body>
</html>"#.into();
// selecting a first match
let single_selection = doc.select_single(".list");
assert_eq!(single_selection.length(), 1);
assert_eq!(single_selection.inner_html().to_string().trim(),
"<li>1</li><li>2</li><li>3</li>");
// selecting all matches
let selection = doc.select(".list");
assert_eq!(selection.length(), 2);
// but when you call property methods usually
// you will get the result of the first match
assert_eq!(selection.inner_html().to_string().trim(),
"<li>1</li><li>2</li><li>3</li>");
// This creates a Selection from the first node in the selection
let first_selection = doc.select(".list").first();
assert_eq!(first_selection.length(), 1);
assert_eq!(first_selection.inner_html().to_string().trim(),
"<li>1</li><li>2</li><li>3</li>");
// This approach also creates a new Selection from the next node, each iteration
let next_selection = doc.select(".list").iter().next().unwrap();
assert_eq!(next_selection.length(), 1);
assert_eq!(next_selection.inner_html().to_string().trim(),
"<li>1</li><li>2</li><li>3</li>");
// currently, to get data from all matches you need to iterate over them:
let all_matched: String = selection
.iter()
.map(|s| s.inner_html().trim().to_string())
.collect();
assert_eq!(
all_matched,
"<li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li>"
);
// same thing as previous, but a little cheaper, because we iterating over the nodes,
// and do not create a new Selection on each iteration
let all_matched: String = doc
.select(".list").nodes()
.iter()
.map(|s| s.inner_html().trim().to_string())
.collect();
assert_eq!(
all_matched,
"<li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li>"
);
Descendant selections
Elements can be selected in relation to a parent element. Here, a Document
is queried for ul elements, and then descendant selectors are applied within that context.
use dom_query::Document;
let html = r#"<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>Test Page</title>
</head>
<body>
<h1>Test Page</h1>
<ul class="list-a">
<li>One</li>
<li><a href="/2">Two</a></li>
<li><a href="/3">Three</a></li>
</ul>
<ul class="list-b">
<li><a href="/4">Four</a></li>
</ul>
</body>
</html>"#;
let document = Document::from(html);
// selecting parent elements
let ul = document.select("ul");
ul.select("li").iter().for_each(|el| {
// descendant select matches only inside the children context
assert!(el.is("li"));
});
// also descendant selector may include elements of the higher level than the parent.
// It may be useful to specify the exact element you want to select
let el = ul.select("body ul.list-b li").first();
let text = el.text();
assert_eq!("Four", text.to_string());
Selecting with Precompiled Matchers
For repeated queries, dom_query
allows using precompiled matchers. This approach enhances performance when matching the same pattern across multiple documents.
use dom_query::{Document, Matcher};
let html1 = r#"<!DOCTYPE html>
<html><head><title>Test Page 1</title></head><body></body></html>"#;
let html2 = r#"<!DOCTYPE html>
<html><head><title>Test Page 2</title></head><body></body></html>"#;
let doc1 = Document::from(html1);
let doc2 = Document::from(html2);
// create a matcher once, reuse on different documents
let title_matcher = Matcher::new("title").unwrap();
let title_el1 = doc1.select_matcher(&title_matcher);
assert_eq!(title_el1.text(), "Test Page 1".into());
let title_el2 = doc2.select_matcher(&title_matcher);
assert_eq!(title_el2.text(), "Test Page 2".into());
let title_single = doc1.select_single_matcher(&title_matcher);
assert_eq!(title_single.text(), "Test Page 1".into());
Selecting Ancestor Elements
You can use Node::ancestors()
to retrieve the sequence of ancestor nodes for a given element in the document tree, which can be helpful when you need to navigate upward from a specific node.
use dom_query::Document;
let doc: Document = r#"<!DOCTYPE html>
<html>
<head>Test</head>
<body>
<div id="great-ancestor">
<div id="grand-parent">
<div id="parent">
<div id="child">Child</div>
</div>
</div>
</div>
</body>
</html>
"#.into();
// Select an element
let child_sel = doc.select("#child");
assert!(child_sel.exists());
// Access the selected node
let child_node = child_sel.nodes().first().unwrap();
// Get all ancestor nodes for the `#child` node
let ancestors = child_node.ancestors(None);
let ancestor_sel = Selection::from(ancestors);
// or just: let ancestor_sel = child_sel.ancestors(None);
// In this case, all ancestor nodes up to the root <html> are included
assert!(ancestor_sel.is("html")); // Root <html> is included
assert!(ancestor_sel.is("#parent")); // Direct parent is also included
// `Selection::is` performs a shallow match, so it will not match `#child` in this selection.
assert!(!ancestor_sel.is("#child"));
// You can limit the number of ancestor nodes returned by specifying `max_limit`
let limited_ancestors = child_node.ancestors(Some(2));
let limited_ancestor_sel = Selection::from(limited_ancestors);
// With a limit of 2, only `#grand-parent` and `#parent` ancestors are included
assert!(limited_ancestor_sel.is("#grand-parent"));
assert!(limited_ancestor_sel.is("#parent"));
assert!(!limited_ancestor_sel.is("#great-ancestor")); // This node is excluded due to the limit
Note that ancestors()
can be called on both NodeRef
and Selection
. NodeRef::ancestors()
returns a vector with ancestor nodes, while Selection
returns a new Selection
containing ancestor nodes.
Selecting with pseudo-classes (:has, :has-text, :contains)
The dom_query
crate provides versatile selector pseudo-classes, built on both its own functionality and the capabilities of the selectors
crate. These pseudo-classes allow targeting elements based on attributes, text content, and context within the document.
use dom_query::Document;
let html = include_str!("../test-pages/rustwiki_2024.html");
let doc = Document::from(html);
// Search for list items (`li`) within a `tr` element that contains an `a` element
// with the title "Programming paradigm"
let paradigm_selection = doc.select(
r#"table tr:has(a[title="Programming paradigm"]) td.infobox-data ul > li"#
);
println!("Rust programming paradigms:");
for item in paradigm_selection.iter() {
println!(" {}", item.text());
}
println!("{:-<50}", "");
// Select items based on `th` containing text "Influenced by" and
// the following `tr` containing `td` with list items.
let influenced_by_selection = doc.select(
r#"table tr:has-text("Influenced by") + tr td ul > li > a"#
);
println!("Rust influenced by:");
for item in influenced_by_selection.iter() {
println!(" {}", item.text());
}
println!("{:-<50}", "");
// Extract all links within a paragraph containing "foreign function interface" text.
// Since a part of the text is in a separate tag, we use the `:contains` pseudo-class.
let links_selection = doc.select(
r#"p:contains("Rust has a foreign function interface") a[href^="/"]"#
);
println!("Links in the FFI block:");
for item in links_selection.iter() {
println!(" {}", item.attr("href").unwrap());
}
println!("{:-<50}", "");
// :only-text selects an element that contains only a single text node,
// with no child elements.
// It can be combined with other pseudo-classes to achieve more specific selections.
// For example, to select a <div> inside an <a>
//that has no siblings and no child elements other than text.
println!("Single <div> inside an <a> with text only:");
for el in doc.select("a div:only-text:only-child").iter() {
println!("{}", el.text().trim());
}
Key Points:
:has(selector)
: Finds elements that contain a matching element anywhere within.:has-text("text")
: Matches elements based on their immediate text content, ignoring any nested elements. This makes it ideal for selecting nodes where the direct text is crucial for differentiation.:contains("text")
: Selects elements containing the specified text within them, useful when searching in a block of text.:only-text
: Selects elements that contain only a single text node, with no other child nodes.
These pseudo-classes allow for precise and expressive searches within the DOM, enabling the selection of content-rich elements based on structural or attribute-driven conditions. For a full list of supported pseudo-classes, refer to the Supported CSS Pseudo-Classes List.
Filtering Selection
You can filter a selection based on another selection. This can be useful when you need to narrow down a selection to only include elements that are also part of another selection.
use dom_query::Document;
let doc: Document = r#"<!DOCTYPE html>
<html lang="en">
<head>TEST</head>
<body>
<div class="content">
<p>Content text has a <a href="/0">link</a></p>
</div>
<footer>
<a href="/1">Footer Link</a>
</footer>
</body>
</html>
"#.into();
// Selecting all links in the document
let sel_with_links = doc.select("a[href]");
assert_eq!(sel_with_links.length(), 2);
// Selecting every element inside
let content_sel = doc.select("div.content *");
// Filter selection by content selection, so now we get only links (actually only 1 link) that are inside
let filtered_sel = sel_with_links.filter_selection(&content_sel);
assert_eq!(filtered_sel.length(), 1);
You can also use Selection::filter
, Selection::try_filter
, which returns an Option<Selection>
, and Selection::filter_matcher
to filter a selection using a pre-compiled Matcher
.
Adding Selection
You can combine multiple selections. This can be useful when you want to work with a combined set of elements.
use dom_query::Document;
let doc: Document = r#"<!DOCTYPE html>
<html>
<head>Test</head>
<body>
<div id="great-ancestor">
<div id="grand-parent">
<div id="parent">
<div id="first-child">Child</div>
<div id="second-child">Child</div>
</div>
</div>
</div>
</body>
</html>"#.into();
let first_sel = doc.select("#first-child");
assert_eq!(first_sel.length(), 1);
let second_sel = doc.select("#second-child");
assert_eq!(second_sel.length(), 1);
let children_sel = first_sel.add_selection(&second_sel);
assert_eq!(children_sel.length(), 2);
Additionally, there are other methods available:
Selection::add
to add a single element.Selection::try_add
which returns anOption<Selection>
.Selection::add_matcher
to add elements using a pre-compiledMatcher
.
Node Descendants
The descendants
method can be used to retrieve all descendant nodes of a given element in the document tree. This method includes both element nodes and text nodes, including whitespace nodes between elements.
use dom_query::Document;
let doc: Document = r#"<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body>
<div id="great-ancestor">
<div id="grand-parent">
<div id="parent">
<div id="first-child">Child</div>
<div id="second-child">Child</div>
</div>
</div>
<div id="grand-parent-sibling"></div>
</div>
</body>
</html>"#.into();
let ancestor_sel = doc.select("#great-ancestor");
assert!(ancestor_sel.exists());
let ancestor_node = ancestor_sel.nodes().first().unwrap();
let expected_id_names = vec![
"grand-parent-sibling",
"second-child",
"first-child",
"parent",
"grand-parent",
];
// if you want to reuse descendants then use `descendants` which returns a vector of nodes
let descendants = ancestor_node.descendants();
// Descendants include not only element nodes, but also text nodes.
// Whitespace characters between element nodes are also considered as text nodes.
// Therefore, the number of descendants is usually not equal to the number of element descendants.
let descendants_id_names = descendants
.iter()
.rev()
.filter(|n| n.is_element())
.map(|n| n.attr_or("id", "").to_string())
.collect::<Vec<_>>();
assert_eq!(descendants_id_names, expected_id_names);
Key Points:
- The
descendants
method returns all descendant nodes, including element nodes and text nodes. - Whitespace characters between elements are considered as text nodes.
- The number of descendants is usually greater than the number of element descendants due to the inclusion of text nodes.
- You can filter the descendants to retrieve only element nodes using an iterator and the
is_element
method.
This method is useful for traversing the DOM tree and accessing nodes that are nested within a specific element.
Retrieving the base URI
The base_uri
is a much faster alternative to doc.select("html > head > base").attr("href")
.
Currently, it does not cache the result, so each time you call it, it will traverse the tree again.
The reason it is not cached is to keep Document
implementing the Send
trait.
Example
let contents: &str = r#"<!DOCTYPE html>
<html>
<head>
<base href="https://www.example.com/"/>
<title>Test</title>
</head>
<body>
<div id="main"></div>
</body>
</html>"#;
let doc = Document::from(contents);
// Access the base URI directly from the document
let base_uri = doc.base_uri().unwrap();
assert_eq!(base_uri.as_ref(), "https://www.example.com/");
// Access the base URI from any node
let sel = doc.select_single("#main");
let node = sel.nodes().first().unwrap();
let base_uri = node.base_uri().unwrap();
assert_eq!(base_uri.as_ref(), "https://www.example.com/");
Verifying Selection and Node Matches
The is
useful if you need to combine several checks into one expression. It can check for having a certain position in the DOM tree,
having a certain attribute, or a certain element name all at once. This method is available for Selection
and NodeRef
.
let contents: &str = r#"<!DOCTYPE html>
<html>
<head>
<title>Test</title>
</head>
<body>
<div id="main" dir="ltr"></div>
<div id="extra"></div>
</body>
</html>"#;
let doc = Document::from(contents);
let main_sel = doc.select_single("#main");
let extra_sel = doc.select_single("#extra");
// For `Selection`, it verifies that at least one of the nodes in the selection
// matches the selector.
assert!(main_sel.is("div#main"));
assert!(!extra_sel.is("div#main"));
// For `NodeRef`, the `is` method verifies that the node matches the selector.
let main_node = main_sel.nodes().first().unwrap();
let extra_node = extra_sel.nodes().first().unwrap();
assert!(main_node.is("html > body > div#main[dir=ltr]"));
assert!(extra_node.is("html > body > div#main + div"));
Fast Finding Child Elements
There is an experimental find
method which is accessible only from NodeRef
. It may be useful to perform a quick search by element names across descendants.
You need to provide a path
argument, which is a sequence of element names. The method returns a vector of NodeRef
that correspond to the matching elements. The elements are returned in the order they appear in the document tree. Since it is experimental, the API may change in the future.
let doc: Document = r#"<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body>
<div id="main"></div>
</body>
</html>"#.into();
let main_sel = doc.select_single("#main");
let main_node = main_sel.nodes().first().unwrap();
// create 10 child blocks with links
let total_links: usize = 10;
for i in 0..total_links {
let content = format!(r#"<div><a href="/{0}">{0} link</a></div>"#, i);
main_node.append_html(content);
}
let selected_count = doc.select("html body a").nodes().len();
assert_eq!(selected_count, total_links);
// `find` currently can deal only with paths that start after the current node.
// In the following example, `&["html", "body", "div", "a"]` will fail,
// while `&["a"]` or `&["div", "a"]` are okay.
let found_count = main_node.find(&["div", "a"]).len();
assert_eq!(found_count, total_links);
HTML and Text Content Extraction
Extracting HTML and Inner HTML
Serialization enables extracting HTML content of elements, either with or without outer tags. This can be useful for accessing structured content within elements.
use dom_query::Document;
let html = r#"<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body><div class="content"><h1>Test Page</h1></div></body>
</html>"#;
let doc = Document::from(html);
let heading_selector = doc.select("div.content");
// Serialization including the outer HTML tag
let content = heading_selector.html();
assert_eq!(content.to_string(), r#"<div class="content"><h1>Test Page</h1></div>"#);
// Serialization excluding the outer HTML tag
let inner_content = heading_selector.inner_html();
assert_eq!(inner_content.to_string(), "<h1>Test Page</h1>");
The html()
and inner_html
() methods return serialized content as StrTendril
. If no elements match the selector, html()
and inner_html()
will return an empty value, whereas try_html()
and try_inner_html()
return an Option<StrTendril>
, allowing for handling of None
.
// Using `try_html()`, which returns an Option<StrTendril>.
// If there are no matching elements, it returns None.
let opt_no_content = doc.select("div.no-content").try_html();
assert_eq!(opt_no_content, None);
// The `html()` method will return an empty `StrTendril` if there are no matches
let no_content = doc.select("div.no-content").html();
assert_eq!(no_content, "".into());
// Similarly, `inner_html()` and `try_inner_html()` work the same way
assert_eq!(doc.select("div.no-content").try_inner_html(), None);
assert_eq!(doc.select("div.no-content").inner_html(), "".into());
Extracting Descendant Text
The text()
method retrieves all descendant text content within the selected element, concatenating any nested text nodes into a single string.
use dom_query::Document;
let html = r#"<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body><div><h1>Test <span>Page</span></h1></div></body>
</html>"#;
let doc = Document::from(html);
let body_selection = doc.select("body div").first();
let text = body_selection.text();
assert_eq!(text.to_string(), "Test Page");
Extracting Immediate Text
The immediate_text()
method retrieves the immediate text content of the selected element, excluding any text content from its descendants.
This is useful when you need to access the text content of an element without including the text content of its child elements.
use dom_query::Document;
let html = r#"<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body><div><h1>Test <span>Page</span></h1></div></body>
</html>"#;
let doc = Document::from(html);
let body_selection = doc.select("body div h1").first();
// accessing immediate text without descendants
let text = body_selection.immediate_text();
assert_eq!(text.to_string(), "Test ");
Accessing and Manipulating the element's attributes
The dom_query
crate provides several methods for accessing and manipulating the attributes of an HTML element.
All methods listed below apply to both Selection and Node.
Getting an attribute value
You can use the attr()
method to retrieve the value of a specific attribute. If the attribute does not exist, it will return None
.
You can use the attr_or()
method to retrieve the value of a specific attribute, and return a default value if the attribute does not exist.
use dom_query::Document;
let html = r#"<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body><input hidden="" id="k" class="important" type="hidden" name="k" data-k="100"></body>
</html>"#;
let doc = Document::from(html);
let input_selection = doc.select("input[name=k]");
let val = input_selection.attr("data-k").unwrap();
assert_eq!(val.to_string(), "100");
// try to get an attribute that does not exist
let val_or = input_selection.attr_or("data-l", "0");
assert_eq!(val_or.to_string(), "0");
Getting the class attribute
You can use the class()
method to retrieve the value of the class
attribute.
If the class
attribute does not exist, the method returns None
.
For Selection
, the class method will return the value of the class
attribute of the first element in the selection.
For Node
, the class method will return the value of the class
attribute of the current element.
use tendril::StrTendril;
let input_selection = doc.select("input");
let class: Option<StrTendril> = input_selection.class();
assert_eq!(class, Some("important".into()));
Getting the id attribute
Everything works the same way as with the class
attribute,
but the method for Selection
is called id()
, while for Node
, it is called id_attr()
.
use tendril::StrTendril;
let input_selection = doc.select("input");
let id_attr: Option<StrTendril> = input_selection.id();
assert_eq!(id_attr, Some("k".into()));
let input_node = input_selection.first().unwrap();
let id_attr: Option<StrTendril> = input_node.id_attr();
assert_eq!(id_attr, Some("k".into()));
Removing an attribute
You can use the remove_attr()
method to remove a specific attribute from the element.
If it called from the Selection
then it will remove an attribute from all elements in the selection.
input_selection.remove_attr("data-k");
Removing multiple attributes
You can use the remove_attrs()
method to remove multiple attributes from the element.
If it called from the Selection
then it will remove all listed attributes from all elements in the selection.
input_selection.remove_attrs(&["id", "class"]);
Setting an attribute value
You can use the set_attr()
method to set the value of a specific attribute.
If it called from the Selection
then it will set an attribute to all elements in the selection.
input_selection.set_attr("data-k", "200");
Checking if an attribute exists
You can use the has_attr()
method to check if a specific attribute exists on the element.
If it called from the Selection
then it will check if an attribute exists on the first element in the selection.
let is_hidden = input_selection.has_attr("hidden");
assert!(is_hidden);
Removing all attributes
You can use the remove_all_attrs()
method to remove all attributes from the element.
If it called from the Selection
then it will remove all attributes from all elements in the selection.
input_selection.remove_all_attrs();
assert_eq!(input_selection.html(), r#"<input>"#.into());
Operating the class attribute
For Selection
, these methods operate on all elements in the selection:
add_class()
adds a class to all elements.remove_class()
removes a class from all elements.has_class()
returns true if at least one element in the selection has the specified class.
For Node
, the same methods work in the same way but only affect the current node.
// selecting an element.
let input_selection = doc.select("input");
// adding a class to all elements in the selection.
input_selection.add_class("new-class");
// checking if at least one element in the selection has a class.
assert!(input_selection.has_class("new-class"));
assert!(input_selection.has_class("important"));
// removing a class from all elements in the selection.
input_selection.remove_class("important");
// checking if at least one element in the selection has a class.
assert!(input_selection.has_class("new-class"));
assert!(!input_selection.has_class("important"));
Manipulating the DOM
Manipulating the Selection
The dom_query
crate provides various methods to manipulate the DOM. Below are some examples demonstrating how to append new HTML nodes, set new content, remove selections, and replace selections with new HTML.
use dom_query::Document;
let html_contents = r#"<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body>
<div class="content">
<p>9,8,7</p>
</div>
<div class="remove-it">
Remove me
</div>
<div class="replace-it">
<div>Replace me</div>
</div>
</body>
</html>"#;
let doc = Document::from(html_contents);
// Select the div with class "content"
let mut content_selection = doc.select("body .content");
// Append a new HTML node to the selection
content_selection.append_html(r#"<div class="inner">inner block</div>"#);
assert!(doc.select("body .content .inner").exists());
// Set a new content to the selection, replacing existing content
let mut set_selection = doc.select(".inner");
set_selection.set_html(r#"<p>1,2,3</p>"#);
assert_eq!(doc.select(".inner").html(),
r#"<div class="inner"><p>1,2,3</p></div>"#.into());
// Remove the selection with class "remove-it"
doc.select(".remove-it").remove();
assert!(!doc.select(".remove-it").exists());
// Replace the selection with new HTML, the current selection will not change
let mut replace_selection = doc.select(".replace-it");
replace_selection.replace_with_html(r#"<div class="replaced">Replaced</div>"#);
assert_eq!(replace_selection.text().trim(), "Replace me");
// But the document will reflect the changes
assert_eq!(doc.select(".replaced").text(),"Replaced".into());
// Prepend more elements to the selection
content_selection.prepend_html(r#"<p class="third">3</p>"#);
content_selection.prepend_html(r#"<p class="first">1</p><p class="second">2</p>"#);
// Also you can insert html before selection:
let first = content_selection.select(".first");
first.before_html(r#"<p class="none">None</p>"#);
// or after:
let third = content_selection.select(".third");
third.after_html(r#"<p class="fourth">4</p>"#);
// now the added paragraphs standing in front of `div`
assert!(doc.select(r#".content > .none + .first + .second + .third + .fourth + div:has-text("1,2,3")"#).exists());
// to set a text to the selection you can use `set_html` but `set_text` is preferable:
let p_sel = content_selection.select("p");
let total_p = p_sel.length();
p_sel.set_text("test content");
assert_eq!(doc.select(r#"p:has-text("test content")"#).length(), total_p);
Explanation:
-
Append HTML:
- The
append_html
method is used to add a new HTML node to the existing selection.
- The
-
Set HTML:
- The
set_html
method replaces the existing content of the selection with new HTML.
- The
-
Remove Selection:
- The
remove
method deletes the elements matching the selector from the document.
- The
-
Replace with HTML:
- The
replace_with_html
method replaces the selected elements with new HTML. Note that the selection itself remains unchanged, but the document reflects the new content.
- The
-
Prepend HTML
- The
prepend_html
method is used to add a new HTML node at the beginning of the existing selection.
- The
-
Insert HTML Before/After
- The
before_html
method inserts HTML before each element in the selection. - The
after_html
method inserts HTML after each element in the selection.
- The
Renaming Elements Without Changing the Contents
The dom_query
crate allows you to easily rename selected elements without changing their contents. Selection::rename
does the same for the entire selection, while Node::rename
does it for a single element.
use dom_query::Document;
let doc: Document = r#"<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body>
<div class="content">
<div>1</div>
<div>2</div>
<div>3</div>
<span>4</span>
</div>
<body>
</html>"#.into();
let mut sel = doc.select("div.content > div, div.content > span");
// Before renaming, there are 3 `div` and 1 `span`
assert_eq!(sel.length(), 4);
sel.rename("p");
// After renaming, there are no `div` and `span` elements
assert_eq!(doc.select("div.content > div, div.content > span").length(), 0);
// But there are four `p` elements
assert_eq!(doc.select("div.content > p").length(), 4);
Creating and Manipulating Elements
The dom_query
crate allows you to create and manipulate HTML elements with ease. Below are examples demonstrating how to create new elements, set attributes, append HTML, and replace content.
use dom_query::Document;
let doc: Document = r#"<!DOCTYPE html>
<html lang="en">
<head></head>
<body>
<div id="main">
<p id="first">It's</p>
<div>
</body>
</html>"#.into();
// Selecting a node we want to attach a new element
let main_sel = doc.select_single("#main");
let main_node = main_sel.nodes().first().unwrap();
// Creating a simple element
let el = doc.tree.new_element("p");
// Setting attributes
el.set_attr("id", "second");
// Setting text content
el.set_text("test");
main_node.append_child(&el);
assert!(doc.select(r#"#main #second:has-text("test")"#).exists());
// Appending a more complex element using `append_html`
main_node.append_html(r#"<p id="third">Wonderful</p>"#);
assert_eq!(doc.select("#main #third").text().as_ref(), "Wonderful");
assert!(doc.select("#first").exists());
// There is also a `prepend_child` and `prepend_html` methods which allows
// to insert content to the begging of the node.
main_node.prepend_html(r#"<p id="minus-one">-1</p><p id="zero">0</p>"#);
assert!(doc.select("#main > #minus-one + #zero + #first + #second + #third").exists());
// Replacing existing element content with new HTML using `set_html`
main_node.set_html(r#"<p id="the-only">Wonderful</p>"#);
assert_eq!(doc.select("#main #the-only").text().as_ref(), "Wonderful");
assert!(!doc.select("#first").exists());
// Completely replacing the contents of the node,
// including itself, using `replace_with_html`
main_node.replace_with_html(
r#"<span>Tweedledum</span> and <span>Tweedledee</span>"#
);
assert!(!doc.select("#main").exists());
assert_eq!(doc.select("span + span").text().as_ref(), "Tweedledee");
// Inserting HTML content before a certain node using `node.before_html`
let span_sel = doc.select("body > span");
let span_node = span_sel.nodes().first().unwrap();
span_node.before_html(r#"<div id="main">Main Content</div>"#);
assert!(doc.select(r#"body > #main + span:has-text("Tweedledum")"#).exists());
// Inserting HTML content after a certain node using `node.after_html`
let span_node = span_sel.nodes().last().unwrap();
span_node.after_html(r#"<div id="extra">Extra Content</div>"#);
assert!(doc.select(r#"body > span:has-text("Tweedledee") + #extra"#).exists());
// To insert nodes before or after a certain element,
// use the `node.insert_before` and `node.insert_after` methods.
// Both methods share the same behavior as `node.append_child`.
Explanation:
-
Creating a Simple Element:
- Use
doc.tree.new_element()
to create a new orphan element. - Set attributes using
node.set_attr()
. - Set text content using
node.set_text()
. - Use
node.append_child()
to append a new child element node to the selected node. - Use
node.prepend_child()
to prepend a new child element node to the selected node. - Use
node.insert_before()
to insert a new sibling element node before the selected node. - Use
node.insert_after()
to insert a new sibling element node after the selected node.
- Use
-
Appending HTML:
- Use
append_html
to add a more complex HTML node to the existing selection. - This method is more convenient for adding multiple elements to the selected node.
- Use
-
Prepending HTML:
- Use
prepend_html
to add new HTML nodes at the beginning of the existing selection. - Use
prepend_child
to prepend a new or an existing element node to the selected node.
- Use
-
Setting New HTML Content:
- Use
set_html
to replace the existing content of the selected node with new HTML. - It changes the inner HTML contents of the node.
- Use
-
Replacing Node Contents Completely:
- Use
replace_with_html
to replace the entire content of the node, including the node itself.
- Use
-
Inserting HTML Before/After:
- Use
before_html
to insert HTML before each element in the selection. - Use
after_html
to insert HTML after each element in the selection.
- Use
Additionally, methods like replace_with_html
, set_html
, append_html
, prepend_html
, before_html
and after_html
can specify more than one element in the provided string.
Text Node Normalization
Node normalization is essential for merging adjacent text nodes into a single node and removing empty text nodes. This helps keep the document structure compact and organized.
use dom_query::Document;
let contents = r#"<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body>
<div id="parent">
<div id="child">Child</div>
</div>
</body>
</html>"#;
let doc = Document::from(contents);
// Select the node with id "child"
let child_sel = doc.select_single("#child");
let child = child_sel.nodes().first().unwrap();
// Check that the node initially has only one child
assert_eq!(child.children_it(false).count(), 1);
// Create and append new text nodes
let text_1 = doc.tree.new_text(" and a");
let text_2 = doc.tree.new_text(" ");
let text_3 = doc.tree.new_text("tail");
child.append_child(&text_1);
child.append_child(&text_2);
child.append_child(&text_3);
// Verify the text and child count before normalization
assert_eq!(child.text(), "Child and a tail".into());
assert_eq!(child.children_it(false).count(), 4);
// Normalize the node
child.normalize();
// Verify the text and child count after normalization
assert_eq!(child.children_it(false).count(), 1);
assert_eq!(child.text(), "Child and a tail".into());
The normalize
method follows the Node.normalize() specification.
This method is also available through the Document
struct as Document::normalize()
, which applies normalization to all text nodes within the document tree.
Supported CSS pseudo-classes in dom_query
Implementation with selectors
:
:empty
:first-child
:last-child
:has
:is
:where
:last-of-type
:not
:only-child
:only-of-type
:nth-child
:nth-last-child
Implementation with dom_query
:
:any-link
:link
:has-text
:contains
:only-text
Notes
:has-text
– checks whether one of children nodes has specific text.
:contains
– checks whether the combined text of all child nodes contains specific text.
:only-text
- checks whether the element contains only a single text node, with no other child nodes.
WASM32 Compilation
When compiling dom_query to WebAssembly (target wasm32-unknown-unknown
) using wasm-pack
, you may encounter runtime panics related to memory allocation, such as:
panicked at 'assertion failed: psize <= size + max_overhead'
This issue currently occurs due to compatibility problems between the latest versions of the selectors
crate and the dlmalloc
crate. The issue specifically manifests when using pseudo-elements, including selectors
' own pseudo-elements like :not
and :has
.
If you must compile dom_query for a wasm32 application, consider using an alternative to dlmalloc
. The following allocators have been tested and work successfully:
- wee_alloc
- lol_alloc
- mini-alloc (fast, but it never deallocates memory)
- alloc_cat
Solution:
- Add mini-alloc to your Cargo.toml:
[dependencies]
alloc_cat = "1.0.0"
- Set your favorite allocator as the global allocator in your lib.rs or main.rs:
#[cfg(target_arch = "wasm32")]
#[global_allocator]
pub static ALLOC: &alloc_cat::AllocCat = &alloc_cat::ALLOCATOR;
- Build or test your WebAssembly project