Parsing

Parsing HTML Documents

The Document struct in dom_query is designed to handle full HTML documents. You can create a Document by passing in HTML content, which can be provided in several formats: &str, String, or StrTendril.

extern crate dom_query;

use dom_query::Document;
use tendril::StrTendril;

// HTML content as a string slice
let contents_str = r#"<!DOCTYPE html>
<html><head><title>Test Page</title></head><body></body></html>"#;
let doc = Document::from(contents_str);

// HTML content as a String
let contents_string = contents_str.to_string();
let doc = Document::from(contents_string);

// HTML content as a StrTendril
let contents_tendril = StrTendril::from(contents_str);
let doc = Document::from(contents_tendril);

// Checking the root element of the `Document`
assert!(doc.root().is_document());

When parsing a full HTML document, Document will recognize a <!DOCTYPE> if it exists at the start of the input. In this case, the Doctype will be added as the first child of the root Document node. If you provide an HTML snippet without a <!DOCTYPE>, Document will ignore the Doctype.

assert!(doc.root().first_child().unwrap().is_doctype());

Parsing HTML Fragments

For cases where you need to parse only a part of an HTML document, such as a snippet or component, dom_query provides Document::fragment(). This function also accepts &str, String, or StrTendril, but behaves a little differently from Document::from() in that it treats the input as a fragment instead of a full document.

use dom_query::Document;
use tendril::StrTendril;

// Parsing an HTML fragment from a string slice
let contents_str = r#"<div><p>Example Fragment</p></div>"#;
let fragment = Document::fragment(contents_str);

// Parsing from a String
let contents_string = contents_str.to_string();
let fragment = Document::fragment(contents_string);

// Parsing from a StrTendril
let contents_tendril = StrTendril::from(contents_str);
let fragment = Document::fragment(contents_tendril);

// Checking the root element of the fragment
assert!(!fragment.root().is_document());
assert!(fragment.root().is_fragment());

When using Document::fragment(), note that Doctype declarations are ignored, focusing only on the fragment itself.

// Confirming Doctype is excluded in the fragment
assert!(!fragment.root().first_child().unwrap().is_doctype());

Document::fragment() is also used internally within the library to create new elements within the document tree.

Querying

Selecting Elements

The dom_query crate provides several selection methods to locate HTML elements in the document. Using CSS-like selectors, you can select both single and multiple elements.

use dom_query::Document;

let html = r#"<!DOCTYPE html>
<html>
    <head>
        <meta charset="utf-8">
        <title>Test Page</title>
    </head>
    <body>
        <h1>Test Page</h1>
        <ul>
            <li>One</li>
            <li><a href="/2">Two</a></li>
            <li><a href="/3">Three</a></li>
        </ul>
    </body>
</html>"#;
let document = Document::from(html);

// Select a single element
let a = document.select("ul li:nth-child(2)");
let text = a.text().to_string();
assert!(text == "Two");

// Selecting multiple elements
document.select("ul > li:has(a)").iter().for_each(|sel| {
    assert!(sel.is("li"));
});

// Optionally select an element with `try_select`, which returns an `Option`
let no_sel = document.try_select("p");
assert!(no_sel.is_none());

The Selection::is method checks whether elements in the current selection match a given selector, without performing a deep search within the elements. dom_query supports pseudo-classes that goes from selectors crate and a few others from itself.

Selecting a Single Match and Multiple Matches

To retrieve only the first match of a selector, Selection::select_single method is available. This method is useful when you want a single match without iterating through all matches.

use dom_query::Document;

let doc: Document = r#"<!DOCTYPE html>
<html lang="en">
<head></head>
<body>
    <ul class="list">
        <li>1</li><li>2</li><li>3</li>
    </ul>
    <ul class="list">
        <li>4</li><li>5</li><li>6</li>
    </ul>
</body>
</html>"#.into();

// selecting a first match
let single_selection = doc.select_single(".list");
assert_eq!(single_selection.length(), 1);
assert_eq!(single_selection.inner_html().to_string().trim(), 
    "<li>1</li><li>2</li><li>3</li>");

// selecting all matches
let selection = doc.select(".list");
assert_eq!(selection.length(), 2);
// but when you call property methods usually
// you will get the result of the first match
assert_eq!(selection.inner_html().to_string().trim(), 
    "<li>1</li><li>2</li><li>3</li>");

// This creates a Selection from the first node in the selection
let first_selection = doc.select(".list").first();
assert_eq!(first_selection.length(), 1);
assert_eq!(first_selection.inner_html().to_string().trim(), 
    "<li>1</li><li>2</li><li>3</li>");

// This approach also creates a new Selection from the next node, each iteration
let next_selection = doc.select(".list").iter().next().unwrap();
assert_eq!(next_selection.length(), 1);
assert_eq!(next_selection.inner_html().to_string().trim(), 
    "<li>1</li><li>2</li><li>3</li>");

// currently, to get data from all matches you need to iterate over them:
let all_matched: String = selection
.iter()
.map(|s| s.inner_html().trim().to_string())
.collect();

assert_eq!(
    all_matched,
    "<li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li>"
);

// same thing as previous, but a little cheaper, because we iterating over the nodes, 
// and do not create a new Selection on each iteration
let all_matched: String = doc
        .select(".list").nodes()
        .iter()
        .map(|s| s.inner_html().trim().to_string())
        .collect();

assert_eq!(
    all_matched,
    "<li>1</li><li>2</li><li>3</li><li>4</li><li>5</li><li>6</li>"
);

Descendant selections

Elements can be selected in relation to a parent element. Here, a Document is queried for ul elements, and then descendant selectors are applied within that context.

use dom_query::Document;

let html = r#"<!DOCTYPE html>
<html>
    <head>
        <meta charset="utf-8">
        <title>Test Page</title>
    </head>
    <body>
        <h1>Test Page</h1>
        <ul class="list-a">
            <li>One</li>
            <li><a href="/2">Two</a></li>
            <li><a href="/3">Three</a></li>
        </ul>
        <ul class="list-b">
            <li><a href="/4">Four</a></li>
        </ul>
    </body>
</html>"#;
let document = Document::from(html);

// selecting parent elements
let ul = document.select("ul");
ul.select("li").iter().for_each(|el| {
    // descendant select matches only inside the children context
    assert!(el.is("li"));
});

// also descendant selector may include elements of the higher level than the parent. 
// It may be useful to specify the exact element you want to select
let el = ul.select("body ul.list-b li").first();
let text = el.text();
assert_eq!("Four", text.to_string());

Selecting with Precompiled Matchers

For repeated queries, dom_query allows using precompiled matchers. This approach enhances performance when matching the same pattern across multiple documents.

use dom_query::{Document, Matcher};

let html1 = r#"<!DOCTYPE html>
    <html><head><title>Test Page 1</title></head><body></body></html>"#;
let html2 = r#"<!DOCTYPE html>
    <html><head><title>Test Page 2</title></head><body></body></html>"#;
let doc1 = Document::from(html1);
let doc2 = Document::from(html2);

// create a matcher once, reuse on different documents
let title_matcher = Matcher::new("title").unwrap();

let title_el1 = doc1.select_matcher(&title_matcher);
assert_eq!(title_el1.text(), "Test Page 1".into());

let title_el2 = doc2.select_matcher(&title_matcher);
assert_eq!(title_el2.text(), "Test Page 2".into());

let title_single = doc1.select_single_matcher(&title_matcher);
assert_eq!(title_single.text(), "Test Page 1".into());

Selecting Ancestor Elements

You can use Node::ancestors() to retrieve the sequence of ancestor nodes for a given element in the document tree, which can be helpful when you need to navigate upward from a specific node.


use dom_query::Document;

let doc: Document = r#"<!DOCTYPE html>
<html>
    <head>Test</head>
    <body>
        <div id="great-ancestor">
            <div id="grand-parent">
                <div id="parent">
                    <div id="child">Child</div>
                </div>
            </div>
        </div>
    </body>
</html>
"#.into();

// Select an element
let child_sel = doc.select("#child");
assert!(child_sel.exists());

// Access the selected node
let child_node = child_sel.nodes().first().unwrap();

// Get all ancestor nodes for the `#child` node
let ancestors = child_node.ancestors(None);
let ancestor_sel = Selection::from(ancestors);

// or just: let ancestor_sel = child_sel.ancestors(None);

// In this case, all ancestor nodes up to the root <html> are included
assert!(ancestor_sel.is("html")); // Root <html> is included
assert!(ancestor_sel.is("#parent")); // Direct parent is also included

// `Selection::is` performs a shallow match, so it will not match `#child` in this selection.
assert!(!ancestor_sel.is("#child"));

// You can limit the number of ancestor nodes returned by specifying `max_limit`
let limited_ancestors = child_node.ancestors(Some(2));
let limited_ancestor_sel = Selection::from(limited_ancestors);

// With a limit of 2, only `#grand-parent` and `#parent` ancestors are included
assert!(limited_ancestor_sel.is("#grand-parent"));
assert!(limited_ancestor_sel.is("#parent"));
assert!(!limited_ancestor_sel.is("#great-ancestor")); // This node is excluded due to the limit

Note that ancestors() can be called on both NodeRef and Selection. NodeRef::ancestors() returns a vector with ancestor nodes, while Selection returns a new Selection containing ancestor nodes.

Selecting with pseudo-classes (:has, :has-text, :contains)

The dom_query crate provides versatile selector pseudo-classes, built on both its own functionality and the capabilities of the selectors crate. These pseudo-classes allow targeting elements based on attributes, text content, and context within the document.

use dom_query::Document;

let html = include_str!("../test-pages/rustwiki_2024.html");
let doc = Document::from(html);

// Search for list items (`li`) within a `tr` element that contains an `a` element
// with the title "Programming paradigm"
let paradigm_selection = doc.select(
    r#"table tr:has(a[title="Programming paradigm"]) td.infobox-data ul > li"#
    );

println!("Rust programming paradigms:");
for item in paradigm_selection.iter() {
    println!(" {}", item.text());
}
println!("{:-<50}", "");

// Select items based on `th` containing text "Influenced by" and
// the following `tr` containing `td` with list items.
let influenced_by_selection = doc.select(
    r#"table tr:has-text("Influenced by") + tr td ul > li > a"#
    );

println!("Rust influenced by:");
for item in influenced_by_selection.iter() {
    println!(" {}", item.text());
}
println!("{:-<50}", "");

// Extract all links within a paragraph containing "foreign function interface" text.
// Since a part of the text is in a separate tag, we use the `:contains` pseudo-class.
let links_selection = doc.select(
    r#"p:contains("Rust has a foreign function interface") a[href^="/"]"#
    );

println!("Links in the FFI block:");
for item in links_selection.iter() {
    println!(" {}", item.attr("href").unwrap());
}
println!("{:-<50}", "");

// :only-text selects an element that contains only a single text node, 
// with no child elements.
// It can be combined with other pseudo-classes to achieve more specific selections.
// For example, to select a <div> inside an <a> 
//that has no siblings and no child elements other than text.
println!("Single <div> inside an <a> with text only:");
for el in doc.select("a div:only-text:only-child").iter() {
    println!("{}", el.text().trim());
}

Key Points:

:has(selector): Finds elements that contain a matching element anywhere within.
:has-text("text"): Matches elements based on their immediate text content, ignoring any nested elements. This makes it ideal for selecting nodes where the direct text is crucial for differentiation.
:contains("text"): Selects elements containing the specified text within them, useful when searching in a block of text.
:only-text: Selects elements that contain only a single text node, with no other child nodes.

These pseudo-classes allow for precise and expressive searches within the DOM, enabling the selection of content-rich elements based on structural or attribute-driven conditions. For a full list of supported pseudo-classes, refer to the Supported CSS Pseudo-Classes List.

Filtering Selection

You can filter a selection based on another selection. This can be useful when you need to narrow down a selection to only include elements that are also part of another selection.

use dom_query::Document;

let doc: Document = r#"<!DOCTYPE html>
<html lang="en">
    <head>TEST</head>
    <body>
        <div class="content">
            <p>Content text has a <a href="/0">link</a></p>
        </div>
        <footer>
            <a href="/1">Footer Link</a>
        </footer>
    </body>
</html>
"#.into();

// Selecting all links in the document
let sel_with_links = doc.select("a[href]");

assert_eq!(sel_with_links.length(), 2);

// Selecting every element inside
let content_sel = doc.select("div.content *");

// Filter selection by content selection, so now we get only links (actually only 1 link) that are inside
let filtered_sel = sel_with_links.filter_selection(&content_sel);

assert_eq!(filtered_sel.length(), 1);

You can also use Selection::filter , Selection::try_filter, which returns an Option<Selection>, and Selection::filter_matcher to filter a selection using a pre-compiled Matcher.

Adding Selection

You can combine multiple selections. This can be useful when you want to work with a combined set of elements.

use dom_query::Document;

let doc: Document = r#"<!DOCTYPE html>
<html>
    <head>Test</head>
    <body>
       <div id="great-ancestor">
           <div id="grand-parent">
               <div id="parent">
                   <div id="first-child">Child</div>
                   <div id="second-child">Child</div>
               </div>
           </div>
       </div>
    </body>
</html>"#.into();

let first_sel = doc.select("#first-child");
assert_eq!(first_sel.length(), 1);
let second_sel = doc.select("#second-child");
assert_eq!(second_sel.length(), 1);
let children_sel = first_sel.add_selection(&second_sel);
assert_eq!(children_sel.length(), 2);

Additionally, there are other methods available:

Selection::add to add a single element.
Selection::try_add which returns an Option<Selection>.
Selection::add_matcher to add elements using a pre-compiled Matcher.

Node Descendants

The descendants method can be used to retrieve all descendant nodes of a given element in the document tree. This method includes both element nodes and text nodes, including whitespace nodes between elements.

use dom_query::Document;

let doc: Document = r#"<!DOCTYPE html>
<html>
    <head><title>Test</title></head>
    <body>
       <div id="great-ancestor">
           <div id="grand-parent">
               <div id="parent">
                   <div id="first-child">Child</div>
                   <div id="second-child">Child</div>
               </div>
           </div>
           <div id="grand-parent-sibling"></div>
        </div>
    </body>
</html>"#.into();

let ancestor_sel = doc.select("#great-ancestor");
assert!(ancestor_sel.exists());

let ancestor_node = ancestor_sel.nodes().first().unwrap();

let expected_id_names = vec![
    "grand-parent-sibling",
    "second-child",
    "first-child",
    "parent",
    "grand-parent",
];

// if you want to reuse descendants then use `descendants` which returns a vector of nodes
let descendants = ancestor_node.descendants();

// Descendants include not only element nodes, but also text nodes.
// Whitespace characters between element nodes are also considered as text nodes.
// Therefore, the number of descendants is usually not equal to the number of element descendants.

let descendants_id_names = descendants
    .iter()
    .rev()
    .filter(|n| n.is_element())
    .map(|n| n.attr_or("id", "").to_string())
    .collect::<Vec<_>>();

assert_eq!(descendants_id_names, expected_id_names);

Key Points:

The descendants method returns all descendant nodes, including element nodes and text nodes.
Whitespace characters between elements are considered as text nodes.
The number of descendants is usually greater than the number of element descendants due to the inclusion of text nodes.
You can filter the descendants to retrieve only element nodes using an iterator and the is_element method.

This method is useful for traversing the DOM tree and accessing nodes that are nested within a specific element.

Retrieving the base URI

The base_uri is a much faster alternative to doc.select("html > head > base").attr("href"). Currently, it does not cache the result, so each time you call it, it will traverse the tree again. The reason it is not cached is to keep Document implementing the Send trait.

Example

let contents: &str = r#"<!DOCTYPE html>
<html>
    <head>
        <base href="https://www.example.com/"/>
        <title>Test</title>
    </head>
    <body>
        <div id="main"></div>
    </body>
</html>"#;

let doc = Document::from(contents);

// Access the base URI directly from the document
let base_uri = doc.base_uri().unwrap();
assert_eq!(base_uri.as_ref(), "https://www.example.com/");

// Access the base URI from any node
let sel = doc.select_single("#main");
let node = sel.nodes().first().unwrap();
let base_uri = node.base_uri().unwrap();
assert_eq!(base_uri.as_ref(), "https://www.example.com/");

Verifying Selection and Node Matches

The is useful if you need to combine several checks into one expression. It can check for having a certain position in the DOM tree, having a certain attribute, or a certain element name all at once. This method is available for Selection and NodeRef.

let contents: &str = r#"<!DOCTYPE html>
<html>
    <head>
        <title>Test</title>
    </head>
    <body>
        <div id="main" dir="ltr"></div>
        <div id="extra"></div>
    </body>
</html>"#;
let doc = Document::from(contents);

let main_sel = doc.select_single("#main");
let extra_sel = doc.select_single("#extra");

// For `Selection`, it verifies that at least one of the nodes in the selection
// matches the selector.
assert!(main_sel.is("div#main"));
assert!(!extra_sel.is("div#main"));

// For `NodeRef`, the `is` method verifies that the node matches the selector.
let main_node = main_sel.nodes().first().unwrap();
let extra_node = extra_sel.nodes().first().unwrap();

assert!(main_node.is("html > body > div#main[dir=ltr]"));
assert!(extra_node.is("html > body > div#main + div"));

Fast Finding Child Elements

There is an experimental find method which is accessible only from NodeRef. It may be useful to perform a quick search by element names across descendants. You need to provide a path argument, which is a sequence of element names. The method returns a vector of NodeRef that correspond to the matching elements. The elements are returned in the order they appear in the document tree. Since it is experimental, the API may change in the future.

let doc: Document = r#"<!DOCTYPE html>
<html>
    <head><title>Test</title></head>
    <body>
        <div id="main"></div>
    </body>
</html>"#.into();

let main_sel = doc.select_single("#main");
let main_node = main_sel.nodes().first().unwrap();

// create 10 child blocks with links
let total_links: usize = 10;
for i in 0..total_links {
    let content = format!(r#"<div><a href="/{0}">{0} link</a></div>"#, i);
    main_node.append_html(content);
}
let selected_count = doc.select("html body a").nodes().len();
assert_eq!(selected_count, total_links);

// `find` currently can deal only with paths that start after the current node. 
// In the following example, `&["html", "body", "div", "a"]` will fail,
// while `&["a"]` or `&["div", "a"]` are okay.
let found_count = main_node.find(&["div", "a"]).len();
assert_eq!(found_count, total_links);

HTML and Text Content Extraction

Extracting HTML and Inner HTML

Serialization enables extracting HTML content of elements, either with or without outer tags. This can be useful for accessing structured content within elements.

use dom_query::Document;

let html = r#"<!DOCTYPE html>
<html>
    <head><title>Test</title></head>
    <body><div class="content"><h1>Test Page</h1></div></body>
</html>"#;
let doc = Document::from(html);
let heading_selector = doc.select("div.content");

// Serialization including the outer HTML tag
let content = heading_selector.html();
assert_eq!(content.to_string(), r#"<div class="content"><h1>Test Page</h1></div>"#);

// Serialization excluding the outer HTML tag
let inner_content = heading_selector.inner_html();
assert_eq!(inner_content.to_string(), "<h1>Test Page</h1>");

The html() and inner_html() methods return serialized content as StrTendril. If no elements match the selector, html() and inner_html() will return an empty value, whereas try_html() and try_inner_html() return an Option<StrTendril>, allowing for handling of None.

// Using `try_html()`, which returns an Option<StrTendril>.
// If there are no matching elements, it returns None.
let opt_no_content = doc.select("div.no-content").try_html();
assert_eq!(opt_no_content, None);

// The `html()` method will return an empty `StrTendril` if there are no matches
let no_content = doc.select("div.no-content").html();
assert_eq!(no_content, "".into());

// Similarly, `inner_html()` and `try_inner_html()` work the same way
assert_eq!(doc.select("div.no-content").try_inner_html(), None);
assert_eq!(doc.select("div.no-content").inner_html(), "".into());

Extracting Descendant Text

The text() method retrieves all descendant text content within the selected element, concatenating any nested text nodes into a single string.

use dom_query::Document;

let html = r#"<!DOCTYPE html>
<html>
    <head><title>Test</title></head>
    <body><div><h1>Test <span>Page</span></h1></div></body>
</html>"#;
let doc = Document::from(html);
let body_selection = doc.select("body div").first();
let text = body_selection.text();
assert_eq!(text.to_string(), "Test Page");

Extracting Immediate Text

The immediate_text() method retrieves the immediate text content of the selected element, excluding any text content from its descendants.

This is useful when you need to access the text content of an element without including the text content of its child elements.

use dom_query::Document;

let html = r#"<!DOCTYPE html>
<html>
    <head><title>Test</title></head>
    <body><div><h1>Test <span>Page</span></h1></div></body>
</html>"#;

let doc = Document::from(html);

let body_selection = doc.select("body div h1").first();
// accessing immediate text without descendants
let text = body_selection.immediate_text();
assert_eq!(text.to_string(), "Test ");

Accessing and Manipulating the element's attributes

The dom_query crate provides several methods for accessing and manipulating the attributes of an HTML element.

All methods listed below apply to both Selection and Node.

Getting an attribute value

You can use the attr() method to retrieve the value of a specific attribute. If the attribute does not exist, it will return None. You can use the attr_or() method to retrieve the value of a specific attribute, and return a default value if the attribute does not exist.

use dom_query::Document;

let html = r#"<!DOCTYPE html>
<html>
    <head><title>Test</title></head>
    <body><input hidden="" id="k" class="important" type="hidden" name="k" data-k="100"></body>
</html>"#;

let doc = Document::from(html);

let input_selection = doc.select("input[name=k]");

let val = input_selection.attr("data-k").unwrap();
assert_eq!(val.to_string(), "100");

// try to get an attribute that does not exist
let val_or = input_selection.attr_or("data-l", "0");
assert_eq!(val_or.to_string(), "0");

Getting the class attribute

You can use the class() method to retrieve the value of the class attribute. If the class attribute does not exist, the method returns None.

For Selection, the class method will return the value of the class attribute of the first element in the selection. For Node, the class method will return the value of the class attribute of the current element.

use tendril::StrTendril;

let input_selection = doc.select("input");
let class: Option<StrTendril> = input_selection.class();
assert_eq!(class, Some("important".into()));

Getting the id attribute

Everything works the same way as with the class attribute, but the method for Selection is called id(), while for Node, it is called id_attr().

use tendril::StrTendril;

let input_selection = doc.select("input");
let id_attr: Option<StrTendril> = input_selection.id();
assert_eq!(id_attr, Some("k".into()));

let input_node = input_selection.first().unwrap();
let id_attr: Option<StrTendril> = input_node.id_attr();
assert_eq!(id_attr, Some("k".into()));

Removing an attribute

You can use the remove_attr() method to remove a specific attribute from the element. If it called from the Selection then it will remove an attribute from all elements in the selection.

input_selection.remove_attr("data-k");

Removing multiple attributes

You can use the remove_attrs() method to remove multiple attributes from the element. If it called from the Selection then it will remove all listed attributes from all elements in the selection.

input_selection.remove_attrs(&["id", "class"]);

Setting an attribute value

You can use the set_attr() method to set the value of a specific attribute. If it called from the Selection then it will set an attribute to all elements in the selection.

input_selection.set_attr("data-k", "200");

Checking if an attribute exists

You can use the has_attr() method to check if a specific attribute exists on the element. If it called from the Selection then it will check if an attribute exists on the first element in the selection.

let is_hidden = input_selection.has_attr("hidden");
assert!(is_hidden);

Removing all attributes

You can use the remove_all_attrs() method to remove all attributes from the element. If it called from the Selection then it will remove all attributes from all elements in the selection.

input_selection.remove_all_attrs();
assert_eq!(input_selection.html(), r#"<input>"#.into());

Operating the class attribute

For Selection, these methods operate on all elements in the selection:

add_class() adds a class to all elements.
remove_class() removes a class from all elements.
has_class() returns true if at least one element in the selection has the specified class.

For Node, the same methods work in the same way but only affect the current node.


// selecting an element.
let input_selection = doc.select("input");

// adding a class to all elements in the selection.
input_selection.add_class("new-class");

// checking if at least one element in the selection has a class.
assert!(input_selection.has_class("new-class"));
assert!(input_selection.has_class("important"));

// removing a class from all elements in the selection.
input_selection.remove_class("important");

// checking if at least one element in the selection has a class.
assert!(input_selection.has_class("new-class"));
assert!(!input_selection.has_class("important"));

Manipulating the DOM

Manipulating the Selection

The dom_query crate provides various methods to manipulate the DOM. Below are some examples demonstrating how to append new HTML nodes, set new content, remove selections, and replace selections with new HTML.

use dom_query::Document;

let html_contents = r#"<!DOCTYPE html>
<html>
    <head><title>Test</title></head>
    <body>
        <div class="content">
            <p>9,8,7</p>
        </div>
        <div class="remove-it">
            Remove me
        </div>
        <div class="replace-it">
            <div>Replace me</div>
        </div>
    </body>
</html>"#;

let doc = Document::from(html_contents);

// Select the div with class "content"
let mut content_selection = doc.select("body .content");

// Append a new HTML node to the selection
content_selection.append_html(r#"<div class="inner">inner block</div>"#);
assert!(doc.select("body .content .inner").exists());

// Set a new content to the selection, replacing existing content
let mut set_selection = doc.select(".inner");
set_selection.set_html(r#"<p>1,2,3</p>"#);
assert_eq!(doc.select(".inner").html(),
    r#"<div class="inner"><p>1,2,3</p></div>"#.into());

// Remove the selection with class "remove-it"
doc.select(".remove-it").remove();
assert!(!doc.select(".remove-it").exists());

// Replace the selection with new HTML, the current selection will not change
let mut replace_selection = doc.select(".replace-it");
replace_selection.replace_with_html(r#"<div class="replaced">Replaced</div>"#);
assert_eq!(replace_selection.text().trim(), "Replace me");

// But the document will reflect the changes
assert_eq!(doc.select(".replaced").text(),"Replaced".into());


// Prepend more elements to the selection
content_selection.prepend_html(r#"<p class="third">3</p>"#);
content_selection.prepend_html(r#"<p class="first">1</p><p class="second">2</p>"#);

// Also you can insert html before selection:
let first = content_selection.select(".first");
first.before_html(r#"<p class="none">None</p>"#);
// or after:
let third = content_selection.select(".third");
third.after_html(r#"<p class="fourth">4</p>"#);

// now the added paragraphs standing in front of `div`
assert!(doc.select(r#".content > .none + .first + .second + .third + .fourth + div:has-text("1,2,3")"#).exists());

// to set a text to the selection you can use `set_html` but `set_text` is preferable:
let p_sel = content_selection.select("p");
let total_p = p_sel.length();
p_sel.set_text("test content");

assert_eq!(doc.select(r#"p:has-text("test content")"#).length(), total_p);

Explanation:

Append HTML:
- The append_html method is used to add a new HTML node to the existing selection.
Set HTML:
- The set_html method replaces the existing content of the selection with new HTML.
Remove Selection:
- The remove method deletes the elements matching the selector from the document.
Replace with HTML:
- The replace_with_html method replaces the selected elements with new HTML. Note that the selection itself remains unchanged, but the document reflects the new content.
Prepend HTML
- The prepend_html method is used to add a new HTML node at the beginning of the existing selection.
Insert HTML Before/After
- The before_html method inserts HTML before each element in the selection.
- The after_html method inserts HTML after each element in the selection.

Renaming Elements Without Changing the Contents

The dom_query crate allows you to easily rename selected elements without changing their contents. Selection::rename does the same for the entire selection, while Node::rename does it for a single element.

use dom_query::Document;

let doc: Document = r#"<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body>
    <div class="content">
        <div>1</div>
        <div>2</div>
        <div>3</div>
        <span>4</span>
    </div>
<body>
</html>"#.into();

let mut sel = doc.select("div.content > div, div.content > span");
// Before renaming, there are 3 `div` and 1 `span`
assert_eq!(sel.length(), 4);

sel.rename("p");

// After renaming, there are no `div` and `span` elements
assert_eq!(doc.select("div.content > div, div.content > span").length(), 0);
// But there are four `p` elements
assert_eq!(doc.select("div.content > p").length(), 4);

Creating and Manipulating Elements

The dom_query crate allows you to create and manipulate HTML elements with ease. Below are examples demonstrating how to create new elements, set attributes, append HTML, and replace content.

use dom_query::Document;

let doc: Document = r#"<!DOCTYPE html>
<html lang="en">
<head></head>
<body>
    <div id="main">
        <p id="first">It's</p>
    <div>
</body>
</html>"#.into();

// Selecting a node we want to attach a new element
let main_sel = doc.select_single("#main");
let main_node = main_sel.nodes().first().unwrap();

// Creating a simple element
let el = doc.tree.new_element("p");
// Setting attributes
el.set_attr("id", "second");
// Setting text content
el.set_text("test");
main_node.append_child(&el);
assert!(doc.select(r#"#main #second:has-text("test")"#).exists());

// Appending a more complex element using `append_html`
main_node.append_html(r#"<p id="third">Wonderful</p>"#);
assert_eq!(doc.select("#main #third").text().as_ref(), "Wonderful");
assert!(doc.select("#first").exists());

// There is also a `prepend_child` and `prepend_html` methods which allows
// to insert content to the begging of the node.
main_node.prepend_html(r#"<p id="minus-one">-1</p><p id="zero">0</p>"#);
assert!(doc.select("#main > #minus-one + #zero + #first + #second + #third").exists());

// Replacing existing element content with new HTML using `set_html`
main_node.set_html(r#"<p id="the-only">Wonderful</p>"#);
assert_eq!(doc.select("#main #the-only").text().as_ref(), "Wonderful");
assert!(!doc.select("#first").exists());

// Completely replacing the contents of the node, 
// including itself, using `replace_with_html`
main_node.replace_with_html(
    r#"<span>Tweedledum</span> and <span>Tweedledee</span>"#
);
assert!(!doc.select("#main").exists());
assert_eq!(doc.select("span + span").text().as_ref(), "Tweedledee");

// Inserting HTML content before a certain node using `node.before_html`
let span_sel = doc.select("body > span");
let span_node = span_sel.nodes().first().unwrap();
span_node.before_html(r#"<div id="main">Main Content</div>"#);
assert!(doc.select(r#"body > #main + span:has-text("Tweedledum")"#).exists());

// Inserting HTML content after a certain node using `node.after_html`
let span_node = span_sel.nodes().last().unwrap();
span_node.after_html(r#"<div id="extra">Extra Content</div>"#);
assert!(doc.select(r#"body > span:has-text("Tweedledee") + #extra"#).exists());

// To insert nodes before or after a certain element, 
// use the `node.insert_before` and `node.insert_after` methods.
// Both methods share the same behavior as `node.append_child`.

Explanation:

Creating a Simple Element:
- Use doc.tree.new_element() to create a new orphan element.
- Set attributes using node.set_attr().
- Set text content using node.set_text().
- Use node.append_child() to append a new child element node to the selected node.
- Use node.prepend_child() to prepend a new child element node to the selected node.
- Use node.insert_before() to insert a new sibling element node before the selected node.
- Use node.insert_after() to insert a new sibling element node after the selected node.
Appending HTML:
- Use append_html to add a more complex HTML node to the existing selection.
- This method is more convenient for adding multiple elements to the selected node.
Prepending HTML:
- Use prepend_html to add new HTML nodes at the beginning of the existing selection.
- Use prepend_child to prepend a new or an existing element node to the selected node.
Setting New HTML Content:
- Use set_html to replace the existing content of the selected node with new HTML.
- It changes the inner HTML contents of the node.
Replacing Node Contents Completely:
- Use replace_with_html to replace the entire content of the node, including the node itself.
Inserting HTML Before/After:
- Use before_html to insert HTML before each element in the selection.
- Use after_html to insert HTML after each element in the selection.

Additionally, methods like replace_with_html, set_html, append_html, prepend_html, before_html and after_html can specify more than one element in the provided string.

Wrapping & Unwrapping Node Elements

use dom_query::Document;

let doc: Document = r#"<!DOCTYPE html>
<html lang="en">
<head></head>
<body>
    <div id="main">
        <p id="content">It's</p>
    <div>
</body>
</html>"#.into();

let content_sel = doc.select("#content");
let content_node = content_sel.nodes().first().unwrap();

// 1. Wrapping the node with a new element
let wrapper = doc.tree.new_element("div");
wrapper.set_attr("id", "wrapper");
content_node.wrap_node(&wrapper);

assert_eq!(doc.select("#main > #wrapper > #content").length(), 1);

// 2. Wrapping the node with an HTML fragment
content_node.wrap_html(r#"<div id="sub-wrapper"><div class="adv">adv block content</div></div>"#);

// The node will be attached after `.adv` inside `#sub-wrapper`
assert_eq!(doc.select("#main > #wrapper > #sub-wrapper > .adv + #content").length(), 1);

// 3. Unwrapping the node
content_node.unwrap_node();

// Now #content is again a direct child of #wrapper
assert_eq!(doc.select("#main > #wrapper > #content").length(), 1);

// The detached #sub-wrapper still exists in the tree but is not connected
assert!(doc.select("#sub-wrapper").exists());

Explanation:

wrap_node(&wrapper_node) — wraps the current node inside the given node.
wrap_html("<html_fragment>") — wraps the current node using the first element found in the provided HTML fragment.
unwrap_node() — removes the node’s parent and moves the node up one level in the tree.

Text Node Normalization

Node normalization is essential for merging adjacent text nodes into a single node and removing empty text nodes. This helps keep the document structure compact and organized.

use dom_query::Document;

let contents = r#"<!DOCTYPE html>
<html>
    <head><title>Test</title></head>
    <body>
        <div id="parent">
            <div id="child">Child</div>
        </div>
    </body>
</html>"#;
let doc = Document::from(contents);

// Select the node with id "child"
let child_sel = doc.select_single("#child");
let child = child_sel.nodes().first().unwrap();

// Check that the node initially has only one child
assert_eq!(child.children_it(false).count(), 1);

// Create and append new text nodes
let text_1 = doc.tree.new_text(" and a");
let text_2 = doc.tree.new_text(" ");
let text_3 = doc.tree.new_text("tail");
child.append_child(&text_1);
child.append_child(&text_2);
child.append_child(&text_3);

// Verify the text and child count before normalization
assert_eq!(child.text(), "Child and a tail".into());
assert_eq!(child.children_it(false).count(), 4);

// Normalize the node
child.normalize();

// Verify the text and child count after normalization
assert_eq!(child.children_it(false).count(), 1);
assert_eq!(child.text(), "Child and a tail".into());

The normalize method follows the Node.normalize() specification. This method is also available through the Document struct as Document::normalize(), which applies normalization to all text nodes within the document tree.

Stripping Elements with `strip_elements`

dom_query allows you to easily remove specific elements from a node, keeping their children preserved.

use dom_query::Document;

let doc: Document = r#"<!DOCTYPE html>
<html lang="en">
<head></head>
<body>
    <div id="main">
        <p id="content">the quick brown <b>fox</b> jumps over the lazy <i>dog</i></p>
    <div>
</body>
</html>"#.into();

// 1. Select the content node
let content_sel = doc.select("#content");
let content_node = content_sel.nodes().first().unwrap();

assert_eq!(
    content_node.inner_html(),
    "the quick brown <b>fox</b> jumps over the lazy <i>dog</i>"
);

// 2. Strip specific elements (`<b>`, `<i>`) while keeping their children
content_node.strip_elements(&["b", "i"]);

// Now only text nodes remain
assert_eq!(
    content_node.inner_html(),
    "the quick brown fox jumps over the lazy dog"
);
assert_ne!(content_node.children_it(false).count(), 1);

// 3. Optional: Normalize to merge adjacent text nodes into a single node
content_node.normalize();
assert_eq!(content_node.children_it(false).count(), 1);

Explanation:

strip_elements(["tag1", "tag2", ...]) — removes all matching elements inside the node but preserves their children.
normalize() — merges adjacent text nodes into a single one, for a cleaner structure.

Note: strip_elements only removes the specified tags themselves, preserving their content inside.

Supported CSS pseudo-classes in `dom_query`

Implementation with selectors:

:empty
:first-child
:last-child
:has
:is
:where
:last-of-type
:not
:only-child
:only-of-type
:nth-child
:nth-last-child

Implementation with dom_query:

:any-link
:link
:has-text
:contains
:only-text

Notes

:has-text – checks whether one of children nodes has specific text.

:contains – checks whether the combined text of all child nodes contains specific text.

:only-text - checks whether the element contains only a single text node, with no other child nodes.

Serializing Document to Markdown

With the markdown feature enabled, you can serialize a Document or NodeRef into Markdown format using the md method.

use dom_query::Document;

let contents = "
<style>p {color: blue;}</style>
<p>I really like using <b>Markdown</b>.</p>

<p>I think I'll use it to format all of my documents from now on.</p>";

let expected = "I really like using **Markdown**\\.\n\n\
I think I'll use it to format all of my documents from now on\\.";

let doc = Document::from(contents);

// Passing `None` into `md` will apply the default list of skipped tags: 
// `["script", "style", "meta", "head"]`.
let got = doc.md(None);

assert_eq!(got.as_ref(), expected);

// If you want to serialize the entire document without skipping any elements,
// pass `Some(&vec![])` into `md`.

When using Document::from, be aware that html5ever automatically constructs an HTML <head> element if necessary and may move <style>, <meta>, and similar tags into it. If you want to preserve the original content order (as provided), it's recommended to use Document::fragment.

let contents = "<style>p {color: blue;}</style>\
<div><h1>Content Heading</h1></div>\
<p>I really like using Markdown.</p>\
<p>I think I'll use it to format all of my documents from now on.</p>";

let expected = "p \\{color: blue;\\}\n\
I really like using Markdown\\.\n\n\
I think I'll use it to format all of my documents from now on\\.";

let doc = Document::fragment(contents);

// Here we tell `md` to skip `<div>` elements during serialization.
let got = doc.md(Some(&["div"]));

assert_eq!(got.as_ref(), expected);

Notes:

Skipped elements are completely ignored in the output.
Backslashes (\) are inserted automatically to escape special Markdown characters when needed.

WASM32 Compilation

When compiling dom_query to WebAssembly (target wasm32-unknown-unknown) using wasm-pack, you may encounter runtime panics related to memory allocation, such as:

panicked at 'assertion failed: psize <= size + max_overhead'

This issue currently occurs due to compatibility problems between the latest versions of the selectors crate and the dlmalloc crate. The issue specifically manifests when using pseudo-elements, including selectors' own pseudo-elements like :not and :has.

If you must compile dom_query for a wasm32 application, consider using an alternative to dlmalloc. The following allocators have been tested and work successfully:

wee_alloc
lol_alloc
mini-alloc (fast, but it never deallocates memory)
alloc_cat

Solution:

Add mini-alloc to your Cargo.toml:

[dependencies]
alloc_cat = "1.0.0"

Set your favorite allocator as the global allocator in your lib.rs or main.rs:

#[cfg(target_arch = "wasm32")]
#[global_allocator]
pub static ALLOC: &alloc_cat::AllocCat = &alloc_cat::ALLOCATOR;

Build or test your WebAssembly project

dom_query by Example