It is the second part of a series about Rust for Python users.

In this article, we will build a foundation for a Rust-powered Python library - a crate that implements CSS inlining. It is a process of moving CSS rules from style tags to the corresponding spots in the HTML body. This approach to including styles is crucial for sending HTML emails or embedding HTML pages into 3rd party resources.

Our goal is to build a library that will transform this HTML:

<html>
    <head>
        <style>h1 { color:blue; }</style>
    </head>
    <body>
        <h1>Big Text</h1>
    </body>
</html>

into this:

<html>
    <head>
        <style>h1 { color:blue; }</style>
    </head>
    <body>
        <h1 style="color:blue;">Big Text</h1>
    </body>
</html>

We'll go through:

Target audience: Those who know Rust common principles and looking for practical examples. Some familiarity with trait bounds and generics is helpful.

ANNOUNCE: I build a service for API fuzzing. Sign up to check your API now!

Other chapters:


How CSS inlining works?

The inlining process involves many details and corner cases - merging CSS with existing style attributes' values, loading external stylesheets, handling pseudo-selectors, and many more. This implementation will include a small feature set: moving CSS rules from style tags to appropriate style attributes and optional removing of style tags after inlining.

And the last assumption is that this transformation is fallible, because in some cases, as a malformed CSS selector, it is not clear how to query the DOM and find matching elements.

The most natural flow to uphold these requirements might look like this:

  • Find all style tags;
  • For each CSS rule in tags find elements matching its selector;
  • Insert declarations to the matched element style attribute;

These operations require an ability to navigate through an HTML document and manipulate its nodes. The most popular project that provides many high-quality components to work with HTML and CSS is Servo - a modern browser engine created by Mozilla.

Learn more about the Servo project from this YouTube video by Josh Matthews

The particular crates we are interested in are the following:

These tools give the developer a browser-grade performance and much flexibility in the parsing process. On the other hand, they are relatively low-level - for example, html5ever does not provide any DOM tree representation. Luckily, there is kuchiki that conveniently wraps them into one powerful library.


Start a new project

I assume, that you already have Rust & cargo installed, if not, then follow the instructions from rustup.rs.

I used rustc 1.45.2 for compiling all the Rust code in this article, but earlier compiler versions should work too.

Let's start by creating a new Rust project:

$ cargo new --lib css-inline-example
     Created library `css-inline-example` package
$ cd css-inline-example && tree
.
├── Cargo.toml
└── src
   └── lib.rs

1 directory, 2 files

And adding the dependencies mentioned above to the Cargo.toml file:

[dependencies]
cssparser = "0.27.2"
kuchiki = "0.8.1"

To reflect the task requirements, we can write a stub function that will take HTML as a string slice and return its inlined version:

pub fn inline(html: &str) -> Result<String, InlineError> {
    todo!()  // panics with a "not yet implemented" message
}

#[derive(Debug)]
pub enum InlineError {}

The Debug trait makes a type printable with the '{:?}' format specifier

Since inlining is fallible, this function returns the Result type. Its Err variant includes an enum with potential error cases. We will expand this enum when we encounter different error scenarios.

To verify that the future implementation works as intended, we can include a test based on the original example at the beginning of the article:

#[cfg(test)]
mod tests {
    use super::*;

    #[test]
    fn it_works() {
        let html = r#"<html><head>
<style>h1 { color:blue; }</style>
        </head>
        <body><h1>Big Text</h1></body>
        </html>"#;
        let expected = r#"<html><head>
<style>h1 { color:blue; }</style>
        </head>
        <body><h1 style=" color:blue; ">Big Text</h1>
        </body></html>"#;
        let inlined = inline(html).unwrap();
        assert_eq!(inlined, expected)
    }
}

kuchiki slightly alters the original formatting of the input document. I formatted the samples in the test to match the kuchiki approach, which simplifies testing.


Inlining configuration

The inlining process should be configurable if you decide to implement more optional features - optionally remove processed style tags, load remote stylesheets, and any other.

The Builder pattern is one of the most convenient ways to design the configuration process. It enables creating very expressive and ergonomic APIs, especially if you have many optional configuration parameters. You can specify only the options you need, and the others will use their default values.

let inliner = CSSInliner::options()
    .remove_style_tags(true)
    // some more options?
    .build();
let processed = inliner::inline(&html);

The underlying process includes the following ingredients:

  • Mutable storage for configuration options;
  • Setters to modify the defaults;
  • Creating something that will perform inlining and use the desired configuration in it.

It implies having two structs - one for options and one for "inliner". The latter will accept options in its constructor:

#[derive(Debug)]
pub struct InlineOptions {
    pub remove_style_tags: bool,
}

#[derive(Debug)]
pub struct CSSInliner {
    options: InlineOptions,
}

impl CSSInliner {
    pub fn new(options: InlineOptions) -> Self {
        CSSInliner { options }
    }
}

To provide a set of default options, we need to implement the Default trait for the InlineOptions:

impl Default for InlineOptions {
    fn default() -> Self {
        InlineOptions {
            remove_style_tags: false,  // do not remove style tags by default
        }
    }
}

Then the default "inliner" should use the default options and create them via the options method:

impl Default for CSSInliner {
    fn default() -> Self {
        CSSInliner::new(InlineOptions::default())
    }
}

impl CSSInliner {
    pub fn options() -> InlineOptions {
        InlineOptions::default()
    }
}

To make our API design work as expected InlineOptions should contain setters for configuration options and a build method to create a new CSSInliner:

impl InlineOptions {
    pub fn remove_style_tags(mut self, remove_style_tags: bool) -> Self {
        self.remove_style_tags = remove_style_tags;
        self
    }

    pub fn build(self) -> CSSInliner {
        CSSInliner::new(self)
    }
}

To get more information regarding the Builder pattern, see this guide or look at the url crate source code

Now we can place the inlining logic inside the CSSInliner struct, and the original inline function will use it under the hood:

impl CSSInliner {
    pub fn inline(&self, html: &str) -> Result<String, InlineError> {
        todo!()
    }
}

pub fn inline(html: &str) -> Result<String, InlineError> {
    CSSInliner::default().inline(html)
}

Searching in an HTML document

Looking up all style tags requires parsing the input HTML document with kuchiki:

use kuchiki::{parse_html, traits::TendrilSink};

impl CSSInliner {
    pub fn inline(&self, html: &str) -> Result<String, InlineError> {
        let document = parse_html().one(html);
        for style_tag in document
            .select("style")
            .map_err(|_| InlineError::ParseError("Unknown error".to_string()))?
        {
            // ...
        }
        todo!()
    }
}

#[derive(Debug)]
pub enum InlineError {
    ParseError(String),
}

In one, the input HTML is transformed into Tendril - a compact string type that behaves similarly to String but optimized for zero-copy parsing

The current Rust (1.45.2 at the time of writing) requires us to explicitly import the TendrilSink trait to use the one method. In general, there could be two traits that implement the one method; therefore, the compiler won't know which implementation to use. However, there is an RFC to mitigate this restriction.

The select call may fail on unsupported selectors or syntax errors, but kuchiki doesn't return a meaningful error type for some reason and returns the unit type instead, so we have to map it with our error type.

An alternative would be the From trait, but it will define the () -> InlineError conversion for all cases, not only for this specific one. The problem is if, in some other place, the Err variant will contain (), then the From trait implementation will convert it to InlineError::ParseError, which may be wrong in that context. It is always better to return meaningful error types to avoid redundancy and ambiguity with error propagation.

If you want to learn more about error handling in Rust, read this masterpiece by BurntSushi

After finding all the style tags, we need to extract their text content:

use kuchiki::{parse_html, traits::TendrilSink, NodeRef};

impl CSSInliner {
    pub fn inline(&self, html: &str) -> Result<String, InlineError> {
        // ...
        {
            if let Some(first_child) = style_tag.as_node().first_child() {
                if let Some(css_cell) = first_child.as_text() {
                    process_css(&document, css_cell.borrow().as_str())?;
                }
            }
            if self.options.remove_style_tags {
                style_tag.as_node().detach()
            }
        }
        todo!()
    }
}

fn process_css(document: &NodeRef, css: &str) -> Result<(), InlineError> {
    todo!()
}

A few notes on the kuchiki implementation details:

  • The text inside a style tag is a separate node of a tree; thus there is a first_child call;
  • css_cell is a RefCell that is common for trees representation in Rust.

As we want to remove processed style tags optionally, this is the optimal place to remove them. Removing a node works by dropping all references to this node from its parents and siblings.

The process_css function will parse the provided text and insert CSS rules into the matched elements, modifying it in-place.


CSS parsing

The CSS parsing implementation is quite low-level because we don't have high-level wrappers like kuchiki. The main benefit of cssparser crate is its flexibility that allows the developer what to parse and how - you can parse rules differently depending on the context or entirely skip parsing of rules, that you don't need.

By using it, we will be able to parse rules list like this:

h1, h2 { color:blue; }
strong { text-decoration:none }
p { font-size:2px }
p.footer { font-size: 1px}

To implement this, we need to use cssparser::RuleListParser::new_for_stylesheet that accepts CSS rules list and a parser. The following traits should bound the parser argument:

  • QualifiedRuleParser
  • AtRuleParser

The first trait parses qualified rules that consist of two parts - a prelude and a block. In most cases, the prelude is a CSS selector, and the block is a list of declarations enclosed in curly brackets:

.button {
   padding: 3px;
   border-radius: 5px;
   border: 1px solid black;
}

Read more about CSS parsing in the W3C recommendation

The QualifiedRuleParser trait requires three associated types:

  • Prelude. CSS selector;
  • QualifiedRule. CSS selector + block;
  • Error. Additional data for custom errors.

To use CSS selectors for querying the document, we need to keep a prelude and a block separately inside the QualifiedRule type:

type QualifiedRule<'i> = (&'i str, &'i str);

We need no custom errors; hence, we can set the Error type to (). The trait itself can work for an empty struct:

use cssparser::QualifiedRuleParser;

struct CSSRuleListParser;

impl<'i> QualifiedRuleParser<'i> for CSSRuleListParser {
    type Prelude = &'i str;
    type QualifiedRule = QualifiedRule<'i>;
    type Error = ();
}

This trait requires the lifetime of the input data, and we can use the same lifetime in our types, which means that parsed qualified rules will live as long as the input.

The default trait implementation ignores all qualified rules; therefore, we have to redefine this behavior. Parsing happens in two methods, and the first one is parse_prelude:

use cssparser::{ParseError, Parser, QualifiedRuleParser};

impl<'i> QualifiedRuleParser<'i> for CSSRuleListParser {
    // ... associated types
    fn parse_prelude<'t>(
        &mut self,
        input: &mut Parser<'i, 't>,
    ) -> Result<Self::Prelude, ParseError<'i, Self::Error>> {
       todo!()
    }
}

It accepts the Parser type that has two layers and behaves similarly to an iterator:

  • First, the underlying ParserInput and Tokenizer structs perform lexical analysis of the input and yield tokens like Delimiter or Number;
  • And then Parser processes tokens from ParserInput and checks if these tokens form meaningful CSS constructions;

You can learn more about tokens in the source code

For our needs it will be enough to advance the parser until the end of the prelude and return it as a string slice:

fn exhaust<'i>(input: &mut Parser<'i, '_>) -> &'i str {
    let start = input.position();  // the current parsing position
    while input.next().is_ok() {}  // parse while it is possible
    input.slice_from(start)        // take a slice from the parsed block
}

impl<'i> QualifiedRuleParser<'i> for CSSRuleListParser {
    // ...
    fn parse_prelude<'t>(
        &mut self,
        input: &mut Parser<'i, 't>,
    ) -> Result<Self::Prelude, ParseError<'i, Self::Error>> {
        Ok(exhaust(input))
    }
}

But how does the parser know when the prelude ends?

Before cssparser calls the parse_prelude function, it configures the parser to yield tokens only until a curly bracket occurs. For this reason, it is safe to call input.next() until the first Err - it won't go any further than the first { position, which bounds this parsing step only to a prelude.

See the source code of Parser.parse_until_before for more information

For the second method, it works similarly - the parser will stop at the closing curly bracket symbol, and we can parse the rest of the qualified rule with the same exhaust function:

use cssparser::{ParseError, Parser, QualifiedRuleParser, SourceLocation};

impl<'i> QualifiedRuleParser<'i> for CSSRuleListParser {
    // ...
    fn parse_block<'t>(
        &mut self,
        prelude: Self::Prelude,
        _: SourceLocation,
        input: &mut Parser<'i, 't>,
    ) -> Result<Self::QualifiedRule, ParseError<'i, Self::Error>> {
        Ok((prelude, exhaust(input)))
    }
}

We finished this trait implementation. Let's deal with the second one!

AtRuleParser trait's default implementation ignores all at-rules, which is what we need exactly, the spec restricts the content of style attributes:

The value of the style attribute must match the syntax of the contents of a CSS declaration block (excluding the delimiting braces)

We can't extract the declaration block content because all important at-rules are conditionals, and by removing it, we'll lose information when they should be applied.

@media screen and (max-width: 992px) {
  body {
    // Only the content of this block is allowed by the spec
    background-color: blue;
  }
}

Consequently, this implementation will require defining only associated types:

use cssparser::AtRuleParser;

impl<'i> AtRuleParser<'i> for CSSRuleListParser {
    type PreludeNoBlock = &'i str;
    type PreludeBlock = &'i str;
    type AtRule = QualifiedRule<'i>;
    type Error = ();
}

The only important detail here is that the RuleListParser struct adds additional restrictions on AtRule and Error types. Its source code:

impl<'i, 't, 'a, R, P, E: 'i> RuleListParser<'i, 't, 'a, P>
where
    P: QualifiedRuleParser<'i, QualifiedRule = R, Error = E>
        + AtRuleParser<'i, AtRule = R, Error = E>,
{
    pub fn new_for_stylesheet(input: &'a mut Parser<'i, 't>, parser: P) -> Self {
        // ...
    }
}

Which reads: the parser argument has a generic type P. This type P should implement traits QualifiedRuleParser and AtRuleParser where QualifiedRuleParser::QualifiedRule is the same as AtRuleParser::AtRule and QualifiedRuleParser::Error is the same as AtRuleParser::Error.

In our implementation it means that the AtRule associated type should be QualifiedRule<'i> and the Error type should be () (the same as QualifiedRule and Error in QualifiedRuleParser respectively). The ability to require the same types across different trait bounds allows developers to express more in their APIs.

It would be nice to have some default values for those types to avoid writing them by hand! The RFC for associated types defaults was accepted, and the implementation is in-progress (here is the tracking issue).

Modifying HTML elements

Now, finally, we can use our parser! There are no shortcuts for constructing a RuleListParser instance from a string slice; therefore, we need to build all pieces by hand:

use cssparser::{Parser, ParserInput, RuleListParser};

fn process_css(document: &NodeRef, css: &str) -> Result<(), InlineError> {
    let mut parse_input = ParserInput::new(css);
    let mut parser = Parser::new(&mut parse_input);
    let rules = RuleListParser::new_for_stylesheet(
        &mut parser,
        CSSRuleListParser
    );
    for rule in rules {
        // apply this rule!
    }
    Ok(())
}

The next step is iterating over parsed rules and processing them individually. The parsing result is an iterator over Result instances which can be:

  • Ok. A tuple of two string slices - a selector and a block;
  • Err. Also a tuple. It contains an instance of cssparser::ParseError and the erroneous input.

As you may see, the return type of the process_css function has InlineError in its Err variant. Hence we need to convert cssparser::ParseError into our InlineError to propagate errors. The canonical way is to use the From trait:

use cssparser::{BasicParseErrorKind, ParseError, ParseErrorKind};

impl From<(ParseError<'_, ()>, &str)> for InlineError {
    fn from(error: (ParseError<'_, ()>, &str)) -> Self {
        let message = match error.0.kind {
            ParseErrorKind::Basic(kind) => match kind {
                BasicParseErrorKind::UnexpectedToken(token) => {
                    format!("Unexpected token: {:?}", token)
                }
                BasicParseErrorKind::EndOfInput => "End of input".to_string(),
                BasicParseErrorKind::AtRuleInvalid(value) => {
                    format!("Invalid @ rule: {}", value)
                }
                BasicParseErrorKind::AtRuleBodyInvalid => {
                    "Invalid @ rule body".to_string()
                }
                BasicParseErrorKind::QualifiedRuleInvalid => {
                    "Invalid qualified rule".to_string()
                }
            },
            ParseErrorKind::Custom(_) => "Never happens".to_string(),
        };
        InlineError::ParseError(message)
    }
}

By matching all the error kinds, we can provide clear error messages for our library.

Now we can handle the rules and compile CSS selectors for further matching against them:

use kuchiki::Selectors;

fn process_css(document: &NodeRef, css: &str) -> Result<(), InlineError> {
    // ...
    for rule in rules {
        let (selector, block) = rule?;
        if let Ok(matching_elements) = document.select(selector) {
            for el in matching_elements {
                todo!()
            }
        }
    }
    Ok(())
}

The code above is similar to what we used before to find all style tags, but in this case, it is better to skip unsupported selectors for better future compatibility.

We need to modify each matched element and put the block value into the style attribute. Some nodes may already have non-empty style attributes, but the implementation will require using additional traits from cssparser and specific merging rules. To focus on the most straightforward flow, I leave it as an exercise to the reader.

As you may see, all variables are immutable during iterating over parsed rules, but we still need to modify element.attributes. It is possible because element.attributes is a RefCell that implements the Interior Mutability pattern. This Rust pattern allows you to modify some object's internal state by checking borrowed rules in runtime.

Read more about the Interior Mutability pattern in chapter 15 of the Book.

When we want to borrow the value of a RefCell mutably, there is a choice - use borrow_mut that panics if the value is currently borrowed or try_borrow_mut that returns Result.

At the moment, it is the only place where we access attributes, therefore using borrow_mut is safe, but this condition may change, and this code will panic. It is a possible situation if, for example, we'll decide to implement the handling of external stylesheets and via href attributes of "link" tags.

Even if the probability is quite low, I prefer a bit more safe and explicit (but more verbose) code:

fn process_css(document: &NodeRef, css: &str) -> Result<(), InlineError> {
    // ...
            for el in matching_elements {
                if let Ok(mut attributes) = el.attributes.try_borrow_mut() {
                    attributes.insert("style", block.to_string());
                }
            }
    // ...
}

The attributes internal (since it is a RefCell) value is a wrapper around a BTreeMap and provides a similar interface. Our simple inlining is done, and now its time to serialize the output.


Generic writers

To serialize an HTML document, we need to write a textual representation of all its nodes into some sink. kuchiki supports serialization to any target that implements std::io::Write trait (a file, for example). The simplest case is serialization into a vector of bytes, which then we need to convert to a string:

impl CSSInliner {
    pub fn inline(&self, html: &str) -> Result<String, InlineError> {
        // ...
        let mut output = Vec::new();
        document.serialize(&mut output)?;
        Ok(String::from_utf8_lossy(&output).to_string())
    }
}

document.serialize returns std::io::Error in its Err case, therefore we have to implement another From trait and add a new variant to the InlineError enum:

use std::io;

#[derive(Debug)]
pub enum InlineError {
    ParseError(String),
    IO(io::Error),
}

impl From<io::Error> for InlineError {
    fn from(error: io::Error) -> Self {
        InlineError::IO(error)
    }
}

But what if you'd like to write inlined HTML to a file or some network stream? The current approach is not flexible enough. Let's create a new method that will provide more flexibility:

impl CSSInliner {
    pub fn inline_to<W: io::Write>(&self, html: &str, target: &mut W) -> Result<(), InlineError> {
        // ... inlining implementation
        document.serialize(target)?;
        Ok(()
    }
}

And use it in the original one:

impl CSSInliner {
    pub fn inline(&self, html: &str) -> Result<String, InlineError> {
        let mut output = Vec::new();
        self.inline_to(html, &mut output)?;
        Ok(String::from_utf8_lossy(&output).to_string())
    }
}

Now it is possible to serialize inlined HTML to any target that implements the io::Write trait.

Finally, our code compiles, and we can run the test we wrote in the beginning:

$ cargo t
    Finished test [unoptimized + debuginfo] target(s) in 1.18s
     Running target/debug/deps/css_inline_example-80ecd8c1feae1ffc

running 1 test
test tests::it_works ... ok

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 0 filtered out

cargo t is an alias for cargo test, and cargo supports more of them. Also, you can define your own. Check the Cargo documentation

Ok, inlining works! Let's add a couple of improvements.


Further improvements

The Error trait improves debugging by providing access to the original cause, and the Display trait makes errors more descriptive and allows them to be formatted with the default formatter.

use std::error::Error;
use std::fmt;

impl Error for InlineError {
    fn source(&self) -> Option<&(dyn Error + 'static)> {
        match self {
            InlineError::IO(error) => Some(error),
            InlineError::ParseError(_) => None,
        }
    }
}

impl fmt::Display for InlineError {
    fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
        match self {
            InlineError::IO(error) => f.write_str(error.to_string().as_str()),
            InlineError::ParseError(error) => f.write_str(error.as_str()),
        }
    }
}

clippy is another useful thing that can improve your code. If you don't have it yet, then you can install it with rustup:

$ rustup component add clippy

I like the "pedantic" set of lints because it provides many helpful (and sometimes annoying) suggestions that may improve your code:

// lib.rs
#![warn(clippy::pedantic)]

clippy also runs cargo check under the hood, so you don't have to run both of them.


Documentation is one of the most important aspects of any great crate. To always keep it in mind, add this line to the beginning of your lib.rs file:

#![warn(missing_docs)]

And clippy will remind you if you missed documenting any public entity of your crate.

Check how the documentation will look like by running cargo doc and opening the target/doc/css_inline_example/index.html file in your browser.

See the complete CSS inlining implementation in this GitHub repo

Summary

At this point, inlining works. We implemented:

  • a high-level inline function and a configurable struct for more flexible inlining;
  • selecting elements in HTML and modifying them;
  • parser of CSS rules;
  • error handling;
  • serializing inlined HTML to a generic target;

Our Rust crate is completed, now we can start adding Python bindings to it!

Chapters:

Thank you,

Dmitry


❤ ❤ ❤