#164: Sanitizing HTML with HtmlSanitizeEx

Published June 1, 2023

Episode Sponsored by Hire For Elixir

Follow along with the episode starter on GitHub

Here we have an application that lists different albums. We’re able to edit a few different fields, including an album summary. Currently, we’re only rendering text about our album for the summary, but what if we updated an album to include some HTML markup. Let’s update the text here to include a link and a h3 tag. Once it’s saved and the summary is rendered again - the markup is displayed on the page - it’s not being rendered on the page as HTML.

This is because, with Phoenix, data output in templates is not considered safe. However, occasionally you may want to tag it safe and show its “raw” contents. In these instances, Phoenix provides the Phoenix.HTML.raw/1. Let’s open our album’s show.html.heex template and where we’re rendering the @album.summary. Let’s update that to be rendered with the raw function.

Template path: lib/teacher_web/templates/album/show.html.heex

...
<%= raw(@album.summary) %>
...

Now if we go back to the browser our changes are displayed. The summary is now being rendered with the HTML markup we specified. While this works, it can also be dangerous. For example, let’s update our album summary. Only this time, let’s include some JavaScript to create an alert with the text “hello”. Now when we click save and reload the page, our alert is invoked. A bad actor could inject some malicious code in the summary field and as-is our application would run it whenever the page loads.

To prevent this we should sanitize our HTML. To help us do that let’s use the html_sanitize_ex package, which gives us a fast and straightforward HTML Sanitizer and was extracted from the great https://elixirstatus.com project. Let’s go to Hex and grab the package from Hex. Then let’s open our Mixfile and add it to our list of dependencies.

mix.exs

...

defp deps do
  ...
  {:html_sanitize_ex, "~> 1.4"},
  ...
end

...

Then let’s go to the command line and download it.

$ mix deps.get
...
New:
  html_sanitize_ex 1.4.2
  mochiweb 2.22.0

Now there are a couple different scrubbing options we can choose from, which are listed in the docs. For our example, let’s use HtmlSanitizeEx.basic_html, which will scrub everything but basic HTML tags. In this example, let’s sanitize the data before it’s saved to the database. To do that we’ll open our album.ex module. And we’ll want to sanitize our summary here in our changeset function.

To handle the scrubbing let’s create a private function named sanitize_attrs and let’s pattern match on the "summary" field that’s included in the map of attributes. If it exists we’ll update the “summary” field in our attrs with the HtmlSanitizeEx.basic_html function we want to run. Then let’s add a second sanitize_attrs function to match when the “summary” isn’t present in the attributes. And from that function, we’ll return the original attrs. Great, with our functions added, we just need to call it from our changeset function above, passing in the attrs.

lib/teacher/recordings/album.ex

...

def changeset(album, attrs) do
  attrs = sanitize_attrs(attrs)

  album
  |> cast(attrs, [:artist, :summary, :title, :year])
  |> validate_required([:artist, :summary, :title, :year])

end

defp sanitize_attrs(%{"summary" => _summary} = attrs) do
  Map.update!(attrs, "summary", &HtmlSanitizeEx.basic_html/1)
end
defp sanitize_attrs(attrs) do
  attrs
end

...

Then let’s go to the command line and start our server.

$ mix phx.server
...

Now if we go back and save our album we should expect the <script> tag to be removed from the text when it runs through the Album.changeset function. So let’s go ahead and save it. And great it looks like it worked. Let’s go back to the “edit” page again and the <script> tags have been removed, leaving only the alert("hello") text.

One of the nice features of html_sanitize_ex is that it makes it easy to create custom scrubbers. Let’s create one that only allows heading tags and strips all other tags.

Let’s go back to our application and create a new module named summary_scrubber.ex in “lib/teacher”. For the scrubber to work we’ll need to require HtmlSanitizeEx.Scrubber.Meta and then alias HtmlSanitizeEx.Scrubber.Meta. html_sanitize_ex provides some functions to help us sanitize our HTML here in the HtmlSanitizeEx.Scrubber.Meta module - we’ll use the examples and functions here to help us build our scrubber. Let’s first include Meta.remove_cdata_sections_before_scrub to remove any CDATA tags and then Meta.strip_comments. Now we can use the Meta.allow_tag_with_these_attributes function for each of the tags we want to allow. In this case all heading tags. Now at the end of our scrubber module it’s important to include Meta.strip_everything_not_covered - this ensures any tags or attributes we haven’t explicitly allowed are stripped.

lib/teacher/summary_scrubber.ex


defmodule Teacher.SummaryScrubber do
  require HtmlSanitizeEx.Scrubber.Meta
  alias HtmlSanitizeEx.Scrubber.Meta

  Meta.remove_cdata_sections_before_scrub()
  Meta.strip_comments()

  Meta.allow_tag_with_these_attributes("h1", [])
  Meta.allow_tag_with_these_attributes("h2", [])
  Meta.allow_tag_with_these_attributes("h3", [])
  Meta.allow_tag_with_these_attributes("h4", [])
  Meta.allow_tag_with_these_attributes("h5", [])
  Meta.allow_tag_with_these_attributes("h6", [])

  Meta.strip_everything_not_covered()

end

With that, our scrubber is ready to use. Let’s go back to our album.ex module and we’ll need to update the sanitize_attrs function to use our new scrubber. We’ll take our summary and pass it as the first argument into HtmlSanitizeEx.Scrubber.scrub and then our custom scrubber module - Teacher.SummaryScrubber - as the second argument.

lib/teacher/recordings/album.ex

...

defp sanitize_attrs(%{"summary" => _summary} = attrs) do
  Map.update!(attrs, "summary", fn(summary) ->
    HtmlSanitizeEx.Scrubber.scrub(summary, Teacher.SummaryScrubber)
  end)
end

...

Now let’s go back to the browser and when we update our album the “sessions” heading should stay, but the “Bob Dylan” link should be stripped.

So go to the “edit” page and I’ll include another heading tag and then when we’ll save it - perfect - only our two heading tags are kept - the link was removed. Our application is now updated to sanitize user input.