html5

HTML5-compliant Parser and Tokenizer with XPath2 support crystal-shard html5 html-parser html-tokenizer xpath2 css-selectors

HTML/XML Parsing

0.1.0 released almost 4 years ago

naqvis/crystal-html5

over 1 year ago

34 3

Ali Naqvi

Source

Crystal-HTML5

Crystal-HTML5 shard is a Pure Crystal implementation of an HTML5-compliant Tokenizer and Parser. The relevant specifications include:

Tokenization is done by creating a Tokenizer for an IO. It is the caller responsibility to ensure that provided IO provides UTF-8 encoded HTML. The tokenization algorithm implemented by this shard is not a line-by-line transliteration of the relatively verbose state-machine in the WHATWG specification. A more direct approach is used instead, where the program counter implies the state, such as whether it is tokenizing a tag or a text node. Specification compliance is verified by checking expected and actual outputs over a test suite rather than aiming for algorithmic fidelity.

Parsing is done by calling HTML5.parse with either a String containing HTML or an IO instance. HTML5.parse returns a document root as HTML5::Node instance.

Parsing a fragment is done by calling HTML5.parse_fragment with either a String containing fragment of HTML5 or an IO instance. If the fragment is the InnerHTML for an existing element, pass that element in context. HTML5.parse_fragment returns a list of HTML5::Node that were found.

Installation

Add the dependency to your shard.yml:

dependencies:
  html:
    github: naqvis/crystal-html5

Run shards install

Usage

Example 1: Process each anchor `<a>` node.

require "html5"

html = <<-HTML
<!DOCTYPE html><html lang="en-US">
<head>
<title>Hello,World!</title>
</head>
<body>
<div class="container">
<header>
	<!-- Logo -->
   <h1>City Gallery</h1>
</header>
<nav>
  <ul>
    <li><a href="/London">London</a></li>
    <li><a href="/Paris">Paris</a></li>
    <li><a href="/Tokyo">Tokyo</a></li>
  </ul>
</nav>
<article>
  <h1>London</h1>
  <img src="pic_mountain.jpg" alt="Mountain View" style="width:304px;height:228px;">
  <p>London is the capital city of England. It is the most populous city in the  United Kingdom, with a metropolitan area of over 13 million inhabitants.</p>
  <p>Standing on the River Thames, London has been a major settlement for two millennia, its history going back to its founding by the Romans, who named it Londinium.</p>
</article>
<footer>Copyright &copy; W3Schools.com</footer>
</div>
</body>
</html>
HTML5

def process(node)
  if node.element? && node.data == "a"
    # Do something with node
    href = node["href"]?
    puts "#{node.first_child.try &.data} =>  #{href.try &.val}"

    # print all attributes
    node.attr.each do |a|
      # puts "#{a.key} = \"#{a.val}\""
    end
  end
  c = node.first_child
  while c
    process(c)
    c = c.next_sibling
  end
end

doc = HTML5.parse(html)
process(doc)

# Output
# London =>  /London
# Paris =>  /Paris
# Tokyo =>  /Tokyo

Example 2: Parse an HTML or Fragment of HTML

require "html5"

def parse_html(html, context)
  if context.empty?
    doc = HTML5.parse(html)
  else
    namespace = ""
    if (i = context.index(' ')) && (i >= 0)
      namespace, context = context[...i], context[i + 1..]
    end
    cnode = HTML5::Node.new(
      type: HTML5::NodeType::Element,
      data_atom: HTML5::Atom.lookup(context.to_slice),
      data: context,
      namespace: namespace,
    )

    nodes = HTML5.parse_fragment(html, cnode)
    doc = HTML5::Node.new(type: HTML5::NodeType::Document)
    nodes.each do |n|
      doc.append_child(n)
    end
  end
  doc
end

html = %(<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul>)
doc = parse_html(html, "body")
process(doc)

# Output
# Foo =>  foo
# BarBaz =>  /bar/baz

Example 3: Render `HTML5::Node` to HTML

require "html5"

html = %(<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul>)
doc = HTML5.parse(html)
doc.render(STDOUT)

# Output
# <html><head></head><body><p>Links:</p><ul><li><a href="foo">Foo</a></li><li><a href="/bar/baz">BarBaz</a></li></ul></body></html>

Example 3: XPath Query

require "html5"

html = %(<p>Links:</p><ul><li><a href="foo">Foo</a><li><a href="/bar/baz">BarBaz</a></ul>)
doc = HTML5.parse(html)

# Find all A elements
list = html.xpath_nodes("//a")

# Find all A elements that have `href` attribute.
list = html.xpath_nodes("//a[@href]")

# Find all A elements with `href` attribute and only return `href` value.
list = html.xpath_nodes("//a/@href")
list.each {|a| pp a.inner_text}

# Find the second `a` element
a = html.xpath("//a[3]")

# Count the number of all a elements.
v = html.xpath_float("//a")

Refer to specs for more sample usages. And refer to Crystal XPath2 Shard for details of what functions and functionality is supported by XPath implementation.

Development

To run all tests:

crystal spec

Contributing

Fork it (https://github.com/naqvis/crystal-html5/fork)
Create your feature branch (git checkout -b my-new-feature)
Commit your changes (git commit -am 'Add some feature')
Push to the branch (git push origin my-new-feature)
Create a new Pull Request

Contributors

Ali Naqvi - creator and maintainer

html5:
  github: naqvis/crystal-html5
  version: ~> 0.1.0

License MIT

Crystal 0.34.0

Authors

Ali Naqvi syed.alinaqvi@gmail.com

Dependencies 1

xpath2 ~> 0.1

{'github' => 'naqvis/crystal-xpath2', 'version' => '~> 0.1'}

Development Dependencies 0

Dependents 2

lucky_flow (3 dependents)
tourmaline

Releases

Show all 15 releases

Last synced about 20 hours ago.

Edit catalog entry ?