myhtml
MyHTML
Fast HTML5 Parser (Crystal binding for awesome lexborisov's myhtml and Modest). This shard used in production to parse millions of pages per day, very stable and fast.
WARNING: original libraries (myhtml and Modest) not maintained since july 2020, i recommend switch to successor parser: Lexbor.
Installation
Add this to your application's shard.yml:
dependencies:
  myhtml:
    github: kostya/myhtml
And run shards install
Usage example
require "myhtml"
html = <<-HTML
  <html>
    <body>
      <div id="t1" class="red">
        <a href="/#">O_o</a>
      </div>
      <div id="t2"></div>
    </body>
  </html>
HTML
myhtml = Myhtml::Parser.new(html)
myhtml.nodes(:div).each do |node|
  id = node.attribute_by("id")
  if first_link = node.scope.nodes(:a).first?
    href = first_link.attribute_by("href")
    link_text = first_link.inner_text
    puts "div with id #{id} have link [#{link_text}](#{href})"
  else
    puts "div with id #{id} have no links"
  end
end
# Output:
#   div with id t1 have link [O_o](/#)
#   div with id t2 have no links
Css selectors example
require "myhtml"
html = <<-HTML
  <html>
    <body>
      <table id="t1">
        <tr><td>Hello</td></tr>
      </table>
      <table id="t2">
        <tr><td>123</td><td>other</td></tr>
        <tr><td>foo</td><td>columns</td></tr>
        <tr><td>bar</td><td>are</td></tr>
        <tr><td>xyz</td><td>ignored</td></tr>
      </table>
    </body>
  </html>
HTML
myhtml = Myhtml::Parser.new(html)
p myhtml.css("#t2 tr td:first-child").map(&.inner_text).to_a
# => ["123", "foo", "bar", "xyz"]
p myhtml.css("#t2 tr td:first-child").map(&.to_html).to_a
# => ["<td>123</td>", "<td>foo</td>", "<td>bar</td>", "<td>xyz</td>"]
More Examples
Development Setup:
git clone https://github.com/kostya/myhtml.git
cd myhtml
make
crystal spec
Benchmark
Parse 1000 times google page(600Kb), and 1000 times css select. myhtml-program, crystagiri-program, nokogiri-program
| Lang | Shard | Lib | Parse time, s | Css time, s | Memory, MiB | | -------- | ---------- | --------------- | ------------- | ----------- | ----------- | | Crystal | lexbor | lexbor | 2.54 | 0.099 | 7.8 | | Crystal | myhtml | myhtml(+modest) | 3.17 | 0.16 | 8.4 | | Ruby 2.7 | Nokogiri | libxml2 | 9.19 | 10.76 | 139.8 | | Crystal | Crystagiri | libxml2 | 11.27 | - | 25.0 |