Indexing HTML with Solr/Lucene

2012-11-19

Lots of people want to index HTML files with Solr or Lucene. One great tool that can extract text out of many document formats is Tika, but it is very heavy-weight. What if you just want to strip the text from html files and have Lucene index it? There is no out-of-the-box solution for this, but with a tiny customization we can get it working.

For the actual text extraction, we’re going to use the excellent Jerico HTML Parser. This is a loose HTML parser that can handle all kinds of malformed markup and still give reasonable results. It’s also got a class dedicated to ripping text out of html documents called TextExtractor.

To use this in Solr, we’re going to implement our own TokenizerFactory. Solr uses this class to get a Tokenizer, which will generate the tokens for your index. The important method here is Tokenizer create(Reader input). This method takes a Reader, a standard Java I/O object, opened on the text of whatever document you’re indexing. Your Tokenizer will use this Reader to get the document’s content.

Our strategy is to return the same Tokenizer we were before (StandardTokenizer for normal text), but to inject Jericho into the mix before passing it the input Reader. So intead of Reader -> StandardTokenizer we’re going to do Reader -> Jericho -> StandardTokenizer. Jericho’s TextExtractor makes this easy for us, because it takes a Reader of HTML and returns a Reader of plain text. So we write an HtmlTokenizerFactory with this implementation:

public Tokenizer create(Reader input) {
  try {
    return new StandardTokenizer(Version.LUCENE_36, convertReader(input));
  } catch (IOException e) {
    throw new RuntimeException(e);
  }
}

private static Reader convertReader(Reader r) throws IOException {
  Source s = new Source(r);
  Element elem = s.getNextElement(0, "html");
  TextExtractor te = new TextExtractor(elem);
  return CharStreamSourceUtil.getReader(te);
}

That’s it! Our HtmlTokenizerFactory returns a StandardTokenizer, but it pre-processes the HTML to extract just the text.

The final step is to define a new Field Type in Solr’s schema.xml file, so we can index fields that have HTML. Just add this snippet to the file:

<fieldType name="text_html" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="your.package.name.HtmlTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Here we’re using the same Filters as Solr’s out-of-the-box text_general Field Type, but you could change them to whatever you like. By converting the HTML to plain text right at the beginning, you retain all the power and flexibility of Lucene’s indexing workflow.

To actually define a Field using our new Field Type, add something like this to your schema.xml:

<field name="html_content" type="text_html" indexed="true" stored="false"/>

There are many improvements you can make to this approach. For instance, you could have Jericho process just the body tag instead of the the whole page, perhaps also extracting certain meta tags and adding them to the token stream. Or you could follow Google’s lead and boost the weight of tokens that come from the <title>, <h1>, or similar tags. But hopefully this is a good starting point for your Solr/Lucene applications!

blog comments powered by Disqus Prev: CKEditor and "TypeError: x is undefined" Next: Histograms in Postgres with Window Functions