<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Manitou-Mail Blog &#187; code</title>
	<atom:link href="http://www.manitou-mail.org/blog/tag/code/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.manitou-mail.org/blog</link>
	<description>on the use and development of the Manitou-Mail program</description>
	<lastBuildDate>Mon, 23 Aug 2010 19:50:26 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.8.5</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Indexing HTML parts</title>
		<link>http://www.manitou-mail.org/blog/2009/08/indexing-html-parts/</link>
		<comments>http://www.manitou-mail.org/blog/2009/08/indexing-html-parts/#comments</comments>
		<pubDate>Sat, 15 Aug 2009 13:12:44 +0000</pubDate>
		<dc:creator>daniel</dc:creator>
				<category><![CDATA[Development]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[manitou-mdx]]></category>
		<category><![CDATA[plugins]]></category>

		<guid isPermaLink="false">http://www.manitou-mail.org/blog/?p=3</guid>
		<description><![CDATA[<p>While HTML integration is improving in Manitou-Mail, the current version (0.9.12) does not index the contents of HTML parts. This is generally not a problem because messages tend to carry a text version inside a multipart/alternative MIME construct, and that version gets indexed so that the message can still be retrieved by the words it [...]]]></description>
			<content:encoded><![CDATA[<p>While HTML integration is improving in Manitou-Mail, the current version (0.9.12) does not index the contents of HTML parts. This is generally not a problem because messages tend to carry a text version inside a multipart/alternative MIME construct, and that version gets indexed so that the message can still be retrieved by the words it contains. But still, some people send HTML-only messages, in which case we want to automatically extract the text from the HTML and pass it to the indexer.</p>
<p>It&#8217;s relatively easy to write a manitou-mdx Perl plugin that does just that, by using a CPAN module to do the HTML to text conversion: <a href="http://search.cpan.org/~sburke/HTML-Format-2.04/lib/HTML/FormatText.pm">HTML::FormatText</a></p>
<p>Apart from the usual <strong>init</strong> and <strong>process</strong> functions that are described in the <a href="http://www.manitou-mail.org/mdx/plugins-reference.html">mdx plugins reference</a>, we need to provide two functions: one that recursively descends the MIME tree to find the html parts, and another that extracts them to text and pass them to the indexer.</p>

<div class="wp_syntax"><div class="code"><pre class="perl" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">sub</span> index_contents <span style="color: #009900;">&#123;</span>
  <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$fh</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">$ctxt</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">=</span><span style="color: #0000ff;">@_</span><span style="color: #339933;">;</span>
  <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$html</span><span style="color: #339933;">;</span>
  <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$text</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#123;</span>
    <span style="color: #000066;">local</span> <span style="color: #0000ff;">$/</span><span style="color: #339933;">;</span>
    <span style="color: #0000ff;">$html</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">$fh</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">getline</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
&nbsp;
  <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #000066;">defined</span> <span style="color: #0000ff;">$html</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$tree</span> <span style="color: #339933;">=</span> HTML<span style="color: #339933;">::</span><span style="color: #006600;">TreeBuilder</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">new</span><span style="color: #339933;">;</span>
    <span style="color: #0000ff;">$tree</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">parse_content</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$html</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$formatter</span> <span style="color: #339933;">=</span> HTML<span style="color: #339933;">::</span><span style="color: #006600;">FormatText</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">new</span><span style="color: #009900;">&#40;</span>leftmargin<span style="color: #339933;">=&gt;</span><span style="color: #cc66cc;">0</span><span style="color: #339933;">,</span> rightmargin<span style="color: #339933;">=&gt;</span><span style="color: #cc66cc;">78</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
    <span style="color: #0000ff;">$text</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">$formatter</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">format</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$tree</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
  <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #000066;">defined</span> <span style="color: #0000ff;">$text</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    Manitou<span style="color: #339933;">::</span><span style="color: #006600;">Words</span><span style="color: #339933;">::</span><span style="color: #006600;">index_words</span><span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$ctxt</span><span style="color: #339933;">-&gt;</span><span style="color: #009900;">&#123;</span><span style="color: #ff0000;">'dbh'</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">$ctxt</span><span style="color: #339933;">-&gt;</span><span style="color: #009900;">&#123;</span><span style="color: #ff0000;">'mail_id'</span><span style="color: #009900;">&#125;</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">\$text</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #000000; font-weight: bold;">sub</span> process_parts <span style="color: #009900;">&#123;</span>
  <span style="color: #b1b100;">my</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$obj</span><span style="color: #339933;">,</span><span style="color: #0000ff;">$ctxt</span><span style="color: #009900;">&#41;</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">@_</span><span style="color: #339933;">;;</span>
  <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$obj</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">is_multipart</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #b1b100;">foreach</span> <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$subobj</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$obj</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">parts</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      process_parts<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$subobj</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">$ctxt</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>    <span style="color: #666666; font-style: italic;"># recurse</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span>
  <span style="color: #b1b100;">else</span> <span style="color: #009900;">&#123;</span>
    <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$type</span><span style="color: #339933;">=</span><span style="color: #0000ff;">$obj</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">effective_type</span><span style="color: #339933;">;</span>
    <span style="color: #b1b100;">if</span> <span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$type</span> <span style="color: #b1b100;">eq</span> <span style="color: #ff0000;">&quot;text/html&quot;</span><span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
      <span style="color: #b1b100;">my</span> <span style="color: #0000ff;">$io</span> <span style="color: #339933;">=</span> <span style="color: #0000ff;">$obj</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">bodyhandle</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">open</span><span style="color: #009900;">&#40;</span><span style="color: #ff0000;">&quot;r&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      index_contents<span style="color: #009900;">&#40;</span><span style="color: #0000ff;">$io</span><span style="color: #339933;">,</span> <span style="color: #0000ff;">$ctxt</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
      <span style="color: #0000ff;">$io</span><span style="color: #339933;">-&gt;</span><span style="color: #006600;">close</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
  <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>The full source code and download link are available on the <a href="http://www.manitou-mail.org/wiki/doku.php/plugins:html_indexer">wiki</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.manitou-mail.org/blog/2009/08/indexing-html-parts/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
