[Lucene]Search Using ASP.Net and Lucene

Search Using ASP.NET and Lucene

Getting Started

Getting Lucene to work on your ASP.NET website isn't hard but there are a few tricks that help. We decided to use the Lucene.Net 2.1.0 release because it made updating the Lucene index easier via a new method on the IndexWriter object called UpdateDocument. This method deletes the specified document and then adds the new copy into the index. You can't download the Lucene.net 2.1.0 binary. Instead you will need to download the source via their subversion repository and then compile it.

Don't worry this is an easy step. Using your subversion client – I recommend TortiseSvn get the source by doing a checkout from this url: https://svn.apache.org/repos/asf/incubator/lucene.net/tags/Lucene.Net_2_1_0/

Next go into the directory: Lucene.Net_2_1_0\src\Lucene.Net You should find a Visual Studio solution file that matches your Visual Studio version. If you are using Visual Studio 2005 be sure to load Lucene.Net-2.1.0-VS2005.sln

Hit compile. The resulting Lucene.Net.dll in the bin/release folder is the dll you will need to reference in your Visual Studio project that will contain the Lucene code.

Creating the Lucene Index

Lucene creates a file based index that it uses to quickly return search results. We had to find a way to index all the pages in our system so that Lucene would have a way to search all of our content. In our case this includes all the articles, forum posts and of course house plans on the website. To make this happen we query our database, get back urls to all of our content and then send a webspider out to pull down the content from our site. That content is then parsed and fed to Lucene.

We developed three classes to make this all work. Most of the code is taken from examples or other kind souls who shared code. The first class, GeneralSearch.cs creates the index and provides the mechanism for searching it. The second class, HtmlDocument consists of code taken from Searcharoo a web spidering project written in C#. The HtmlDocument class handles parsing the html for us. Special thanks to Searcharoo for that code. I didn't want to write it. The last class is also borrowed from Searcharoo. It is called HtmlDownloader.cs and its task is to download pages from the site and then create a HtmlDocument from them.

GeneralSearch.cs

using System;
using System.Collections.Generic;
using System.Data;
using Core.Utils.Html;
using Lucene.Net.Analysis;
using Lucene.Net.Analysis.Standard;
using Lucene.Net.Documents;
using Lucene.Net.Index;
using Lucene.Net.QueryParsers;
using Lucene.Net.Search;
namespace Core.Search
{
/// <summary>
/// Wrapper for Lucene to perform a general search
/// see:
/// http://www.codeproject.com/KB/aspnet/DotLuceneSearch.aspx
/// for more help about the methods used in this class
/// </summary>
public class GeneralSearch
{
private IndexWriter _Writer = null;
private string _IndexDirectory;
private List<string> _Errors = new List<string>();
private int _TotalResults = 0;
private int _Start = 1;
private int _End = 10;
/// <summary>
/// General constructor method.
/// </summary>
/// <param name="indexDirectory">The directory where the index 
/// is located.</param>
public GeneralSearch(string indexDirectory)
{
_IndexDirectory = indexDirectory;
}
/// <summary>
/// List of errors that occured during indexing
/// </summary>
public List<string> Errors
{
get { return _Errors; }
set { _Errors = value; }
}
/// <summary>
/// Total number of hits return by the search
/// </summary>
public int TotalResults
{
get { return _TotalResults; }
set { _TotalResults = value; }
}
/// <summary>
/// The number of the record where the results begin.
/// </summary>
public int Start
{
get { return _Start; }
set { _Start = value; }
}
/// <summary>
/// The number of the record where the results end.
/// </summary>
public int End
{
get { return _End; }
set { _End = value; }
}
/// <summary>
/// Returns a table with matching results or null
/// if the index does not exist.  This method will page the
/// results.
/// </summary>
/// <param name="searchText">terms to search for</param>
/// <param name="currentPage">The current results page</param>
/// <param name="hitsPerPage">The number of hits to return for each results page</param>
/// <returns>A datatable containing the number of results specified for the given page.</returns>
public DataTable DoSearch(string searchText, int hitsPerPage, int currentPage)
{
if(!IndexReader.IndexExists(_IndexDirectory))
{
return null;
}
string field = IndexedFields.Contents;
IndexReader reader = IndexReader.Open(_IndexDirectory);
IndexSearcher searcher = new IndexSearcher(reader);
Analyzer analyzer = new StandardAnalyzer();
QueryParser parser = new QueryParser(field, analyzer);
Query query = parser.Parse(searchText);
Hits hits = searcher.Search(query);
DataTable dt = new DataTable();
dt.Columns.Add(IndexedFields.Url, typeof(string));
dt.Columns.Add(IndexedFields.Title, typeof(string));
//dt.Columns.Add(IndexedFields.Summary, typeof(string));
dt.Columns.Add(IndexedFields.Contents, typeof(string));
dt.Columns.Add(IndexedFields.Image, typeof(string));
if(currentPage <= 0)
{
currentPage = 1;
}
Start = (currentPage-1) * hitsPerPage;
End = System.Math.Min(hits.Length(), Start + hitsPerPage);
TotalResults = hits.Length();
for (int i = Start; i < End; i++)
{
// get the document from index
Document doc = hits.Doc(i);
DataRow row = dt.NewRow();
row[IndexedFields.Url] = doc.Get(IndexedFields.Url);
//row[IndexedFields.Summary] = doc.Get(IndexedFields.Summary);
row[IndexedFields.Contents] = doc.Get(IndexedFields.Contents);
row[IndexedFields.Title] = doc.Get(IndexedFields.Title);
row[IndexedFields.Image] = doc.Get(IndexedFields.Image);
dt.Rows.Add(row);
}
reader.Close();
return dt;
}
/// <summary>
/// Opens the index for writing
/// </summary>
public void OpenWriter()
{
bool create = false;
if (!IndexReader.IndexExists(_IndexDirectory))
{
create = true;
}
_Writer = new IndexWriter(_IndexDirectory, new StandardAnalyzer(), create);
_Writer.SetUseCompoundFile(true);
_Writer.SetMaxFieldLength(1000000);
}
/// <summary>
/// Closes and optimizes the index
/// </summary>
public void CloseWriter()
{
_Writer.Optimize();
_Writer.Close();
}
/// <summary>
/// Loads, parses and indexes an HTML file at a given url.
/// </summary>
/// <param name="url"></param>
public void AddWebPage(string url)
{
HtmlDocument html = HtmlDownloader.Download(url);
if (null != html)
{
// make a new, empty document
Document doc = new Document();
// Store the url
doc.Add(new Field(IndexedFields.Url, url, Field.Store.YES, Field.Index.UN_TOKENIZED));
// create a uid that will let us maintain the index incrementally
doc.Add(new Field(IndexedFields.Uid, url, Field.Store.NO, Field.Index.UN_TOKENIZED));
// Add the tag-stripped contents as a Reader-valued Text field so it will
// get tokenized and indexed.
doc.Add(new Field(IndexedFields.Contents, html.WordsOnly, Field.Store.YES, Field.Index.TOKENIZED));
// Add the summary as a field that is stored and returned with
// hit documents for display.
//doc.Add(new Field(IndexedFields.Summary, html.Description, Field.Store.YES, Field.Index.NO));
// Add the title as a field that it can be searched and that is stored.
doc.Add(new Field(IndexedFields.Title, html.Title, Field.Store.YES, Field.Index.TOKENIZED));
Term t = new Term(IndexedFields.Uid, url);
_Writer.UpdateDocument(t, doc);
}
else
{
Errors.Add("Could not index " + url);
}
}
/// <summary>
/// Use this method to add a single page to the index.
/// </summary>
/// <remarks>
/// If you are adding multiple pages use the AddPage method instead as it only opens and closes the index once.
/// </remarks>
/// <param name="url">The url for the given document.  The document will not be requested from
/// this url.  Instead it will be used as a key to access the document within the index and 
/// will be returned when the index is searched so that the document can be referenced by the
/// client.</param>
/// <param name="documentText">The contents of the document that is to be added to the index.</param>
/// <param name="title">The title of the document to add to the index.</param>
public void AddSinglePage(string url, string documentText, string title, string image)
{
OpenWriter();
AddPage(url, documentText, title, image);
CloseWriter();
}
/// <summary>
/// Indexes the text of the given document, but does not request the document from the specified url.
/// </summary>
/// <remarks>
/// Use this method to add a document to the index when you know it's contents and url.  This prevents
/// a http download which can take longer.
/// </remarks>
/// <param name="url">The url for the given document.  The document will not be requested from
/// this url.  Instead it will be used as a key to access the document within the index and 
/// will be returned when the index is searched so that the document can be referenced by the
/// client.</param>
/// <param name="documentText">The contents of the document that is to be added to the index.</param>
/// <param name="title">The title of the document to add to the index.</param>
/// <param name="image">Image to include with search results</param>
public void AddPage(string url, string documentText, string title, string image)
{
// make a new, empty document
Document doc = new Document();
// Store the url
doc.Add(new Field(IndexedFields.Url, url, Field.Store.YES, Field.Index.UN_TOKENIZED));
// create a uid that will let us maintain the index incrementally
doc.Add(new Field(IndexedFields.Uid, url, Field.Store.NO, Field.Index.UN_TOKENIZED));
// Add the tag-stripped contents as a Reader-valued Text field so it will
// get tokenized and indexed.
doc.Add(new Field(IndexedFields.Contents, documentText, Field.Store.YES, Field.Index.TOKENIZED));
// Add the summary as a field that is stored and returned with
// hit documents for display.
//doc.Add(new Field(IndexedFields.Summary, documentDescription, Field.Store.YES, Field.Index.NO));
// Add the title as a field that it can be searched and that is stored.
doc.Add(new Field(IndexedFields.Title, title, Field.Store.YES, Field.Index.TOKENIZED));
// Add the title as a field that it can be searched and that is stored.
doc.Add(new Field(IndexedFields.Image, image, Field.Store.YES, Field.Index.TOKENIZED));
Term t = new Term(IndexedFields.Uid, url);
try
{
_Writer.UpdateDocument(t, doc);
}
catch(Exception ex)
{
Errors.Add(ex.Message);
}
}
/// <summary>
/// A list of fields available in the index
/// </summary>
public static class IndexedFields
{
public const string Url = "url";
public const string Uid = "uid";
public const string Contents = "contents";
//public const string Summary = "summary";
public const string Title = "title";
public const string Image = "image";
}
}
}

HtmlDocument.cs

using System;
using System.Collections;
using System.Text.RegularExpressions;
namespace Core.Utils.Html
{
/// <summary>
/// This code was taken from:
/// http://www.searcharoo.net/SearcharooV5/
/// 
/// Storage for parsed HTML data returned by ParsedHtmlData();
/// </summary>
/// <remarks>
/// Arbitrary class to encapsulate just the properties we need 
/// to index Html pages (Title, Meta tags, Keywords, etc).
/// A 'generic' search engine would probably have a 'generic'
/// document class, so maybe a future version of Searcharoo 
/// will too...
/// </remarks>
public class HtmlDocument
{
#region Private fields: _Uri, _ContentType, _RobotIndexOK, _RobotFollowOK
private int _SummaryCharacters = 350;
private string _IgnoreRegionTagNoIndex = "";
private string _All = String.Empty;
private Uri _Uri;
private String _ContentType;
private string _Extension;
private bool _RobotIndexOK = true;
private bool _RobotFollowOK = true;
private string _WordsOnly = string.Empty;
/// <summary>MimeType so we know whether to try and parse the contents, eg. "text/html", "text/plain", etc</summary>
private string _MimeType = String.Empty;
/// <summary>Html &lt;title&gt; tag</summary>
private String _Title = String.Empty;
/// <summary>Html &lt;meta http-equiv='description'&gt; tag</summary>
private string _Description = String.Empty;
/// <summary>Length as reported by the server in the Http headers</summary>
private long _Length;
#endregion
public ArrayList LocalLinks;
public ArrayList ExternalLinks;
#region Public Properties: Uri, RobotIndexOK
/// <summary>
/// http://www.ietf.org/rfc/rfc2396.txt
/// </summary>
public Uri Uri
{
get { return _Uri; }
set
{
_Uri = value;
}
}
/// <summary>
/// Whether a robot should index the text 
/// found on this page, or just ignore it
/// </summary>
/// <remarks>
/// Set when page META tags are parsed - no 'set' property
/// More info:
/// http://www.robotstxt.org/
/// </remarks>
public bool RobotIndexOK
{
get { return _RobotIndexOK; }
}
/// <summary>
/// Whether a robot should follow any links 
/// found on this page, or just ignore them
/// </summary>
/// <remarks>
/// Set when page META tags are parsed - no 'set' property
/// More info:
/// http://www.robotstxt.org/
/// </remarks>
public bool RobotFollowOK
{
get { return _RobotFollowOK; }
}
public string Title
{
get { return _Title; }
set { _Title = value; }
}
/// <summary>
/// Whether to ignore sections of HTML wrapped in a special comment tag
/// </summary>
public bool IgnoreRegions
{
get { return _IgnoreRegionTagNoIndex.Length > 0; }
}
public string ContentType
{
get
{
return _ContentType;
}
set
{
_ContentType = value.ToString();
string[] contentTypeArray = _ContentType.Split(';');
// Set MimeType if it's blank
if (_MimeType == String.Empty && contentTypeArray.Length >= 1)
{
_MimeType = contentTypeArray[0];
}
// Set Encoding if it's blank
if (Encoding == String.Empty && contentTypeArray.Length >= 2)
{
int charsetpos = contentTypeArray[1].IndexOf("charset");
if (charsetpos > 0)
{
Encoding = contentTypeArray[1].Substring(charsetpos + 8, contentTypeArray[1].Length - charsetpos - 8);
}
}
}
}
public string MimeType
{
get { return _MimeType; }
set { _MimeType = value; }
}
public string Extension
{
get { return _Extension; }
set { _Extension = value; }
}
#endregion
#region Public fields: Encoding, Keywords, All
/// <summary>Encoding eg. "utf-8", "Shift_JIS", "iso-8859-1", "gb2312", etc</summary>
public string Encoding = String.Empty;
/// <summary>Html &lt;meta http-equiv='keywords'&gt; tag</summary>
public string Keywords = String.Empty;
/// <summary>
/// Raw content of page, as downloaded from the server
/// Html stripped to make up the 'wordsonly'
/// </summary>
public string Html
{
get { return _All; }
set
{
_All = value;
_WordsOnly = StripHtml(_All);
}
}
public string WordsOnly
{
get { return this.Keywords + this._Description + this._WordsOnly; }
}
public virtual long Length
{
get { return _Length; }
set { _Length = value; }
}
public string Description
{
get
{
// ### If no META DESC, grab start of file text ###
if (String.Empty == this._Description)
{
if (_WordsOnly.Length > _SummaryCharacters)
{
_Description = _WordsOnly.Substring(0, _SummaryCharacters);
}
else
{
_Description = WordsOnly;
}
_Description = Regex.Replace(_Description, @"\s+", " ").Trim();
}
// http://authors.aspalliance.com/stevesmith/articles/removewhitespace.asp
return _Description;
}
set
{
_Description = Regex.Replace(value, @"\s+", " ").Trim();
}
}
#endregion
#region Public Methods: SetRobotDirective, ToString()
/// <summary>
/// Pass in a ROBOTS meta tag found while parsing, 
/// and set HtmlDocument property/ies appropriately
/// </summary>
/// <remarks>
/// More info:
/// * Robots Exclusion Protocol *
/// - for META tags
/// http://www.robotstxt.org/wc/meta-user.html
/// - for ROBOTS.TXT in the siteroot
/// http://www.robotstxt.org/wc/norobots.html
/// </remarks>
public void SetRobotDirective(string robotMetaContent)
{
robotMetaContent = robotMetaContent.ToLower();
if (robotMetaContent.IndexOf("none") >= 0)
{
// 'none' means you can't Index or Follow!
_RobotIndexOK = false;
_RobotFollowOK = false;
}
else
{
if (robotMetaContent.IndexOf("noindex") >= 0) { _RobotIndexOK = false; }
if (robotMetaContent.IndexOf("nofollow") >= 0) { _RobotFollowOK = false; }
}
}
/// <summary>
/// For debugging - output all links found in the page
/// </summary>
public override string ToString()
{
string linkstring = "";
foreach (object link in LocalLinks)
{
linkstring += Convert.ToString(link) + "\r\n";
}
return Title + "\r\n" + Description + "\r\n----------------\r\n" + linkstring + "\r\n----------------\r\n" + Html + "\r\n======================\r\n";
}
#endregion
/// <summary>
///
/// </summary>
/// <remarks>
/// "Original" link search Regex used by the code was from here
/// http://www.dotnetjunkies.com/Tutorial/1B219C93-7702-4ADF-9106-DFFDF90914CF.dcik
/// but it was not sophisticated enough to match all tag permutations
///
/// whereas the Regex on this blog will parse ALL attributes from within tags...
/// IMPORTANT when they're out of order, spaced out or over multiple lines
/// http://blogs.worldnomads.com.au/matthewb/archive/2003/10/24/158.aspx
/// http://blogs.worldnomads.com.au/matthewb/archive/2004/04/06/215.aspx
///
/// http://www.experts-exchange.com/Programming/Programming_Languages/C_Sharp/Q_20848043.html
/// </remarks>
public void Parse()
{
string htmlData = this.Html;    // htmlData will be munged
//xenomouse http://www.codeproject.com/aspnet/Spideroo.asp?msg=1271902#xx1271902xx
if (string.IsNullOrEmpty(this.Title))
{   // title may have been set previously... non-HTML file type (this will be refactored out, later)
this.Title = Regex.Match(htmlData, @"(?<=<title[^\>]*>).*?(?=</title>)",
RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture).Value;
}
string metaKey = String.Empty, metaValue = String.Empty;
foreach (Match metamatch in Regex.Matches(htmlData
, @"<meta\s*(?:(?:\b(\w|-)+\b\s*(?:=\s*(?:""[^""]*""|'[^']*'|[^""'<> ]+)\s*)?)*)/?\s*>"
, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture))
{
metaKey = String.Empty;
metaValue = String.Empty;
// Loop through the attribute/value pairs inside the tag
foreach (Match submetamatch in Regex.Matches(metamatch.Value.ToString()
, @"(?<name>\b(\w|-)+\b)\s*=\s*(""(?<value>[^""]*)""|'(?<value>[^']*)'|(?<value>[^""'<> ]+)\s*)+"
, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture))
{
if ("http-equiv" == submetamatch.Groups[1].ToString().ToLower())
{
metaKey = submetamatch.Groups[2].ToString();
}
if (("name" == submetamatch.Groups[1].ToString().ToLower())
&& (metaKey == String.Empty))
{ // if it's already set, HTTP-EQUIV takes precedence
metaKey = submetamatch.Groups[2].ToString();
}
if ("content" == submetamatch.Groups[1].ToString().ToLower())
{
metaValue = submetamatch.Groups[2].ToString();
}
}
switch (metaKey.ToLower())
{
case "description":
this.Description = metaValue;
break;
case "keywords":
case "keyword":
this.Keywords = metaValue;
break;
case "robots":
case "robot":
this.SetRobotDirective(metaValue);
break;
}
//                ProgressEvent(this, new ProgressEventArgs(4, metaKey + " = " + metaValue));
}
string link = String.Empty;
ArrayList linkLocal = new ArrayList();
ArrayList linkExternal = new ArrayList();
// http://msdn.microsoft.com/library/en-us/script56/html/js56jsgrpregexpsyntax.asp
// original Regex, just found <a href=""> links; and was "broken" by spaces, out-of-order, etc
// @"(?<=<a\s+href="").*?(?=""\s*/?>)"
// Looks for the src attribute of:
// <A> anchor tags
// <AREA> imagemap links
// <FRAME> frameset links
// <IFRAME> floating frames
foreach (Match match in Regex.Matches(htmlData
, @"(?<anchor><\s*(a|area|frame|iframe)\s*(?:(?:\b\w+\b\s*(?:=\s*(?:""[^""]*""|'[^']*'|[^""'<> ]+)\s*)?)*)?\s*>)"
, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture))
{
// Parse ALL attributes from within tags... IMPORTANT when they're out of order!!
// in addition to the 'href' attribute, there might also be 'alt', 'class', 'style', 'area', etc...
// there might also be 'spaces' between the attributes and they may be ", ', or unquoted
link = String.Empty;
//                ProgressEvent(this, new ProgressEventArgs(4, "Match:" + System.Web.HttpUtility.HtmlEncode(match.Value) + ""));
foreach (Match submatch in Regex.Matches(match.Value.ToString()
, @"(?<name>\b\w+\b)\s*=\s*(""(?<value>[^""]*)""|'(?<value>[^']*)'|(?<value>[^""'<> \s]+)\s*)+"
, RegexOptions.IgnoreCase | RegexOptions.ExplicitCapture))
{
// we're only interested in the href attribute (although in future maybe index the 'alt'/'title'?)
//                    ProgressEvent(this, new ProgressEventArgs(4, "Submatch: " + submatch.Groups[1].ToString() + "=" + submatch.Groups[2].ToString() + ""));
if ("href" == submatch.Groups[1].ToString().ToLower())
{
link = submatch.Groups[2].ToString();
if (link != "#") break; // break if this isn't just a placeholder href="#", which implies maybe an onclick attribute exists
}
if ("onclick" == submatch.Groups[1].ToString().ToLower())
{   // maybe try to parse some javascript in here
string jscript = submatch.Groups[2].ToString();
// some code here to extract a filename/link to follow from the onclick="_____"
// say it was onclick="window.location='top.htm'"
int firstApos = jscript.IndexOf("'");
int secondApos = jscript.IndexOf("'", firstApos + 1);
if (secondApos > firstApos)
{
link = jscript.Substring(firstApos + 1, secondApos - firstApos - 1);
break;  // break if we found something, ignoring any later href="" which may exist _after_ the onclick in the <a> element
}
}
}
// strip off internal links, so we don't index same page over again
if (link.IndexOf("#") > -1)
{
link = link.Substring(0, link.IndexOf("#"));
}
if (link.IndexOf("javascript:") == -1
&& link.IndexOf("mailto:") == -1
&& !link.StartsWith("#")
&& link != String.Empty)
{
if ((link.Length > 8) && (link.StartsWith("http://")
|| link.StartsWith("https://")
|| link.StartsWith("file://")
|| link.StartsWith("//")
|| link.StartsWith(@"\\")))
{
linkExternal.Add(link);
//                        ProgressEvent(this, new ProgressEventArgs(4, "External link: " + link));
}
else if (link.StartsWith("?"))
{
// it's possible to have /?query which sends the querystring to the
// 'default' page in a directory
linkLocal.Add(this.Uri.AbsolutePath + link);
//                        ProgressEvent(this, new ProgressEventArgs(4, "? Internal default page link: " + link));
}
else
{
linkLocal.Add(link);
//                        ProgressEvent(this, new ProgressEventArgs(4, "I Internal link: " + link));
}
} // add each link to a collection
} // foreach
this.LocalLinks = linkLocal;
this.ExternalLinks = linkExternal;
} // Parse
/// <summary>
/// Stripping HTML
/// http://www.4guysfromrolla.com/webtech/042501-1.shtml
/// </summary>
/// <remarks>
/// Using regex to find tags without a trailing slash
/// http://concepts.waetech.com/unclosed_tags/index.cfm
///
/// http://msdn.microsoft.com/library/en-us/script56/html/js56jsgrpregexpsyntax.asp
///
/// Replace html comment tags
/// http://www.faqts.com/knowledge_base/view.phtml/aid/21761/fid/53
/// </remarks>
protected string StripHtml(string Html)
{
//Strips the <script> tags from the Html
string scriptregex = @"<scr" + @"ipt[^>.]*>[\s\S]*?</sc" + @"ript>";
System.Text.RegularExpressions.Regex scripts = new System.Text.RegularExpressions.Regex(scriptregex, RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.ExplicitCapture);
string scriptless = scripts.Replace(Html, " ");
//Strips the <style> tags from the Html
string styleregex = @"<style[^>.]*>[\s\S]*?</style>";
System.Text.RegularExpressions.Regex styles = new System.Text.RegularExpressions.Regex(styleregex, RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.ExplicitCapture);
string styleless = styles.Replace(scriptless, " ");
//Strips the <NOSEARCH> tags from the Html (where NOSEARCH is set in the web.config/Preferences class)
//TODO: NOTE: this only applies to INDEXING the text - links are parsed before now, so they aren't "excluded" by the region!! (yet)
string ignoreless = string.Empty;
if (IgnoreRegions)
{
string noSearchStartTag = "<!--" + _IgnoreRegionTagNoIndex + "-->";
string noSearchEndTag = "<!--/" + _IgnoreRegionTagNoIndex + "-->";
string ignoreregex = noSearchStartTag + @"[\s\S]*?" + noSearchEndTag;
System.Text.RegularExpressions.Regex ignores = new System.Text.RegularExpressions.Regex(ignoreregex, RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.ExplicitCapture);
ignoreless = ignores.Replace(styleless, " ");
}
else
{
ignoreless = styleless;
}
//Strips the <!--comment--> tags from the Html
//string commentregex = @"<!\-\-.*?\-\->";        // alternate suggestion from antonello franzil
string commentregex = @"<!(?:--[\s\S]*?--\s*)?>";
System.Text.RegularExpressions.Regex comments = new System.Text.RegularExpressions.Regex(commentregex, RegexOptions.IgnoreCase | RegexOptions.Multiline | RegexOptions.ExplicitCapture);
string commentless = comments.Replace(ignoreless, " ");
//Strips the HTML tags from the Html
System.Text.RegularExpressions.Regex objRegExp = new System.Text.RegularExpressions.Regex("<(.|\n)+?>", RegexOptions.IgnoreCase);
//Replace all HTML tag matches with the empty string
string output = objRegExp.Replace(commentless, " ");
//Replace all _remaining_ < and > with &lt; and &gt;
output = output.Replace("<", "&lt;");
output = output.Replace(">", "&gt;");
objRegExp = null;
return output;
}
}
}

HtmlDownloader.cs

using System;
namespace Core.Utils.Html
{
public static class HtmlDownloader
{
private static string _UserAgent = "Mozilla/6.0 (MSIE 6.0; Windows NT 5.1; ThePlanCollection.com; robot)";
private static int _RequestTimeout = 5;
private static System.Net.CookieContainer _CookieContainer = new System.Net.CookieContainer();
/// <summary>
/// Attempts to download the Uri into the current document.
/// </summary>
/// <remarks>
/// http://www.123aspx.com/redir.aspx?res=28320
/// </remarks>
public static HtmlDocument Download(string url)
{
Uri uri = new Uri(url);
HtmlDocument doc = null;
// Open the requested URL
System.Net.HttpWebRequest req = (System.Net.HttpWebRequest)System.Net.WebRequest.Create(uri.AbsoluteUri);
req.AllowAutoRedirect = true;
req.MaximumAutomaticRedirections = 3;
req.UserAgent = _UserAgent; //"Mozilla/6.0 (MSIE 6.0; Windows NT 5.1; Searcharoo.NET)";
req.KeepAlive = true;
req.Timeout = _RequestTimeout * 1000; //prefRequestTimeout 
// SIMONJONES http://codeproject.com/aspnet/spideroo.asp?msg=1421158#xx1421158xx
req.CookieContainer = new System.Net.CookieContainer();
req.CookieContainer.Add(_CookieContainer.GetCookies(uri));
// Get the stream from the returned web response
System.Net.HttpWebResponse webresponse = null;
try
{
webresponse = (System.Net.HttpWebResponse)req.GetResponse();
}
catch(Exception ex)
{
webresponse = null;
Console.Write("request for url failed: {0} {1}", url, ex.Message);
}
if (webresponse != null)
{
webresponse.Cookies = req.CookieContainer.GetCookies(req.RequestUri);
// handle cookies (need to do this incase we have any session cookies)
foreach (System.Net.Cookie retCookie in webresponse.Cookies)
{
bool cookieFound = false;
foreach (System.Net.Cookie oldCookie in _CookieContainer.GetCookies(uri))
{
if (retCookie.Name.Equals(oldCookie.Name))
{
oldCookie.Value = retCookie.Value;
cookieFound = true;
}
}
if (!cookieFound)
{
_CookieContainer.Add(retCookie);
}
}
doc = new HtmlDocument();
doc.MimeType = ParseMimeType(webresponse.ContentType.ToString()).ToLower();
doc.ContentType = ParseEncoding(webresponse.ToString()).ToLower();
doc.Extension = ParseExtension(uri.AbsoluteUri);
string enc = "utf-8"; // default
if (webresponse.ContentEncoding != String.Empty)
{
// Use the HttpHeader Content-Type in preference to the one set in META
doc.Encoding = webresponse.ContentEncoding;
}
else if (doc.Encoding == String.Empty)
{
doc.Encoding = enc; // default
}
//http://www.c-sharpcorner.com/Code/2003/Dec/ReadingWebPageSources.asp
System.IO.StreamReader stream = new System.IO.StreamReader
(webresponse.GetResponseStream(), System.Text.Encoding.GetEncoding(doc.Encoding));
doc.Uri = webresponse.ResponseUri; // we *may* have been redirected... and we want the *final* URL
doc.Length = webresponse.ContentLength;
doc.Html = stream.ReadToEnd();
stream.Close();
doc.Parse();
webresponse.Close();
}
return doc;
}
#region Private Methods: ParseExtension, ParseMimeType, ParseEncoding
private static string ParseExtension(string filename)
{
return System.IO.Path.GetExtension(filename).ToLower();
}
private static string ParseMimeType(string contentType)
{
string mimeType = string.Empty;
string[] contentTypeArray = contentType.Split(';');
// Set MimeType if it's blank
if (mimeType == String.Empty && contentTypeArray.Length >= 1)
{
mimeType = contentTypeArray[0];
}
return mimeType;
}
private static string ParseEncoding(string contentType)
{
string encoding = string.Empty;
string[] contentTypeArray = contentType.Split(';');
// Set Encoding if it's blank
if (encoding == String.Empty && contentTypeArray.Length >= 2)
{
int charsetpos = contentTypeArray[1].IndexOf("charset");
if (charsetpos > 0)
{
encoding = contentTypeArray[1].Substring(charsetpos + 8, contentTypeArray[1].Length - charsetpos - 8);
}
}
return encoding;
}
#endregion
}
}

References:

赞(0) 打赏
分享到: 更多 (0)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏