[转载]HTMLParser1.6 源代码阅读

[转载]HTMLParser1.6 源代码阅读 – S面B面 – 博客园.

看到博客园的大牛们都喜欢发系列的文章,我也发一篇。不过我不打算写什么spring hibernate配置什么的,我只想写写自己阅读别人代码的一些笔记。欢迎大家拍砖。

 

从开始进行阅读,第一个包是:org.htmlparser.里面的类包括

Attribute.java

Node.java

NodeFactory.java

NodeFilter.java

Parser.java

PrototypicalNodeFactory.java

Remark.java

Tag.java

Text.java

可以看出都是针对基本数据结构的类。

一个一个进行分析,Attribute.java是记录网页元素的属性的,

                                <a href="EditPosts.aspx" id="TabPosts">随笔</a>

这个是博客园的,那Attribute应当可以记录 href=“MySubscibe.aspx”这样的元素。看他的构造方法:

public Attribute (String name, String assignment, String value, char quote)
    {
        setName (name);
        setAssignment (assignment);
        if (0 == quote)
            setRawValue (value);
        else
        {
            setValue (value);
            setQuote (quote);
        }
    }

看见分为是否含有引号的情况。setName,和setValue没有什么可以看的,如果不含引号的情况,value怎么设置

/**
     * Set the value of the attribute and the quote character.
     * If the value is pure whitespace, assign it 'as is' and reset the
     * quote character. If not, check for leading and trailing double or
     * single quotes, and if found use this as the quote character and
     * the inner contents of <code>value</code> as the real value.
     * Otherwise, examine the string to determine if quotes are needed
     * and an appropriate quote character if so. This may involve changing
     * double quotes within the string to character references.
     * @param value The new value.
     * @see #getRawValue
     * @see #getRawValue(StringBuffer)
     */
    public void setRawValue (String value)
    {
        char ch;
        boolean needed;
        boolean singleq;
        boolean doubleq;
        String ref;
        StringBuffer buffer;
        char quote;

        quote = 0;
        if ((null != value) && (0 != value.trim ().length ()))
        {
            if (value.startsWith ("'") && value.endsWith ("'")
                && (2 <= value.length ()))
            {
                quote = '\'';
                value = value.substring (1, value.length () - 1);
            }
            else if (value.startsWith ("\"") && value.endsWith ("\"")
                && (2 <= value.length ()))
            {
                quote = '"';
                value = value.substring (1, value.length () - 1);
            }
            else
            {
                // first determine if there's whitespace in the value
                // and while we're at it find a suitable quote character
                needed = false;
                singleq = true;
                doubleq = true;
                for (int i = 0; i < value.length (); i++)
                {
                    ch = value.charAt (i);
                    if ('\'' == ch)
                    {
                        singleq  = false;
                        needed = true;
                    }
                    else if ('"' == ch)
                    {
                        doubleq = false;
                        needed = true;
                    }
                    else if (!('-' == ch) && !('.' == ch) && !('_' == ch)
                       && !(':' == ch) && !Character.isLetterOrDigit (ch))
                    {
                        needed = true;
                    }
                }

                // now apply quoting
                if (needed)
                {
                    if (doubleq)
                        quote = '"';
                    else if (singleq)
                        quote = '\'';
                    else
                    {
                        // uh-oh, we need to convert some quotes into character
                        // references, so convert all double quotes into &#34;
                        quote = '"';
                        ref = "&quot;"; // Translate.encode (quote);
                        // JDK 1.4: value = value.replaceAll ("\"", ref);
                        buffer = new StringBuffer (
                                value.length() * (ref.length () - 1));
                        for (int i = 0; i < value.length (); i++)
                        {
                            ch = value.charAt (i);
                            if (quote == ch)
                                buffer.append (ref);
                            else
                                buffer.append (ch);
                        }
                        value = buffer.toString ();
                    }
                }
            }
        }
        setValue (value);
        setQuote (quote);
    }

如果没有设置分割字符的话,需要进行判断,首先判断value的字符是哪种?单引号,双引号,还是其他。如果是单引号开头,单引号结尾,那么分割是单引号。如果是双引号,就是双引号。如果不是这样,可能需要修复,如果里面有单引号,那么我们用双引号进行包装,里面含有双引号,我们用单引号进行包装。或者其中含有一些特别的字符(不是数字,字符,-,_,:),我们需要用引号引用起来。这样属性就可以保存下来。

属性暂时分析到这里,有兴趣的可以自己阅读剩下的部分。

Node.java

其实htmlParser就是一个词法语法分析器,学过自动机的同学应当对此很熟悉,(ps,本人自动机挂掉了。。)而HTML的元素有三种类型,text,Tag,Remark(remark是不是这种,这个不确定)。 我们进行语法解析的时候肯定要返回相应的node,那这个node应当设计成抽象类或者接口,的确,也是这样设计的。看代码

package org.htmlparser;

import org.htmlparser.lexer.Page;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.visitors.NodeVisitor;

public interface Node
    extends
        Cloneable
{
    
    String toPlainTextString ();  
    String toHtml ();
    String toHtml (boolean verbatim);
    String toString ();    
    void collectInto (NodeList list, NodeFilter filter);   
    int getStartPosition ();    
    void setStartPosition (int position);   
    int getEndPosition ();   
    void setEndPosition (int position);   
    Page getPage ();
    void setPage (Page page); 
    void accept (NodeVisitor visitor);   
    Node getParent ();
    void setParent (Node node);
    NodeList getChildren ();  
    void setChildren (NodeList children);
    Node getFirstChild ();
    Node getLastChild (); 
    Node getPreviousSibling ();      
    Node getNextSibling ();     
    String getText ();  
    void setText (String text); 
    void doSemanticAction ()
        throws
            ParserException;  
    Object clone ()
        throws
            CloneNotSupportedException;
}

这里光看这个也没办法领略精髓,大致就是一个开始克隆,可以转换成html元素,转换成string的类型,并且可以迭代的一系列方法。不过下一步我们应当从lexer中寻找相关答案,保持我们的阅读顺序,我们继续进行下一个java分析

NodeFactory

先看代码

package org.htmlparser;

import java.util.Vector;

import org.htmlparser.lexer.Page;
import org.htmlparser.util.ParserException;

public interface NodeFactory
{
   
    Text createStringNode (Page page, int start, int end)
        throws
            ParserException;
 
    Remark createRemarkNode (Page page, int start, int end)
        throws
            ParserException;
   
    Tag createTagNode (Page page, int start, int end, Vector attributes)
        throws
            ParserException;
}

就是创建上面三种HTML页面的基本元素。

NodeFilter

package org.htmlparser;

import java.io.Serializable;

/**
 * Implement this interface to select particular nodes.
 */
public interface NodeFilter
    extends
        Serializable,
        Cloneable
{

    boolean accept (Node node);
}

进行一个Node的合理性验证

Parser

终于看到重点了

package org.htmlparser;

import java.io.Serializable;
import java.net.HttpURLConnection;
import java.net.URLConnection;

import org.htmlparser.filters.TagNameFilter;
import org.htmlparser.filters.NodeClassFilter;
import org.htmlparser.http.ConnectionManager;
import org.htmlparser.http.ConnectionMonitor;
import org.htmlparser.http.HttpHeader;
import org.htmlparser.lexer.Lexer;
import org.htmlparser.lexer.Page;
import org.htmlparser.util.DefaultParserFeedback;
import org.htmlparser.util.IteratorImpl;
import org.htmlparser.util.NodeIterator;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;
import org.htmlparser.util.ParserFeedback;
import org.htmlparser.visitors.NodeVisitor;


public class Parser
    implements
        Serializable,
        ConnectionMonitor
{
   
    public static final double
    VERSION_NUMBER = 1.6
    ;

   
    public static final String
    VERSION_TYPE = "Release Build"
    ;

   
    public static final String
    VERSION_DATE = "Jun 10, 2006"
    ;

   
    public static final String VERSION_STRING =
            "" + VERSION_NUMBER
            + " (" + VERSION_TYPE + " " + VERSION_DATE + ")";

  
    protected ParserFeedback mFeedback;

    protected Lexer mLexer;

    public static final ParserFeedback DEVNULL =
        new DefaultParserFeedback (DefaultParserFeedback.QUIET);

    public static final ParserFeedback STDOUT = new DefaultParserFeedback ();

    static
    {
        getConnectionManager ().getDefaultRequestProperties ().put (
            "User-Agent", "HTMLParser/" + getVersionNumber ());
    
    }


    public static String getVersion ()
    {
        return (VERSION_STRING);
    }

    public static double getVersionNumber ()
    {
        return (VERSION_NUMBER);
    }

    public static ConnectionManager getConnectionManager ()
    {
        return (Page.getConnectionManager ());
    }

    public static void setConnectionManager (ConnectionManager manager)
    {
        Page.setConnectionManager (manager);
    }

    public static Parser createParser (String html, String charset)
    {
        Parser ret;

        if (null == html)
            throw new IllegalArgumentException ("html cannot be null");
        ret = new Parser (new Lexer (new Page (html, charset)));

        return (ret);
    }


    public Parser ()
    {
        this (new Lexer (new Page ("")), DEVNULL);
    }

    public Parser (Lexer lexer, ParserFeedback fb)
    {
        setFeedback (fb);
        setLexer (lexer);
        setNodeFactory (new PrototypicalNodeFactory ());
    }

    public Parser (URLConnection connection, ParserFeedback fb)
        throws
            ParserException
    {
        this (new Lexer (connection), fb);
    }

    public Parser (String resource, ParserFeedback feedback)
        throws
            ParserException
    {
        setFeedback (feedback);
        setResource (resource);
        setNodeFactory (new PrototypicalNodeFactory ());
    }

    public Parser (String resource) throws ParserException
    {
        this (resource, STDOUT);
    }

    public Parser (Lexer lexer)
    {
        this (lexer, STDOUT);
    }

    public Parser (URLConnection connection) throws ParserException
    {
        this (connection, STDOUT);
    }

    public void setResource (String resource)
        throws
            ParserException
    {
        int length;
        boolean html;
        char ch;

        if (null == resource)
            throw new IllegalArgumentException ("resource cannot be null");
        length = resource.length ();
        html = false;
        for (int i = 0; i < length; i++)
        {
            ch = resource.charAt (i);
            if (!Character.isWhitespace (ch))
            {
                if ('<' == ch)
                    html = true;
                break;
            }
        }
        if (html)
            setLexer (new Lexer (new Page (resource)));
        else
            setLexer (new Lexer (getConnectionManager ().openConnection (resource)));
    }

    public void setConnection (URLConnection connection)
        throws
            ParserException
    {
        if (null == connection)
            throw new IllegalArgumentException ("connection cannot be null");
        setLexer (new Lexer (connection));
    }

    public URLConnection getConnection ()
    {
        return (getLexer ().getPage ().getConnection ());
    }


    public void setURL (String url)
        throws
            ParserException
    {
        if ((null != url) && !"".equals (url))
            setConnection (getConnectionManager ().openConnection (url));
    }

    public String getURL ()
    {
        return (getLexer ().getPage ().getUrl ());
    }

    public void setEncoding (String encoding)
        throws
            ParserException
    {
        getLexer ().getPage ().setEncoding (encoding);
    }

    public String getEncoding ()
    {
        return (getLexer ().getPage ().getEncoding ());
    }


    public void setLexer (Lexer lexer)
    {
        NodeFactory factory;
        String type;

        if (null == lexer)
            throw new IllegalArgumentException ("lexer cannot be null");
        // move a node factory that's been set to the new lexer
        factory = null;
        if (null != getLexer ())
            factory = getLexer ().getNodeFactory ();
        if (null != factory)
            lexer.setNodeFactory (factory);
        mLexer = lexer;
        // warn about content that's not likely text
        type = mLexer.getPage ().getContentType ();
        if (type != null && !type.startsWith ("text"))
            getFeedback ().warning (
                "URL "
                + mLexer.getPage ().getUrl ()
                + " does not contain text");
    }

    public Lexer getLexer ()
    {
        return (mLexer);
    }

    public NodeFactory getNodeFactory ()
    {
        return (getLexer ().getNodeFactory ());
    }

    public void setNodeFactory (NodeFactory factory)
    {
        if (null == factory)
            throw new IllegalArgumentException ("node factory cannot be null");
        getLexer ().setNodeFactory (factory);
    }

 
    public void setFeedback (ParserFeedback fb)
    {
        if (null == fb)
            mFeedback = DEVNULL;
        else
            mFeedback = fb;
    }

    public ParserFeedback getFeedback()
    {
        return (mFeedback);
    }

    public void reset ()
    {
        getLexer ().reset ();
    }

 
    public NodeIterator elements () throws ParserException
    {
        return (new IteratorImpl (getLexer (), getFeedback ()));
    }
   
    public NodeList parse (NodeFilter filter) throws ParserException
    {
        NodeIterator e;
        Node node;
        NodeList ret;

        ret = new NodeList ();
        for (e = elements (); e.hasMoreNodes (); )
        {
            node = e.nextNode ();
            if (null != filter)
                node.collectInto (ret, filter);
            else
                ret.add (node);
        }

        return (ret);
    }

    public void visitAllNodesWith (NodeVisitor visitor) throws ParserException
    {
        Node node;
        visitor.beginParsing();
        for (NodeIterator e = elements(); e.hasMoreNodes(); )
        {
            node = e.nextNode();
            node.accept(visitor);
        }
        visitor.finishedParsing();
    }

    public void setInputHTML (String inputHTML)
        throws
            ParserException
    {
        if (null == inputHTML)
            throw new IllegalArgumentException ("html cannot be null");
        if (!"".equals (inputHTML))
            setLexer (new Lexer (new Page (inputHTML)));
    }

    public NodeList extractAllNodesThatMatch (NodeFilter filter)
        throws
            ParserException
    {
        NodeIterator e;
        NodeList ret;

        ret = new NodeList ();
        for (e = elements (); e.hasMoreNodes (); )
            e.nextNode ().collectInto (ret, filter);

        return (ret);
    }


    public void preConnect (HttpURLConnection connection)
        throws
            ParserException
    {
        getFeedback ().info (HttpHeader.getRequestHeader (connection));
    }


    public void postConnect (HttpURLConnection connection)
        throws
            ParserException
    {
        getFeedback ().info (HttpHeader.getResponseHeader (connection));
    }

    public static void main (String [] args)
    {
        Parser parser;
        NodeFilter filter;

        if (args.length < 1 || args[0].equals ("-help"))
        {
            System.out.println ("HTML Parser v" + getVersion () + "\n");
            System.out.println ();
            System.out.println ("Syntax : java -jar htmlparser.jar"
                    + " <file/page> [type]");
            System.out.println ("   <file/page> the URL or file to be parsed");
            System.out.println ("   type the node type, for example:");
            System.out.println ("     A - Show only the link tags");
            System.out.println ("     IMG - Show only the image tags");
            System.out.println ("     TITLE - Show only the title tag");
            System.out.println ();
            System.out.println ("Example : java -jar htmlparser.jar"
                    + " http://www.yahoo.com");
            System.out.println ();
        }
        else
            try
            {
                parser = new Parser ();
                if (1 < args.length)
                    filter = new TagNameFilter (args[1]);
                else
                {
                    filter = null;
                    // for a simple dump, use more verbose settings
                    parser.setFeedback (Parser.STDOUT);
                    getConnectionManager ().setMonitor (parser);
                }
                getConnectionManager ().setRedirectionProcessingEnabled (true);
                getConnectionManager ().setCookieProcessingEnabled (true);
                parser.setResource (args[0]);
                System.out.println (parser.parse (filter));
            }
            catch (ParserException e)
            {
                e.printStackTrace ();
            }
    }
}

首先我们要看看构造方法:通常,我们使用HTMLParser的时候是这样new的 Parser a = Parser.createParser(content,”UTF-8″);而这个方法就是通过new Parser(New Lexer(new Page(html,charset)));方法创建一个Parser,也就是核心是Page对象,Lexer对象和Parser对象。

看其他的构造方法,空构造方法,我们略去。

public Parser(Lexer lexer,ParserFeedback fb),里面无非就是设置词法解析器,设置异常信息接收器,和设置一个nodeFactory。

public Parser (URLConnection connection, ParserFeedback fb)网页
public Parser (String resource, ParserFeedback feedback) 对内容进行解析

public Parser (String resource)

public Parser (Lexer lexer)

public Parser (URLConnection connection)

无非就是用默认的fd,之后会对FeedBack类进行相关说明。

到这里,parser完成的无非就是对给定的url,或者content,或者urlconnection对象,创建相应的词法分析器,feedback,和根据条件创建相应的Page对象。同时,我们看到Nodefactory实际上的设置是在Lexer中进行的。这里先不管。

这里有几个我们经常用的方法

public NodeList parse (NodeFilter filter) throws ParserException 根据过滤条件返回相应的nodelist

public void visitAllNodesWith (NodeVisitor visitor) throws ParserException 运用迭代器的方式进行遍历,这里涉及了NodeVisitor,IteratorImpl 这里还没有看。暂时不知道为什么这样设计。

public void setInputHTML (String inputHTML) 可以对html元素进行解析

public NodeList extractAllNodesThatMatch (NodeFilter filter) 根据过滤条件返回满足条件的NodeList

最后是测试方法。parser类读完了,其实parser类就是一个大的入口,将与页面相关的信息传递给parser类,parser调用其他类对其进行解析,返回nodelist。nodelist里面有若干node,而实例化的node里面存有我们相用的各种信息,到这里没有看到词法分析器的影子。。

PrototypicalNodeFactory

对Text,Remark,Tag进行标准化的约束。这个类暂时不做过多介绍,等用到的时候进行相关解释。

Remark

Text

Tag 这三个接口类,实现node,为具体的Node提供接口。

至此,org.htmlParser 阅读完毕,看见还是比较简单的。为什么我要阅读这个代码,一时项目中用到了这个HTMLParser,因为不放心,想知道内部是如何实现的。二是,我自动机以前挂掉过,我想自己实现一个Parser。三是,我的时间和充裕。

下一章节,我们将介绍另一个核心包:org.htmlparser.lexer 。同时,我们要对html页面标准进行学习,不然如何自己实现一个Parser呢?

赞(0) 打赏
分享到: 更多 (0)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏