出家如初,成佛有余

htmlparser encoding 问题

Posted in Uncategorized by chuanliang on 2008/06/19

    使用htmlparser爬取一些页面时候(例如http://bbs.pcpop.com/O71228/1286458.html),会抛出org.htmlparser.util.EncodingChangeException异常:

例如执行如下代码(junit代码):

public void testLinkTag() {

try {

           NodeFilter filter = new NodeClassFilter(LinkTag.class);

           Parser parser = new Parser();

           parser.setURL(“http://bbs.pcpop.com/O71228/1286458.html”);

           parser.setEncoding(parser.getEncoding());

logger.fatal(“Encoding is “+parser.getEncoding());

           NodeList list = parser.extractAllNodesThatMatch(filter);

for (int i = 0; i < list.size(); i++) {

              LinkTag node = (LinkTag) list.elementAt(i);

logger.fatal(“testLinkTag() Link is :” + node.extractLink());

           }

       } catch (Exception e) {

           e.printStackTrace();

       }

    }

会抛出如下异常

org.htmlparser.util.EncodingChangeException: character mismatch (new: 涓 [0x6d93] != old:  [0x4e2d中]) for encoding change from UTF-8 to GB2312 at character offset 158

    at org.htmlparser.lexer.InputStreamSource.setEncoding(InputStreamSource.java:280)

    at org.htmlparser.lexer.Page.setEncoding(Page.java:865)

    at org.htmlparser.tags.MetaTag.doSemanticAction(MetaTag.java:150)

    at org.htmlparser.scanners.TagScanner.scan(TagScanner.java:69)

    at org.htmlparser.scanners.CompositeTagScanner.scan(CompositeTagScanner.java:160)

    at org.htmlparser.util.IteratorImpl.nextNode(IteratorImpl.java:92)

    at org.htmlparser.Parser.visitAllNodesWith(Parser.java:726)

    at ParserTestCase1.testImageVisitor(ParserTestCase1.java:71)

    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)

    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)

    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)

    at java.lang.reflect.Method.invoke(Method.java:585)

    at junit.framework.TestCase.runTest(TestCase.java:154)

    at junit.framework.TestCase.runBare(TestCase.java:127)

    at junit.framework.TestResult$1.protect(TestResult.java:106)

    at junit.framework.TestResult.runProtected(TestResult.java:124)

    at junit.framework.TestResult.run(TestResult.java:109)

    at junit.framework.TestCase.run(TestCase.java:118)

    at junit.framework.TestSuite.runTest(TestSuite.java:208)

    at junit.framework.TestSuite.run(TestSuite.java:203)

    at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)

    at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)

    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)

    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)

    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)

    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)

分析此类型的页面可以知道,主要原因还是org.htmlparser.tags.MetaTag对页面缺省Encoding的处理存在问题

对于页面http://bbs.pcpop.com/O71228/1286458.html,其页面缺省的编码为gb2312

        <META http-equiv="Content-Type" content="text/html; charset=gb2312">

但在服务器的Respone中是utf-8编码,因此浏览器是按照utf-8来编码。

HTTP/1.x 200 OK

Date: Thu, 19 Jun 2008 03:16:53 GMT

Server: Microsoft-IIS/6.0

X-Powered-By: ASP.NET

X-AspNet-Version: 2.0.50727

Cache-Control: private

Content-Type: text/html; charset=utf-8

Content-Length: 130386

但在htmlparser中,即使调用parser.setEncoding(parser.getEncoding())后,在MetaTag处理时候,没有沿用Parser设定的encoding

修改如下:

    public void doSemanticAction ()

        throws

            ParserException

    {

        String httpEquiv;

        String charset;

       httpEquiv = getHttpEquiv ();

        if (“Content-Type”.equalsIgnoreCase (httpEquiv)){

             //charset = getPage ().getCharset (getAttribute (“CONTENT”));

             //getPage ().setEncoding (charset);

             if (Page.DEFAULT_CHARSET == getPage ().getEncoding ()){

                 charset = getPage ().getCharset (getAttribute (“CONTENT”));

                 getPage ().setEncoding (charset);

             }

        }

    }

Technorati 标签: ,
Tagged with: ,

No Responses Yet

Subscribe to comments with RSS.

  1. VOODOO said, on 2008/09/28 at 14:48

    您写的这个修改 里面只是代码 片段 而且

    getHttpEquiv ();
    getPage ()

    方法 并没有写出来是什么操作

    请您补充完整


发表评论

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / 更改 )

Twitter picture

You are commenting using your Twitter account. Log Out / 更改 )

Facebook photo

You are commenting using your Facebook account. Log Out / 更改 )

Google+ photo

You are commenting using your Google+ account. Log Out / 更改 )

Connecting to %s

%d 博主赞过: