2013年12月15日星期日

Web crawl data (js)

 This post last edited by the tt357788906 on 2013-12-12 17:14:21
such as title, URL : http://zj.jsds.gov.cn/art/2011/12/31/art_39115_408734.html
show title " Party Secretary Xia lighting " and
content , beg

------ Solution ------------------------------------ then almost --------
content. . Through the label . Or attributes match . Then regular .
------ For reference only -------------------------------------- -
http://blog.csdn.net/jdgdf566/article/details/17039693
------ For reference only ----------------- ----------------------
htmlparser.jar. wrote a title . Almost. Regular matches

String path = "http://zj.jsds.gov.cn/art/2011/12/31/art_39115_408734.html";
Parser parser = new Parser(path);
parser.setEncoding("gbk");

NodeFilter filter = new NodeClassFilter(Div.class); 
NodeList  nodeList  = parser.parse(filter); 
String s=nodeList.elementAt(0).getChildren().toHtml();
System.out.println(s.subSequence("<!--<$[标题]>begin-->".length(), s.length()-"<!--<$[标题]>end-->".length()))  ;

------ For reference only ----------------------------------- ----
http://blog.csdn.net/withiter/article/ details/14450003
------ For reference only ------------------------------- --------

this can output it
------ For reference only ------------------ ---------------------
search "js get web content ."
refer to:
http://wenwen.soso.com/z/q356947703.htm
http://bbs.csdn.net/topics/240067166
----- - For reference only ---------------------------------------

you can point you in detail . . The best code. . After I quoted this package , copy the code online . . Most error
------ For reference only ------------------------------------ ---

public static void main(String[] args) throws Exception {
String path = "http://zj.jsds.gov.cn/art/2011/12/31/art_39115_408734.html";
Parser parser = new Parser(path);
parser.setEncoding("gbk");
NodeFilter filter1 = new NodeClassFilter(Div.class); 
NodeList  nodeList1  = parser.parse(filter1);
for (int i = 0; i < nodeList1.size()-1; i++) { 
System.out.println(nodeList1.elementAt(i).getChildren().asString());
  }
}

=====================================
党组书记、局长     夏照明
发布时间:2011年12月31日
信息来源:市局人事处
  党组书记、局长:夏照明
  负责主持全面工作。
&nbsp;&nbsp;&nbsp; &nbsp; 
&nbsp;

  夏照明,男,汉族,1958年3月出生,江苏泰兴人,1980年7月参加工作,1985年9月加入中国共产党,研究生学历。
  历任扬州市地方税务局党组成员、副局长,扬州市地方税务局党组书记、副局长,1998年3月任扬州市地方税务局党组书记、局长,2008年6月任江苏省镇江地方税务局党组书记、局长。



------ For reference only ----------------------------- ----------


jar package Lao Leba I use , I have to solve their own problem , thanks . What do you look on this page : http://zj.jsds.gov.cn/col/col39115/index.html
which data is js inside , grab the leadership required data , but also the key to grab the appropriate link ! beg . . After solving the Open paste +100 points
------ For reference only ------------------------------ ---------

String path = "http://zj.jsds.gov.cn/col/col39115/index.html";
Parser parser = new Parser(path);
parser.setEncoding("gbk");
        NodeList list = parser.parse(null);
        
        Matcher m = Pattern.compile("\\['(.*?)'\\]").matcher(list.toHtml());
        while(m.find()){
System.out.println( m.group());
        }
==========
['<tr><td width=16 align="center"><img src=\'/picture/0/110607114955809.jpg\' align=\'absmiddle\' border=\'0\'></script></span></td><td height=28 align="left"><a style=\'font-size:14px;\'  href=\'/art/2011/12/31/art_39115_408734.html\' target="_blank" class=\'bt_link\' title=\'党组书记、局长     夏照明\'>党组书记、局长     夏照明</a></td><td width="80" class=\'bt_date\'><font style=\'color:#313131;\'>2011-12-31</font></td></tr>','<tr><td width=16 align="center"><img src=\'/picture/0/110607114955809.jpg\' align=\'absmiddle\' border=\'0\'></td><td height=28 align="left"><a style=\'font-size:14px;\'  href=\'/art/2011/12/31/art_39115_408733.html\' target="_blank" class=\'bt_link\' title=\'党组副书记、副局长  施竞平\'>党组副书记、副局长  施竞平</a></td><td width="80" class=\'bt_date\'><font style=\'color:#313131;\'>2011-12-31</font></td></tr>','<tr><td width=16 align="center"><img src=\'/picture/0/110607114955809.jpg\' align=\'absmiddle\' border=\'0\'></td><td height=28 align="left"><a style=\'font-size:14px;\'  href=\'/art/2011/12/31/art_39115_408732.html\' target="_blank" class=\'bt_link\' title=\'党组成员、副局长  李 峻\'>党组成员、副局长  李 峻</a></td><td width="80" class=\'bt_date\'><font style=\'color:#313131;\'>2011-12-31</font></td></tr>','<tr><td width=16 align="center"><img src=\'/picture/0/110607114955809.jpg\' align=\'absmiddle\' border=\'0\'></td><td height=28 align="left"><a style=\'font-size:14px;\'  href=\'/art/2011/12/31/art_39115_408731.html\' target="_blank" class=\'bt_link\' title=\'党组成员、纪检组长  邵云\'>党组成员、纪检组长  邵云</a></td><td width="80" class=\'bt_date\'><font style=\'color:#313131;\'>2011-12-31</font></td></tr>','<tr><td width=16 align="center"><img src=\'/picture/0/110607114955809.jpg\' align=\'absmiddle\' border=\'0\'></td><td height=28 align="left"><a style=\'font-size:14px;\'  href=\'/art/2011/12/31/art_39115_408730.html\' target="_blank" class=\'bt_link\' title=\'党组成员、副局长  郦梅生\'>党组成员、副局长  郦梅生</a></td><td width="80" class=\'bt_date\'><font style=\'color:#313131;\'>2011-12-31</font></td></tr>','<tr><td width=16 align="center"><img src=\'/picture/0/110607114955809.jpg\' align=\'absmiddle\' border=\'0\'></td><td height=28 align="left"><a style=\'font-size:14px;\'  href=\'/art/2011/5/20/art_39115_408729.html\' target="_blank" class=\'bt_link\' title=\'党组成员、总经济师  高凌\'>党组成员、总经济师  高凌</a></td><td width="80" class=\'bt_date\'><font style=\'color:#313131;\'>2011-05-20</font></td></tr>']



------ For reference only ----------------------- ----------------
jar package can give you, I find several of them. Thank you. 357788906@qq.com

------ For reference only ---------------------------------- -----

not issued a thank you , to find the
------ For reference only ------------- --------------------------

Because from this point into a page , so to give a re- href assignment , System.out.println (m.group (). replace ("href = \ '", "href = \'? url = zj.jsds.gov. cn "));, how did it play a role
------ For reference only ------------------------ ---------------
what you want . Is to extract the data inside , and what is the relationship between the assignment ? You do not grab the data it?
String path = "http://zj.jsds.gov.cn/col/col39115/index.html";
Parser parser = new Parser(path);
parser.setEncoding("gbk");
        NodeList list = parser.parse(null);        
        Matcher m = Pattern.compile("\\['(.*?)'\\]").matcher(list.toHtml());
        String str="";
        while(m.find()){
         System.out.println(m.group());
         str= m.group();
        }
        Matcher m1=Pattern.compile("href=\\\\'(.*?\\.html)(.*?>)(.*?)(</a>)").matcher(str);
        while(m1.find()){
         System.out.println(m1.group(1));
         System.out.println(m1.group(3));
        } 

------ For reference only ----------------------------------- ----

has been resolved. My goal is to level the page is then linked to crawl get two pages , thank you heroes
------ For reference only ------ ---------------------------------




This is what does that mean , how some web pages based on key characters , the content does not lose : http://www.jsds.gov.cn/art/2013/5/24/art_50308_585582.html < br> ------ For reference only ---------------------------------------
What do you mean , you want to get http://www.jsds.gov.cn/art/2013/5/24/art_50308_585582.html the inside?
directly grab ah. The key character ? What does this mean ?

没有评论:

发表评论