A simple, scalable, and highly efficient web crawler framework for Java.
- WebMagic基于Maven进行构建,推荐使用Maven来安装WebMagic。在你自己的项目(已有项目或者新建一个)中添加以下坐标即可:
1 2 3 4 5 6 7 8 9 10
| <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-core</artifactId> <version>0.7.3</version> </dependency> <dependency> <groupId>us.codecraft</groupId> <artifactId>webmagic-extension</artifactId> <version>0.7.3</version> </dependency>
|
在你的项目中添加了WebMagic的依赖之后,即可开始第一个爬虫的开发了!我们这里拿一个抓取Boss直聘网招聘信息的例子:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
| package com.zby.test;
import us.codecraft.webmagic.Page; import us.codecraft.webmagic.Site; import us.codecraft.webmagic.Spider; import us.codecraft.webmagic.processor.PageProcessor;
public class spider implements PageProcessor {
@Override public void process(Page page) {
System.out.println( "职位:" +page.getHtml().xpath("//*[@id='main']/div[1]/div[1]/div[1]/div[2]/div[2]/h1/text()") .toString() ); System.out.println( "薪资:" +page.getHtml().xpath("//*[@id='main']/div[1]/div[1]/div[1]/div[2]/div[2]/span/text()") .toString() ); System.out.println( "学历:" +page.getHtml().xpath("//*[@id='main']/div[1]/div[1]/div[1]/div[2]/p/text()") .toString() ); System.out.println( "福利:" +page.getHtml().xpath("//*[@id='main']/div[1]/div[1]/div[1]/div[2]/div[3]/div[2]/span[1]/text()") .toString() );
}
@Override public Site getSite() { return Site.me().setSleepTime(1000).setRetrySleepTime(5);
}
public static void main(String[] args) {
Spider.create(new spider()).addUrl("https://www.zhipin.com/job_detail/3ded0d912dac595903V53tu8E1o~.html?ka=search_list_1").run(); } }
|
1 2 3
| public Site getSite() { return Site.me().setSleepTime(1000).setRetrySleepTime(5); }
|
这里是防止被网站后台发现爬取内容,加入等待时间