使用jsoup爬取中国青年网的资讯(资讯列表,资讯内容)
思路分析
http://news.youth.cn/gn/网站内容分析,每一条资讯单独成行,在标签<ul class="tj3\_1">
中,内容比较格式化,只需要将页面下载下来,以行为单位进行逐行解析即可
流程
代码思路
代码(只贴 Factory 和Main)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
| import Util.MyLogger; import Util.MySpiderContextFactory; import MySpider.MySpiderContext; import java.net.URL; import java.util.List; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors;
public class ContextMain {
private static URL[] urls; public static boolean useThreads = false;
public static void main(String[] args) throws Exception{
MyLogger.log("获取所有待爬取的URL"); urls = new URL[500]; YouthNewsService youthNewsService = new YouthNewsService(); youthNewsService.init(); List<YouthNews> youthNews = youthNewsService.selectList(); MyLogger.log("队列初始大小:" + urls.length); MyLogger.log("获取文章数量:" + youthNews.size());
for (int i = 0; i < youthNews.size(); i++) { urls[i] = new URL("http://" + youthNews.get(i).getUrl()); }
long start = System.currentTimeMillis(); if(!useThreads){ MySpiderContext mySpider = MySpiderContextFactory.getNewsContextBySpider(urls); mySpider.start();
}
long end = System.currentTimeMillis();
MyLogger.log("Main ended");
} }
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
| package club.zby.Until;
import club.zby.Boot.Example.ExampleBoot; import club.zby.DataService.Example.exampleDataService; import club.zby.DataService.NewsContextService; import club.zby.Downloader.Example.StreamDownloader; import club.zby.Processor.Example.NewsProcessor; import club.zby.Processor.Example.YouthProcessor; import club.zby.ScheduleQueue.Example.exampleScheduleQueue; import club.zby.Spider.MySpider; import club.zby.Spider.MySpiderContext;
import java.net.URL;
public class MySpiderContextFactory {
public static MySpiderContext getSpiderContextToLocalAndNoDataService(URL[] urls) throws Exception { MySpiderContext mySpiderContext = new MySpiderContext(urls) .addBoot(new ExampleBoot()) .addDownloader(new StreamDownloader()) .addProcessor(new NewsProcessor()) .addScheduleQueue(new exampleScheduleQueue()); return mySpiderContext; }
public static MySpiderContext getSpiderDataToDataService(URL[] urls) throws Exception { MySpiderContext mySpiderContext = new MySpiderContext(urls) .addBoot(new ExampleBoot()) .addDownloader(new StreamDownloader()) .addProcessor(new NewsProcessor()) .addScheduleQueue(new exampleScheduleQueue()) .addDataService(new NewsContextService()); return mySpiderContext; } }
|
结果
资讯列表
列表中每条资讯的详细内容
完整代码,点击下面的按钮下载(需要根据entity中的实体自己建立数据)
点击此处下载
密码:0kad