Releases · code4craft/webmagic

27 Apr 05:10

code4craft

WebMagic-0.5.0

7ff83bb

WebMaigc-0.5.0

此次更新主要增加了监控功能，同时重写了多线程部分，使得多线程下性能有了极大的提升。另外还包含注解模式一些优化、多页面的支持等功能。

项目总体进展：

官网webmagic.io上线了！同时上线的还有详细版的官方文档http://webmagic.io/docs，从此使用更加简单！
新增三名合作开发者@ccliangbo @ouyanghuangzheng @linkerlin ，一起参与项目的维护。
官方论坛http://bbs.webmagic.io/和官方QQ群373225642上线，以后会更加重视社区的建设。

监控部分：

增加了监控功能，使用JMX可以监控页面数量、爬虫状态，并可以启动和终止爬虫。使用文档：http://webmagic.io/docs/posts/ch4-basic-page-processor/monitor.html #98

多线程部分：

重写了多线程部分，修复了多线程下，主分发线程会被工作线程阻塞的问题，使得多线程下效率有了极大的提升，推荐所有用户升级。 #110
为主线程等待新URL时的wait/notify机制增加了timeout时间，防止少数情况下发生的爬虫卡死的情况。 #111

抽取API部分：

增加了JSON的支持，现在可以使用page.getJson().jsonPath()来使用jsonPath解析AJAX请求，也可以使用page.getJson().removePadding().jsonPath()来解析JSONP请求。 #101
修复一个Selectable的缓存导致两次取出的结果不一致的问题。 #73 感谢@seveniu 发现问题
支持为一个Spider添加多个PageProcessor，并按照URL区分，感谢@sebastian1118 提交patch。使用示例：PatternProcessorExample #86
修复不常用标签无法使用nth-of-type选择的问题(例如//div/svg[2]) 。#75
修复XPath中包含特殊字符，即使转义也会导致解析失败的问题。#77

注解模式：

注解模式现在支持继承了！父类的注解将对子类也有效。#103
修复注解模式下，一个Spider使用多个Model时，可能不生效的问题，感谢 @ccliangbo 发现此问题。#85
修复sourceRegion中只有一个URL会被抽取出来的问题，感谢@jsinak 发现此问题。#107
修复了自动类型转换Formatter的一个BUG，现在可以自定义Formatter了。如果你不了解Formatter可以看这里：注解模式下结果的类型转换 #100

其他组件：

Downloader现在支持除了GET之外的其他几种HTTP请求了，包括POST、HEAD、PUT、DELETE、TRACE，感谢@usenrong 提出建议。 #108
在Site中设置Cookie时，可以指定域名，而不是只能使用默认域名了。 #109
setScheduler()方法在调用时，如果之前Scheduler已有URL，会先转移到新的Scheduler，避免URL丢失。 #104
在发布包中去掉了log4j.xml，避免与用户程序冲突，感谢@cnjavaer 发现问题。 #82

Assets 7

13 Mar 00:28

code4craft

webmaigc-0.4.3

63ffb5c

WebMaigc-0.4.3

Bugfix:

Fix cycleRetryTimes does not work #58 #60 #62 @yxssfxwzy
Fix NullPointerException in FileCachedQueueScheduler #53 @Xuchaoo
Fix Selenium does not quit #57 @d0ngw

Enhancement:

Enhance RegexSelector group check #51 @SimpleExpress
Add XPath syntax support: #64
contains,or/and,"|"
Add text attribute select to CssSelector #66
Change logger to slf4j #55
Update HttpClient version to 4.3.3 #59

Assets 2

03 Dec 15:43

code4craft

webmagic-0.4.2

e8c32a3

webmagic-0.4.2

Enhancement:
#45 Remove multi option in ExtractBy. Auto detect whether is multi be field type.
Bugfix:
#46 Downloader thread hang up sometiems.

Assets 2

28 Nov 05:11

code4craft

webmagic-0.4.1

ae62356

webmagic-0.4.1

Fix some concurrent problem causing the spider not exit after all pages are downloaded. #36
#38 Use algorithm of https://code.google.com/p/cx-extractor/.

More support for ajax:

#39 Parsing html after page.getHtml()
#42 Add jsonpath support in annotation mode
#35 Add more http info to page
#41 Add more status monitor method to Spider

Assets 2

06 Nov 23:54

code4craft

webmagic-0.4.0

fdb9441

webmagic-0.4.0

Improve performance of Downloader.

Update HttpClient to 4.3.1 and rewrite the code of HttpClientDownloader #32.
Use gzip by default to reduce the transport cost #31.
Enable HTTP Keep-Alive and connection persistence, fix the wrong usage of PoolConnectionManage r#30.

The performance of Downloader is improved by 90% in my test.Test code: Kr36NewsModel.java.

Add synchronzing API for small task #28.

        OOSpider ooSpider = OOSpider.create(Site.me().setSleepTime(100), BaiduBaike.class);
        BaiduBaike baike = ooSpider.<BaiduBaike>get("http://baike.baidu.com/search/word?word=httpclient&pic=1&sug=1&enc=utf8");
        System.out.println(baike);

More config for site

Http proxy support by Site.setHttpProxy #22.
More http header customizing support by Site.addHeader #27.
Allow disable gzip by Site.setUseGzip(false).
Move Site.addStartUrl to Spider.addUrl because I think startUrl is more a Spider's property than Site.

Code refactor in Spider

Refactor the multi-thread part of Spider and fix some concurrent problem.
Import Google Guava API for simpler code.
Allow add request with more information by Spider.addRequest() instead of addUrl #29.
Allow just downloading start urls without spawn urls extracted by Spider.setSpawnUrl(false).

Assets 2

23 Sep 05:26

code4craft

webmagic-0.3.2

cc3b787

webmagic-0.3.2

#13 Add class cast to annotation crawler.
Allow customize class caster by implementing ObjectFormatter and register it in ObjectFormatters.put().
Fix a thread pool reject exception when call Spider.stop()

Assets 2

08 Sep 15:04

code4craft

webmagic-parent-0.3.1

bfaaa04

webmagic-0.3.1

#26 Bugfix: Annotation extractor does not work.
#25 Bugfix: UrlUtils.canonicalizeUrl does not work with "../../" path.
#24 Enhancement: Add stop method to Spider.

Assets 2

04 Sep 03:02

code4craft

webmagic-0.3.0

77ff252

webmagic-0.3.0

Change default XPath selector from HtmlCleaner to Xsoup.

Xsoup is an XPath selector based on Jsoup written by me. It has much better performance than HtmlCleaner.

Time of processing a page is reduced from 7~9ms to 0.4ms.

If Xsoup is not stable for your usage, just use Spider.xsoupOff() to turn off it and report an issue to me!
Add cycle retry times for Site.

When cycle retry times is set, Spider will put the url which downloading failed back to scheduler, and retry after a cycle of queue.

Assets 2

20 Aug 15:51

code4craft

webmagic-parent-0.2.1

9dc6b11

webmagic-0.2.1

ComboExtractor support for annotation.

Request priority support (using PriorityScheduler).

Complete some I18n work (comments and documents).

More convenient extractor API:

Add attribute name select for CSSSelector.
Group of regex selector can be specified.
Add OrSelector.

Add Selectors, import static Selectors.* for fluent API such as:

or(regex("<title>(.*)</title>"), xpath("//title"), $("title")).select(s);

Add JsonPathSelector for Json parse.

Assets 2

30 Aug 09:46

code4craft

version-0.2.0

781c80d

version-0.2.0

此次更新的主题是"方便"(之前的主题是"灵活")。

增加了webmagic-extension模块。

增加了注解方式支持，可以通过POJO+注解的方式编写一个爬虫，更符合Java开发习惯。以下是抓取一个博客的完整代码：

    @TargetUrl("http://my.oschina.net/flashsword/blog/\\d+")
    public class OschinaBlog {

        @ExtractBy("//title")
        private String title;

        @ExtractBy(value = "div.BlogContent",type = ExtractBy.Type.Css)
        private String content;

        @ExtractBy(value = "//div[@class='BlogTags']/a/text()", multi = true)
        private List<String> tags;

        public static void main(String[] args) {
            OOSpider.create(Site.me().addStartUrl("http://my.oschina.net/flashsword/blog"),
            new ConsolePageModelPipeline(), OschinaBlog.class)
            .scheduler(new RedisScheduler("127.0.0.1")).thread(5).run();
        }

    }

增加一个Spider.test(url)方法，用于开发爬虫时进行调试。

增加基于redis的分布式支持。

增加XPath2.0语法支持(webmagic-saxon模块)。

增加基于Selenium的浏览器渲染支持，用于抓取动态加载内容(webmagic-selenium模块)。

修复了不支持https的bug。

补充了文档：webmagic-0.2.0用户手册。

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

项目总体进展：

监控部分：

多线程部分：

抽取API部分：

注解模式：

其他组件：

Bugfix:

Enhancement:

Improve performance of Downloader.

Add synchronzing API for small task #28.

More config for site

Code refactor in Spider

Releases: code4craft/webmagic

WebMaigc-0.5.0

项目总体进展：

监控部分：

多线程部分：

抽取API部分：

注解模式：

其他组件：

WebMaigc-0.4.3

Bugfix:

Enhancement:

webmagic-0.4.2

webmagic-0.4.1

webmagic-0.4.0

Improve performance of Downloader.

Add synchronzing API for small task #28.

More config for site

Code refactor in Spider

webmagic-0.3.2

webmagic-0.3.1

webmagic-0.3.0

webmagic-0.2.1

version-0.2.0