Skip to content

Releases: code4craft/webmagic

WebMaigc-0.5.0

27 Apr 05:10
Compare
Choose a tag to compare

此次更新主要增加了监控功能,同时重写了多线程部分,使得多线程下性能有了极大的提升。另外还包含注解模式一些优化、多页面的支持等功能。

项目总体进展:

监控部分:

多线程部分:

  • 重写了多线程部分,修复了多线程下,主分发线程会被工作线程阻塞的问题,使得多线程下效率有了极大的提升,推荐所有用户升级。 #110
  • 为主线程等待新URL时的wait/notify机制增加了timeout时间,防止少数情况下发生的爬虫卡死的情况。 #111

抽取API部分:

  • 增加了JSON的支持,现在可以使用page.getJson().jsonPath()来使用jsonPath解析AJAX请求,也可以使用page.getJson().removePadding().jsonPath()来解析JSONP请求。 #101
  • 修复一个Selectable的缓存导致两次取出的结果不一致的问题。 #73 感谢@seveniu 发现问题
  • 支持为一个Spider添加多个PageProcessor,并按照URL区分,感谢@sebastian1118 提交patch。使用示例:PatternProcessorExample #86
  • 修复不常用标签无法使用nth-of-type选择的问题(例如//div/svg[2]) 。#75
  • 修复XPath中包含特殊字符,即使转义也会导致解析失败的问题。#77

注解模式:

  • 注解模式现在支持继承了!父类的注解将对子类也有效。#103
  • 修复注解模式下,一个Spider使用多个Model时,可能不生效的问题,感谢 @ccliangbo 发现此问题。#85
  • 修复sourceRegion中只有一个URL会被抽取出来的问题,感谢@jsinak 发现此问题。#107
  • 修复了自动类型转换Formatter的一个BUG,现在可以自定义Formatter了。如果你不了解Formatter可以看这里:注解模式下结果的类型转换 #100

其他组件:

  • Downloader现在支持除了GET之外的其他几种HTTP请求了,包括POST、HEAD、PUT、DELETE、TRACE,感谢@usenrong 提出建议。 #108
  • Site中设置Cookie时,可以指定域名,而不是只能使用默认域名了。 #109
  • setScheduler()方法在调用时,如果之前Scheduler已有URL,会先转移到新的Scheduler,避免URL丢失。 #104
  • 在发布包中去掉了log4j.xml,避免与用户程序冲突,感谢@cnjavaer 发现问题。 #82

WebMaigc-0.4.3

13 Mar 00:28
Compare
Choose a tag to compare

Bugfix:

Enhancement:

  • Enhance RegexSelector group check #51 @SimpleExpress
  • Add XPath syntax support: #64
    contains,or/and,"|"
  • Add text attribute select to CssSelector #66
  • Change logger to slf4j #55
  • Update HttpClient version to 4.3.3 #59

webmagic-0.4.2

03 Dec 15:43
Compare
Choose a tag to compare

Enhancement:
#45 Remove multi option in ExtractBy. Auto detect whether is multi be field type.
Bugfix:
#46 Downloader thread hang up sometiems.

webmagic-0.4.1

28 Nov 05:11
Compare
Choose a tag to compare

More support for ajax:

  • #39 Parsing html after page.getHtml()
  • #42 Add jsonpath support in annotation mode
  • #35 Add more http info to page
  • #41 Add more status monitor method to Spider

webmagic-0.4.0

06 Nov 23:54
Compare
Choose a tag to compare

Improve performance of Downloader.

  • Update HttpClient to 4.3.1 and rewrite the code of HttpClientDownloader #32.
  • Use gzip by default to reduce the transport cost #31.
  • Enable HTTP Keep-Alive and connection persistence, fix the wrong usage of PoolConnectionManage r#30.

The performance of Downloader is improved by 90% in my test.Test code: Kr36NewsModel.java.

Add synchronzing API for small task #28.

        OOSpider ooSpider = OOSpider.create(Site.me().setSleepTime(100), BaiduBaike.class);
        BaiduBaike baike = ooSpider.<BaiduBaike>get("http://baike.baidu.com/search/word?word=httpclient&pic=1&sug=1&enc=utf8");
        System.out.println(baike);

More config for site

  • Http proxy support by Site.setHttpProxy #22.
  • More http header customizing support by Site.addHeader #27.
  • Allow disable gzip by Site.setUseGzip(false).
  • Move Site.addStartUrl to Spider.addUrl because I think startUrl is more a Spider's property than Site.

Code refactor in Spider

  • Refactor the multi-thread part of Spider and fix some concurrent problem.
  • Import Google Guava API for simpler code.
  • Allow add request with more information by Spider.addRequest() instead of addUrl #29.
  • Allow just downloading start urls without spawn urls extracted by Spider.setSpawnUrl(false).

webmagic-0.3.2

23 Sep 05:26
Compare
Choose a tag to compare
  • #13 Add class cast to annotation crawler.
  • Allow customize class caster by implementing ObjectFormatter and register it in ObjectFormatters.put().
  • Fix a thread pool reject exception when call Spider.stop()

webmagic-0.3.1

08 Sep 15:04
Compare
Choose a tag to compare
  • #26 Bugfix: Annotation extractor does not work.
  • #25 Bugfix: UrlUtils.canonicalizeUrl does not work with "../../" path.
  • #24 Enhancement: Add stop method to Spider.

webmagic-0.3.0

04 Sep 03:02
Compare
Choose a tag to compare
  • Change default XPath selector from HtmlCleaner to Xsoup.

    Xsoup is an XPath selector based on Jsoup written by me. It has much better performance than HtmlCleaner.

    Time of processing a page is reduced from 7~9ms to 0.4ms.

    If Xsoup is not stable for your usage, just use Spider.xsoupOff() to turn off it and report an issue to me!

  • Add cycle retry times for Site.

    When cycle retry times is set, Spider will put the url which downloading failed back to scheduler, and retry after a cycle of queue.

webmagic-0.2.1

20 Aug 15:51
Compare
Choose a tag to compare

ComboExtractor support for annotation.

Request priority support (using PriorityScheduler).

Complete some I18n work (comments and documents).

More convenient extractor API:

  • Add attribute name select for CSSSelector.

  • Group of regex selector can be specified.

  • Add OrSelector.

  • Add Selectors, import static Selectors.* for fluent API such as:

    or(regex("<title>(.*)</title>"), xpath("//title"), $("title")).select(s);
    
  • Add JsonPathSelector for Json parse.

version-0.2.0

30 Aug 09:46
Compare
Choose a tag to compare

此次更新的主题是"方便"(之前的主题是"灵活")。

增加了webmagic-extension模块。

增加了注解方式支持,可以通过POJO+注解的方式编写一个爬虫,更符合Java开发习惯。以下是抓取一个博客的完整代码:

    @TargetUrl("http://my.oschina.net/flashsword/blog/\\d+")
    public class OschinaBlog {

        @ExtractBy("//title")
        private String title;

        @ExtractBy(value = "div.BlogContent",type = ExtractBy.Type.Css)
        private String content;

        @ExtractBy(value = "//div[@class='BlogTags']/a/text()", multi = true)
        private List<String> tags;

        public static void main(String[] args) {
            OOSpider.create(Site.me().addStartUrl("http://my.oschina.net/flashsword/blog"),
            new ConsolePageModelPipeline(), OschinaBlog.class)
            .scheduler(new RedisScheduler("127.0.0.1")).thread(5).run();
        }

    }

增加一个Spider.test(url)方法,用于开发爬虫时进行调试。

增加基于redis的分布式支持。

增加XPath2.0语法支持(webmagic-saxon模块)。

增加基于Selenium的浏览器渲染支持,用于抓取动态加载内容(webmagic-selenium模块)。

修复了不支持https的bug。

补充了文档:webmagic-0.2.0用户手册