-
Notifications
You must be signed in to change notification settings - Fork 759
Facebook and Twitter Scroll down
Alex Osborne edited this page Jul 4, 2018
·
2 revisions
Configuration for capturing the stuff that gets loaded when you scroll past the end of the page on facebook and twitter using the new ExtractorMultipleRegex
Configure extractors at top level:
<bean id="extractorTwitterScrollOne" class="org.archive.modules.extractor.ExtractorMultipleRegex">
<property name="enabled" value="false"/>
<property name="uriRegex" value="^https?://(?:www\.)?twitter\.com/([^/]+)/?(?:\?.*)?$"/>
<property name="contentRegexes">
<map>
<entry key="maxId" value="data-max-id="(\d+)""/>
</map>
</property>
<property name="template">
<value>/i/profiles/show/${uriRegex[1]}/timeline/with_replies?include_available_features=1&include_entities=1&max_id=${maxId[1]}</value>
</property>
</bean>
<bean id="extractorTwitterScrollFurther" class="org.archive.modules.extractor.ExtractorMultipleRegex">
<property name="enabled" value="false"/>
<property name="uriRegex" value="^https?://(?:www\.)?twitter\.com/i/profiles/show/([^/]+)/timeline/with_replies\?include_available_features=1&include_entities=1&max_id=\d+$"/>
<property name="contentRegexes">
<map>
<entry key="maxId" value=""max_id":"(\d+)""/>
</map>
</property>
<property name="template">
<value>/i/profiles/show/${uriRegex[1]}/timeline/with_replies?include_available_features=1&include_entities=1&max_id=${maxId[1]}</value>
</property>
</bean>
Insert into fetch chain:
<ref bean="extractorHtml"/>
<ref bean="extractorCss"/>
<ref bean="extractorJs"/>
+ <ref bean="extractorTwitterScrollOne"/>
+ <ref bean="extractorTwitterScrollFurther"/>
</list>
</property>
Use sheets to enable for relevant urls:
<bean id="twitterScrollOne" class="org.archive.spring.Sheet">
<property name="map">
<map>
<entry key="extractorTwitterScrollOne.enabled" value="true"/>
</map>
</property>
</bean>
<bean class="org.archive.crawler.spring.SurtPrefixesSheetAssociation">
<property name="surtPrefixes">
<list>
<value>http://(com,twitter,)/</value>
<value>http://(com,twitter,www,)/</value>
</list>
</property>
<property name="targetSheetNames">
<list>
<value>twitterScrollOne</value>
</list>
</property>
</bean>
<bean id="twitterScrollFurther" class="org.archive.spring.Sheet">
<property name="map">
<map>
<entry key="extractorTwitterScrollFurther.enabled" value="true"/>
<entry key="extractorTwitterScrollOne.enabled" value="false"/>
</map>
</property>
</bean>
<bean class="org.archive.crawler.spring.SurtPrefixesSheetAssociation">
<property name="surtPrefixes">
<list>
<value>http://(com,twitter,)/i/profiles/show/</value>
<value>http://(com,twitter,www,)/i/profiles/show/</value>
</list>
</property>
<property name="targetSheetNames">
<list>
<value>twitterScrollFurther</value>
</list>
</property>
</bean>
Configure extractor at top level:
<bean id="extractorFacebookScroll" class="org.archive.modules.extractor.ExtractorMultipleRegex">
<property name="enabled" value="false"/>
<property name="uriRegex" value="^https?://(?:www\.)?facebook\.com/[^/?]+$"/>
<property name="contentRegexes">
<map>
<entry key="jsonBlob" value="\["TimelineContentLoader","registerTimePeriod",[^,]+,[^,]+,[^,]+,\{("profile_id":[^}]+)\},false,null,(\d+),"/>
<entry key="ajaxpipeToken" value=""ajaxpipe_token":"([^"]+)""/>
<entry key="timeCutoff" value=""setTimeCutoff",[^,]*,\[(\d+)\]\]"/>
</map>
</property>
<property name="template">
<value>/ajax/pagelet/generic.php/ProfileTimelineSectionPagelet?ajaxpipe=1&ajaxpipe_token=${ajaxpipeToken[1]}&no_script_path=1&data=${java.net.URLEncoder.encode('{' + jsonBlob[1] , 'UTF-8')},"time_cutoff"%3A${java.net.URLEncoder.encode(timeCutoff[1] , 'UTF-8')},"force_no_friend_activity"%3Afalse%7D&__user=0&__a=1&__adt=${jsonBlob[2]}</value>
</property>
</bean>
Insert into fetch chain:
<ref bean="extractorCss"/>
<ref bean="extractorJs"/>
<ref bean="extractorTwitterScrollOne"/>
<ref bean="extractorTwitterScrollFurther"/>
+ <ref bean="extractorFacebookScroll"/>
</list>
</property>
Use sheets to enable for relevant urls:
<bean id="enableFacebookScroll" class="org.archive.spring.Sheet">
<property name="map">
<map>
<entry key="extractorFacebookScroll.enabled" value="true"/>
</map>
</property>
</bean>
<bean class="org.archive.crawler.spring.SurtPrefixesSheetAssociation">
<property name="surtPrefixes">
<list>
<value>http://(com,facebook,</value>
</list>
</property>
<property name="targetSheetNames">
<list>
<value>enableFacebookScroll</value>
</list>
</property>
</bean>
Structured Guides:
User Guide
- Introduction
- New Features in 3.0 and 3.1
- Your First Crawl
- Checkpointing
- Main Console Page
- Profiles
- Heritrix Output
- Common Heritrix Use Cases
- Jobs
- Configuring Jobs and Profiles
- Processing Chains
- Credentials
- Creating Jobs and Profiles
- Outside the User Interface
- A Quick Guide to Creating a Profile
- Job Page
- Frontier
- Spring Framework
- Multiple Machine Crawling
- Heritrix3 on Mac OS X
- Heritrix3 on Windows
- Responsible Crawling
- Politeness parameters
- BeanShell Script For Downloading Video
- crawl manifest
- JVM Options
- Frontier queue budgets
- BeanShell User Notes
- Facebook and Twitter Scroll-down
- Deduping (Duplication Reduction)
- Force speculative embed URIs into single queue.
- Heritrix3 Useful Scripts
- How-To Feed URLs in bulk to a crawler
- MatchesListRegexDecideRule vs NotMatchesListRegexDecideRule
- WARC (Web ARChive)
- When taking a snapshot Heritrix renames crawl.log
- YouTube
- H3 Dev Notes for Crawl Operators
- Development Notes
- Spring Crawl Configuration
- Potential Cleanup-Refactorings
- Future Directions Brainstorming
- Documentation Wishlist
- Web Spam Detection for Heritrix
- Style Guide
- HOWTO Ship a Heritrix Release
- Heritrix in Eclipse