googleScholar_spider

It's a Program based on google_search_module developed by:

Author: Tan Kok Hua (Guohua tan)

Email: [email protected]

More information can be obtained from: http://wp.me/p4nnkg-1i

Modified by:

Author: David Hine

Email:[email protected]

General Information

Retrieve Google Scholar results using python and Scrapy

Program obtained the results links from googleScholar main page, and each links are run separately using Scrapy. googleScholar_spider adds the posibility to extract metadata such as: Author, Publisher, date; using Google & Dublin Core standard.

Dublin Core Syntax example:

<META NAME="DC.Title "CONTENT=" 

<META NAME="DC.Creatorr "CONTENT=" 

<META NAME="DC.Subject "CONTENT=" 

<META NAME="DC.Description "CONTENT=" 

<META NAME="DC.Publisher "CONTENT=" 

<META NAME="DC.Contributor "CONTENT=" 

<META NAME="DC.Date" CONTENT=" 

<META NAME="DC.Coverage "CONTENT=" 

<META NAME="DC.Format "CONTENT=" 

<META NAME="DC.Identifier"CONTENT=" 

<META NAME="DC.Fuente "CONTENT=" 

<META NAME="DC.Language "CONTENT=" 

<META NAME="DC. type "CONTENT=" 

<META NAME="DC.Relation "CONTENT=" 

<META NAME="DC.Rights "CONTENT="

Installation & Configuration

Dependency of script are Scrapy and yaml (for unicode handling). Both can be downloaded using PIP.

pip install Scrapy

pip search yaml

pip install pyyaml

if you want to install with linux repositories.

$ sudo apt-get install python-yaml

$ sudo yum install python-yaml

Scripts is divided into 2 parts. The main script for running is from Google_Scholar.py The get_google_link_results.py is the scrapy spider for crawling either the google Scholar search page or individual websites.

##Support & Documentation

Scrapy:

http://doc.scrapy.org/en/0.24/

Dublin Core:

http://scielo.sld.cu/scielo.php?script=sci_arttext&pid=S1024-94352006000400009

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
Get_google_link_results.py		Get_google_link_results.py
Google_Scholar.py		Google_Scholar.py
Python_google_search_gui.py		Python_google_search_gui.py
README.md		README.md
c:\data\temp\google_search		c:\data\temp\google_search
c:\data\temp\htmlread_1.txt		c:\data\temp\htmlread_1.txt
htmlread_1.txt		htmlread_1.txt
output		output
pruebamogo.py		pruebamogo.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

googleScholar_spider

General Information

Installation & Configuration

About

Releases

Packages

Languages

bjzhangyong/Google_Scholar_Spider

Folders and files

Latest commit

History

Repository files navigation

googleScholar_spider

General Information

Installation & Configuration

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages