Skip to content

avagreyyy4/pagerank_project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Pagerank Project

In this project, you will create a simple search engine for the website https://www.lawfareblog.com. This website provides legal analysis on US national security issues. You will use pagerank to return only the most important results from this website in your search engine.

Due date: Sunday, 22 29 September at midnight

Late Policy: You lose $2^{(i-1)}$ points, where i is the number of days late.

Collaboration Policy: Do whatever will help you learn, but be an adult. You may talk to other students and use Google/ChatGPT. Recall that you will have an in-person oral exam on this material and the exam is worth many more points. The main purpose of this project is to help prepare you for the exam.

Background

Data:

The data folder contains two files that store example "web graphs". The file small.csv.gz contains the example graph from the Deeper Inside Pagerank paper. This is a small graph, so we can manually inspect the contents of this file with the following command:

$ zcat data/small.csv.gz
source,target
1,2
1,3
3,1
3,2
3,5
4,5
4,6
5,6
5,4
6,4

Recall: The cat terminal command outputs the contents of a file to stdout, and the zcat command first decompressed a gzipped file and then outputs the decompressed contents.

In python, we can use the built-in gzip module to access gzipped files. The following python code is equivalent to the bash code above:

>>> import gzip
>>> fin = gzip.open('data/small.csv.gz', mode='rt')
>>> print(fin.read())
source,target
1,2
1,3
3,1
3,2
3,5
4,5
4,6
5,6
5,4
6,4

There are many terminal commands throughout these instructions. If you haven't used the terminal before, and so these commands are unfamiliar, that's okay. I'd be happy to explain them in office hours, or there are many tutors in the QCL available who can help. (There are no tutors for this class specifically, but anyone who has taken CSCI046 or CSCI133 with me will be able to help with the terminal.)

Furthermore, you don't "need" to understand the terminal commands in detail, since you are not required to run these commands or to create your own. The important part is to understand the English language description of what the commands are doing, and to understand that this is just how I computed what the English language text is describing.

As you can see, the graph is stored as a CSV file. The first line is a header, and each subsequent line stores a single edge in the graph. The first column contains the source node of the edge and the second column the target node. The file is assumed to be sorted alphabetically.

The second data file lawfareblog.csv.gz contains the link structure for the lawfare blog. Let's take a look at the first 10 of these lines:

$ zcat data/lawfareblog.csv.gz | head
source,target
www.lawfareblog.com/,www.lawfareblog.com/topic/interrogation
www.lawfareblog.com/,www.lawfareblog.com/upcoming-events
www.lawfareblog.com/,www.lawfareblog.com/
www.lawfareblog.com/,www.lawfareblog.com/our-comments-policy
www.lawfareblog.com/,www.lawfareblog.com/litigation-documents-related-appointment-matthew-whitaker-acting-attorney-general
www.lawfareblog.com/,www.lawfareblog.com/topic/lawfare-research-paper-series
www.lawfareblog.com/,www.lawfareblog.com/topic/book-reviews
www.lawfareblog.com/,www.lawfareblog.com/documents-related-mueller-investigation
www.lawfareblog.com/,www.lawfareblog.com/topic/international-law-loac

You can see that in this file, the node names are URLs. Semantically, each line corresponds to an HTML <a> tag that is contained in the source webpage and links to the target webpage.

We can use the following command to count the total number of links in the file:

$ zcat data/lawfareblog.csv.gz | wc -l
1610789

Since every link corresponds to a non-zero entry in the $P$ matrix, this is also the value of $\text{nnz}(P)$. (Technically, we should subtract 1 from this value since the wc -l command also counts the header line, not just the data lines.)

To get the dimensions of $P$, we need to count the total number of nodes in the graph. The following command achieves this by: decompressing the file, extracting the first column, removing all duplicate lines, then counting the results.

$ zcat data/lawfareblog.csv.gz | cut -f1 -d, | uniq | wc -l
25761

This matrix is large enough that computing matrix products for dense matrices takes several minutes on a single CPU. Fortunately, however, the matrix is very sparse. The following python code computes the fraction of entries in the matrix with non-zero values:

>>> 1610788 / (25760**2)
0.0024274297384360172

Thus, by using sparse matrix operations, we will be able to speed up the code significantly.

Code:

The pagerank.py file contains code for loading the graph CSV files and searching through their nodes for key phrases. For example, you can perform a search for all nodes (i.e. urls) that mention the string corona with the following command:

$ python3 pagerank.py --data=data/lawfareblog.csv.gz --verbose --search_query=corona

NOTE: It will take about 10 seconds to load and parse the data files. All the other computation happens essentially instantly.

Currently, the pagerank of the nodes is not currently being calculated correctly, and so the webpages are returned in an arbitrary order. Your task in this assignment will be to fix these calculations in order to have the most important results (i.e. highest pagerank results) returned first.

Task 1: the power method

Implement the WebGraph.power_method function in pagerank.py for computing the pagerank vector by fixing the FIXME: Task 1 annotation.

NOTE: The power method is the only data mining algorithm you will implement in class. You are implementing it because there are no standard library implementations available. Why?

  1. The runtime is heavily dependent on the data structures used to store the graph data. Different applications will need to use different data structures.
  2. It is "trivial" to implement. My solution to this homework is <10 lines of code.

Part 1:

To check that your implementation is working, you should run the program on the data/small.csv.gz graph. For my implementation, I get the following output.

$ python3 pagerank.py --data=data/small.csv.gz --verbose
DEBUG:root:computing indices
DEBUG:root:computing values
DEBUG:root:i=0 residual=2.5629e-01
DEBUG:root:i=1 residual=1.1841e-01
DEBUG:root:i=2 residual=7.0701e-02
DEBUG:root:i=3 residual=3.1815e-02
DEBUG:root:i=4 residual=2.0497e-02
DEBUG:root:i=5 residual=1.0108e-02
DEBUG:root:i=6 residual=6.3716e-03
DEBUG:root:i=7 residual=3.4228e-03
DEBUG:root:i=8 residual=2.0879e-03
DEBUG:root:i=9 residual=1.1750e-03
DEBUG:root:i=10 residual=7.0131e-04
DEBUG:root:i=11 residual=4.0321e-04
DEBUG:root:i=12 residual=2.3800e-04
DEBUG:root:i=13 residual=1.3812e-04
DEBUG:root:i=14 residual=8.1083e-05
DEBUG:root:i=15 residual=4.7251e-05
DEBUG:root:i=16 residual=2.7704e-05
DEBUG:root:i=17 residual=1.6164e-05
DEBUG:root:i=18 residual=9.4778e-06
DEBUG:root:i=19 residual=5.5066e-06
DEBUG:root:i=20 residual=3.2042e-06
DEBUG:root:i=21 residual=1.8612e-06
DEBUG:root:i=22 residual=1.1283e-06
DEBUG:root:i=23 residual=6.1907e-07
INFO:root:rank=0 pagerank=6.6270e-01 url=4
INFO:root:rank=1 pagerank=5.2179e-01 url=6
INFO:root:rank=2 pagerank=4.1434e-01 url=5
INFO:root:rank=3 pagerank=2.3175e-01 url=2
INFO:root:rank=4 pagerank=1.8590e-01 url=3
INFO:root:rank=5 pagerank=1.6917e-01 url=1

Yours likely won't be identical (due to minor implementation details and weird floating point issues), but it should be similar. In particular, the ranking of the nodes/urls should be the same order.

NOTE: The --verbose flag causes all of the lines beginning with DEBUG to be printed. By default, only lines beginning with INFO are printed.

NOTE: There are no automated test cases to pass for this assignment. Test cases for algorithms involving floating point computations are hard to write and understand. Minor-seeming implementations details can have large impacts on the final result. These software engineering issues are beyond the scope of this class.

Instructions for how I will grade your homework are contained in the submission section at the end of this document.

Part 2:

The pagerank.py file has an option --search_query, which takes a string as a parameter. If this argument is used, then the program returns all nodes that match the query string sorted according to their pagerank. Essentially, this gives us the most important pages related to our query.

Again, you may not get the exact same results as me, but you should get similar results to the examples I've shown below. Verify that you do in fact get similar results.

$ python3 pagerank.py --data=data/lawfareblog.csv.gz --search_query='corona'
INFO:root:rank=0 pagerank=1.0038e-03 url=www.lawfareblog.com/lawfare-podcast-united-nations-and-coronavirus-crisis
INFO:root:rank=1 pagerank=8.9224e-04 url=www.lawfareblog.com/house-oversight-committee-holds-day-two-hearing-government-coronavirus-response
INFO:root:rank=2 pagerank=7.0390e-04 url=www.lawfareblog.com/britains-coronavirus-response
INFO:root:rank=3 pagerank=6.9153e-04 url=www.lawfareblog.com/prosecuting-purposeful-coronavirus-exposure-terrorism
INFO:root:rank=4 pagerank=6.7041e-04 url=www.lawfareblog.com/israeli-emergency-regulations-location-tracking-coronavirus-carriers
INFO:root:rank=5 pagerank=6.6256e-04 url=www.lawfareblog.com/why-congress-conducting-business-usual-face-coronavirus
INFO:root:rank=6 pagerank=6.5046e-04 url=www.lawfareblog.com/congressional-homeland-security-committees-seek-ways-support-state-federal-responses-coronavirus
INFO:root:rank=7 pagerank=6.3620e-04 url=www.lawfareblog.com/paper-hearing-experts-debate-digital-contact-tracing-and-coronavirus-privacy-concerns
INFO:root:rank=8 pagerank=6.1248e-04 url=www.lawfareblog.com/house-subcommittee-voices-concerns-over-us-management-coronavirus
INFO:root:rank=9 pagerank=6.0187e-04 url=www.lawfareblog.com/livestream-house-oversight-committee-holds-hearing-government-coronavirus-response

$ python3 pagerank.py --data=data/lawfareblog.csv.gz --search_query='trump'
INFO:root:rank=0 pagerank=5.7826e-03 url=www.lawfareblog.com/trump-asks-supreme-court-stay-congressional-subpeona-tax-returns
INFO:root:rank=1 pagerank=5.2338e-03 url=www.lawfareblog.com/document-trump-revokes-obama-executive-order-counterterrorism-strike-casualty-reporting
INFO:root:rank=2 pagerank=5.1297e-03 url=www.lawfareblog.com/trump-administrations-worrying-new-policy-israeli-settlements
INFO:root:rank=3 pagerank=4.6599e-03 url=www.lawfareblog.com/dc-circuit-overrules-district-courts-due-process-ruling-qasim-v-trump
INFO:root:rank=4 pagerank=4.5934e-03 url=www.lawfareblog.com/donald-trump-and-politically-weaponized-executive-branch
INFO:root:rank=5 pagerank=4.3071e-03 url=www.lawfareblog.com/how-trumps-approach-middle-east-ignores-past-future-and-human-condition
INFO:root:rank=6 pagerank=4.0935e-03 url=www.lawfareblog.com/why-trump-cant-buy-greenland
INFO:root:rank=7 pagerank=3.7591e-03 url=www.lawfareblog.com/oral-argument-summary-qassim-v-trump
INFO:root:rank=8 pagerank=3.4509e-03 url=www.lawfareblog.com/dc-circuit-court-denies-trump-rehearing-mazars-case
INFO:root:rank=9 pagerank=3.4484e-03 url=www.lawfareblog.com/second-circuit-rules-mazars-must-hand-over-trump-tax-returns-new-york-prosecutors

$ python3 pagerank.py --data=data/lawfareblog.csv.gz --search_query='iran'
INFO:root:rank=0 pagerank=4.5746e-03 url=www.lawfareblog.com/praise-presidents-iran-tweets
INFO:root:rank=1 pagerank=4.4174e-03 url=www.lawfareblog.com/how-us-iran-tensions-could-disrupt-iraqs-fragile-peace
INFO:root:rank=2 pagerank=2.6928e-03 url=www.lawfareblog.com/cyber-command-operational-update-clarifying-june-2019-iran-operation
INFO:root:rank=3 pagerank=1.9391e-03 url=www.lawfareblog.com/aborted-iran-strike-fine-line-between-necessity-and-revenge
INFO:root:rank=4 pagerank=1.5452e-03 url=www.lawfareblog.com/parsing-state-departments-letter-use-force-against-iran
INFO:root:rank=5 pagerank=1.5357e-03 url=www.lawfareblog.com/iranian-hostage-crisis-and-its-effect-american-politics
INFO:root:rank=6 pagerank=1.5258e-03 url=www.lawfareblog.com/announcing-united-states-and-use-force-against-iran-new-lawfare-e-book
INFO:root:rank=7 pagerank=1.4221e-03 url=www.lawfareblog.com/us-names-iranian-revolutionary-guard-terrorist-organization-and-sanctions-international-criminal
INFO:root:rank=8 pagerank=1.1788e-03 url=www.lawfareblog.com/iran-shoots-down-us-drone-domestic-and-international-legal-implications
INFO:root:rank=9 pagerank=1.1463e-03 url=www.lawfareblog.com/israel-iran-syria-clash-and-law-use-force

Part 3:

The webgraph of lawfareblog.com (i.e. the $P$ matrix) naturally contains a lot of structure. For example, essentially all pages on the domain have links to the root page https://lawfareblog.com/ and other "non-article" pages like https://www.lawfareblog.com/topics and https://www.lawfareblog.com/subscribe-lawfare. These pages therefore have a large pagerank. We can get a list of the pages with the largest pagerank by running

$ python3 pagerank.py --data=data/lawfareblog.csv.gz
INFO:root:rank=0 pagerank=2.8741e-01 url=www.lawfareblog.com/lawfare-job-board
INFO:root:rank=1 pagerank=2.8741e-01 url=www.lawfareblog.com/masthead
INFO:root:rank=2 pagerank=2.8741e-01 url=www.lawfareblog.com/litigation-documents-related-appointment-matthew-whitaker-acting-attorney-general
INFO:root:rank=3 pagerank=2.8741e-01 url=www.lawfareblog.com/documents-related-mueller-investigation
INFO:root:rank=4 pagerank=2.8741e-01 url=www.lawfareblog.com/topics
INFO:root:rank=5 pagerank=2.8741e-01 url=www.lawfareblog.com/about-lawfare-brief-history-term-and-site
INFO:root:rank=6 pagerank=2.8741e-01 url=www.lawfareblog.com/snowden-revelations
INFO:root:rank=7 pagerank=2.8741e-01 url=www.lawfareblog.com/support-lawfare
INFO:root:rank=8 pagerank=2.8741e-01 url=www.lawfareblog.com/upcoming-events
INFO:root:rank=9 pagerank=2.8741e-01 url=www.lawfareblog.com/our-comments-policy

Most of these pages are not very interesting, however, because they are not articles, and usually when we are performing a web search, we only want articles.

This raises the question: How can we find the most important articles filtering out the non-article pages? The answer is to modify the $P$ matrix by removing all links to non-article pages.

This raises another question: How do we know if a link is a non-article page? Unfortunately, this is a hard question to answer with 100% accuracy, but there are many methods that get us most of the way there. One easy to implement method is to compute what's called the "in-link ratio" of each node (i.e. the total number of edges with the node as a target divided by the total number of nodes), and then remove nodes from the search results with too-high of a ratio. The intuition is that non-article pages often appear in the menu of a webpage, and so have links from almost all of the other webpages; but article-webpages are unlikely to appear on a menu and so will only have a small number of links from other webpages. The --filter_ratio parameter causes the code to remove all pages that have an in-link ratio larger than the provided value.

Using this option, we can estimate the most important articles on the domain with the following command:

$ python3 pagerank.py --data=data/lawfareblog.csv.gz --filter_ratio=0.2
INFO:root:rank=0 pagerank=3.4696e-01 url=www.lawfareblog.com/trump-asks-supreme-court-stay-congressional-subpeona-tax-returns
INFO:root:rank=1 pagerank=2.9521e-01 url=www.lawfareblog.com/livestream-nov-21-impeachment-hearings-0
INFO:root:rank=2 pagerank=2.9040e-01 url=www.lawfareblog.com/opening-statement-david-holmes
INFO:root:rank=3 pagerank=1.5179e-01 url=www.lawfareblog.com/lawfare-podcast-ben-nimmo-whack-mole-game-disinformation
INFO:root:rank=4 pagerank=1.5099e-01 url=www.lawfareblog.com/todays-headlines-and-commentary-1963
INFO:root:rank=5 pagerank=1.5099e-01 url=www.lawfareblog.com/todays-headlines-and-commentary-1964
INFO:root:rank=6 pagerank=1.5071e-01 url=www.lawfareblog.com/lawfare-podcast-week-was-impeachment
INFO:root:rank=7 pagerank=1.4957e-01 url=www.lawfareblog.com/todays-headlines-and-commentary-1962
INFO:root:rank=8 pagerank=1.4367e-01 url=www.lawfareblog.com/cyberlaw-podcast-mistrusting-google
INFO:root:rank=9 pagerank=1.4240e-01 url=www.lawfareblog.com/lawfare-podcast-bonus-edition-gordon-sondland-vs-committee-no-bull

Notice that the urls in this list look much more like articles than the urls in the previous list.

When Google calculates their $P$ matrix for the web, they use a similar (but much more complicated) process to modify the $P$ matrix in order to reduce spam results. The exact formula they use is a jealously guarded secret that they update continuously.

In the case above, notice that we have accidentally removed the blog's most popular article (https://www.lawfareblog.com/snowden-revelations). The blog editors believed that Snowden's revelations about NSA spying are so important that they directly put a link to the article on the menu. So every single webpage in the domain links to the Snowden article, and our "anti-spam" --filter-ratio argument removed this article from the list. In general, it is a challenging open problem to remove spam from pagerank results, and all current solutions rely on careful human tuning and still have lots of false positives and false negatives.

Part 4:

Recall from the reading that the runtime of pagerank depends heavily on the eigengap of the $\bar{\bar P}$ matrix, and that this eigengap is bounded by the alpha parameter.

Run the following four commands:

$ python3 pagerank.py --data=data/lawfareblog.csv.gz --verbose 
$ python3 pagerank.py --data=data/lawfareblog.csv.gz --verbose --alpha=0.99999
$ python3 pagerank.py --data=data/lawfareblog.csv.gz --verbose --filter_ratio=0.2
$ python3 pagerank.py --data=data/lawfareblog.csv.gz --verbose --filter_ratio=0.2 --alpha=0.99999

You should notice that the last command takes considerably more iterations to compute the pagerank vector. (My code takes 685 iterations for this call, and about 10 iterations for all the others.)

This raises the question: Why does the second command (with the --alpha option but without the --filter_ratio) option not take a long time to run? The answer is that the $P$ graph for https://www.lawfareblog.com naturally has a large eigengap and so is fast to compute for all alpha values, but the modified graph does not have a large eigengap and so requires a small alpha for fast convergence.

Changing the value of alpha also gives us very different pagerank rankings. For example,

$ python3 pagerank.py --data=data/lawfareblog.csv.gz --filter_ratio=0.2
INFO:root:rank=0 pagerank=3.4696e-01 url=www.lawfareblog.com/trump-asks-supreme-court-stay-congressional-subpeona-tax-returns
INFO:root:rank=1 pagerank=2.9521e-01 url=www.lawfareblog.com/livestream-nov-21-impeachment-hearings-0
INFO:root:rank=2 pagerank=2.9040e-01 url=www.lawfareblog.com/opening-statement-david-holmes
INFO:root:rank=3 pagerank=1.5179e-01 url=www.lawfareblog.com/lawfare-podcast-ben-nimmo-whack-mole-game-disinformation
INFO:root:rank=4 pagerank=1.5099e-01 url=www.lawfareblog.com/todays-headlines-and-commentary-1963
INFO:root:rank=5 pagerank=1.5099e-01 url=www.lawfareblog.com/todays-headlines-and-commentary-1964
INFO:root:rank=6 pagerank=1.5071e-01 url=www.lawfareblog.com/lawfare-podcast-week-was-impeachment
INFO:root:rank=7 pagerank=1.4957e-01 url=www.lawfareblog.com/todays-headlines-and-commentary-1962
INFO:root:rank=8 pagerank=1.4367e-01 url=www.lawfareblog.com/cyberlaw-podcast-mistrusting-google
INFO:root:rank=9 pagerank=1.4240e-01 url=www.lawfareblog.com/lawfare-podcast-bonus-edition-gordon-sondland-vs-committee-no-bull

$ python3 pagerank.py --data=data/lawfareblog.csv.gz --filter_ratio=0.2 --alpha=0.99999
INFO:root:rank=0 pagerank=7.0149e-01 url=www.lawfareblog.com/covid-19-speech-and-surveillance-response
INFO:root:rank=1 pagerank=7.0149e-01 url=www.lawfareblog.com/lawfare-live-covid-19-speech-and-surveillance
INFO:root:rank=2 pagerank=1.0552e-01 url=www.lawfareblog.com/cost-using-zero-days
INFO:root:rank=3 pagerank=3.1755e-02 url=www.lawfareblog.com/lawfare-podcast-former-congressman-brian-baird-and-daniel-schuman-how-congress-can-continue-function
INFO:root:rank=4 pagerank=2.2040e-02 url=www.lawfareblog.com/events
INFO:root:rank=5 pagerank=1.6027e-02 url=www.lawfareblog.com/water-wars-increased-us-focus-indo-pacific
INFO:root:rank=6 pagerank=1.6026e-02 url=www.lawfareblog.com/water-wars-drill-maybe-drill
INFO:root:rank=7 pagerank=1.6023e-02 url=www.lawfareblog.com/water-wars-disjointed-operations-south-china-sea
INFO:root:rank=8 pagerank=1.6020e-02 url=www.lawfareblog.com/water-wars-song-oil-and-fire
INFO:root:rank=9 pagerank=1.6020e-02 url=www.lawfareblog.com/water-wars-sinking-feeling-philippine-china-relations

Which of these rankings is better is entirely subjective, and the only way to know if you have the "best" alpha for your application is to try several variations and see what is best.

NOTE: It should be "obvious" to you that large alpha values imply that the structure of the webgraph has more influence on the final result, and small alpha values ignore the structure of the webgraph. Recall that the word "obvious" means that it follows directly from the definition, but you may still need to sit and meditate on the definition for a long period of time.

If large alphas are good for your application, you can see that there is a trade-off between quality answers and algorithmic runtime. We'll be exploring this trade-off more formally in class over the rest of the semester.

Task 2: the personalization vector

The most interesting applications of pagerank involve the personalization vector. Implement the WebGraph.make_personalization_vector function so that it outputs a personalization vector tuned for the input query. The pseudocode for the function is:

for each index in the personalization vector:
    get the url for the index (see the _index_to_url function)
    check if the url satisfies the input query (see the url_satisfies_query function)
    if so, set the corresponding index to one
normalize the vector

Part 1:

The command line argument --personalization_vector_query will use the function you created above to augment your search with a custom personalization vector. If you've implemented the function correctly, you should get results similar to:

$ python3 pagerank.py --data=data/lawfareblog.csv.gz --filter_ratio=0.2 --personalization_vector_query='corona'
INFO:root:rank=0 pagerank=6.3127e-01 url=www.lawfareblog.com/covid-19-speech-and-surveillance-response
INFO:root:rank=1 pagerank=6.3124e-01 url=www.lawfareblog.com/lawfare-live-covid-19-speech-and-surveillance
INFO:root:rank=2 pagerank=1.5947e-01 url=www.lawfareblog.com/chinatalk-how-party-takes-its-propaganda-global
INFO:root:rank=3 pagerank=1.2209e-01 url=www.lawfareblog.com/brexit-not-immune-coronavirus
INFO:root:rank=4 pagerank=1.2209e-01 url=www.lawfareblog.com/rational-security-my-corona-edition
INFO:root:rank=5 pagerank=9.3360e-02 url=www.lawfareblog.com/trump-cant-reopen-country-over-state-objections
INFO:root:rank=6 pagerank=9.1920e-02 url=www.lawfareblog.com/prosecuting-purposeful-coronavirus-exposure-terrorism
INFO:root:rank=7 pagerank=9.1920e-02 url=www.lawfareblog.com/britains-coronavirus-response
INFO:root:rank=8 pagerank=7.7770e-02 url=www.lawfareblog.com/lawfare-podcast-united-nations-and-coronavirus-crisis
INFO:root:rank=9 pagerank=7.2888e-02 url=www.lawfareblog.com/house-oversight-committee-holds-day-two-hearing-government-coronavirus-response

Notice that these results are significantly different than when using the --search_query option:

$ python3 pagerank.py --data=data/lawfareblog.csv.gz --filter_ratio=0.2 --search_query='corona'
INFO:root:rank=0 pagerank=8.1320e-03 url=www.lawfareblog.com/house-oversight-committee-holds-day-two-hearing-government-coronavirus-response
INFO:root:rank=1 pagerank=7.7908e-03 url=www.lawfareblog.com/lawfare-podcast-united-nations-and-coronavirus-crisis
INFO:root:rank=2 pagerank=5.2262e-03 url=www.lawfareblog.com/livestream-house-oversight-committee-holds-hearing-government-coronavirus-response
INFO:root:rank=3 pagerank=3.9584e-03 url=www.lawfareblog.com/britains-coronavirus-response
INFO:root:rank=4 pagerank=3.8114e-03 url=www.lawfareblog.com/prosecuting-purposeful-coronavirus-exposure-terrorism
INFO:root:rank=5 pagerank=3.3973e-03 url=www.lawfareblog.com/paper-hearing-experts-debate-digital-contact-tracing-and-coronavirus-privacy-concerns
INFO:root:rank=6 pagerank=3.3633e-03 url=www.lawfareblog.com/cyberlaw-podcast-how-israel-fighting-coronavirus
INFO:root:rank=7 pagerank=3.3557e-03 url=www.lawfareblog.com/israeli-emergency-regulations-location-tracking-coronavirus-carriers
INFO:root:rank=8 pagerank=3.2160e-03 url=www.lawfareblog.com/congress-needs-coronavirus-failsafe-its-too-late
INFO:root:rank=9 pagerank=3.1036e-03 url=www.lawfareblog.com/why-congress-conducting-business-usual-face-coronavirus

Which results are better? Again, that depends on what you mean by "better." With the --personalization_vector_query option, a webpage is important only if other coronavirus webpages also think it's important; with the --search_query option, a webpage is important if any other webpage thinks it's important. You'll notice that in the later example, many of the webpages are about Congressional proceedings related to the coronavirus. From a strictly coronavirus perspective, these are not very important webpages. But in the broader context of national security, these are very important webpages.

Google engineers spend TONs of time fine-tuning their pagerank personalization vectors to remove spam webpages. Exactly how they do this is another one of their secrets that they don't publicly talk about.

Part 2:

Another use of the --personalization_vector_query option is that we can find out what webpages are related to the coronavirus but don't directly mention the coronavirus. This can be used to map out what types of topics are similar to the coronavirus.

For example, the following query ranks all webpages by their corona importance, but removes webpages mentioning corona from the results.

$ python3 pagerank.py --data=data/lawfareblog.csv.gz --filter_ratio=0.2 --personalization_vector_query='corona' --search_query='-corona'
INFO:root:rank=0 pagerank=6.3127e-01 url=www.lawfareblog.com/covid-19-speech-and-surveillance-response
INFO:root:rank=1 pagerank=6.3124e-01 url=www.lawfareblog.com/lawfare-live-covid-19-speech-and-surveillance
INFO:root:rank=2 pagerank=1.5947e-01 url=www.lawfareblog.com/chinatalk-how-party-takes-its-propaganda-global
INFO:root:rank=3 pagerank=9.3360e-02 url=www.lawfareblog.com/trump-cant-reopen-country-over-state-objections
INFO:root:rank=4 pagerank=7.0277e-02 url=www.lawfareblog.com/fault-lines-foreign-policy-quarantined
INFO:root:rank=5 pagerank=6.9713e-02 url=www.lawfareblog.com/lawfare-podcast-mom-and-dad-talk-clinical-trials-pandemic
INFO:root:rank=6 pagerank=6.4944e-02 url=www.lawfareblog.com/limits-world-health-organization
INFO:root:rank=7 pagerank=5.9492e-02 url=www.lawfareblog.com/chinatalk-dispatches-shanghai-beijing-and-hong-kong
INFO:root:rank=8 pagerank=5.1245e-02 url=www.lawfareblog.com/us-moves-dismiss-case-against-company-linked-ira-troll-farm
INFO:root:rank=9 pagerank=5.1245e-02 url=www.lawfareblog.com/livestream-house-armed-services-holds-hearing-national-security-challenges-north-and-south-america

You can see that there are many urls about concepts that are obviously related like "covid", "clinical trials", and "quarantine", but this algorithm also finds articles about Chinese propaganda and Trump's policy decisions. Both of these articles are highly relevant to coronavirus discussions, but a simple keyword search for corona or related terms would not find these articles. The vast majority of industry data mining work is finding clever uses of standard algorithms.

Submission

  1. Create a new repo on github (not a fork of this repo). Ensure that all of the project files are copied from this folder into your new repo.

  2. As you complete the tasks above: Run the corresponding commands below, and paste their output into the code blocks. Please ensure correct markdown formatting.

    Task 1, part 1:

    $ python3 pagerank.py --data=data/small.csv.gz --verbose
    
    DEBUG:root:computing indices
    DEBUG:root:computing values
    DEBUG:root:i=0 residual=0.6277832984924316
    DEBUG:root:i=1 residual=0.11841226369142532
    DEBUG:root:i=2 residual=0.07070129364728928
    DEBUG:root:i=3 residual=0.03181539848446846
    DEBUG:root:i=4 residual=0.020496614277362823
    DEBUG:root:i=5 residual=0.01010835450142622
    DEBUG:root:i=6 residual=0.006371526513248682
    DEBUG:root:i=7 residual=0.0034228116273880005
    DEBUG:root:i=8 residual=0.002087965374812484
    DEBUG:root:i=9 residual=0.0011749409604817629
    DEBUG:root:i=10 residual=0.0007013162248767912
    DEBUG:root:i=11 residual=0.00040323587018065155
    DEBUG:root:i=12 residual=0.00023796527239028364
    DEBUG:root:i=13 residual=0.00013811791723128408
    DEBUG:root:i=14 residual=8.111781789921224e-05
    DEBUG:root:i=15 residual=4.723723031929694e-05
    DEBUG:root:i=16 residual=2.7683758162311278e-05
    DEBUG:root:i=17 residual=1.6175998098333366e-05
    DEBUG:root:i=18 residual=9.440776921110228e-06
    DEBUG:root:i=19 residual=5.51695256945095e-06
    DEBUG:root:i=20 residual=3.198200147380703e-06
    DEBUG:root:i=21 residual=1.911825393108302e-06
    DEBUG:root:i=22 residual=1.128756821344723e-06
    DEBUG:root:i=23 residual=6.653998525507632e-07
    INFO:root:rank=0 pagerank=6.6270e-01 url=4
    INFO:root:rank=1 pagerank=5.2179e-01 url=6
    INFO:root:rank=2 pagerank=4.1434e-01 url=5
    INFO:root:rank=3 pagerank=2.3175e-01 url=2
    INFO:root:rank=4 pagerank=1.8590e-01 url=3
    INFO:root:rank=5 pagerank=1.6917e-01 url=1
    

    Task 1, part 2:

$ python3 pagerank.py --data=data/lawfareblog.csv.gz --search_query='corona'

INFO:root:rank=0 pagerank=1.0038e-03 url=www.lawfareblog.com/lawfare-podcast-united-nations-and-coronavirus-crisis
INFO:root:rank=1 pagerank=8.9231e-04 url=www.lawfareblog.com/house-oversight-committee-holds-day-two-hearing-government-coronavirus-response
INFO:root:rank=2 pagerank=7.0396e-04 url=www.lawfareblog.com/britains-coronavirus-response
INFO:root:rank=3 pagerank=6.9159e-04 url=www.lawfareblog.com/prosecuting-purposeful-coronavirus-exposure-terrorism
INFO:root:rank=4 pagerank=6.7047e-04 url=www.lawfareblog.com/israeli-emergency-regulations-location-tracking-coronavirus-carriers
INFO:root:rank=5 pagerank=6.6262e-04 url=www.lawfareblog.com/why-congress-conducting-business-usual-face-coronavirus
INFO:root:rank=6 pagerank=6.5051e-04 url=www.lawfareblog.com/congressional-homeland-security-committees-seek-ways-support-state-federal-responses-coronavirus
INFO:root:rank=7 pagerank=6.3625e-04 url=www.lawfareblog.com/paper-hearing-experts-debate-digital-contact-tracing-and-coronavirus-privacy-concerns
INFO:root:rank=8 pagerank=6.1254e-04 url=www.lawfareblog.com/house-subcommittee-voices-concerns-over-us-management-coronavirus
INFO:root:rank=9 pagerank=6.0193e-04 url=www.lawfareblog.com/livestream-house-oversight-committee-holds-hearing-government-coronavirus-response
$ python3 pagerank.py --data=data/lawfareblog.csv.gz --search_query='trump'

INFO:root:rank=0 pagerank=5.7828e-03 url=www.lawfareblog.com/trump-asks-supreme-court-stay-congressional-subpeona-tax-returns
INFO:root:rank=1 pagerank=5.2341e-03 url=www.lawfareblog.com/document-trump-revokes-obama-executive-order-counterterrorism-strike-casualty-reporting
INFO:root:rank=2 pagerank=5.1299e-03 url=www.lawfareblog.com/trump-administrations-worrying-new-policy-israeli-settlements
INFO:root:rank=3 pagerank=4.6601e-03 url=www.lawfareblog.com/dc-circuit-overrules-district-courts-due-process-ruling-qasim-v-trump
INFO:root:rank=4 pagerank=4.5936e-03 url=www.lawfareblog.com/donald-trump-and-politically-weaponized-executive-branch
INFO:root:rank=5 pagerank=4.3073e-03 url=www.lawfareblog.com/how-trumps-approach-middle-east-ignores-past-future-and-human-condition
INFO:root:rank=6 pagerank=4.0937e-03 url=www.lawfareblog.com/why-trump-cant-buy-greenland
INFO:root:rank=7 pagerank=3.7593e-03 url=www.lawfareblog.com/oral-argument-summary-qassim-v-trump
INFO:root:rank=8 pagerank=3.4510e-03 url=www.lawfareblog.com/dc-circuit-court-denies-trump-rehearing-mazars-case
INFO:root:rank=9 pagerank=3.4486e-03 url=www.lawfareblog.com/second-circuit-rules-mazars-must-hand-over-trump-tax-returns-new-york-prosecutors
$ python3 pagerank.py --data=data/lawfareblog.csv.gz --search_query='iran'

INFO:root:rank=0 pagerank=4.5748e-03 url=www.lawfareblog.com/praise-presidents-iran-tweets
INFO:root:rank=1 pagerank=4.4176e-03 url=www.lawfareblog.com/how-us-iran-tensions-could-disrupt-iraqs-fragile-peace
INFO:root:rank=2 pagerank=2.6929e-03 url=www.lawfareblog.com/cyber-command-operational-update-clarifying-june-2019-iran-operation
INFO:root:rank=3 pagerank=1.9393e-03 url=www.lawfareblog.com/aborted-iran-strike-fine-line-between-necessity-and-revenge
INFO:root:rank=4 pagerank=1.5453e-03 url=www.lawfareblog.com/parsing-state-departments-letter-use-force-against-iran
INFO:root:rank=5 pagerank=1.5358e-03 url=www.lawfareblog.com/iranian-hostage-crisis-and-its-effect-american-politics
INFO:root:rank=6 pagerank=1.5259e-03 url=www.lawfareblog.com/announcing-united-states-and-use-force-against-iran-new-lawfare-e-book
INFO:root:rank=7 pagerank=1.4222e-03 url=www.lawfareblog.com/us-names-iranian-revolutionary-guard-terrorist-organization-and-sanctions-international-criminal
INFO:root:rank=8 pagerank=1.1788e-03 url=www.lawfareblog.com/iran-shoots-down-us-drone-domestic-and-international-legal-implications
INFO:root:rank=9 pagerank=1.1464e-03 url=www.lawfareblog.com/israel-iran-syria-clash-and-law-use-force

Task 1, part 3:

$ python3 pagerank.py --data=data/lawfareblog.csv.gz

INFO:root:rank=0 pagerank=2.8741e-01 url=www.lawfareblog.com/about-lawfare-brief-history-term-and-site
INFO:root:rank=1 pagerank=2.8741e-01 url=www.lawfareblog.com/lawfare-job-board
INFO:root:rank=2 pagerank=2.8741e-01 url=www.lawfareblog.com/masthead
INFO:root:rank=3 pagerank=2.8741e-01 url=www.lawfareblog.com/litigation-documents-resources-related-travel-ban
INFO:root:rank=4 pagerank=2.8741e-01 url=www.lawfareblog.com/subscribe-lawfare
INFO:root:rank=5 pagerank=2.8741e-01 url=www.lawfareblog.com/litigation-documents-related-appointment-matthew-whitaker-acting-attorney-general
INFO:root:rank=6 pagerank=2.8741e-01 url=www.lawfareblog.com/documents-related-mueller-investigation
INFO:root:rank=7 pagerank=2.8741e-01 url=www.lawfareblog.com/our-comments-policy
INFO:root:rank=8 pagerank=2.8741e-01 url=www.lawfareblog.com/upcoming-events
INFO:root:rank=9 pagerank=2.8741e-01 url=www.lawfareblog.com/topics
$ python3 pagerank.py --data=data/lawfareblog.csv.gz --filter_ratio=0.2

INFO:root:rank=0 pagerank=3.4697e-01 url=www.lawfareblog.com/trump-asks-supreme-court-stay-congressional-subpeona-tax-returns
INFO:root:rank=1 pagerank=2.9522e-01 url=www.lawfareblog.com/livestream-nov-21-impeachment-hearings-0
INFO:root:rank=2 pagerank=2.9040e-01 url=www.lawfareblog.com/opening-statement-david-holmes
INFO:root:rank=3 pagerank=1.5179e-01 url=www.lawfareblog.com/lawfare-podcast-ben-nimmo-whack-mole-game-disinformation
INFO:root:rank=4 pagerank=1.5100e-01 url=www.lawfareblog.com/todays-headlines-and-commentary-1964
INFO:root:rank=5 pagerank=1.5100e-01 url=www.lawfareblog.com/todays-headlines-and-commentary-1963
INFO:root:rank=6 pagerank=1.5072e-01 url=www.lawfareblog.com/lawfare-podcast-week-was-impeachment
INFO:root:rank=7 pagerank=1.4958e-01 url=www.lawfareblog.com/todays-headlines-and-commentary-1962
INFO:root:rank=8 pagerank=1.4367e-01 url=www.lawfareblog.com/cyberlaw-podcast-mistrusting-google
INFO:root:rank=9 pagerank=1.4240e-01 url=www.lawfareblog.com/lawfare-podcast-bonus-edition-gordon-sondland-vs-committee-no-bull

Task 1, part 4:

$ python3 pagerank.py --data=data/lawfareblog.csv.gz --verbose

DEBUG:root:computing indices
DEBUG:root:computing values
DEBUG:root:i=0 residual=141.904296875
DEBUG:root:i=1 residual=0.11642323434352875
DEBUG:root:i=2 residual=0.0749533399939537
DEBUG:root:i=3 residual=0.031701844185590744
DEBUG:root:i=4 residual=0.01745055615901947
DEBUG:root:i=5 residual=0.008532337844371796
DEBUG:root:i=6 residual=0.004448112566024065
DEBUG:root:i=7 residual=0.0022480429615825415
DEBUG:root:i=8 residual=0.0011543912114575505
DEBUG:root:i=9 residual=0.0005845915293321013
DEBUG:root:i=10 residual=0.0002967763866763562
DEBUG:root:i=11 residual=0.00015026562323328108
DEBUG:root:i=12 residual=7.531535811722279e-05
DEBUG:root:i=13 residual=4.007851021015085e-05
DEBUG:root:i=14 residual=2.0689936718554236e-05
DEBUG:root:i=15 residual=1.014485224004602e-05
...
DEBUG:root:i=996 residual=4.073772288393229e-06
DEBUG:root:i=997 residual=4.073772288393229e-06
DEBUG:root:i=998 residual=4.073772288393229e-06
DEBUG:root:i=999 residual=4.073772288393229e-06
INFO:root:rank=0 pagerank=2.8741e-01 url=www.lawfareblog.com/about-lawfare-brief-history-term-and-site
INFO:root:rank=1 pagerank=2.8741e-01 url=www.lawfareblog.com/lawfare-job-board
INFO:root:rank=2 pagerank=2.8741e-01 url=www.lawfareblog.com/masthead
INFO:root:rank=3 pagerank=2.8741e-01 url=www.lawfareblog.com/litigation-documents-resources-related-travel-ban
INFO:root:rank=4 pagerank=2.8741e-01 url=www.lawfareblog.com/subscribe-lawfare
INFO:root:rank=5 pagerank=2.8741e-01 url=www.lawfareblog.com/litigation-documents-related-appointment-matthew-whitaker-acting-attorney-general
INFO:root:rank=6 pagerank=2.8741e-01 url=www.lawfareblog.com/documents-related-mueller-investigation
INFO:root:rank=7 pagerank=2.8741e-01 url=www.lawfareblog.com/our-comments-policy
INFO:root:rank=8 pagerank=2.8741e-01 url=www.lawfareblog.com/upcoming-events
INFO:root:rank=9 pagerank=2.8741e-01 url=www.lawfareblog.com/topics
$ python3 pagerank.py --data=data/lawfareblog.csv.gz --verbose --alpha=0.99999

DEBUG:root:computing indices
DEBUG:root:computing values
DEBUG:root:i=0 residual=141.91505432128906
DEBUG:root:i=1 residual=0.0708821713924408
DEBUG:root:i=2 residual=0.018822869285941124
DEBUG:root:i=3 residual=0.006958351004868746
DEBUG:root:i=4 residual=0.0027358438819646835
DEBUG:root:i=5 residual=0.0010345664341002703
DEBUG:root:i=6 residual=0.00037746725138276815
DEBUG:root:i=7 residual=0.00013533492165151983
DEBUG:root:i=8 residual=4.822490882361308e-05
DEBUG:root:i=9 residual=1.717261693556793e-05
DEBUG:root:i=10 residual=6.114854841143824e-06
DEBUG:root:i=11 residual=2.1757550712209195e-06
DEBUG:root:i=12 residual=7.828199954929005e-07
INFO:root:rank=0 pagerank=2.8859e-01 url=www.lawfareblog.com/snowden-revelations
INFO:root:rank=1 pagerank=2.8859e-01 url=www.lawfareblog.com/lawfare-job-board
INFO:root:rank=2 pagerank=2.8859e-01 url=www.lawfareblog.com/documents-related-mueller-investigation
INFO:root:rank=3 pagerank=2.8859e-01 url=www.lawfareblog.com/litigation-documents-resources-related-travel-ban
INFO:root:rank=4 pagerank=2.8859e-01 url=www.lawfareblog.com/subscribe-lawfare
INFO:root:rank=5 pagerank=2.8859e-01 url=www.lawfareblog.com/topics
INFO:root:rank=6 pagerank=2.8859e-01 url=www.lawfareblog.com/masthead
INFO:root:rank=7 pagerank=2.8859e-01 url=www.lawfareblog.com/our-comments-policy
INFO:root:rank=8 pagerank=2.8859e-01 url=www.lawfareblog.com/upcoming-events
INFO:root:rank=9 pagerank=2.8859e-01 url=www.lawfareblog.com/litigation-documents-related-appointment-matthew-whitaker-acting-attorney-general
$ python3 pagerank.py --data=data/lawfareblog.csv.gz --verbose --filter_ratio=0.2

DEBUG:root:computing indices
DEBUG:root:computing values
DEBUG:root:i=0 residual=133.37046813964844
DEBUG:root:i=1 residual=0.4985728859901428
DEBUG:root:i=2 residual=0.13418623805046082
DEBUG:root:i=3 residual=0.06922142952680588
DEBUG:root:i=4 residual=0.02340950071811676
DEBUG:root:i=5 residual=0.010187399573624134
DEBUG:root:i=6 residual=0.004906957503408194
DEBUG:root:i=7 residual=0.0022800201550126076
DEBUG:root:i=8 residual=0.0010746779153123498
DEBUG:root:i=9 residual=0.0005250612157396972
DEBUG:root:i=10 residual=0.0002696898300200701
DEBUG:root:i=11 residual=0.00014567782636731863
DEBUG:root:i=12 residual=8.228314982261509e-05
DEBUG:root:i=13 residual=4.81165261589922e-05
DEBUG:root:i=14 residual=2.8800423024222255e-05
DEBUG:root:i=15 residual=1.7394713722751476e-05
DEBUG:root:i=16 residual=1.0549636499490589e-05
DEBUG:root:i=17 residual=6.375057182594901e-06
DEBUG:root:i=18 residual=3.839084001810988e-06
DEBUG:root:i=19 residual=2.2954950509301852e-06
DEBUG:root:i=20 residual=1.3711619430978317e-06
DEBUG:root:i=21 residual=8.12082362244837e-07
INFO:root:rank=0 pagerank=3.4697e-01 url=www.lawfareblog.com/trump-asks-supreme-court-stay-congressional-subpeona-tax-returns
INFO:root:rank=1 pagerank=2.9522e-01 url=www.lawfareblog.com/livestream-nov-21-impeachment-hearings-0
INFO:root:rank=2 pagerank=2.9040e-01 url=www.lawfareblog.com/opening-statement-david-holmes
INFO:root:rank=3 pagerank=1.5179e-01 url=www.lawfareblog.com/lawfare-podcast-ben-nimmo-whack-mole-game-disinformation
INFO:root:rank=4 pagerank=1.5100e-01 url=www.lawfareblog.com/todays-headlines-and-commentary-1964
INFO:root:rank=5 pagerank=1.5100e-01 url=www.lawfareblog.com/todays-headlines-and-commentary-1963
INFO:root:rank=6 pagerank=1.5072e-01 url=www.lawfareblog.com/lawfare-podcast-week-was-impeachment
INFO:root:rank=7 pagerank=1.4958e-01 url=www.lawfareblog.com/todays-headlines-and-commentary-1962
INFO:root:rank=8 pagerank=1.4367e-01 url=www.lawfareblog.com/cyberlaw-podcast-mistrusting-google
INFO:root:rank=9 pagerank=1.4240e-01 url=www.lawfareblog.com/lawfare-podcast-bonus-edition-gordon-sondland-vs-committee-no-bull
$ python3 pagerank.py --data=data/lawfareblog.csv.gz --verbose --filter_ratio=0.2 --alpha=0.99999

DEBUG:root:computing indices
DEBUG:root:computing values
DEBUG:root:i=0 residual=133.93057250976562
DEBUG:root:i=1 residual=0.5695652365684509
DEBUG:root:i=2 residual=0.3829951286315918
DEBUG:root:i=3 residual=0.21739360690116882
DEBUG:root:i=4 residual=0.1404505968093872
DEBUG:root:i=5 residual=0.10851354897022247
DEBUG:root:i=6 residual=0.09284136444330215
DEBUG:root:i=7 residual=0.08225566148757935
DEBUG:root:i=8 residual=0.07338891178369522
DEBUG:root:i=9 residual=0.0656123012304306
DEBUG:root:i=10 residual=0.05909649282693863
DEBUG:root:i=11 residual=0.0541754774749279
DEBUG:root:i=12 residual=0.051116958260536194
DEBUG:root:i=13 residual=0.04999382048845291
DEBUG:root:i=14 residual=0.05060892924666405
DEBUG:root:i=15 residual=0.05252622812986374
DEBUG:root:i=16 residual=0.05518880486488342
DEBUG:root:i=17 residual=0.058038510382175446
DEBUG:root:i=18 residual=0.06059235706925392
DEBUG:root:i=19 residual=0.06247846782207489
DEBUG:root:i=20 residual=0.06345327198505402
DEBUG:root:i=21 residual=0.06340529769659042
DEBUG:root:i=22 residual=0.062345635145902634
DEBUG:root:i=23 residual=0.060383714735507965
DEBUG:root:i=24 residual=0.057693950831890106
...
DEBUG:root:i=683 residual=1.0142875908059068e-06
DEBUG:root:i=684 residual=1.0041906080004992e-06
DEBUG:root:i=685 residual=9.921640184984426e-07
INFO:root:rank=0 pagerank=7.0149e-01 url=www.lawfareblog.com/covid-19-speech-and-surveillance-response
INFO:root:rank=1 pagerank=7.0149e-01 url=www.lawfareblog.com/lawfare-live-covid-19-speech-and-surveillance
INFO:root:rank=2 pagerank=1.0552e-01 url=www.lawfareblog.com/cost-using-zero-days
INFO:root:rank=3 pagerank=3.1757e-02 url=www.lawfareblog.com/lawfare-podcast-former-congressman-brian-baird-and-daniel-schuman-how-congress-can-continue-function
INFO:root:rank=4 pagerank=2.2040e-02 url=www.lawfareblog.com/events
INFO:root:rank=5 pagerank=1.6027e-02 url=www.lawfareblog.com/water-wars-increased-us-focus-indo-pacific
INFO:root:rank=6 pagerank=1.6026e-02 url=www.lawfareblog.com/water-wars-drill-maybe-drill
INFO:root:rank=7 pagerank=1.6023e-02 url=www.lawfareblog.com/water-wars-disjointed-operations-south-china-sea
INFO:root:rank=8 pagerank=1.6020e-02 url=www.lawfareblog.com/water-wars-song-oil-and-fire
INFO:root:rank=9 pagerank=1.6020e-02 url=www.lawfareblog.com/water-wars-sinking-feeling-philippine-china-relations

Task 2, part 1:

$ python3 pagerank.py --data=data/lawfareblog.csv.gz --filter_ratio=0.2 --personalization_vector_query='corona'

INFO:root:rank=0 pagerank=6.3127e-01 url=www.lawfareblog.com/covid-19-speech-and-surveillance-response
INFO:root:rank=1 pagerank=6.3124e-01 url=www.lawfareblog.com/lawfare-live-covid-19-speech-and-surveillance
INFO:root:rank=2 pagerank=1.5947e-01 url=www.lawfareblog.com/chinatalk-how-party-takes-its-propaganda-global
INFO:root:rank=3 pagerank=1.2209e-01 url=www.lawfareblog.com/rational-security-my-corona-edition
INFO:root:rank=4 pagerank=1.2209e-01 url=www.lawfareblog.com/brexit-not-immune-coronavirus
INFO:root:rank=5 pagerank=9.3360e-02 url=www.lawfareblog.com/trump-cant-reopen-country-over-state-objections
INFO:root:rank=6 pagerank=9.1920e-02 url=www.lawfareblog.com/prosecuting-purposeful-coronavirus-exposure-terrorism
INFO:root:rank=7 pagerank=9.1920e-02 url=www.lawfareblog.com/britains-coronavirus-response
INFO:root:rank=8 pagerank=7.7770e-02 url=www.lawfareblog.com/lawfare-podcast-united-nations-and-coronavirus-crisis
INFO:root:rank=9 pagerank=7.2888e-02 url=www.lawfareblog.com/house-oversight-committee-holds-day-two-hearing-government-coronavirus-response

Task 2, part 2:

$ python3 pagerank.py --data=data/lawfareblog.csv.gz --filter_ratio=0.2 --personalization_vector_query='corona' --search_query='-corona'

INFO:root:rank=0 pagerank=6.3127e-01 url=www.lawfareblog.com/covid-19-speech-and-surveillance-response
INFO:root:rank=1 pagerank=6.3124e-01 url=www.lawfareblog.com/lawfare-live-covid-19-speech-and-surveillance
INFO:root:rank=2 pagerank=1.5947e-01 url=www.lawfareblog.com/chinatalk-how-party-takes-its-propaganda-global
INFO:root:rank=3 pagerank=9.3360e-02 url=www.lawfareblog.com/trump-cant-reopen-country-over-state-objections
INFO:root:rank=4 pagerank=7.0277e-02 url=www.lawfareblog.com/fault-lines-foreign-policy-quarantined
INFO:root:rank=5 pagerank=6.9713e-02 url=www.lawfareblog.com/lawfare-podcast-mom-and-dad-talk-clinical-trials-pandemic
INFO:root:rank=6 pagerank=6.4944e-02 url=www.lawfareblog.com/limits-world-health-organization
INFO:root:rank=7 pagerank=5.9492e-02 url=www.lawfareblog.com/chinatalk-dispatches-shanghai-beijing-and-hong-kong
INFO:root:rank=8 pagerank=5.1245e-02 url=www.lawfareblog.com/us-moves-dismiss-case-against-company-linked-ira-troll-farm
INFO:root:rank=9 pagerank=5.1245e-02 url=www.lawfareblog.com/livestream-house-armed-services-committee-holds-hearing-priorities-missile-defense
  1. Ensure that all your changes to the pagerank.py and README.md files are committed to your repo and pushed to github.

  2. Get at least 5 stars on your repo. (You may trade stars with other students in the class.)

    NOTE:

    Recruiters use github profiles to determine who to hire, and pagerank is used to rank user profiles and projects. Links in this graph correspond to who has starred/followed who's repo. By getting more stars on your repo, you'll be increasing your github pagerank, which increases the likelihood that recruiters will hire you. To see an example, perform a search for data mining. Notice that the results are returned "approximately" ranked by the number of stars, but because "some stars count more than others" the results are not exactly ranked by the number of stars. (I asked you not to fork this repo because forks are ranked lower than non-forks.)

    In some sense, we are doing a "dual problem" to data mining by getting these stars. Recruiters are using data mining to find out who the best people to recruit are, and we are hacking their data mining algorithms by making those algorithms select you instead of someone else.

    If you're interested in exploring this idea further, here's a python tutorial for extracting GitHub's social graph: https://www.oreilly.com/library/view/mining-the-social/9781449368180/ch07.html ; if you're interested in learning more about how recruiters use github profiles, read this Hacker News post: https://news.ycombinator.com/item?id=19413348.

  3. Submit the url of your repo to sakai.

    The assignment is worth 8 points.

    1. There are 6 parts to the output above. (4 in Task1 and 2 in Task2.)
    2. Each part that you get incorrect will result in -2 points. (But you cannot go negative.)
    3. Another way of phrasing this is that the first 2 parts you complete are not worth any points, but each part after that is worth 2 points.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages