Skip to content

Latest commit

 

History

History
151 lines (120 loc) · 6.48 KB

README.md

File metadata and controls

151 lines (120 loc) · 6.48 KB

Randomuseragent

CRAN_Status_Badge CRAN_Downloads R-CMD-check License: MIT

The goal of Randomuseragent is to have a easy access to different user-agent strings by randomly sampling from a pool of real strings.

Installation

You can install the released version of Randomuseragent from CRAN with:

install.packages("Randomuseragent")

The development version can be installed from GitHub with:

# install.packages("devtools")
devtools::install_github("fangzhou-xie/Randomuseragent")

Example

This is a basic example to get random user-agent strings:

library(Randomuseragent)

random_useragent()
> [1] "Mozilla/5.0 (Windows NT 6.1; rv:11.0) Gecko/20100101 Firefox/11.0"

filter_useragent(min_obs = 50000, software_name = "Safari", operating_system_name = "Mac OS X")
> [1] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/600.8.9 (KHTML, like Gecko) Version/8.0.8 Safari/600.8.9"   
> [2] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/534.59.10 (KHTML, like Gecko) Version/5.1.9 Safari/534.59.10"
> [3] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_5) AppleWebKit/537.78.2 (KHTML, like Gecko) Version/6.1.6 Safari/537.78.2"  
> [4] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.75.14 (KHTML, like Gecko) Version/7.0.3 Safari/E7FBAF"   
> [5] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/600.5.17 (KHTML, like Gecko) Version/8.0.5 Safari/600.5.17" 
> [6] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/600.8.9 (KHTML, like Gecko) Version/6.2.8 Safari/537.85.17"  
> [7] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_5_8) AppleWebKit/534.50.2 (KHTML, like Gecko) Version/5.0.6 Safari/533.22.3"

Both function will accept the same set of arguments for filtering user-agent strings. Please refer to documentation of either function for details.

Advanced Example

Although calling random_useragent() is very convenient, but this may not be the best way if you care about performance. random_useragent() essentially wraps up the filter_useragent() function and return a random one from the pool.

# call directly
random_useragent(min_obs = 50000, software_name = "Safari", operating_system_name = "Mac OS X")
> [1] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_5) AppleWebKit/600.8.9 (KHTML, like Gecko) Version/6.2.8 Safari/537.85.17"

However, if you need to generate LOTS OF them, i.e. calling random_useragent() repeatedly, each time you call random_useragent() you need first to filter from all the strings that this package provides, and then randomly draw one from the pool. Hence, you are doing the subsetting each time you call the function. This is very inefficient.

A better way would be to get the string pool directly from filter_useragent() and then sampling yourself.

# first filter
uas <- filter_useragent(min_obs = 50000, software_name = "Safari", operating_system_name = "Mac OS X")
# then sample manually
sample(uas, 1)
> [1] "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/600.5.17 (KHTML, like Gecko) Version/8.0.5 Safari/600.5.17"

To note this difference, we need to time the following code chunks.

# first to call random_useragent() directly
system.time(lapply(1:5000, function(x){random_useragent()}))
>    user  system elapsed 
>   1.922   0.015   1.944
# second generate the character vector and sampling manually
system.time({
  ua <- filter_useragent(min_obs = 5000, software_type = "browser", operating_system_name = "Windows")
  lapply(1:5000, function(x) {sample(ua, 1)})
})
>    user  system elapsed 
>   0.023   0.000   0.023

We run each method 5000 times to make a fair comparison between methods. You should immediately see that the second method is more than 50 times faster than the first one! That said, the first method only spends 0.2452 ms per call, on average, which is pretty fast already. The second method needs 4.4 ns per call. This is certainly faster, but for most use cases, I don’t think it worth going this far.

Optional Parameters

You can type ?random_useragent to see the documentation for the parameters.

  1. min_obs: integer, threshold to filter number of times observed in the dataset. This is to keep the most frequently used UAs while removing the less frequently used ones. Larger number of this argument will result in less returned strings. Hence smaller set to be sampled from.
  2. software_name: character vector, name of the software. For example, you can choose to only use software_name="Chrome" or several platforms together software_name = c("Safari", "Edge").
  3. software_type: character vector, one or more of "browser", "bot", "application". For webscraping applications, you would most likely choose software_type="browser" to mimic real browser behavior.
  4. operating_system_name: character vector, system being operated. For example, use one or more of "Windows", "Linux", "Mac OS X", "macOS", etc.
  5. layout_engine_name: character vector, e.g. "Gecko", "Blink", etc.