3

I am trying to build a web scraper that uses proxies, because its repetitive behavior over a large website resulted in an IP ban. I am running into trouble, however, as setting a proxy using the method below does not circumvent the ban.

[EDIT: Per a commenter's concerns below, I should clarify that I am not violating the site's stated terms of service, at least not that I can find. Unfortunately, I think the repetitive behavior looks suspicious and was probably caught up in an auto-detection process designed to weed out malicious actors.]

I have been searching for a way to verify that my function is actually using the IP pipeline I am trying to send it through, but I can't find any information about checking the existing IP being used specifically within the R environment.

I am very new to web scraping in general and R in particular, so I very much appreciate any help you can give, especially spelled out as basically as possible.

I tried using...

Sys.getenv("http_proxy")

...but that seems different from what I am looking for, as it checks a system-wide IP setting and does not recognize a proxy set using 'set_config()'.

I also tried setting output to verbose to look for how the website views the incoming request...

set_config(verbose())

...but I either don't see or am misunderstanding the information I need.

Below is some reproducible code, though unless you have the ability to test it on a website from which your IP is banned, you will not be able to exactly reproduce my issue.

Required libraries:

library(httr)

Proxies and associated ports from https://free-proxy-list.net/

proxies_b <- c("212.129.52.155", #anon, https
               "180.183.128.204", #anon, https
               "51.15.103.214") #anon, https
ports_b <- c(8080, 
             8213, 
             3128)

set_config(use_proxy(proxies_b[1], 
                     port = ports_b[1], 
                     username = NULL, password = NULL, 
                     auth ="basic"))

Sample function:

url_works <- function(url){
  tryCatch({
    # Returns logical based on status code.
      identical(status_code(HEAD(url)),200L) 
    }, 
    error = function(e){
      print(paste0("The URL \'", url, "\' returned: ", e))
      return(FALSE) # Returns FALSE if an error
    })
}

Test the function:

url_works("https://www.google.com") # Should return TRUE
url_works("https://www.googlebug.com/") # Should return FALSE

To be clear, this function works. The trouble I am having is when I run it from behind a banned IP, setting a proxy has no effect. I cannot seem to find a function to debug why. So what I most hope to have answered is:

  1. Is there such a function that will check the active proxy within the R environment, as set by 'set_config'?

  2. Are there any reasons you can see why setting a proxy in this way would not circumvent an IP ban?

  3. Is httr in this function actually even sending its queries through the proxies, or is it still going through my normal IP?

Again, I am really new to this, so your patience is appreciated!

  • I have a feeling folks are going to be reluctant to help get around an IP ban, since it sounds like you've violated someone's terms of service – camille Apr 16 at 13:42
  • What I am trying to do is not actually violating any part of the TOS for the site, at least not that I can find. Unfortunately, I think the repetitive behavior just looks generically "suspicious" and was probably flagged as such. – Roxanne Ready Apr 16 at 20:26

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service, privacy policy and cookie policy

Browse other questions tagged or ask your own question.