Proxy Pilot By BlazingSEO

Proxy Pilot (By BlazingSEO) : Learn Manuals Before You Buy

Do you know what Proxy Pilot does and how it works? By learning this article, you will master the above knowledge and know-how to troubleshoot Proxy Pilot.

What Is Proxy Pilot?

Description

Proxy Pilot is a software that completely manages your proxy lists intelligently. This means you can build your system to be more flexible by using this “microservice” that handles a significant workload that is otherwise put into your main code base.


Features

The list of current features are:

  • Specific cooldowns between each proxy attempt
  • Specific cooldown after a ban is detected
  • Ban detection (the certificate you install on your server allows us to decrypt your traffic as a man-in-the-middle attack and read the resulting HTML for ban messages), specific per each site
  • Optimal proxy usage (round-robin at first, and then once various cooldown timers are started, it will use the proxies that are most cooled-down)
  • Automatically downloads your proxy list from a proxy API endpoint
  • Geo-targeting, if you use multiple countries in your proxy list
  • Advanced statistics powered by ELK (see below)

Powered by ELK

The ELK stack is one that has grown massively in popularity over the past few years… and Proxy Pilot is powered by it! See the amazing statistics that you will have at your fingertips through Kibana’s amazing visualizations:

Powered by ELK1

Powered by ELK11


What It Is, and Is Not

One confusion we get from interested users of Proxy Pilot is what does Proxy Pilot do, and not do. Here is our General Troubleshooting.

Therefore, it's important to make the distinction between how Proxy Pilot can help you, versus its inability to prevent many common anti-scraping technologies when you use your own software.

What it does:

  • All listed features above

  • Allows you to separate proxy code from your main code base and into a separate microservice

  • 100% open-sourced (coming Q4 2021)

What it does not do:

  • It does not guarantee 100% success rate if your proxy pools and settings are not appropriate.

  • Example:  if you want to scrape 1m requests/hour to domain.com, and only input 10 proxies into your proxy pool, you will most likely receive a ban on the target website. When this happens, all 10 of your proxies would go into a “ban cooldown”, and Proxy Pilot will return a ‘No Proxies’ error message.

  • Proxy Pilot does not “charge per successful scrape”. If you’d like to offload all portions of scraping to us then we recommend you consider our Scraping Robot API. Our Scraping Robot API handles all browser management, proxy management, and ensures 100% success back to your software. Proxy Pilot is only a proxy manager, which is highly dependent on the proxies you provide it. If you provide low quality proxy IP addresses, or configure your software incorrectly, then you will get low quality results.

  • Proxy Pilot does not provide you free proxies or access to a specific proxy pool. You must provide it with the proxies you wish to use. Again, if you do not want to purchase proxies or manage them at all, then our Scraping Robot API would be recommended.


Proxy Pilot Setup Instructions

Technical Setup Explanation – How Does It Work?

If you haven’t read What Is Proxy Pilot? we recommend reading the business overview article first.

This article outlines the technical details on how to implement Proxy Pilot. First, let’s define how it works:

Technical Setup Explanation

Key components:

  1. Install a custom certificate in your software. For most software, this is 1-2 lines of code to do this.

    Once installed, this allows us to emulate what a man-in-the-middle-attack does, which decrypts your HTTPS traffic so we can read the HTML. Once we are able to read the full HTML of your requests, we can detect bans and do the appropriate retries.

  2. You connect to a central Proxy Pilot server (self-hosted or managed hosting).

    We will provide you a single proxy IP (ip: port with IP authorization, or ip:port:user: pass for user: pass authorization). You will send all requests to this single proxy gateway and from there the Proxy Pilot system will take over and forward your request to the appropriate proxy.

  3. Your actual proxy list

    As mentioned in What Is Proxy Pilot? you must provide your own proxies to the system. These proxies are the ones that Proxy Pilot forwards your requests to.


Programming Language Implementations

Please see the following links for the programming language of your choice to implement Proxy Pilot into. For most languages, it requires less than 2 lines of code to install the custom certificate, and from that point on you will use the proxy gateway the same way you use a normal proxy.

See setup instructions for the following languages:


1. Node.js (Requests)

Prerequisites

You should have the following installed:

Example code

Required lines to use Proxy Pilot:

  • rejectUnauthorized: false
    • rejectUnauthorized: false “ignores” the certificate warnings
  • (OPTIONAL) – If you wish to have a secure connection between your server and our server, you can install our certificate and use it in Node.js. For most users, this is not necessary and you can simply “Ignore” the certificate errors. You can download the certificate here.
    • // const cert = fs.readFileSync(path.resolve(__dirname, ‘./public/ca.pem'));
    • // ca: cert
    • // tunnel: true,
    • (OPTIONAL – don't use if you are not geo-targetting)
      • // proxyHeaderWhiteList: [‘X-Sprious-Region'],
        headers: {
        // ‘X-ProxyPilot-Region': ‘GB',
const fs = require('fs');
const path = require('path');

/* Note: please replace the ./public/ca.pem with a local path to the ca.pem file on your server and simply remove lines with X-ProxyPilot-Region if you don’t need to use geo-targeting feature*/

// const cert = fs.readFileSync(path.resolve(__dirname, './public/ca.pem'));
const request = require('request');
request(
    {
        url: 'https://www.amazon.com/dp/B07HNW68ZC/',
        proxy: 'http://PROXY_LOGIN:[email protected]_IP:PROXY_PORT',
        // ca: cert,
        // tunnel: true,
        followAllRedirects: true,
        timeout: 60000,
        method: "GET",
        rejectUnauthorized: false,
        gzip: true,
        // proxyHeaderWhiteList: ['X-Sprious-Region'],
        headers: {
            // 'X-ProxyPilot-Region': 'GB',
            'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
        }
    },
    (err, response, body) => {
        console.log(err, body);
    }
);

2. Node.js with Puppeteer

Prerequisites

You should have the following installed:

Example code

Required lines to use Proxy Pilot:

  • const browser = await puppeteer.launch({
    args: [‘–no-sandbox', ‘–disable-setuid-sandbox', ‘–proxy-server=' + anonymizeProxy, ‘–ignore-certificate-errors'],
    });

    • By ignoring the certificate on your server you do not need to install the certificate.
JavaScript
const puppeteer = require('puppeteer');
const proxyChain = require('proxy-chain');

(async () => {
    const anonymizeProxy = await proxyChain.anonymizeProxy('http://PROXY_LOGIN:[email protected]_IP:PROXY_PORT');

    const browser = await puppeteer.launch({
        args: ['--no-sandbox', '--disable-setuid-sandbox', '--proxy-server=' + anonymizeProxy, '--ignore-certificate-errors'],
    });
    const page = await browser.newPage();
    await page.goto('https://www.amazon.com/p/dp/B08GL2XTV6');

    let pageTitle = await page.title();
    let element = await page.$("#priceblock_ourprice");
    let price = await (await element.getProperty('textContent')).jsonValue();;

    console.log(price + ' - ' + pageTitle);
    // should be '$199.00 - LOVESHACKFANCY Women's Antonella Dress, Brilliant Blue, 4 at Amazon Women’s Clothing store'

    await browser.close();
    await proxyChain.closeAnonymizedProxy(anonymizeProxy, true)
})();

3. Curl

Prerequisites

You should have the following installed:

  • You should be able to run curl on your machine

Example code

Required lines to use Proxy Pilot:

  • -k –compressed \
    • By using the “-k” parameter in curl, it will IGNORE the custom certificate requirement to use Proxy Pilot.
  • (OPTIONAL) If you want to use the geo-targeting feature, please pass:
    • –proxy-header ‘X-Sprious-Region: US' \
curl -s 'https://www.amazon.com/dp/B07HNW68ZC/' \
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36' \
-k --compressed \
-x 'PROXY_LOGIN:[email protected]_IP:PROXY_PORT'

4. Python (Requests)

Prerequisites

You should have the following installed:

Example code

Required lines to use Proxy Pilot:

  • r = requests.get(url, headers=headers, proxies=proxies, verify=False)
    • verify=False “ignores” the certificate warnings
  • (OPTIONAL)  # r = requests.get(url, headers=headers, proxies=proxies, verify='./public/ca.pem”)
    • If you wish to have a secure connection between your server and our server, you can install our certificate and use it in Python. For most users, this is not necessary and you can simply “Ignore” the certificate errors. You can download the certificate here.
import requests

url = "https://www.amazon.com/dp/B07HNW68ZC/"
proxies = {
    "https": f"http://PROXY_LOGIN:[email protected]_IP:PROXY_PORT/",
    "http": f"http://PROXY_LOGIN:[email protected]_IP:PROXY_PORT/"
}
headers = {
    'user-agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36',
    'content-encoding': 'gzip'
}

# r = requests.get(url, headers=headers, proxies=proxies, verify='./public/ca.pem'')
r = requests.get(url, headers=headers, proxies=proxies, verify=False)

print(f"Response Body: {r.text}\n"
    "Request Headers:"
    f"{r.request.headers}\n\n"
    f"Response Time: {r.elapsed.total_seconds()}\n"
    f"Response Code: {r.status_code}\n"
    f"Response Headers: {r.headers}\n\n"
    f"Response Cookies: {r.cookies.items()}\n\n"
    f"Requesting {url}\n"
)

5. Firefox Browser

Prerequisites

You should have the following installed:

  • Latest Firefox version

You should download the custom certificate:

How to use Proxy Pilot in Firefox

NOTE OF CAUTION: By following the instructions below you are essentially allowing our server to read the pages that you visit. This is dangerous if you intend to use your browser for normal activity like using your bank account. Please only proceed if you intend to use our certificate with your proxy activities such as web scraping – DO NOT use it for personal use!

In Firefox you can import your certificate by following these steps:

  1. Settings → Privacy & Security → “View certificates” → “import”

  2. Select the CA.pem file that you saved earlier

  3. check the checkbox “Trust this to identify websites” → ‘OK'

  4. Click ‘OK'

Then you need to specify your browser to use the PM proxy server:

  1. Settings → General → Network Settings → “Settings”

  2. Manual proxy settings

  3. HTTP Proxy: PROXY_IP Port: PROXY_PORT

  4. Select “Also use this proxy for FTP and HTTPS”

  5. Click “OK”

Visit amazon.com. When the browser asks for login and password, enter:

  • login: PROXY_LOGIN
  • password: PROXY_PASS

Proxy Manager Error codes

HTTP Response CodeResponse ContentDescription
500Error retry limit

Proxy Pilot reached the limit of retries trying to fetch a specific URL

You may retry your request again.

500No proxies

There is currently not enough proxies in your proxy pool. This might either indicate that all proxies are in cooldown, or if the geo-targeting header is specified – it might also indicate that there are no proxies for such region in your proxy pool.

Either retry your request again or check if the specified geo-targeting header matches the available proxies in your proxy pool. Read here for more information about What Proxy Pilot Is, and Is Not here.


How do Retries Work? Are You doing the Scraping on My Behalf?

This question is a common one given the intricacies of what’s going on in the solution. The simple answer:  no, we are not scraping on your behalf.

When the following flow happens:

  1. You send a request to scrape domain.com to Proxy Pilot gateway

  2. Proxy Pilot forwards your request to proxyA

  3. proxyA returns a banned HTML page back to Proxy Pilot

  4. Proxy Pilot sees this is a ban, and then sends this same request to proxyB

  5. proxyB returns a successful HTML page to Proxy Pilot

  6. Proxy Pilot returns the successful HTML back to you (the user)

… on step #4 we have a common question which asks whether or not we are using our server resources to do the scraping, or if your server compute resources are doing the scraping. The answer is that your server is still doing the act of the scraping.

The best way to think about it is when your internet disconnects midway through a connection to a website and your browser shows you a longer-than-usual “Loading” symbol as your internet attempts to do a retry. This is mostly what is happening with Proxy Pilot:  as it makes retries on your behalf, your software is keeping the connection tunnel open while it waits for a response from Proxy Pilot.

The compute consumption is actually happening on Proxy Pilot by resending the exact same request headers and body. By resending the exact same headers and body, we have proven with extensive testing that it does not affect the results of your scraping (i.e. – if you are using Puppeteer on Chromium).

The best way to prove this yourself with Proxy Pilot:  connect to a javascript-only website (like Google Maps) with your browser. You will notice that you will be able to load the page, because javascript is still being executed by your browser while that tunnel connection is still open.

Confusing? We agree! Please sign up for Proxy Pilot and we’re happy to give you free proxies to trial it out.


General Troubleshooting for Proxy Pilot

In this article we will discuss some steps you can take to help troubleshoot any unexpected issues when trying to use your proxies via Proxy Pilot. As a reminder, Proxy Pilot is a tool that relies on proper configurations by the end user in order to work properly. If you set bad headers or cookies, use bad proxies, or so forth, then you will get poor results nonetheless.

At the core of web scraping, if you cannot load a request on your browser, using your home/work IP address, then it is unlikely you will be able to scrape a page using software + a proxy source. 

There are many ways to detect scraping software (see example1 and example2), so the more customization you add to loading a website (your software + proxies), the greater your footprint will be, and the easier it will be to detect you.

If you do not wish to worry about such anti-scraping battles, please consider our API at: https://scrapingrobot.com/api/  Our Scraping Robot API was built to solve this exact issue:  allowing you to focus on your core business, instead of fighting with anti-scraping technologies.

If you wish to manage your own proxies, use developer resources, and pay for server compute power, then using Proxy Pilot will help (but not solve!) with some of these common scraping issues for you.


Example of Bad vs Good Scraping Requests

Below you will find an example of a very bad scraping request to Amazon (or any site, really). Proxy Pilot's role is not to solve these bad requests – it is still on the developer's code to send good requests to avoid being banned.

curl -s ‘https://www.amazon.com/dp/B07HNW68ZC/‘ \

     -x ‘PROXY_LOGIN:[email protected]_IP:PROXY_PORT' \

     -k –compressed -v

The reason the above code would result in a ban is not because of Proxy Pilot, or even your proxies, but rather it is because normal browser requests would have more headers set in the request. Specifically, Amazon checks that the request has at least ‘User-Agent' header, and no matter which proxies you're running this request from – it would most likely get blocked.

By simply adding user-agent to your request you can significantly decrease ban rates for your request:

curl -s ‘https://www.amazon.com/dp/B07HNW68ZC/‘ \

     -H ‘User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36' \

     -x ‘PROXY_LOGIN:[email protected]_IP:PROXY_PORT' \

     -k –compressed -v


Tip #1:  Replicate Your Request in a Browser to Confirm It Works There First

Description:  As mentioned above, the best way to know if it’s your software causing issues would be to run your request via a browser on your local machine. Because your local machine will be a pure “residential IP”, and a browser is not customized software, you will then be able to successfully load all pages. However, if you cannot load a page in your browser using the steps below, then it means you are passing incorrect headers or cookies to the target URL and would need to debug on your side to find the proper headers/cookies.

Steps to troubleshoot:

  1. Open a Chrome incognito tab and make sure you clear the cookies
    As most scraping software starts with no previous browsing history and no cookies – it’s best to do it this way to replicate how your software would work

  2. Replicate the URL you're going to scrape by just pasting it into the address field

  3. Make sure it loads as expected. If it fails – you should take this into consideration when designing your scraping software
    In some cases the site might ban you even at this step simply because you have no previous browsing history (and no cookies). 

  4. In Chrome Dev Tools open a network tab and check the first request (it will likely have a ‘document' type). Check that the URL of that request matches the one you just made, and then right-click and choose ‘Copy as cURL'
    The cURL sent by a browser should appear in your clipboard and would look something like this:
    curl ‘https://www.amazon.com/gp/product/B08F7PTF53/‘ \
    -H ‘authority: www.amazon.com‘ \
    -H ‘sec-ch-ua: ” Not;A Brand”;v=”99″, “Google Chrome”;v=”91″, “Chromium”;v=”91″‘ \
    -H ‘sec-ch-ua-mobile: ?0' \
    -H ‘upgrade-insecure-requests: 1' \
    -H ‘user-agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36' \
    -H ‘accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9' \
    -H ‘sec-fetch-site: none' \
    -H ‘sec-fetch-mode: navigate' \
    -H ‘sec-fetch-user: ?1' \
    -H ‘sec-fetch-dest: document' \
    -H ‘accept-language: en-US,en;q=0.9' \
    –compressed

Note how many headers the browser sends even from an incognito mode. With cookies the request can easily be several times bigger.

Note: cURL is provided by default with macOS and most Linux distributions, as well as the latest Windows 10 updates.

In case you are running an older version of Windows you can install curl from their official site.


Tip #2:  Replicate the Same Request from the Browser Via Proxy Pilot

Description: After step #4 in the previous tip you should have a perfect cURL request with a perfect set of headers which you might try to replicate via the Proxy Pilot, by simply adding to this cURL request parameter:  -x ‘PROXY_LOGIN:[email protected]_IP:PROXY_PORT' -v -k

with the PP credentials provided to you earlier. This would send the same request via PP.

You might also consider adding parameters
-o test.html
This would save results into a test.html page, so you can open it with a browser and see it’s content to make sure it’s working properly.
If it returns a proper content at this stage – this means PP works fine and takes care of managing proxies, doing retries if it’s banned via some proxy, etc.

In case the request works directly (without setting Proxy Pilot via -x flag), but stops working via Proxy Pilot – please inform us about that and let us know which curl request you were sending


Tip #3:  Replicate the Same Behavior Via Your Software

Description: Once you’ve tested your request via the browser and via the Proxy Pilot – you can apply this to your own scraping software. The integration with Proxy Pilot is almost as simple as using regular proxies for data scraping. More details and some code examples for different languages and frameworks can be found here

Please note, that if while integrating the same request which worked via cURL stops working with your software, the most possible reason is a set of headers. Many sites implement really sophisticated anti-scraping solutions which might take into account not only cookies, user-agents, but also a specific order of headers, compressing algorithms and specific browser market share (i.e. Chrome v41 is rarely used, so sending this via user-agent would look suspicious for the target site)


References,


Disclaimer: This part of the content is mainly from the merchant. If the merchant does not want it to be displayed on my website, please contact us to delete your content.

Last Updated on September 7, 2021


Do you recommend the proxy service?

Click on a trophy to award it!

Average rating 5 / 5. Vote count: 1

No votes so far! Be the first to rate this post.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top