Reveal aggregator secrets: avoiding blockages and air tickets mismatching
Collecting data from large scale sites. How online retailers counteract the parsing process. How to deal with the difficulties encountered.
A large percentage of our clients are aggregator sites, which often face the problem that sites prevent data collection. Let's talk about the problem - and how it can be solved.
Aggregator sites automatically collect information about products and prices in online stores, so that the user can quickly compare them and select the best offer. The first example that comes to mind - airline tickets aggregators: platforms that can quickly and accurately answer the question: "Where to find the cheapest ticket?" (however, in this article we will see that to provide "quickly and accurately" - another adventure).
Scraping
Data collection is complicated by its own scale: in order to find the lowest price, you need to scan all sources and do not allow a strong discrepancy in time, so manual labor is not the best option. Fortunately, this process can be automated: set up a special bot (crawler) - and it will bypass the sites pages and unload goods information in the desired format. This process is called "web scraping" (and by its meaning is about the same as "data collection".).
Despite the apparent uniqueness of each website, the logic of data organization is (roughly) the same everywhere - and this is the strength of data collection: the process can be scaled.
The airline tickets aggregator is the most obvious example, so let's consider something more original: let's imagine that the comments on the platforms belonging to the publishing house "Committee". (vc.ru, tjournal.ru, dtf.ru) - a mine of wisdom and proven life experience, so it would be good to collect, analyze, and then write reports on them:
This is a good idea for everyone - except the platforms where we're taking that data from.
Scraping Confronting
Online stores do not benefit from such aggregators: the buyer can go to a competitor with a lower price, so their owners use different ways to prevent scraping.
To collect information from the right server, the robot sends him many requests. This may look like a DDOS attack, and then the server's security mechanisms will work, blocking the bot from accessing the site. And if the server is also protected against scraping, it will notice suspicious activity much faster and block your crawler.
Of course, you can reduce the number of requests to decrease the chance of a lockdown. But it will make the scraping process longer. And the chance that anti-scraping mechanisms will notice your robot is still quite high.
In addition to blocking requests, another method is widely used - displaying inaccurate prices for goods or services ("cloaking"). For example, sellers intentionally change product card descriptions, reduce or, conversely, increase the price.
One of the most common examples is the constant change in airline ticket prices. Here it is common practice to show different prices for flights, depending on the IP address. By searching for the price of tickets from Miami to London from different parts of the world, the user will get different results.
Thus, the cost of flight when requesting via Asian IP was $446, and IP from Eastern Europe - $370. The savings are obvious: $76, which is almost 24% of the ticket price.
Problem solution
Hosting providers offer service IP addresses. In order to determine whether an IP address belongs to a certain provider's pool, the user only needs to perform a couple of simple actions. Each of the IP addresses has its own ASN number, which contains the necessary information.
You can use a lot of paid and free services, to analyze the ASN number These services are often integrated with anti-bot systems. They block access to information for the crawlers or deliberately provide inaccurate information (e.g., intentionally lowering prices to increase interest from potential customers).
The accuracy of the information the aggregator provides is the key quality criterion of its work, so each aggregator tries to improve accuracy with various tools.
The problem can be solved, and the trust to your portal aggregator cannot be reduced with the help of resident proxies.
A proxy is a remote server or device that has its own IP address. When you connect to a proxy, you put its IP address on top of your own, masking your own data. So, the server from which you collect data sees the IP of your proxy, but not yours - and now you can fool the site by constantly changing addresses and pretending that you are not the same user, but many different ones. No online shop will block potential customers.
Working with aggregators, we made sure that the resident type is ideal for scraping.
Conclusions
- You can shoot spy guns on the aggregators' operation. The task: to enter the online store, collect data on prices and stay unnoticed.
- Automatic data collection is a double-edged sword: you have to pay for scalability and convenience with potential locks.
- Among the existing types of proxy, resident is best suited for aggregators, as it helps to pretend to be a real user from any country.