social scraping

How do I parse people’s data from the social network Facebook?

Specialized facebook scraper tool. All conditions for data collection from the world's largest social network are taken into account. Which data can be parsed?

Apifornia team

Apr 6, 2022 • 4 min read

5 things you should know before parsing data from Facebook

In fact, Facebook prohibits any parsers

Before parsing the site, you should first check its robots.txt file. Robots.txt is a file used by websites to inform "bots" whether they are allowed to scan and index a given site. You can access the file by adding "/robots.txt" at the end of your target site link. Enter https://www.facebook.com/robots.txt in your browser and let's check the bot file on Facebook. These two lines can be found at the bottom of the file. These lines says that Facebook prohibits all automatic parsers. That is, no part of the site should be visited by the automatic parser.

Why should we respect robots.txt?

Websites use this file to define a set of rules by which you or a bot should interact with them. When a website blocks access to parsers, it is best to leave that site alone. Following the recommendations of the "robot.txt" file means avoiding unethical data collection as well as any legal consequences.

Technically, the only legal way to collect data from Facebook using a parser is to obtain prior written permission.

At the very beginning of the file for bots, Facebook warns: "Scanning Facebook is forbidden unless you have explicit written permission". By clicking on the link in the second line, you can find the terms of Facebook's automatic data collection, last revised on April 15, 2010.

Like any other terms and conditions in this world, Facebook's Automatic Data Collection Terms are huge (written in unusually small letters) and full of legal terms that few people fully understand.

These terms look so familiar, that we see them every time, we install a new application on our mobile phone or register on the website.

"By obtaining permission to... you agree to be bound by..."

"You agree that you will not..."

"You agree that any violation of these terms may result in..."

As a social media giant, Facebook has money, time and a dedicated team of lawyers. If you keep paralyzing Facebook by ignoring its automatic data collection terms, it's okay, but keep in mind that it reminded you to at least get "written permission". Sometimes this corporation can be quite aggressive with illegal data collection.

But of course, you can still parse data from Facebook as you wish.

If you parsed the site without observing robots.txt, it does not mean that you will definitely face legal difficulties because you broke the rules.

The data from social media is by far the largest and most dynamic set of data on human behavior and real events. For more than a decade, researchers and business experts around the world have been collecting information from Facebook, filtering out typical samples to understand individuals, groups and society, and exploring new opportunities hidden in user data. Users agree that social statistics data is not always bad. For example, it is the use of social data to personalize marketing that makes the Internet free, and makes the advertising and content we see more relevant.

Tools that you can use to get data from Facebook

In response to public protests after the Cambridge Analytica scandal, Facebook last April imposed severe restrictions on access to its API.

Application Programming Interfaces (APIs) are software interfaces designed to be used by computer programs that allow people to obtain large-scale data through an automated process. Many companies currently provide a public API for users, researchers and third party application developers to access their infrastructure.

Blocking the Facebook API and radically restricting access to data, as an attempt to protect user information, is controversial. However, as a result, people have only one choice left. Without the API, we can only retrieve data from Facebook through user interfaces, i.e. web pages. This is exactly the case when parsers come into play.

However, after the entry into force of the GDPR you have a better chance of getting a claim if you try to parry personal data.

The General Data Protection Regulation of the EU or, as it is more widely known, the GDPR, entered into force on 25 May 2018. It is said to be the most important change in data privacy regulation over the last 20 years, which should lead to radical changes in everything from technology to advertising, to banking medicine. Companies or organizations that store and process large amounts of consumer data, such as technology firms like Facebook, suffer most from GDPR. Previously all these companies had to monitor the protection of the user data themselves. Now they have to make sure within the framework of the GDPR that they fully comply with the law.

The good news is:

...GDPR only applies to personal data.

Here "personal data" refer to data, that can be used for the direct or indirect identification of a particular person. This type of information is better known as personal information, which includes the name of the person, physical address, e-mail address, telephone number, IP address, date of birth, employment information and even video/audio records. If you do not parse personal data, the GDPR does not apply to you.

In short, if you do not have the explicit consent of a person, it is currently illegal within the GDPR to parcel personal data of an EU resident.

You can try alternative Facebook sources for your analysis project.

As mentioned above, although Facebook prohibits all automated scanners, it is still technically possible to capture data from the site. The problem is that -

it's risky. In addition to the legal consequences, you may also find that obtaining data on a regular basis may become more difficult, because Facebook blocks suspicious IP addresses - and may even implement stricter blocking mechanisms in the future, making it impossible to parse data from the site. Therefore, it is recommended to look for more reliable sources of social statistics data for business intelligence and an understanding of your target market.

Four alternative data sources to Facebook

Twitter - Approximately 500 million tweets are generated every day and Twitter is overflowing with information, that can be used as an excellent source for brand monitoring and customer sentiment assessment. Unlike Facebook, Twitter allows people to receive data on a large scale through Twitter APIs.

Reddit - with as many users as Twitter, Reddit is one of the largest sources of UGC (User Content) in the world. Reddit provides public APIs that can be used for a variety of purposes such as data collection, automatic comments, or even to help with subreddit moderation.

VKontakte (VK)-VK is a Russian social media platform focused on Russian and other Eastern European users. Of course, it boasts more than 90 million unique visitors per month and 9 billion page views every day. As a Russian company, VK adheres to Russian law and if you check the file for search robots, you will find that it is quite friendly to parser.

Instagram - Instagram, owned by Facebook, focuses more on sharing visual content, especially videos and photos. The platform is used by many brands to humanize content to improve customer communication and brand recognition. However, in addition to blocking Facebook data last year, Instagram has also imposed radical data access restrictions, making the site much less reliable than before.