Is it legal to parse sites?

Is it legal to collect data by parsing? Which data can be taken and which data are protected by law? Legal facts related to data parsing.

Introduction

Do you have to review your competitor's sites? You can gain a certain competitive advantage, if you know how other companies work. However, you need to receive such data not once, but on a regular basis.

Do you want to parse sites, but not sure if this is legally acceptable? Don't worry about it. Everybody wants to, and many are not sure if automatic data retrieval is legal. Some people just get the data and do not stop at the reached.

Others are not sure if getting product descriptions from shop sites, will not cause legal problems. Don't worry. To put an end to the discussion, we wrote this article, which dispels all myths about the parsers' legality .

Why does parsing often seem unethical?

When you parse web pages, you hardly find it offensive or unethical. However, if someone else uses your site to gain a competitive advantage or financial benefit, you are certainly furious.

That's the point. Parsing is not usually done for academic or research purposes.

People think that parsing is a process, with which companies invade someone else's space and gain a competitive advantage and corresponding financial benefit.

These are the key points why people find web scraping offensive and even unethical, on the brink of legality:

As far as data gives companies an immediate competitive advantage, they search the Internet to get the information they need. At the same time, they, of course, pursue certain financial goals. This gives the impression that parsing is meant to make money. People do not like what is misused to make money. This is why people find parsing offensive and even unethical.

When companies or individuals parse, they sometimes cross the line and violate copyright or service conditions. Extracting data from web pages looks like an aggressive activity in which ethical or legal standards are not respected. For this reason, it is difficult for people to take parsing in a positive light.

From time to time, people are simply offended by the way, the parsing occurs. For example, when bots send more requests than normal users. It puts a lot of pressure on the site.

Each site protects its data. Those who extract data may not respect or comply with these security measures. They may circumvent them and perform their tasks without concern for privacy or security issues.

Internet parsing is annoying for people and has earned a bad reputation. Ironically, however, anyone who finds parsing offensive, needs it as much as anyone else does!

Arguments for parsing

Anyway, our world is a world of data. No matter what field you work in, you need access to information. Without data, you cannot make real progress.

If you can't work or run your own business without data, imagine what does data mean to large international companies.

Imagine that your company is a multi-billion dollar corporation, that needs to develop a new marketing campaign. Can you just create one at random?

Of course not!

You need something, whereon you can base your policy and strategy. This is where the data comes into play. You need reliable and up-to-date data relating to your area of activity. This is where parsing becomes a huge boon.

Besides that, web-based parsing can automate the data collection process, it can also make it available within the shortest time possible.

Internet scanning can facilitate the data search because it can make it available in one place. Moreover, data can be available but in an inconvenient format. Parsing can extract the data and save it in the format you want, such as Excel, so you can process and use it the way you want.

Web page parsing is a big help for information technology, a help without which the digital world, as it is now, can be deadlocked.

As long as the parsing remains within legitimate boundaries and provides the data you need, there should be no reason to call it wrong or illegal.

Is parsing legal or not?

Let's take a practical example to understand that. Craigslist has sued a company called Instamotor for parsing its content, in order to create its own lists of announcements and send letters to Craigslist users about the used cars sale.

Guess what happened next? Instamotor was ordered to pay out $31 million to Craigslist. As you can see, the parsing has become quite a difficult experience in legal terms.

You may wonder how legal this process is and when it becomes illegal? When can it become vulnerable to such trials?

We have collected key points for you, to know how legitimate or illegal your data extraction process is.

Is parsing legal or not?   

Nine answers to this question.

Legal  Illegal
1. 1. Computer Fraud and Abuse Act (CFAA)     
As long as you don’t use parsing aggressively, it’s legal. As long as you do not use the data for commercial gain, you do not violate the CFAA.    CFAA determines cases where unauthorized access and use of data violate the law, particularly it concerns of data extraction for commercial gain or profit. If your parsing violates CFAA, it may be considered illegal.
2. Copyright Infringement     
If you do not post data on the Internet and do not use it for commercial purposes, you are safe. Parsing is not illegal, but the use of protected material may be considered a copyright infringement.    Companies may have data protected by copyrights. Using such data for commercial purposes may cause you problems with the law.
3. Invasion of an enclosed space     
As long as you do not get into an enclosed space and do not disturb the work of the site, your activities are mostly legal.    Receiving data from closed sections or the site disruption by your activities may result in legal actions.
4. Robots.txt     
As long as you follow the rules, set out in Robots.txt, you’re safe. If the file directly prohibits automatic traversal of the site, you should ask the owner of the site for permission to parse.    If you ignore the rules, it can cause problems.
5. Scan Frequency     
If you use a sane frequency of inquiry and do not damage the site, your parsing is considered legal. You should also use the scanning delay described in Robots.txt. If there is no such indication there, the standard is 1 request in 10-15 seconds.    The site works well with people. If you overfill it with requests, it may lose efficiency or stop responding at all, which is already illegal.
6. API or parsing     
Using the API instead of bots is more than legal.    If you receive the data by requesting pages and damage the site, there is a risk of legal proceedings.
7. 7. Violation of service conditions     
If you follow the rules set out in ToS, there will be no problems. If parsing directly forbidden in the regulations, then you should request permission.     Violation of the rules is illegal. 
8. Too frequent requests.     
You should use a reasonable number of requests and limit parallel scanning of the site.    If your actions will lead to any server  malfunction, serving the site, the problems are more than likely.
9. Going beyond public content     
As long as you work with open data, the process is secure. If you do not use it in your publications or for your own benefit, you are safe.    If you receive secured data, or even more, use it to achieve your business goals, your activity becomes illegal.

1. 1. Computer Fraud and Abuse Act (CFAA)

As you can see in the Craiglist case, it was not so much about the data itself. Rather, it was about the misuse access  and abuse of data.

This is where the Computer Fraud and Misuse Act comes into force. Craiglist won thanks to it. According to this law, unauthorised use of data from a web page can lead to legal action.

Therefore, it is necessary to ensure that you do not violate this act. Parsing is illegal if it violates the CFAA.

Tip #1. Do not violate the rules, set out in the CFAA. Avoid unauthorized access and use of data for commercial and financial gain.

2. Copyright Infringement

Copyright is a well-known concept. However, one might ask how it relates to page parsing.

When you extract data, you gain access to information that may be copyrightable.

So if you get the data and use it for commercial purposes, it can cause problems with the law.

You would think that public data is scanned, and there's nothing wrong with that. But that's only true at the level of extracting information. Commercial use of this data is not permitted under copyright laws. So if parsing leads to copyright infringement, it will be considered illegal.

Tip # 2. Honour copyrights and do not parse or use copyrighted data.

3. Invasion of an enclosed space

This sounds less frightening than CFAA or copyright infringement, but it is also a serious legal problem.

In fact, you know that trespassing on someone's property is illegal. You are not allowed to burgle other people's homes.

Entry into forbidden space and irresponsible behavior on digital platforms are also not encouraged.

From a parsing perspective, you should not directly damage the site or interfere with its operation in any way. When parsing you do not see how it negatively affects the site and the server.

To speed up data processing, your bot can do too many requests and slow down or even stop the server. This can be qualified as a violation of its owner's rights.

In any case, your parsing should not affect the site and server. If this happens, there may be legal problems.

Tip #3. Do not go to prohibited sections and do not invade restricted areas and data.

4. Robots.txt

There is a file, named Robots.txt,  that you should use from the beginning. This document contains all the rules about how bots should interact with the site.

Some sites completely prohibit bots. If you are careful enough, you will receive a message saying that you should stay away from such a site.

Robots.txt also explains what the site considers "good behavior" when it comes to access, limited pages and scanning frequency.

So if you want your parsing to be safe in terms of the law, follow the rules set out in Robots.txt. These are clear guidelines on what you can and cannot do. As long as you follow the rules contained in it, you will be safe and protected by the law.

Tip # 4. Follow the Robots.txt rules and when parsing, respect the conditions described therein.

5. Scanning frequency

The power of parsing, is its weakness as well. The advantage of automatic information retrieval is at the speed, which could obtain the right materials.

Still, here's the problem. Sites don't like such aggressive scanning and data retrieval at high speeds.

That's why many sites set scan delay parameters to slow down bots. However, many people who parse data persistently ignore these guidelines and damage sites with their actions. This, in turn, can expose them to serious legal problems.

Tip # 5. Do not parse aggressively. Stick to a reasonable scanning speed - 1 request in 10-15 seconds. As long as you will scan at this rate, the parsing will be safe.

6. API or parsing

Extracting data without considering the legitimacy of the process can lead to trouble.

Instead, you can choose a safer path. For example, use an API. Most of the sites, you come across, already offer their users an API.

It would not be appropriate to aggressively parse data when you have an API. The reason for this is that using the API puts you in a much more advantageous position.

Reasonable use of an API means legal security.

Tip #6. Most sites have an API. Use the API instead of parsing wherever possible.

7. Service Violation (ToS)

When it comes to parsing, people often cross borders. One of them is Terms of Service.

Websites create and store data, protecting it from parsers. The Terms of Service are usually clearly stated that there is data on the site, that may not be extracted and used.

You may think that if you take data, that is publicly available, you are fine, but in fact, if the Terms of Service prohibit you from retrieving it, you cross the line.

Publicly available data parsing  is not illegal, but you may face a situation where the company can initiate actions against you if it wishes.

The point is that you must comply with the Terms of Service or be prepared for legal consequences.

Tip # 7. Respect the Terms of Service. If they clearly state the rules for parsing web pages, follow both their letter and spirit.

8. Too frequent requests

The business world is so dependent on data, that companies are willing to do anything, to get it. Because time is of the essence, companies want to get the data right away.

In an effort to defeat their competitors, they are willing to take unnecessary risks and scan pages quickly, ignoring rules and regulations.

One such violation is that parsers unnecessarily frequently poll servers. People do not access sites at this speed, and they are not designed for such load.

Therefore, when you access the server too often, it may happen that it will fail or at least slow down, so that, it will not be able effectively give web pages to real users.

This will give the site owner the right to initiate legal proceedings against you on the grounds that your actions have harmed the resource.

Tip #8. Maintain a time interval between two requests. Do not be too aggressive in your actions.

9. Going beyond public content

As a smart Internet user, you must learn to distinguish between public and private content.

Sites store some information that is available for public use and allow everyone to access it. However, there is also some information on the site that is not intended for public viewing.

If you deliberately went beyond public content and parse data, that is not open to ordinary users, you may get into trouble.

For example, if a page requires a login, it means that its data is not open to the public. You need to stay away from such information, which you can only get after logging in.

If you break this basic rule and direct your parsers outside of the public content, you may get into legal trouble. However, if you stick to public data, you will be safe and able to extract the data for as long as you want without having to worry about legal consequences.

Tip #9. Access only to public data. Do not go beyond the open pages. This can lead to copyright infringement, etc.

Conclusion

It's not a question of whether or not you're gonna mess up the sites. Automatic data retrieval is inevitable.

There is no other quick and effective way to get the information you need to make decisions and grow your business.

The question, however, is how to parse in a way that does not cause legal problems. To do this, you need to maintain a good balance between your needs and the capabilities and norms of the sites.

If you violate any of the rules set by the information owner, you may face legal prosecution.

On the other hand, if you carefully extract data without in any way damaging the site, you can continue to parse the data without worrying about legal consequences.

Hope that this article will help you to avoid legal problems and make the right decisions.

Parse, but with respect for other people's information!