Parsing the science section on reddit.com v 1.0
Setting up a development environment for parsing. Parsing process — where does data collection on Reddit start. Some ways to improve data collection efficiency.
This work is dedicated to parsing the site reddit.com, section - science to create a database of posts. The purpose of this work is to accumulate information. The obtained information can be further used for semantic analysis and practice in Big Data analysis. The goal includes a set of tasks:
- to select the necessary libraries;
- to find the parsing addresses and decide on their logic;
- to create a function that parses the necessary information;
- to organize accumulation of information;
- to offer a way of processing the received information and any useful result.
This theme is actual in connection with global distribution of algorithms of accumulation and analysis of big data. The received information can be used for tracking relevant areas of science and simply to learn English on the most widespread words (to increase the vocabulary). And besides, simply as a tool for personal development.
1. Development and library environment
The main part of this kind of work is to create the development environment and the right settings. However, this topic is quite extensive and, more importantly, it is strictly individual for each sentient being. My expertise: using python, and the development environment PyCharm IDE or Jupyter notebook. For taking notes - editor atom. Although eventually, the working environment can be a normal notebook with a command line.
This research was based on some interesting and useful articles: 1, 2 and, of course, Google.
I will mark some more author's features:
- the code was written, tested and worked in Linux Ubuntu 16.04 x64 system;
- now the code is used in Arch Linux and modernized for python 3.8;
- Python 2.7 and Firefox Quantum 58.0.1 (64 bit) are used.
At the beginning of each chapter you will find code with comments. Explanations will be provided later, if necessary. So, let's get right to the point.
Choosing the necessary libraries
This is a kind of "equipment and reagent preparation". The main objects of research are: a site and a data set (for which you need libraries, request, json, fake_user). As a useful action it is proposed to build a cloud of the most frequently occurring words - you also need your own libraries (wordcloud, numpy), also for working with text (nltk, collection, pymorphy). The work logic: you need to get information from the site (parsing), process it, add it to the database, save the database on disk, calculate the frequency of words in the database, build a cloud of words (a useful result).
It is worth saying that it is possible to run this code directly from the command line without specifying the interpreter (#!/usr/bin/env python). At the same time the second part of the header is necessary for correct work with the format utf-8 and unicode (# -*- coding: utf-8 -*-).
2. Parsing the site
We selected the object of study and prepared all necessary for our experimental work (section 1). Now we need to deal with the ideology of receiving and downloading data. One of the most obvious and universal ways is direct parsing of the html code of a page. It requires you to specify the address of the page itself and get its code. However, so we get the information lying on the surface (visible to the user). This is usually enough, but it is often useful to read about additional site features.
Having searched a bit on the Internet, we find Article 2. In this work, we have specified an interesting way of loading reddit pages in json format. It is very convenient for further processing.
Let's check it out in practice. Let's try to open a page of interest in json format in a browser (for example, https://www.reddit.com/r/science/new/.json). It is worth mentioning that the structure is clearer and there is more information about the news on the page. Thus, it makes sense to choose the most informative approach and parry json versions of pages.
First of all, let's upload the data:
In small projects at first, we will try to use a functional programming approach as more visible and understandable. Thus, the main actions are implemented in functions.
At the output of the collect_news function we get a dictionary to store characteristic data of all the news. Dictionary keys are common properties of the last found post (assume that properties of all posts are the same). Each key corresponds to a list. The length of the list is determined by the number of uploaded news posts (100 by default). Each list consists of lists, which store information on this key from one post. If there' s no such property for any post, then the list is filled in as None. Such logic is chosen to have the same objects (list) in the list, and that is why each list of the key had the same size (we will need it for further creation of Data Frame).
The second function of read_js is json format parsing with the appropriate library.
It should be noted that when parsing json it is assumed that the file format is correct. If it is not so, the function will return None and the code will break. This is a great opportunity to show your skills, username, and rewrite the code so that it would work even when this exception occurs.
Let's consider getting information from the site in more detail. To prevent a ban (ban on visiting the site) - we use the request signature that is standard for an ordinary user. That is what the library fake_useragent is for. To begin with, at sufficiently large intervals between requests (about 1 s, they can be randomized) and a relatively small amount of downloadable pages, this will be enough.
Next, it's a small matter of saving the information in the form of a matrix of "sign objects" for further use.
3. Data conversion and storage
The main part is written, it remains to accumulate data, for which the pandas library is perfect.
Together with the database existence check (science_reddit_df), we have prescribed a simple logging function log_write, which will allow us to observe the behavior of our code. The type of file with logs:
Tue Jan 23 09:35:56 2018 new news message: 155
Tue Jan 23 09:39:21 2018 new news message: 0
Tue Jan 23 22:41:39 2018 new news message: 33
Wed Jan 24 09:23:09 2018 new news message: 11
I think it's pretty clear.
The logic of the scientific news database looks like: getting new information - check the possibility of opening a saved database - create a new database - merge two Data Frames - delete duplicate rows (on the id=news column) - save the updated database.
4. Building a word cloud
To generate a useful code output, I suggest building a cloud of the most frequently occurring words. To do this, you need to highlight relevant text data from the database and perform preliminary processing.
The logic of the code is following: a choice of all headings from a database - clearing texts from html tags and additional spaces - processing of all text in a cycle - removal of all that not the letter * - repeated removal of additional spaces - morphologization of words (put it to "normal" form) - removal of "stop" words (most common ones, articles, pronouns, etc.) - output of one big line of words and spaces between them.
* - since Python 2.7 has certain encoding problems, we use two methods of encoding just in case.
The resulting string can be used to build a word cloud using the wordcloud library. After building a cloud, we repaint it in the desired color (in my case - 50 shades of white). A beautifully designed cloud can be applied on a specific mask (image). The mask should be binarized (black and white format of two pixel types: 0 and 255). For this purpose there are several sites that can be easily googled. I used this one: 3, the operation is called threshold.
Finally, a small gift for Ubuntu users (and maybe not just for them, if you think about it). The resulting image can be converted into an image with a transparent background by a simple terminal command:
convert ~/news_raw.png -transparent black ~/news_ready.png
In the same way, you can organize automatic parsing of the site through the autostart:
Set a delay for execution in seconds, so that the system could connect to the Internet and run the script (to execute it you can specify "python /FULL_PATH/science_reddit.py").
5. Conclusion .
If to summarize, the main goal was to accumulate information about scientific news that appear on reddit.com. For this purpose, it was necessary to solve a number of the above tasks:
- to get information from the site;
- parsing information and selecting key properties for each news;
- saving each news in a "sign objects" matrix.
According to the written code, these tasks were solved and the goal was achieved. On our hard drive is a file science_reddit_df, which is constantly growing and accumulate information (each time you run the script). In addition, the task of using this information was solved (a cloud of the most frequently used words is built by news headlines).
5.1 Ways to improve (self-improvement tasks)
Any project is subject to optimization and update. I hope that my readers will also try to adapt the information received for themselves and their needs. So, at first sight, we can implement the following:
- write an error handler (checking values returned by functions and reacting if it is a None or not a target value);
- text processing as a separate function;
- use an object-oriented approach (use classes for news);
- implementation of html-parsing with beautiful soup library;
- writing a user interface (GUI);
- rewrite the code on python 3.