How to Extract Facebook Posts, Comments, Pages, Photos, and More?

How to Extract Facebook Posts, Comments, Pages, Photos, and More?

With this tutorial blog, you can use Python for scraping data from all Facebook profiles or pages. The data you would be scraping from the predefined amounts of posts include:

  • Post URLs
  • Post Media URLs
  • Post Texts
post-urls

You would be scraping comments from different posts as well as from every comment:

  • Profile’s Name
  • Comment Text
  • Profile URLs
profile’s-name-urls-comment

Certainly, there are a lot more, which can be scraped from Facebook however for this tutorial blog it would be sufficient.

Python Packages

In this tutorial blog, you would require following Python packages:

  • bs4 (BeautifulSoup)
  • collections
  • json
  • logging
  • re
  • requests
  • time
python-packages

Don’t forget to install all these packages in the Python Virtual Environment to do this project, this is a superior practice.

Scrape Facebook Using Requests

Facebook is loaded with JavaScript however the requests package does not extract JavaScript; this only permits you to do easy web requests including POST and GET.

Note: In this tutorial blog, you will scrape Facebook’s mobile version as this will help you to scrape the required data with easy requests.

How Will This Script Extract Facebook Mobile?

how-will-this-script-extract-facebook-mobile

Initially, you require to consider what script would be exactly performing, the script would:

  • Get a listing of Facebook profile URLs from the file.
  • Get credentials from the file for doing a login through requests package.
  • Create a login with the Session object through requests package.
  • For all profile URLs, we will scrape data from predefined amounts of posts.

This script would look like this on the key function:

if __name__ == "__main__":

    logging.basicConfig(level=logging.INFO)
    base_url = 'https://mobile.facebook.com'
    session = requests.session()

    # Extracts credentials for the login and all of the profiles URL to scrape
    credentials = json_to_obj('credentials.json')
    profiles_urls = json_to_obj('profiles_urls.json')

    make_login(session, base_url, credentials)

    posts_data = None
    for profile_url in profiles_urls:
        posts_data = crawl_profile(session, base_url, profile_url, 25)
    logging.info('[!] Scraping finished. Total: {}'.format(len(posts_data)))
    logging.info('[!] Saving.')
    save_data(posts_data)

You use the logging package for putting a few log messages on script executions so that you understand which the script is performing.

Then you describe the base_url which would be a Facebook’s mobile URL.

After scraping input data from the files you perform login calling a function named make_login which you would be describing shortly.

After that for all profile URLs on input data that you want to extract data from a particular amount of posts with crawl_profile function.

Getting Input Data

As previously stated, this script would require to get data from different sources: the file having profile URLs as well as another one having credentials from the Facebook accounts to make login. Let’s describe a function, which will help you scrape data from the JSON files:

def json_to_obj(filename):
    """Extracts data from JSON file and saves it on Python object
    """
    obj = None
    with open(filename) as json_file:
        obj = json.loads(json_file.read())
    return obj

The function will permit you to scrape data formatted within JSON format as well as convert that in the Python objects.

The files credentials.json and profiles_urls.json will have input data which the script requires.

profiles_urls.json:
[
    "https://mobile.facebook.com/profileURL1/",
    "https://mobile.facebook.com/profileURL2"
]

credentials.json:

{
    "email":"username@mail.com",
    "pass":"password"
}


You would require to replace profiles URLs which you wish to scrape data from as well as Facebook account’s identifications from login.

Facebook Log In

For making the login, you need to examine Facebook’s main page (mobile.facebook.com) on the mobile version for knowing the URLs of a form for making the login.

facebook-log-in

In case, we do the right clicking on “Log In” tab, you can have the form for which we need to send credentials:

code-1

A URL from form element having the id="login_form" is what you require to do login. Let’s describe the function which would assist you in this job:

def make_login(session, base_url, credentials):
    """Returns a Session object logged in with credentials.
    """
    login_form_url = '/login/device-based/regular/login/?refsrc=https%3A'\
        '%2F%2Fmobile.facebook.com%2Flogin%2Fdevice-based%2Fedit-user%2F&lwv=100'

    params = {'email':credentials['email'], 'pass':credentials['pass']}

    while True:
        time.sleep(3)
        logged_request = session.post(base_url+login_form_url, data=params)
        
        if logged_request.ok:
            logging.info('[*] Logged in.')
            break

With the action URLs from form elements, you can create the POST request having requests package from Python. In case, the reply is OK is as you have successfully logged in, else you can wait a bit to try again.

With the action URLs from form elements, you can create the POST request having requests package from Python. In case, the reply is OK is as you have successfully logged in, else you can wait a bit to try again.

Crawling the Facebook’s Profile or Page

When you have logged in, then you require to crawl a Facebook page URL or profile to scrape public posts.

def crawl_profile(session, base_url, profile_url, post_limit):
    """Goes to profile URL, crawls it and extracts posts URLs.
    """
    profile_bs = get_bs(session, profile_url)
    n_scraped_posts = 0
    scraped_posts = list()
    posts_id = None

    while n_scraped_posts < post_limit:
        try:
            posts_id = 'recent'
            posts = profile_bs.find('div', id=posts_id).div.div.contents
        except Exception:
            posts_id = 'structured_composer_async_container'
            posts = profile_bs.find('div', id=posts_id).div.div.contents

        posts_urls = [a['href'] for a in profile_bs.find_all('a', text='Full Story')] 

        for post_url in posts_urls:
            # print(post_url)
            try:
                post_data = scrape_post(session, base_url, post_url)
                scraped_posts.append(post_data)
            except Exception as e:
                logging.info('Error: {}'.format(e))
            n_scraped_posts += 1
            if posts_completed(scraped_posts, post_limit):
                break
        
        show_more_posts_url = None
        if not posts_completed(scraped_posts, post_limit):
            show_more_posts_url = profile_bs.find('div', id=posts_id).next_sibling.a['href']
            profile_bs = get_bs(session, base_url+show_more_posts_url)
            time.sleep(3)
        else:
            break
    return scraped_posts


First you need to save results of get_bs function in the profile_bs variable. The get_bs function gets the Session object as well as the URL variable:

def get_bs(session, url):
    """Makes a GET requests using the given Session object
    and returns a BeautifulSoup object.
    """
    r = None
    while True:
        r = session.get(url)
        time.sleep(3)
        if r.ok:
            break
    return BeautifulSoup(r.text, 'lxml')

This get_bs function would make the GET request through a Session object, in case, the request codes are OK and after that, we return the BeautifulSoup object made with a response.

Let’s break down the crawl_profile function:

When you have a profile_bs variable, then you describe variables for total posts extracted, the posts as well as posts ids.

After that, you open the while loop which will repeat always which the n_scraped_posts variables are less than the post_limit variables.

Within this loop, you just try and find out the HTML elements which hold all the elements whereas these posts are. In case, a Facebook URL is the Facebook page, the posts would be on an element having an id='recent' however, if a Facebook URL is the person’s profile, these posts would be on an element with id='structured_composer_async_container'.

When you know these elements where these posts are, then you can scrape the URLs.

After that, for all post URLs which you have found, just call a scrape_post function as well as add that results to scraped_posts list.

In case, you have touched the predefined amounts of posts, you have broken the while loop.

Scrape Data from Different Facebook Posts

Let’s observe the function which will permit you to begin the actual scraping:

def scrape_post(session, base_url, post_url):
    """Goes to post URL and extracts post data.
    """
    post_data = OrderedDict()

    post_bs = get_bs(session, base_url+post_url)
    time.sleep(5)

    # Here we populate the OrderedDict object
    post_data['url'] = post_url

    try:
        post_text_element = post_bs.find('div', id='u_0_0').div
        string_groups = [p.strings for p in post_text_element.find_all('p')]
        strings = [repr(string) for group in string_groups for string in group]
        post_data['text'] = strings
    except Exception:
        post_data['text'] = []
    
    try:
        post_data['media_url'] = post_bs.find('div', id='u_0_0').find('a')['href']
    except Exception:
        post_data['media_url'] = ''
    

    try:
        post_data['comments'] = extract_comments(session, base_url, post_bs, post_url)
    except Exception:
        post_data['comments'] = []
    
    return dict(post_data)

The function begins making an OrderedDict object which would be a one that holds post data:

  • Comments
  • Post Media URLs
  • Post Texts
  • Post URL

Initially, you require a post HTML code within the BeautifulSoup object so utilize get_bs function for it.

As you already understand the post URLs, at that point you require to add that to a post_data object.

For scraping the post texts, you require to get the post key elements, as follows:

try:
        post_text_element = post_bs.find('div', id='u_0_0').div
        string_groups = [p.strings for p in post_text_element.find_all('p')]
        strings = [repr(string) for group in string_groups for string in group]
        post_data['text'] = strings
    except Exception:
        post_data['text'] = []

Then you search for a div having all text, however these elements can have many (p) tags having text so that you repeat overall as well as scrape the text.

Then, you scrape post media URLs. Facebook posts have either video or images or even that might be only the text:

try:
        post_data['media_url'] = post_bs.find('div', id='u_0_0').find('a')['href']
    except Exception:
        post_data['media_url'] = ''

In the end, call a function extract_comments for scraping remaining data:

try:
        post_data['comments'] = extract_comments(session, base_url, post_bs, post_url)
    except Exception:
        post_data['comments'] = []

Scraping Facebook Comments

The function is bigger for the tutorial so here you repeat over the while loop till there are more comments extracted:

def extract_comments(session, base_url, post_bs, post_url):
    """Extracts all coments from post
    """
    comments = list()
    show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))['href']
    first_comment_page = True

    logging.info('Scraping comments from {}'.format(post_url))
    while True:

        logging.info('[!] Scraping comments.')
        time.sleep(3)
        if first_comment_page:
            first_comment_page = False
        else:
            post_bs = get_bs(session, base_url+show_more_url)
            time.sleep(3)
        
        try:
            comments_elements = post_bs.find('div', id=re.compile('composer')).next_sibling\
                .find_all('div', id=re.compile('^\d+'))
        except Exception:
            pass

        if len(comments_elements) != 0:
            logging.info('[!] There are comments.')
        else:
            break
        
        for comment in comments_elements:
            comment_data = OrderedDict()
            comment_data['text'] = list()
            try:
                comment_strings = comment.find('h3').next_sibling.strings
                for string in comment_strings:
                    comment_data['text'].append(string)
            except Exception:
                pass
            
            try:
                media = comment.find('h3').next_sibling.next_sibling.children
                if media is not None:
                    for element in media:
                        comment_data['media_url'] = element['src']
                else:
                    comment_data['media_url'] = ''
            except Exception:
                pass
            
            comment_data['profile_name'] = comment.find('h3').a.string
            comment_data['profile_url'] = comment.find('h3').a['href'].split('?')[0]
            comments.append(dict(comment_data))
        
        show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))
        if 'View more' in show_more_url.text:
            logging.info('[!] More comments.')
            show_more_url = show_more_url['href']
        else:
            break
    
    return comments
                      

You require to be well aware that if you scrape the initial comments page or following pages then you can define the first_comment_page variables as True.

Just look in case there is any link of “View More Comments” as it will let us know if you are repeating over the loop or not:

show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))['href']

Within the key loop of a function, initially you need to check values of first_comment_page, in case, it is True, you scrape comments from the current page, or you make the requests to URL of “View More Comments”:

        if first_comment_page:
            first_comment_page = False
        else:
            post_bs = get_bs(session, base_url+show_more_url)
            time.sleep(3)         Value  1,229.01  Mil.Baht

After that, choose all HTML elements which have comments. You require to perform a second clicking on a comment and you will observe that every comment is within the div having the 17-digit ID:

code-2

By understanding this, you can choose all elements like this:

 try:
            comments_elements = post_bs.find('div', id=re.compile('composer')).next_sibling\
                .find_all('div', id=re.compile('^\d+'))
        except Exception:
            pass

        if len(comments_elements) != 0:
            logging.info('[!] There are comments.')
        else:
            break

In case, you are unable to get elements, it means that no elements are there. Now, for all comments, you need to make the OrderedDict object where you would save data from the comment:

 for comment in comments_elements:
            comment_data = OrderedDict()
            comment_data['text'] = list()
            try:
                comment_strings = comment.find('h3').next_sibling.strings
                for string in comment_strings:
                    comment_data['text'].append(string)
            except Exception:
                pass
            
            try:
                media = comment.find('h3').next_sibling.next_sibling.children
                if media is not None:
                    for element in media:
                        comment_data['media_url'] = element['src']
                else:
                    comment_data['media_url'] = ''
            except Exception:
                pass
            
            comment_data['profile_name'] = comment.find('h3').a.string
            comment_data['profile_url'] = comment.find('h3').a['href'].split('?')[0]
            comments.append(dict(comment_data))


Within this loop, you need to scrape comment text, searching for a HTML element having text, because in text of a post, you require to get all elements having strings as well as add every string into the list:

 try:
                comment_strings = comment.find('h3').next_sibling.strings
                for string in comment_strings:
                    comment_data['text'].append(string)
            except Exception:
                pass

After that, you require a media URL:

            try:
                media = comment.find('h3').next_sibling.next_sibling.children
                if media is not None:
                    for element in media:
                        comment_data['media_url'] = element['src']
                else:
                    comment_data['media_url'] = ''
            except Exception:
                pass

When you get the data you require a profile name as well as profile URL, those you could find like this:

  comment_data['profile_name'] = comment.find('h3').a.string
            comment_data['profile_url'] = comment.find('h3').a['href'].split('?')[0]


When you get all data then you can have from the comment, then add data into the comment list. After that, you require to check in case, there is the link called “Show more comments”:

show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))
        if 'View more' in show_more_url.text:
            logging.info('[!] More comments.')
            show_more_url = show_more_url['href']
        else:
            break

A loop which is scraping the comments would stop in case, it cannot get more comments as well as a loop scraping the post data would stop after that reaches the given posts limit.

Complete Code
import requests
import re
import json
import time
import logging
import pandas
from collections import OrderedDict
from bs4 import BeautifulSoup



def get_bs(session, url):
    """Makes a GET requests using the given Session object
    and returns a BeautifulSoup object.
    """
    r = None
    while True:
        r = session.get(url)
        if r.ok:
            break
    return BeautifulSoup(r.text, 'lxml')


def make_login(session, base_url, credentials):
    """Returns a Session object logged in with credentials.
    """
    login_form_url = '/login/device-based/regular/login/?refsrc=https%3A'\
        '%2F%2Fmobile.facebook.com%2Flogin%2Fdevice-based%2Fedit-user%2F&lwv=100'

    params = {'email':credentials['email'], 'pass':credentials['pass']}

    while True:
        time.sleep(3)
        logged_request = session.post(base_url+login_form_url, data=params)
        
        if logged_request.ok:
            logging.info('[*] Logged in.')
            break


def crawl_profile(session, base_url, profile_url, post_limit):
    """Goes to profile URL, crawls it and extracts posts URLs.
    """
    profile_bs = get_bs(session, profile_url)
    n_scraped_posts = 0
    scraped_posts = list()
    posts_id = None

    while n_scraped_posts < post_limit:
        try:
            posts_id = 'recent'
            posts = profile_bs.find('div', id=posts_id).div.div.contents
        except Exception:
            posts_id = 'structured_composer_async_container'
            posts = profile_bs.find('div', id=posts_id).div.div.contents

        posts_urls = [a['href'] for a in profile_bs.find_all('a', text='Full Story')] 

        for post_url in posts_urls:
            # print(post_url)
            try:
                post_data = scrape_post(session, base_url, post_url)
                scraped_posts.append(post_data)
            except Exception as e:
                logging.info('Error: {}'.format(e))
            n_scraped_posts += 1
            if posts_completed(scraped_posts, post_limit):
                break
        
        show_more_posts_url = None
        if not posts_completed(scraped_posts, post_limit):
            show_more_posts_url = profile_bs.find('div', id=posts_id).next_sibling.a['href']
            profile_bs = get_bs(session, base_url+show_more_posts_url)
            time.sleep(3)
        else:
            break
    return scraped_posts

def posts_completed(scraped_posts, limit):
    """Returns true if the amount of posts scraped from
    profile has reached its limit.
    """
    if len(scraped_posts) == limit:
        return True
    else:
        return False


def scrape_post(session, base_url, post_url):
    """Goes to post URL and extracts post data.
    """
    post_data = OrderedDict()

    post_bs = get_bs(session, base_url+post_url)
    time.sleep(5)

    # Here we populate the OrderedDict object
    post_data['url'] = post_url

    try:
        post_text_element = post_bs.find('div', id='u_0_0').div
        string_groups = [p.strings for p in post_text_element.find_all('p')]
        strings = [repr(string) for group in string_groups for string in group]
        post_data['text'] = strings
    except Exception:
        post_data['text'] = []
    
    try:
        post_data['media_url'] = post_bs.find('div', id='u_0_0').find('a')['href']
    except Exception:
        post_data['media_url'] = ''
    

    try:
        post_data['comments'] = extract_comments(session, base_url, post_bs, post_url)
    except Exception:
        post_data['comments'] = []
    
    return dict(post_data)


def extract_comments(session, base_url, post_bs, post_url):
    """Extracts all coments from post
    """
    comments = list()
    show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))['href']
    first_comment_page = True

    logging.info('Scraping comments from {}'.format(post_url))
    while True:

        logging.info('[!] Scraping comments.')
        time.sleep(3)
        if first_comment_page:
            first_comment_page = False
        else:
            post_bs = get_bs(session, base_url+show_more_url)
            time.sleep(3)
        
        try:
            comments_elements = post_bs.find('div', id=re.compile('composer')).next_sibling\
                .find_all('div', id=re.compile('^\d+'))
        except Exception:
            pass

        if len(comments_elements) != 0:
            logging.info('[!] There are comments.')
        else:
            break
        
        for comment in comments_elements:
            comment_data = OrderedDict()
            comment_data['text'] = list()
            try:
                comment_strings = comment.find('h3').next_sibling.strings
                for string in comment_strings:
                    comment_data['text'].append(string)
            except Exception:
                pass
            
            try:
                media = comment.find('h3').next_sibling.next_sibling.children
                if media is not None:
                    for element in media:
                        comment_data['media_url'] = element['src']
                else:
                    comment_data['media_url'] = ''
            except Exception:
                pass
            
            comment_data['profile_name'] = comment.find('h3').a.string
            comment_data['profile_url'] = comment.find('h3').a['href'].split('?')[0]
            comments.append(dict(comment_data))
        
        show_more_url = post_bs.find('a', href=re.compile('/story\.php\?story'))
        if 'View more' in show_more_url.text:
            logging.info('[!] More comments.')
            show_more_url = show_more_url['href']
        else:
            break
    
    return comments


def json_to_obj(filename):
    """Extracts dta from JSON file and saves it on Python object
    """
    obj = None
    with open(filename) as json_file:
        obj = json.loads(json_file.read())
    return obj


def save_data(data):
    """Converts data to JSON.
    """
    with open('profile_posts_data.json', 'w') as json_file:
        json.dump(data, json_file, indent=4)


if __name__ == "__main__":

    logging.basicConfig(level=logging.INFO)
    base_url = 'https://mobile.facebook.com'
    session = requests.session()

    # Extracts credentials for the login and all of the profiles URL to scrape
    credentials = json_to_obj('credentials.json')
    profiles_urls = json_to_obj('profiles_urls.json')

    make_login(session, base_url, credentials)

    posts_data = None
    for profile_url in profiles_urls:
        posts_data = crawl_profile(session, base_url, profile_url, 25)
    logging.info('[!] Scraping finished. Total: {}'.format(len(posts_data)))
    logging.info('[!] Saving.')
    save_data(posts_data)

                     
Running a Script

You could run a script through running following commands in the CMD or Terminal:

$ python facebook_profile_scraper.py                

When you complete that, you will get a JSON file having the scraped data:

[
    {
        "url": "/story.php?story_fbid=1201918583328686&id=826604640860084&refid=17&_ft_=mf_story_key.1201918583328686%3Atop_level_post_id.1201918583328686%3Atl_objid.1201918583328686%3Acontent_owner_id_new.826604640860084%3Athrowback_story_fbid.1201918583328686%3Apage_id.826604640860084%3Aphoto_attachments_list.%5B1201918319995379%2C1201918329995378%2C1201918396662038%2C1201918409995370%5D%3Astory_location.4%3Astory_attachment_style.album%3Apage_insights.%7B%22826604640860084%22%3A%7B%22page_id%22%3A826604640860084%2C%22actor_id%22%3A826604640860084%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22psn%22%3A%22EntStatusCreationStory%22%2C%22post_context%22%3A%7B%22object_fbtype%22%3A266%2C%22publish_time%22%3A1573226077%2C%22story_name%22%3A%22EntStatusCreationStory%22%2C%22story_fbid%22%3A%5B1201918583328686%5D%7D%2C%22role%22%3A1%2C%22sl%22%3A4%2C%22targets%22%3A%5B%7B%22actor_id%22%3A826604640860084%2C%22page_id%22%3A826604640860084%2C%22post_id%22%3A1201918583328686%2C%22role%22%3A1%2C%22share_id%22%3A0%7D%5D%7D%7D%3Athid.826604640860084%3A306061129499414%3A2%3A0%3A1575187199%3A3518174746269382888&__tn__=%2AW-R#footer_action_list",
        "text": [
            "'Cute moments like these r my weakness'",
            "' Follow our insta page: '",
            "'https://'",
            "'instagram.com/'",
            "'_disquieting_'"
        ],
        "media_url": "/Disquietingg/?refid=52&_ft_=mf_story_key.1201918583328686%3Atop_level_post_id.1201918583328686%3Atl_objid.1201918583328686%3Acontent_owner_id_new.826604640860084%3Athrowback_story_fbid.1201918583328686%3Apage_id.826604640860084%3Aphoto_attachments_list.%5B1201918319995379%2C1201918329995378%2C1201918396662038%2C1201918409995370%5D%3Astory_location.9%3Astory_attachment_style.album%3Apage_insights.%7B%22826604640860084%22%3A%7B%22page_id%22%3A826604640860084%2C%22actor_id%22%3A826604640860084%2C%22dm%22%3A%7B%22isShare%22%3A0%2C%22originalPostOwnerID%22%3A0%7D%2C%22psn%22%3A%22EntStatusCreationStory%22%2C%22post_context%22%3A%7B%22object_fbtype%22%3A266%2C%22publish_time%22%3A1573226077%2C%22story_name%22%3A%22EntStatusCreationStory%22%2C%22story_fbid%22%3A%5B1201918583328686%5D%7D%2C%22role%22%3A1%2C%22sl%22%3A9%2C%22targets%22%3A%5B%7B%22actor_id%22%3A826604640860084%2C%22page_id%22%3A826604640860084%2C%22post_id%22%3A1201918583328686%2C%22role%22%3A1%2C%22share_id%22%3A0%7D%5D%7D%7D&__tn__=C-R",
        "comments": [
            {
                "text": [
                    "Diana Vanessa",
                    " darling ",
                    "\u2764\ufe0f"
                ],
                "profile_name": "Zeus Alejandro",
                "profile_url": "/ZeusAlejandroXd"
            },
            {
                "text": [
                    "Ema Yordanova",
                    " my love ",
                    "<3"
                ],
                "profile_name": "Sam Mihov",
                "profile_url": "/darknessBornFromLight"
            },
...
...
...
            {
                "text": [
                    "Your one and only sunshine ;3"
                ],
                "profile_name": "Edgar G\u00f3mez S\u00e1nchez",
                "profile_url": "/edgar.gomezsanchez.7"
            }
        ]
    }
]

Conclusion

It may look like an easy script, however it has the trick for dominance; you require to get experience with various subjects including requests, Regular expressions, as well as BeautifulSoup. We hope that you have learnt more about web scraping in the tutorial post. You could try and scrape the similar data using various selectors or scrape the amounts of reactions which a post get.

Comments

Popular posts from this blog

How to Extract Walmart Products Data Including Names, Details, Pricing, etc.

How to Use Amazon Seller Reviews In Getting Business Opportunities From Home?

How do Data Scraping Services Help to Grow Your Business?