web_scrapping_2

First we need to import necessary library

In [1]:
from bs4 import BeautifulSoup
import requests as re
import pandas as pd

re.get gives us the contests or the html codes for the webpage

In [2]:
r = re.get("https://www.rev.com/blog/transcript-category/donald-trump-transcripts?view=all")
#print(r.content)

Lets convert the messy codes into beutiful form with Beautiful Soup

In [3]:
soup = BeautifulSoup(r.content)
In [4]:
#soup

Let's find all the div container with class name f1-post-column. In that div, there are link for individual pages of Trump speeches

In [5]:
div = soup.findAll("div", {"class":"fl-post-column"})
In [6]:
link_list = []
In [7]:
for i in range(0, len(div)):
    link_list.append(div[i].meta["itemid"])
In [8]:
link_list
Out[8]:
['https://www.rev.com/blog/transcripts/donald-trump-concedes-election-condemns-rioters-video-transcript-january-7',
 'https://www.rev.com/blog/transcripts/trump-video-telling-protesters-at-capitol-building-to-go-home-transcript',
 'https://www.rev.com/blog/transcripts/donald-trump-speech-save-america-rally-transcript-january-6',
 'https://www.rev.com/blog/transcripts/donald-trump-rally-speech-transcript-dalton-georgia-senate-runoff-election',
 'https://www.rev.com/blog/transcripts/donald-trump-georgia-phone-call-transcript-brad-raffensperger-recording',
 'https://www.rev.com/blog/transcripts/donald-trump-melania-trump-christmas-message-transcript-2020',
 'https://www.rev.com/blog/transcripts/donald-trump-video-speech-transcript-on-covid-relief-bill-december-22',
 'https://www.rev.com/blog/transcripts/donald-trump-hosts-operation-warp-speed-covid-19-vaccine-summit-transcript-december-8',
 'https://www.rev.com/blog/transcripts/donald-trump-presents-medal-of-freedom-to-dan-gable-transcript-december-7',
 'https://www.rev.com/blog/transcripts/donald-trump-georgia-rally-transcript-before-senate-runoff-elections-december-5',
 'https://www.rev.com/blog/transcripts/donald-trump-presents-medal-of-freedom-to-lou-holtz-transcript-december-3',
 'https://www.rev.com/blog/transcripts/donald-trump-speech-on-election-fraud-claims-transcript-december-2']

Find all the data into a data frame

In [9]:
df = pd.DataFrame(columns = ["id", "script"])
for i in range(0, len(link_list)):
    r2 = re.get(link_list[i])
    soup2 = BeautifulSoup(r2.content)
    content = soup2.findAll("div", {"id":"transcription"})
    transcript = content[0].text
    df = df.append({'id': i, "script": transcript}, ignore_index = True )
In [10]:
df
Out[10]:
id script
0 0 \n\n\n\n \nDonald Trump: (00:00)\nI would like...
1 1 \n\n\n\n \nDonald Trump: (00:00)\nI know your ...
2 2 \n\n\n\n \nDonald Trump: (02:44)\nThe media wi...
3 3 \n\n\n\n \nCrowd: (00:00)\n(singing).\nCrowd: ...
4 4 \n\n\n\n \nMark Meadows: (00:00)\nMr. Presiden...
5 5 \n\n\n\n \nMelania Trump: (00:00)\nThe preside...
6 6 \n\n\n\n \nDonald Trump: (00:00)\nThroughout t...
7 7 \n\n\n\n \nSpeaker 9: (06:02)\nLadies and gent...
8 8 \n\n\n\n \nPresident Trump: (00:00)\nWe presen...
9 9 \n\n\n\n \nMelania Trump: (01:30)\nHello, Geor...
10 10 \n\n\n\n \nDonald Trump: (00:00)\n… being hono...
11 11 \n\n\n\n \nPresident Donald Trump: (00:00)\nTh...

To save a data into csv form

In [11]:
df.to_csv("trump.csv")
In [ ]: