First we need to import necessary library¶
In [1]:
from bs4 import BeautifulSoup
import requests as re
import pandas as pd
re.get gives us the contests or the html codes for the webpage¶
In [2]:
r = re.get("https://www.rev.com/blog/transcript-category/donald-trump-transcripts?view=all")
#print(r.content)
Lets convert the messy codes into beutiful form with Beautiful Soup¶
In [3]:
soup = BeautifulSoup(r.content)
In [4]:
#soup
Let's find all the div container with class name f1-post-column. In that div, there are link for individual pages of Trump speeches¶
In [5]:
div = soup.findAll("div", {"class":"fl-post-column"})
In [6]:
link_list = []
In [7]:
for i in range(0, len(div)):
link_list.append(div[i].meta["itemid"])
In [8]:
link_list
Out[8]:
Find all the data into a data frame¶
In [9]:
df = pd.DataFrame(columns = ["id", "script"])
for i in range(0, len(link_list)):
r2 = re.get(link_list[i])
soup2 = BeautifulSoup(r2.content)
content = soup2.findAll("div", {"id":"transcription"})
transcript = content[0].text
df = df.append({'id': i, "script": transcript}, ignore_index = True )
In [10]:
df
Out[10]:
To save a data into csv form¶
In [11]:
df.to_csv("trump.csv")
In [ ]:
0 Comments