For retrieving data from the Internet you can use _requests_ library. There are other libraries such as _urllib_ but we will focus on _requests_ as it simplifies a lot of tasks.


__requests__ provide multiple functions, here we will look at _get_ function for GET request. 


In [1]:
import requests 

url = requests.get("https://en.wikipedia.org/wiki/List_of_serial_killers_in_the_United_States")

__url__ is an object and contains some attributes such as status code, text, apparent_encoding, encoding, headers, ... 
We can use these to access different information of the object.

In [3]:
type(url)

requests.models.Response

In [4]:
url.apparent_encoding

'utf-8'

In [5]:
url.status_code

200

In [6]:
print(url.text[:2000])

<!DOCTYPE html>
<html class="client-nojs" lang="en" dir="ltr">
<head>
<meta charset="UTF-8"/>
<title>List of serial killers in the United States - Wikipedia</title>
<script>document.documentElement.className="client-js";RLCONF={"wgBreakFrames":!1,"wgSeparatorTransformTable":["",""],"wgDigitTransformTable":["",""],"wgDefaultDateFormat":"dmy","wgMonthNames":["","January","February","March","April","May","June","July","August","September","October","November","December"],"wgRequestId":"d30646bd-7c13-4f76-86ed-5d9278b3df9f","wgCSPNonce":!1,"wgCanonicalNamespace":"","wgCanonicalSpecialPageName":!1,"wgNamespaceNumber":0,"wgPageName":"List_of_serial_killers_in_the_United_States","wgTitle":"List of serial killers in the United States","wgCurRevisionId":1044211112,"wgRevisionId":1044211112,"wgArticleId":32568837,"wgIsArticle":!0,"wgIsRedirect":!1,"wgAction":"view","wgUserName":null,"wgUserGroups":["*"],"wgCategories":["CS1 maint: uses authors parameter","All articles with dead external links","

__Beautifulsoup__ is another library that allows us to parse HTML and XML data. In this example, we are going to access the wikipedia page of American Serial Killers and scrape their names. 
Using Beautifulsoup creates a soup object that has several attributes such as title, body.span, and several useful functions.
We will be using the following functions from Beeuifulsoup:

- __soup.prettify()__: returns a clean and nested data structure
- __soup.find_all(...)__: returns a list of given attributes
- __soup.find(...)__: returns the first object matching the given attribute


For more information check the documentation:

- requests: https://requests.readthedocs.io/en/latest/user/quickstart/
- Beautifulsoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/


In [7]:
content = url.text

from bs4 import BeautifulSoup 

soup = BeautifulSoup(content , 'html5lib')

In [None]:
print(soup.prettify())

In [8]:
soup.h1

<h1 class="firstHeading" id="firstHeading">List of serial killers in the United States</h1>

In [None]:
soup.title.string

In [None]:
soup.a

In [None]:
soup.find_all('a')[0:10]

In [None]:
all_headers = soup.find_all('h2')
all_headers[0:5]

We now try to extract table and get the name of killers and store it in a list.

In [None]:
table  = soup.find("table" , {"class": "wikitable sortable"})

# tr: table row
# th : table header
#td : table cell

table

In [None]:
table.find_all('th')[2].string


In [None]:
names = [] 

for entry in table.find_all('tr')[1:]:
    name = entry.find_all('a')[0]
    names.append(name.text)

In [None]:
names[:20]

Or we can use pandas 

In [9]:
import pandas as pd

df = pd.read_html(content, header= 0)[0]
df.head()


Unnamed: 0,Name,Years active,Proven victims,Possible victims,Status,Notes,Source
0,"Ables, Tony",1970–1990,4,4+,Sentenced to death; commuted to life imprisonment,"Murdered robbery victim in 1970, and at least ...",[4]
1,"Ackroyd, John Arthur",1976–1992,1,3+,Died in prison,Suspected in multiple murders along Oregon's H...,[5]
2,"Adams, Edward J.",1920–1921,7,7,Killed by police during shootout,"Criminal who murdered seven people, including ...",[6]
3,"Alcala, Rodney",1971–1979,8,50–230,"Sentenced to death, but died in prison before ...","Sometimes called ""The Dating Game Killer"" beca...",[7]
4,"Albright, Charles",1990–1991,3,3,Sentenced to life in prison; died in 2020,"Texas man known as the ""Eyeball Killer"" becaus...",[8]


In [10]:
df['Name']

0                   Ables, Tony
1          Ackroyd, John Arthur
2              Adams, Edward J.
3                Alcala, Rodney
4             Albright, Charles
                 ...           
454          Woodfield, Randall
455    Wright, Douglas Franklin
456             Wuornos, Aileen
457           Yates, Robert Lee
458            Zarinsky, Robert
Name: Name, Length: 459, dtype: object

In [21]:
url2 = requests.get("https://en.wikipedia.org/wiki/List_of_Middle-earth_characters")

content2 = url2.text

soup2 =BeautifulSoup(content2, 'html5lib')

In [31]:
full_list = []

for item in soup2.find_all('ul')[1:17]:
    for entry in item.find_all('li'):
        full_list.append(entry)



In [27]:
for item in full_list:
    print(item)

<li><a href="/wiki/Aragorn" title="Aragorn">Aragorn</a>: Descendant of <a href="/wiki/Isildur" title="Isildur">Isildur</a> who was a principal figure in both the Fellowship of the Ring and the <a class="mw-redirect" href="/wiki/War_of_the_Ring" title="War of the Ring">War of the Ring</a>.  Became king over the reunited kingdoms of <a href="/wiki/Gondor" title="Gondor">Gondor</a> and Arnor.</li>
<li><a href="/wiki/Arwen" title="Arwen">Arwen</a>: Daughter of <a href="/wiki/Elrond" title="Elrond">Elrond</a> <a class="mw-redirect" href="/wiki/Half-elven" title="Half-elven">Half-elven</a> and Celebrían, marries <a href="/wiki/Aragorn" title="Aragorn">Aragorn</a> at the end of the War of the Ring and becomes queen of the reunited kingdoms of <a href="/wiki/Gondor" title="Gondor">Gondor</a> and Arnor.</li>
<li><a href="/wiki/Bilbo_Baggins" title="Bilbo Baggins">Bilbo Baggins</a>: A <a href="/wiki/Hobbit" title="Hobbit">hobbit</a> adventurer.  Discovered the <a href="/wiki/One_Ring" title="One

In [32]:
for entry in full_list:
    link = entry.find_all('a')[0]
    print(link.text)

Aragorn
Arwen
Bilbo Baggins
Frodo Baggins
Balin
Bard the Bowman
Beorn
Boromir
Merry Brandybuck
Celebrimbor
Denethor
Eärendil and Elwing
Elendil
Elrond
Éomer
Éowyn
Faramir
Fëanor
Finrod Felagund
Finwë and Míriel
Galadriel
Samwise Gamgee
Gandalf
Glorfindel
Gimli
Goldberry
Gollum
Gríma Wormtongue
Húrin
Isildur
Legolas
Lúthien and Beren
Maedhros
Melian
Morgoth
Radagast
Saruman
Sauron
Shelob
Smaug
Théoden
Thingol
Thranduil
Thorin Oakenshield
Tom Bombadil
Pippin Took
Treebeard
Tuor and Idril
Túrin Turambar
Ungoliant
Watcher in the Water


In [33]:
url3 = requests.get("https://en.wikipedia.org/wiki/List_of_A_Song_of_Ice_and_Fire_characters")


content3 = url3.text


soup3 = BeautifulSoup(content3 , "lxml")

In [36]:
character_name = []


for item in soup3.find_all('li' , {'class' : 'toclevel-3'})[:-2]:
    for entry in item.find_all('span', {'class': 'toctext'}):
        character_name.append(entry.string.strip())



In [37]:
character_name

['Eddard Stark',
 'Catelyn Stark',
 'Robb Stark',
 'Sansa Stark',
 'Arya Stark',
 'Bran Stark',
 'Rickon Stark',
 'Jon Snow',
 'Benjen Stark',
 'Lyanna Stark',
 'Jeyne Westerling',
 'Roose Bolton',
 'Ramsay Bolton',
 'Rickard Karstark',
 'Alys Karstark',
 'Wyman Manderly',
 'Hodor',
 'Osha',
 'Jeyne Poole',
 'Jojen and Meera Reed',
 'Aegon V Targaryen',
 'Aerys II Targaryen',
 'Rhaegar Targaryen',
 'Viserys Targaryen',
 'Daenerys Targaryen',
 'Aegon VI Targaryen',
 'Brynden Rivers',
 'Maekar I Targaryen',
 'House Blackfyre',
 'Jon Connington',
 'Jorah Mormont',
 'Missandei',
 'Daario Naharis',
 'Grey Worm',
 'Tywin Lannister',
 'Cersei Lannister',
 'Jaime Lannister',
 'Tyrion Lannister',
 'Joffrey Baratheon',
 'Myrcella Baratheon',
 'Tommen Baratheon',
 'Kevan Lannister',
 'Lancel Lannister',
 'Bronn',
 'Gregor Clegane',
 'Sandor Clegane',
 'Podrick Payne',
 'Shae',
 'Robert Baratheon',
 'Stannis Baratheon',
 'Renly Baratheon',
 'Selyse Baratheon',
 'Shireen Baratheon',
 'Gendry',
 'Ed