Named Entity Recognition From Wikipedia article using Spacy
In this article we ‘ll try to find names of person in a wikipedia article using python spacy library. I assume that you have already installed spacy and wikipedia api libraries from pypi if you are planning to run source code from this article.
Many a time articles are too long and we are only interested in certain information. We are either interested in summary or major events and major characters associated with the current. Here we are trying to just find person names from different articles. Determining whether a word is name of a person is done using pretrained models. Spacy does a good job of labeling these. We are going to explore that in this article.
Steps
- Search for wikipedia articles
- Use spacy to create document object
- Iterate for entries and find the ones with label Person
- Count the frequency of person and plot them in descending order
In following section we list necessary imports. wikipedia api is python api used to get wikipedia content.
import wikipedia
import requests
import spacy
from collections import Counter
import matplotlib.pyplot as plt
import spacy
nlp = spacy.load('en_core_web_lg')
Search on a specific page
Here we are trying to search a page for a given article. I have choosen Lord Krishna as our starting point. Let’s see who all are the most frequently occurring persons in wikipedia article relate to Lord Krishna.
result = wikipedia.search("Krishna")
result['Krishna',
'Krishna Krishna',
'Krishna Janmashtami',
'Krishna (Telugu actor)',
'Krishna Vamsi',
'Krishna Bhagavaan',
'International Society for Krishna Consciousness',
'Krishna-Krishna',
'Hare Krishna',
'Krishna (TV series)']
We get the page content corresponding to the first article related to the first search result of our search term.
page = wikipedia.page(result[0], preload= True)
We get the parced document using spacy module.
doc = nlp(page.content)#from spacy import displacy
Lets try to find the page url corresponding to first result of our search query
page.url'https://en.wikipedia.org/wiki/Krishna'#displacy.serve(doc, style="ent")
Lets explore the part of speech taggings of different terms in our page. For illustration purpose I am showing just 10 tokens here.
max_token_display = 10
for idx , token in enumerate(doc):
# Print the token and its part-of-speech tag
print(token.text, "-->", token.pos_, )
if idx > max_token_display:
break;The --> DET
Mahābhārata --> PROPN
( --> PUNCT
US --> PROPN
: --> PUNCT
, --> PUNCT
UK --> PROPN
: --> PUNCT
; --> PUNCT
Sanskrit --> ADJ
: --> PUNCT
महाभारतम् --> X
Here are some labels corresponding to the words appearing in the document.
for idx , ent in enumerate(doc.ents):
print(ent.text, ent.start_char, ent.end_char, ent.label_)
if idx>max_token_display:
breakMahābhārata 4 15 PERSON
US 17 19 GPE
UK 23 25 GPE
Sanskrit 29 37 LANGUAGE
महाभारतम् 39 48 CARDINAL
Mahābhāratam 50 62 PERSON
two 108 111 CARDINAL
Sanskrit 118 126 NORP
India 144 149 GPE
Rāmāyaṇa 171 179 PERSON
two 214 217 CARDINAL
the Kurukshetra War 239 258 EVENT
In the below section we are trying to identify all the entries with label as person.
persons = [ent.text for ent in doc.ents if ent.label_=='PERSON' ]
Lets count the frequency of person names as identified by spacy on a particular wikipedia page
person_count = Counter(persons)print(person_count){'Pandavas': 31, 'Krishna': 25, 'Mahābhārata': 24, 'Mahabharata': 23, 'Pandu': 17, 'Dhritarashtra': 15, 'Yudhishthira': 14, 'Bhishma': 11, 'Kunti': 11, 'Kaurava': 8, 'Satyavati': 6, 'Madri': 6, 'Gandhari': 6, 'Vyasa': 5, 'Kuru': 5, 'Pandava': 5, 'Vichitravirya': 5, 'Vidura': 5, 'Kauravas': 5, 'Rāmāyaṇa': 4, 'Bhima': 4, 'Draupadi': 4, 'Jain': 4, 'Gupta': 3, 'Janamejaya': 3, 'Jaya': 3, 'Minkowski': 3, 'Parikshit': 3, 'Devavrata': 3, 'Amba': 3, 'Karna': 3, 'Yama': 3, 'Yayati': 3, 'Jarasandha': 3, 'Motilal Banarsidass': 3, 'BCE': 2, 'Ugraśrava Sauti': 2, 'Vasu': 2, 'Oberlies': 2, 'Kālidāsa': 2, 'Mahapadma Nanda': 2, 'Adhisimakrishna': 2, 'Shakuni': 2, 'Dushasana': 2, 'Ghatotkacha': 2, 'J. L. Fitzgerald': 2, 'P. Lal': 2, 'Bibek Debroy': 2, 'Shyam Benegal': 2, 'Vasudeva': 2, 'Jaini': 2, 'Oldenberg': 2}
sort the persons from maximum to minimum occurrences of a person on a page.
person_count = {k: v for k, v in sorted(person_count.items(), key=lambda item: item[1] , reverse=True) if v>1}print(person_count){'Pandavas': 31, 'Krishna': 25, 'Mahābhārata': 24, 'Mahabharata': 23, 'Pandu': 17, 'Dhritarashtra': 15, 'Yudhishthira': 14, 'Bhishma': 11, 'Kunti': 11, 'Kaurava': 8, 'Satyavati': 6, 'Madri': 6, 'Gandhari': 6, 'Vyasa': 5, 'Kuru': 5, 'Pandava': 5, 'Vichitravirya': 5, 'Vidura': 5, 'Kauravas': 5, 'Rāmāyaṇa': 4, 'Bhima': 4, 'Draupadi': 4, 'Jain': 4, 'Gupta': 3, 'Janamejaya': 3, 'Jaya': 3, 'Minkowski': 3, 'Parikshit': 3, 'Devavrata': 3, 'Amba': 3, 'Karna': 3, 'Yama': 3, 'Yayati': 3, 'Jarasandha': 3, 'Motilal Banarsidass': 3, 'BCE': 2, 'Ugraśrava Sauti': 2, 'Vasu': 2, 'Oberlies': 2, 'Kālidāsa': 2, 'Mahapadma Nanda': 2, 'Adhisimakrishna': 2, 'Shakuni': 2, 'Dushasana': 2, 'Ghatotkacha': 2, 'J. L. Fitzgerald': 2, 'P. Lal': 2, 'Bibek Debroy': 2, 'Shyam Benegal': 2, 'Vasudeva': 2, 'Jaini': 2, 'Oldenberg': 2}
Here we are trying to plot the counts corresponding to each person appearing on the page.
fig = plt.gcf()
ax= plt.gca()
fig.set_size_inches(25.5, 25.5)
plt.barh(list(person_count.keys()), person_count.values())
plt.xticks(rotation=0, fontsize=40)
plt.yticks(rotation=0, fontsize=25)
for i, v in enumerate(person_count.values()):
ax.text(v + 2, i + 0, str(v), color='black' ,fontsize = 20)
plt.show()

Check for the other page
Following piece of code consolidates everything and uses a different search query for word ‘Jesus’
result = wikipedia.search("Jesus")
page = wikipedia.page(result[0], preload= True)
doc = nlp(page.content)
persons = [ent.text for ent in doc.ents if ent.label_=='PERSON' ]
person_count = Counter(persons)
person_count = {k: v for k, v in sorted(person_count.items(), key=lambda item: item[1] , reverse=True) if v>1}fig = plt.gcf()
ax= plt.gca()
fig.set_size_inches(25.5, 25.5)
plt.barh(list(person_count.keys()), person_count.values())
plt.xticks(rotation=0, fontsize=40)
plt.yticks(rotation=0, fontsize=25)
for i, v in enumerate(person_count.values()):
ax.text(v + 2, i + 0, str(v), color='black' ,fontsize = 20)
plt.show()

result = wikipedia.search("Mahabharat")
page = wikipedia.page(result[0], preload= True)
doc = nlp(page.content)
persons = [ent.text for ent in doc.ents if ent.label_=='PERSON' ]
person_count = Counter(persons)
person_count = {k: v for k, v in sorted(person_count.items(), key=lambda item: item[1] , reverse=True) if v>1}fig = plt.gcf()
ax= plt.gca()
fig.set_size_inches(25.5, 25.5)
plt.barh(list(person_count.keys()), person_count.values())
plt.xticks(rotation=0, fontsize=40)
plt.yticks(rotation=0, fontsize=25)
for i, v in enumerate(person_count.values()):
ax.text(v + 2, i + 0, str(v), color='black' ,fontsize = 20)
plt.show()

Create a function including all above
Finally we can create a function that plots all the names present on the first page from the list of pages from search result of a given term. Here the search title is given as an argument. Details for this Method can be found in previous sections.
def plot_names_from_page(title = "Mahabharat"):
result = wikipedia.search(title)
page = wikipedia.page(result[0], preload= True)
doc = nlp(page.content)
persons = [ent.text for ent in doc.ents if ent.label_=='PERSON' ]
person_count = Counter(persons)
person_count = {k: v for k, v in sorted(person_count.items(), key=lambda item: item[1] , reverse=True) if v>1}
print(page.url)
fig = plt.gcf()
ax= plt.gca()
fig.set_size_inches(25.5, 25.5)
plt.barh(list(person_count.keys()), person_count.values())
plt.xticks(rotation=0, fontsize=40)
plt.yticks(rotation=0, fontsize=25)
#plt.title(page.url, fontdict={size:20})
for i, v in enumerate(person_count.values()):
ax.text(v + 2, i + 0, str(v), color='black' ,fontsize = 20)
plt.show()
Finally we can use above function to get occurrences of different names on a wikipedia page. I tried to find names in article for variety of topics. First one is related to the books Illiad by homer. Most of the names are characters in the book. It may also include writer’s name.
plot_names_from_page('Illiad')

Following are the names corresponding to the article for great Hindu epic Ramayan. As we can expect name of Lord Rama appears most of the times here.
plot_names_from_page('Ramayan')

plot_names_from_page('World_War_I')https://en.wikipedia.org/wiki/World_War_I

plot_names_from_page('great depression')https://en.wikipedia.org/wiki/Great_Depression

plot_names_from_page('higgs boson')https://en.wikipedia.org/wiki/Higgs_boson

References :