Webscraping Chinese Characters Using Python and Beautiful Soup

TechJD
3 min readDec 29, 2019

--

Learn how to webscrape almost 10,000 chinese characters in seconds.

This post is going to be part of an ongoing series I’m doing that’s centrally focused on designing and deploying a simple flash card app for Android(later maybe iOS) using Python. So, if you’re interested in learning Kivy, which is a cross-platform Python framework for natural user interface (NUI) development, stay tuned. But first, a quick detour, as before I actually started building the app, I had to find a database that could provide all of the Chinese characters I needed, and a quick and reliable way to integrate that database.

Beautiful Soup

If you didn’t know already, Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. Wikipedia. Go ahead and install it.

pip install beautifulsoup4

Finding a Reliable and Sufficient Database

For those who don’t know, there’s a lot of Chinese characters. The number is estimated to be over 50,000 characters, even though modern dictionaries will only list the approximated 20,000 that are still in use. It is said that an educated Chinese person will know about 8,000 characters, but you will only need about 2–3,000 to be able to read a newspaper. http://www.bbc.co.uk/languages/chinese/real_chinese/mini_guides/characters/characters_howmany.shtml

My next issue is apparent. If I want to make flash cards, how many should I make? Of course, I would start at the most common characters, but how far should I go?

I went to hanzidb.org. It’s got a bunch of different lists, but the one I chose was “Chinese characters by frequency.”

Try copying these by hand…

The list contains 9,933 characters! I actually tried copying these by hand at first. The first page alone took me over an hour…No way I was going to build a new database of my own from scratch. Luckily, HanziDB has very clean and elegant HTML, which made it really simple to scrape.

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
num = 2
dkey = 0
my_urls = ['http://hanzidb.org/character-list/by-frequency']
for url in range(1,100):
url = 'http://hanzidb.org/character-list/by-frequency'+'?page='+str(num)
my_urls.append(url)
num+=1
for my_url in my_urls:
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr")
hanzi_temp = []
for container in containers:
hanzi_temp.append(container.td)
hanzi1 = hanzi_temp[1:]
hanzilist = []
pinyin = []
cnt = 0

for x in hanzi1:
hanzi = x.text
hanzilist.append(hanzi)
for tr in page_soup.findAll('tr'):
td = tr.findAll('td')[1:2]
if td:
y = td[0].text
pinyin.append(y)
for z in range(1,101):
print(str(dkey)+r": ('"+hanzilist[cnt]+r"',"+r" '"+pinyin[cnt]+r"'),")
cnt+=1
dkey+=1

In the video below, I break down line by line what all this means. Check it out:

Part One
Part Two

--

--

TechJD
TechJD

Written by TechJD

Law, programming, and everything in-between! Coming up with fun coding projects with real-world application.

No responses yet