This post is going to be part of an ongoing series I’m doing that’s centrally focused on designing and deploying a simple flash card app for Android(later maybe iOS) using Python. So, if you’re interested in learning Kivy, which is a cross-platform Python framework for natural user interface (NUI) development, stay tuned. But first, a quick detour, as before I actually started building the app, I had to find a database that could provide all of the Chinese characters I needed, and a quick and reliable way to integrate that database.
Beautiful Soup
If you didn’t know already, Beautiful Soup is a Python package for parsing HTML and XML documents. It creates a parse tree for parsed pages that can be used to extract data from HTML, which is useful for web scraping. Wikipedia. Go ahead and install it.
pip install beautifulsoup4
Finding a Reliable and Sufficient Database
For those who don’t know, there’s a lot of Chinese characters. The number is estimated to be over 50,000 characters, even though modern dictionaries will only list the approximated 20,000 that are still in use. It is said that an educated Chinese person will know about 8,000 characters, but you will only need about 2–3,000 to be able to read a newspaper. http://www.bbc.co.uk/languages/chinese/real_chinese/mini_guides/characters/characters_howmany.shtml
My next issue is apparent. If I want to make flash cards, how many should I make? Of course, I would start at the most common characters, but how far should I go?
I went to hanzidb.org. It’s got a bunch of different lists, but the one I chose was “Chinese characters by frequency.”
The list contains 9,933 characters! I actually tried copying these by hand at first. The first page alone took me over an hour…No way I was going to build a new database of my own from scratch. Luckily, HanziDB has very clean and elegant HTML, which made it really simple to scrape.
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soupnum = 2
dkey = 0
my_urls = ['http://hanzidb.org/character-list/by-frequency']
for url in range(1,100):
url = 'http://hanzidb.org/character-list/by-frequency'+'?page='+str(num)
my_urls.append(url)
num+=1
for my_url in my_urls:
uClient = uReq(my_url)
page_html = uClient.read()
uClient.close()
page_soup = soup(page_html, "html.parser")
containers = page_soup.findAll("tr")
hanzi_temp = []
for container in containers:
hanzi_temp.append(container.td)
hanzi1 = hanzi_temp[1:]
hanzilist = []
pinyin = []
cnt = 0
for x in hanzi1:
hanzi = x.text
hanzilist.append(hanzi)
for tr in page_soup.findAll('tr'):
td = tr.findAll('td')[1:2]
if td:
y = td[0].text
pinyin.append(y)
for z in range(1,101):
print(str(dkey)+r": ('"+hanzilist[cnt]+r"',"+r" '"+pinyin[cnt]+r"'),")
cnt+=1
dkey+=1
In the video below, I break down line by line what all this means. Check it out: