Skip to Main Content
 

Global Search Box

 
 
 
 

ETD Abstract Container

Abstract Header

Regional Lexical Variation in Modern Written Chinese: Analysis and Characterization Using Geo-Tagged Social Media Data

Abstract Details

2018, Master of Arts, Ohio State University, East Asian Languages and Literatures.
The current study surveys social media data to identify regional lexical variants in modern written Chinese and to characterize their geographical distributional patterns. A large amount of geo-tagged linguistic data was obtained from a corpus containing 5.1 million messages posted on a Chinese micro-blogging website (Weibo). A list of lexical items obtained from the book "Lexicon of Chinese Dialects" was searched in the corpus to generate word counts by location. It was found that a portion of regional lexical variants from this book appeared in the written corpus. Closer examination of these variants revealed different patterns in their geographical distributions. This study also investigated the regional specificity of these lexical variants by calculating their cumulative frequencies across space, which led to different conclusions about their usage when compared with survey results found in previous literature. In order to find out if there are regional sub-varieties of written Chinese characterized by lexical variation, a machine learning algorithm (k-means) was trained on the word frequency data gathered from the corpus to cluster the locations based on their uses of lexical items most clearly signaling regional differences. The cluster analysis suggests the existence of three clusters, reflecting the north-south contrast in modern written Chinese that is associated with the linguistic history of China, as well as the strong influence of Cantonese in areas around the Guangdong province. Through the above-mentioned analyses, this study provideds some insights into the lexical norms of written Chinese in contemporary China. It also contributes to the development of methods for processing Chinese texts on computer.
Marjorie Chan (Advisor)
Marie-Catherine de Marneffe (Committee Member)
98 p.

Recommended Citations

Citations

  • Shen, J. (2018). Regional Lexical Variation in Modern Written Chinese: Analysis and Characterization Using Geo-Tagged Social Media Data [Master's thesis, Ohio State University]. OhioLINK Electronic Theses and Dissertations Center. http://rave.ohiolink.edu/etdc/view?acc_num=osu1531845935585073

    APA Style (7th edition)

  • Shen, Jingdi. Regional Lexical Variation in Modern Written Chinese: Analysis and Characterization Using Geo-Tagged Social Media Data . 2018. Ohio State University, Master's thesis. OhioLINK Electronic Theses and Dissertations Center, http://rave.ohiolink.edu/etdc/view?acc_num=osu1531845935585073.

    MLA Style (8th edition)

  • Shen, Jingdi. "Regional Lexical Variation in Modern Written Chinese: Analysis and Characterization Using Geo-Tagged Social Media Data ." Master's thesis, Ohio State University, 2018. http://rave.ohiolink.edu/etdc/view?acc_num=osu1531845935585073

    Chicago Manual of Style (17th edition)