Overseas Chinese Language Resources Database-Beijing Language and Culture University Language Resource High-Quality Innovation Center

Overseas Chinese Language Resources Database

May 31, 2019

This resource bank is the result of ACLR's research project Construction of the Overseas Chinese Language Resources Database. The project manager is Guo Xi, chief expert of ACLR and Professor of Jinan University. This project includes the construction of three sub-bases:

1. Overseas Chinese and Chinese Language Basic Information Dataset

The project has completed the collection of the overseas Chinese demographic data, literature resources about Chinese language research, materials about other country’s national language policy, Chinese language policy research materials, Chinese schools and related educational institutions information, and the overseas Chinese social organization information.The overseas Chinese demographic data collects and sorts out the population data of the overseas Chinese in 183 countries and regions on 5 continents, with the synchronic data as the main, and the diachronic data as the subsidiary. The Chinese schools and related educational institutions information dataset contains information about 420 Chinese schools and 116 Chinese education organizations in Asia, Africa, Europe, the Americas, and Oceania. The 420 Chinese schools are distributed as 162 in Asia, 158 in Europe, 23 in Oceania, 62 in North America, 8 in South America, and 7 in Africa. 116 education organizations are found as 44 in Asia, 35 in Europe, 7 in Oceania, 18 in North America, 8 in South America, and 4 in Africa. In terms of Chinese language policy research materials, we have collected documents about 94,000 words on Chinese education policy in the late Qing Dynasty, the Republic of China, and after the founding of the People’s Republic of China. The Overseas Early Chinese Newspaper Documents Corpus has collected and compiled more than 13,400 important documents from early Chinese newspapers. It has classified them into categories of Mandarin Language Promotion, Chinese Language Education, Overseas Chinese News, Editorials, Graphic Advertising, and Art Works. Parts (1.2 million words) of the documents directly related to current Chinese language study have been transcribed and included in Emeditor for the ease of searching the full content with key words.

2. Dataset for Oral History of Chinese Language Inheritance

268 respondents from 33 countries and regions have been interviewed so far, resulting in 400 hours of audio and video from important people, 500,000 words of oral factual materials, and more than 100 precious documents and files.

The survey outlines 40 major issues about Chinese language inheritance, and the data are multi-modal including oral narrations, pictures, audio and video recordings, and physical objects. It is the first time to record the oral history of overseas Chinese language inheritance in comprehensive, systematic, in-depth and truthful way.The interviewees are representatives of Chinese social group leaders, elites in the Chinese language education industry, frontline Chinese language teachers, and mainstream Chinese media managers. The majority of the interviewees are over 70 years old, and many of them are over 80 years old, and theoldest is 92. At present, the first batch of interview materials has been transcribed, and proofreading is also underway.

3. Multi-modal Chinese Language Corpus

This corpus contains 9 sub-corpora, and the current scale is as follows:

(1) A corpus of major overseas Chinese media (websites, newspapers), approximately 700 million words;

(2) Spoken Chinese corpus for Chinese learners, about 4 million words;

(3) A corpus of Chinese language textbooks for elementary schools, about 1 million words;

(4) Spoken language corpus of Chinese learners, about 200,000 words;

(5) Audio and video recordings of overseas spoken Chinese, about 20G;

(6) More than 20,000 pictures of overseas Chinese language landscapes;

(7) The database of special words used in Southeast Asian Chinese media has been completed;

(8) A corpus of overseas Chinese language literary works, about 5 million words have been completed;

(9) Spoken language corpus of interviews with overseas Chinese, about 600,000 words have been completed (manual proofreading).