多语种智能信息处理

André Salem （安德烈·萨莱姆）

2024-10-07

Paris 3-Nouvelle Sorbonne 法国新索邦巴黎第三大学

André Salem （安德烈·萨莱姆）

报告题目

The textometric study of chronological text corpora

历时性文本语料的篇章词量学研究

摘要

The computer revolution at the end of the last century initially facilitated the automation of pre-existing research methods in text studies (such as identifying particular forms, creating indexes, concordances, etc.). Over time, textometric methods (a body of statistical techniques applicable to large text corpora) have progressively spread across various scientific communities (see, for example, Lebart and Salem 1994).

The textometric analysis of chronological text, consisting of texts from the same institutional source (political parties, trade unions, international organizations, etc.), highlights the evolution of these organizations' textual productions over time. This approach allows for the identification of lexical changes between two consecutive periods in the corpora, distinguishing between isolated shifts and evolutions that span multiple periods. It also helps to sketch out “text grammars” applicable to subsets of documents. The broader consideration of changes occurring in homologous contexts and the forms of writing that remain unchanged leads to more precise observations.

Textometric methods apply to texts written in a wide variety of languages with no significant changes. Beyond simple word frequency counts, these methods extend to more complex units, such as repeated sequences of identical forms appearing in multiple places within the corpora. The integration and articulation of textometric methods into software that enables their implementation and interoperability now ensures the widespread dissemination of these methods for studying texts written in all languages worldwide.

上世纪末的计算机革命首先实现了对原有文本研究方法的自动化（如对特定词形的识别、索引、上下文检索等）。随着时间的推移，适用于大型文本语料库的系列统计方法——篇章词量学方法逐渐在不同科学领域中传播开来（见Lebart and Salem 1994）。

就历时性文本的篇章词量学研究，即对来自同一机构（如政党、工会组织、国际组织等）来源的文本进行汇总分析，可揭示相关组织在不同时期所产生文本的演变（见Salem 2021; Miao and Salem 2021）。该方法不仅能识别语料中两个连续时期之间的词汇变化，区分短暂波动或跨越多个时期的长期演变，更能描绘“文本语法”，并适用于语料的某一子集。在同质的上下文以及书写词形中，自动汇总的相关变化，可带来更精确的发现。

篇章词量学方法适用于多语种文本，无需进行大的变更。除了对独立词进行简单计量外，还能对复杂词汇单位进行统计，如文本语料库中在多个地方所反复出现的语段。各类篇章词量学方法在软件中的整合，确保相关方法的操作与协同使用，从而也确保了篇章词量学方法在全球范围内对各语言文本研究中的广泛应用。

报告人简介

André Salem is an emeritus professor and doctoral advisor at Paris 3 Sorbonne Nouvelle University, and a prominent figure in French discourse lexicometry research. He holds two PhDs, one in French linguistics and the other in mathematical statistics. Since 1972, he has worked at the French National Center for Scientific Research (CNRS) on automated text research, later becoming an engineer at the École Normale Supérieure in Saint-Cloud and the lead researcher of the political text lexicometry group. In 1994, he joined Paris 3 Sorbonne Nouvelle University, where he founded and directed the Center for Automated Text Analysis, Discourse Studies, and Linguistics, primarily responsible for teaching in areas such as computer science, language science, text processing, and natural language processing. Salem developed the Textometric software Lexico (http://lexi-co.com/, currently Version 5) and he continues to enhance it with new features for discourse analysis.

Professor André Salem has authored or co-authored five major books, including: The Practice of Repeated Segments (Pratique des segments répétés) (1987), Statistical Analysis of Textual Data (Analyse statistique des données textuelles) (1988), Textual Statistics (Statistique textuelle) (1994) and its expanded English version Exploring Textual Data (1997), as well as Corpus Linguistics (Les linguistiques de corpus) (1997). He has supervised 19 doctoral dissertations and published numerous academic papers. He serves as a reviewer for French academic journals such as Mots and the online journal Lexicométrica. Additionally, he has been a key organizer of the International Conference on the Statistical Analysis of Textual Data (JADT, http://www.jadt.org) for over 30 years and is highly regarded within the field.

André Salem 法国新索邦巴黎第三大学荣休教授、博士生导师，法国语篇词量学研究核心人物之一。拥有法国语言学以及数学统计学方向两个博士文凭。自1972年开始，在法国国家科学研究中心进行有关文本自动化研究中心工作，后成为圣·克劳德高等师范学院工程师、政治文本词量学研究小组的主要负责人。1994年起，加入巴黎新索邦第三大学，创立并领导自动文本分析-话语研究-语言学中心，主要承担计算机、语言科学、文本处理、自然语言处理的教学工作，研发词量学软件Lexico（http://lexi-co.com/ 现版本V. 5），至今仍推出用于语篇分析的新功能。

André Salem教授拥有5部专著（独著或合著，如：1987年《重复语段的实践研究》Pratique des segments répétés；1988年《文本数据的统计分析》Analyse statistique des données textuelles；1994年《文本统计》Statistique textuelle，英文拓展版本1997年Exploring Textual Data；1997年《语料库语言学》Les linguistiques de corpus；指导博士生论文19篇，发表多篇学术论文，是法国学术杂志 Mots、电子期刊Lexicométrica等主审人，同时也曾是国际文本分析与数据大会JADT（http://www.jadt.org）的主要策划人（迄今已有30多年），在业界享有较高威望。

参考文献

Lebart Ludovic, Salem André, Statistique textuelle, Dunod, Paris, 1994.

Salem André, « Le temps lexical », Histoire et Mesure, volume XXXVI n°2 :21-56, Paris, 2021.

Miao Jun, Salem André, « Des textes en mouvement », Histoire et Mesure, volume XXXVI n°2 : 91-124, Paris, 2021.

Lexico5, Méthodes textométriques pour l’analyse des corpus de textes, http://lexi-co.com/

下一篇：这是最后一篇

上一篇：这是第一篇