Rethinking consultation with automatic language processing
The art of synthèse - Partie 1
Accompanying institutions during their consultation processes, Open Source Politics must develop expertise in all stages of such a process. In recent months, we have worked hard on the preparation of the syntheses that we are asked to produce in order to format and exploit the various contributions to the digital platforms that we are building. The volumes of contributions are indeed particularly large and it is relatively difficult to manage to exploit all of them properly without losing sight of the coherence of the whole. We have developed specific skills, focused on automatic language processing (ALP), with specialized software, in order to provide more precise syntheses while taking into account the general structure of the contributions. Open Source Politics was notably able to test these tools and software during themission carried out for the National Assembly in October 2017-wewill come back to this later.
These two articles are an opportunity for us to retrace our thinking on the subject, explain our interest in automatic language processing and present our results. In this first article, we will quickly retrace the history of TAL before focusing on the interest of this tool for Open Source Politics. The second article will be devoted to a case study that will allow us to show, in practice, the contribution of TAL; we will also come back to the use that Open Source Politics has made of it in the past as well as the evolution of our thinking until today.
The ancestor of textometry (the measurement of text) is the analysis of data, based on statistics. It is then relatively easy to trace the first uses of statistics and probabilities for real-world analysis. Historians thus underline the recurrence of statistical observations made, among others, by the scribes of ancient Egypt. For Jean-Paul Benzécri, it was the needs of the administration of the great empires, whether Egyptian, Chinese or Mesopotamian, that prompted the use of statistics.
However, it was not until the 15th and 16th centuries that the mathematisation of the discipline was undertaken, notably through the discoveries of Galileo, Pascal and Bernoulli. After these first advances, we observe a growing development of the discipline, despite an interruption during the 19th century. We then left the general theory of data analysis (via probabilities and statistics) to concentrate on the analysis of texts, which constitute data in the same way as the Nile height readings of the Egyptian scribes.
Descending (consciously or unconsciously) in line with Wittgenstein's philosophy of language and his obsession to identify the rules of word usage, automatic language processing was born according to Catherine Fuchs and Benoît Habert at the crossroads of two concerns coming from quite distant fields.
In the second half of the 20th century, the academic field was thus interested in the mathematical formalization of language because this allowed it to be described "in the manner of a machine".
At the same time, the necessities of the Cold War have fostered the interest of the defence sector in machine translation. These two issues attracted funding and research in the field of automatic language processing developed. Two types of applications were distinguished. The first one focuses on the written word, in particular :
- machine translation,
- automatic text generation (for example, articles were automatically generated by Syllabs for Le Monde during the 2015 departmental elections),
- spelling and grammar checkers,
- search engines,
- the messaging system: mail filtering (spam/not spam),
- information retrieval,
- conversational agents (chatbots),
- optical character recognition (OCR).
The second type of application focused on oral, video and other multimodal formats, including call management, computer-based teaching, system control by voice, and speech synthesis.
The discipline of automatic language processing has essentially developed in France since the 1970s, in line with the pioneering research of Pierre Guiraud and Charles Muller in lexical statistics. It is during this period that many ways of representing textual data emerged.
Among these, textometry (measurement of text) is part of a discipline called text data analysis (TDA). Lexicometrics (measurement of the lexicon) is also part of this discipline and logometrics is added to these two disciplines, thus complementing ADT. Textometry first focused on assessing the richness of a text's vocabulary and then specialized in various procedures such as correspondence calculation, classification and other procedures.
As for logometry(logos = speech; metron = measurement). This discipline is developing in the 21st century within the framework of the digital humanities. It is a natural extension of lexicometry (measurement of the lexicon) and textometry (measurement of the text). However, it is the discourse or logos (i.e. political, literary, media, scientific discourse...) in its linguistic and social dimensions that is its object. It is a method of analysis and interpretation of discourse used in the Human and Social Sciences that is computer-assisted, thus combining qualitative and quantitative reading of digital corpora. It also articulates global reading (the whole discourse) and local reading (the units of discourse) to build interpretation.
Let us recall here two definitions of the concept of "text": first, "a text is an oral or written series of words perceived as constituting a coherent whole, conveying meaning and using the structures specific to a language (conjugations, construction and association of sentences...)". Then, "a text can represent an interview, an article, a book or any other type of document. A corpus may contain one or more texts (but at least one)". From these two complementary definitions, we can clarify the link between the notion of text and that of discourse in the field of logometry. Indeed, if the concept of discourse is understood as a type of text of a personal nature according to Emile Benveniste, the concept of text is understood as an oral or written series of words that are coherent with each other. The latter is therefore to be understood in its generic form.
To summarize the notions to which the approach to processing textual data responds, here are the various elements that make it up:
- Proposals are written series of words.
- In a textual corpus are gathered one or more texts (of the "discourse" type) corresponding to the proposals of the consultation. This is the unit established and constituted manually, on which we are working and which will be used for processing with the IRaMuTeQ software.
- "The text" is a hyperonym; it includes several more specific words: speech, interview, article, book, or others.
- A consultation brings together several types of discourse: "argumentative", "explanatory", "descriptive" for example.
- The speech systematically engages the speaker, and is therefore considered to be "personal".
In fact, the logometry that applies to discourse is therefore naturally adapted to the data sets of the various consultations carried out with the platforms deployed by Open Source Politics.
Software at the service of the analyst
The results of the analysis carried out with IRaMuTeQ, a free software developed by Pierre Ratinaud at the Laboratoire d'Etudes et de Recherches Appliquées en Sciences Sociales (LERASS), open the way to different interpretations. Textual statistics allow the analyst to rely on quantitative criteria and not on subjective interpretation. The software allows us to take into account all the dimensions of the corpus, allowing both an exhaustiveness and a specificity of the analysis. This approach invites us to bear witness to both individual and collective contributions.
The challenge is to reveal the articulation of the proposals, to reveal how the proposals interact with each other. This articulation manifests itself through a spatial representation of the contributions, through graphs that make it easier to interpret the results of the consultation. You will find below examples of graphic visualization of the data integrated into our synthesis work carried out for the National Assembly.
The results produced are not only more readable and understandable, they also correspond to a point of view that we would not have been able to adopt without the tool.
Moreover, from the moment the operation of the software is explained, we can also guarantee that its use is not a simple mathematical exploration disconnected from reality. Indeed, it is an autonomous dynamic that takes into account the context of the consultation and calls for the analyst's attention. Our synthesis enriched by this software cannot do without an external action, since the software does not work without the involvement of the analyst who will have to parameterize the software according to his needs and his starting postulate.
If the treatment does not take context into account in the first place, the analyst must reintroduce this notion in a systematic way. On the other hand, we cannot isolate the tool from an earlier problematization. The use of IRaMuTeQ cannot be envisaged by and for itself, detached from any upstream reflection. The outputs produced, which can be seen in the examples opposite, will be subject to human interpretation with respect to the starting hypothesis.
Open Source Politics therefore combines a lucid interpretation of the results with an understanding of these algorithms. In other words, the transparency of the IRaMuTeQ software algorithms (favoured by the different manuals available online as well as the free access to the code) allows us to guarantee the autonomy of Open Source Politics in the interpretation of the results and in the reliability of the results.