Rethinking consultation with automatic language processing

Rethinking consultation with automatic language processing

The art of synthèse - Partie 1

Accompanying institutions during their consultation processes, Open Source Politics must develop expertise in all stages of such a process. In recent months, we have worked hard on the preparation of the syntheses that we are asked to produce in order to format and exploit the various contributions to the digital platforms that we are building. The volumes of contributions are indeed particularly large and it is relatively difficult to manage to exploit all of them properly without losing sight of the coherence of the whole. We have developed specific skills, focused on automatic language processing (ALP), with specialized software, in order to provide more precise syntheses while taking into account the general structure of the contributions. Open Source Politics was notably able to test these tools and software during themission carried out for the National Assembly in October 2017-wewill come back to this later.

These two articles are an opportunity for us to retrace our thinking on the subject, explain our interest in automatic language processing and present our results. In this first article, we will quickly retrace the history of TAL before focusing on the interest of this tool for Open Source Politics. The second article will be devoted to a case study that will allow us to show, in practice, the contribution of TAL; we will also come back to the use that Open Source Politics has made of it in the past as well as the evolution of our thinking until today.

Prémices

The ancestor of textometry (the measurement of text) is the analysis of data, based on statistics. It is then relatively easy to trace the first uses of statistics and probabilities for real-world analysis. Historians thus underline the recurrence of statistical observations made, among others, by the scribes of ancient Egypt. For Jean-Paul Benzécri, it was the needs of the administration of the great empires, whether Egyptian, Chinese or Mesopotamian, that prompted the use of statistics.

However, it was not until the 15th and 16th centuries that the mathematisation of the discipline was undertaken, notably through the discoveries of Galileo, Pascal and Bernoulli. After these first advances, we observe a growing development of the discipline, despite an interruption during the 19th century. We then left the general theory of data analysis (via probabilities and statistics) to concentrate on the analysis of texts, which constitute data in the same way as the Nile height readings of the Egyptian scribes.

Origins

Descending (consciously or unconsciously) in line with Wittgenstein's philosophy of language and his obsession to identify the rules of word usage, automatic language processing was born according to Catherine Fuchs and Benoît Habert at the crossroads of two concerns coming from quite distant fields.

In the second half of the 20th century, the academic field was thus interested in the mathematical formalization of language because this allowed it to be described "in the manner of a machine".

At the same time, the necessities of the Cold War have fostered the interest of the defence sector in machine translation. These two issues attracted funding and research in the field of automatic language processing developed. Two types of applications were distinguished. The first one focuses on the written word, in particular :

  • machine translation,
  • automatic text generation (for example, articles were automatically generated by Syllabs for Le Monde during the 2015 departmental elections),
  • spelling and grammar checkers,
  • search engines,
  • the messaging system: mail filtering (spam/not spam),
  • classification,
  • information retrieval,
  • conversational agents (chatbots),
  • optical character recognition (OCR).

The second type of application focused on oral, video and other multimodal formats, including call management, computer-based teaching, system control by voice, and speech synthesis.

Logometry

The discipline of automatic language processing has essentially developed in France since the 1970s, in line with the pioneering research of Pierre Guiraud and Charles Muller in lexical statistics. It is during this period that many ways of representing textual data emerged.

Among these, textometry (measurement of text) is part of a discipline called text data analysis (TDA). Lexicometrics (measurement of the lexicon) is also part of this discipline and logometrics is added to these two disciplines, thus complementing ADT. Textometry first focused on assessing the richness of a text's vocabulary and then specialized in various procedures such as correspondence calculation, classification and other procedures.

As for logometry(logos = speech; metron = measurement). This discipline is developing in the 21st century within the framework of the digital humanities. It is a natural extension of lexicometry (measurement of the lexicon) and textometry (measurement of the text). However, it is the discourse or logos (i.e. political, literary, media, scientific discourse...) in its linguistic and social dimensions that is its object. It is a method of analysis and interpretation of discourse used in the Human and Social Sciences that is computer-assisted, thus combining qualitative and quantitative reading of digital corpora. It also articulates global reading (the whole discourse) and local reading (the units of discourse) to build interpretation.

Let us recall here two definitions of the concept of "text": first, "a text is an oral or written series of words perceived as constituting a coherent whole, conveying meaning and using the structures specific to a language (conjugations, construction and association of sentences...)". Then, "a text can represent an interview, an article, a book or any other type of document. A corpus may contain one or more texts (but at least one)". From these two complementary definitions, we can clarify the link between the notion of text and that of discourse in the field of logometry. Indeed, if the concept of discourse is understood as a type of text of a personal nature according to Emile Benveniste, the concept of text is understood as an oral or written series of words that are coherent with each other. The latter is therefore to be understood in its generic form.

To summarize the notions to which the approach to processing textual data responds, here are the various elements that make it up:

  1. Proposals are written series of words.
  2. In a textual corpus are gathered one or more texts (of the "discourse" type) corresponding to the proposals of the consultation. This is the unit established and constituted manually, on which we are working and which will be used for processing with the IRaMuTeQ software.
  3. "The text" is a hyperonym; it includes several more specific words: speech, interview, article, book, or others.
  4. A consultation brings together several types of discourse: "argumentative", "explanatory", "descriptive" for example.
  5. The speech systematically engages the speaker, and is therefore considered to be "personal".

In fact, the logometry that applies to discourse is therefore naturally adapted to the data sets of the various consultations carried out with the platforms deployed by Open Source Politics.

Software at the service of the analyst

The results of the analysis carried out with IRaMuTeQ, a free software developed by Pierre Ratinaud at the Laboratoire d'Etudes et de Recherches Appliquées en Sciences Sociales (LERASS), open the way to different interpretations. Textual statistics allow the analyst to rely on quantitative criteria and not on subjective interpretation. The software allows us to take into account all the dimensions of the corpus, allowing both an exhaustiveness and a specificity of the analysis. This approach invites us to bear witness to both individual and collective contributions.

The challenge is to reveal the articulation of the proposals, to reveal how the proposals interact with each other. This articulation manifests itself through a spatial representation of the contributions, through graphs that make it easier to interpret the results of the consultation. You will find below examples of graphic visualization of the data integrated into our synthesis work carried out for the National Assembly.

The results produced are not only more readable and understandable, they also correspond to a point of view that we would not have been able to adopt without the tool.

Moreover, from the moment the operation of the software is explained, we can also guarantee that its use is not a simple mathematical exploration disconnected from reality. Indeed, it is an autonomous dynamic that takes into account the context of the consultation and calls for the analyst's attention. Our synthesis enriched by this software cannot do without an external action, since the software does not work without the involvement of the analyst who will have to parameterize the software according to his needs and his starting postulate.

If the treatment does not take context into account in the first place, the analyst must reintroduce this notion in a systematic way. On the other hand, we cannot isolate the tool from an earlier problematization. The use of IRaMuTeQ cannot be envisaged by and for itself, detached from any upstream reflection. The outputs produced, which can be seen in the examples opposite, will be subject to human interpretation with respect to the starting hypothesis.

Conclusion

Open Source Politics therefore combines a lucid interpretation of the results with an understanding of these algorithms. In other words, the transparency of the IRaMuTeQ software algorithms (favoured by the different manuals available online as well as the free access to the code) allows us to guarantee the autonomy of Open Source Politics in the interpretation of the results and in the reliability of the results.

@OpenSourcePol

Methodological innovations used by OSP for discourse analysis

Methodological innovations used by OSP for discourse analysis

Methodological innovations used by OSP for discourse analysis

The Art of Synthesis - Part II

Automatic Language Processing (ALP) is a field at the crossroads of three disciplines: linguistic analysis, computer science and artificial intelligence. This field is already under development at Open Source Politics. We will have the opportunity in this second article dedicated to the vision of the synthesis we have adopted to detail the reasons for our choice of software, to explain more precisely its action, to develop a small case study and finally to come back to our mission with the National Assembly in order to further clarify the interest of this type of tool for our activity.

A software choice reflecting a strategic orientation.

The approach we take through logometry is correlated to the ALT. This procedure for analyzing textual data through statistics is performed via IRaMuTeQ, a free software developed by Pierre Ratinaud at the Laboratoire d'Etudes et de Recherches Appliquées en Sciences Sociales (LERASS), in the context of the writing of our syntheses.

At a time when text-mining tools are multiplying and specializing in more and more specific tasks, there are still a few that offer the possibility of embracing a wide variety of treatments. Many of the tools are usually paid for and do not always provide access to a satisfactory set of procedures, which is why Open Source Politics uses the open source software IRaMuTeQ. It allows to perform many logometric procedures on a very large corpus. The advantages are numerous and benefit the analyst but also and above all the citizen. For example, such a tool allows him to better visualize the data presented to him and thus gives him a better appropriation of the themes and proposals present within a consultation.

It should also be remembered that text statistics methods make it possible, more generally, to process texts as they were written or collected without intervening to modify them. Thus, no subjective intervention interferes during the procedure, thus guaranteeing the lexical richness of the corpus. We deal with verbatim (propositions) in their raw form, which we will then try to grasp and analyse through the meaning of the words and the forms of the sentences that structure them. Moreover, this discipline to which the software responds makes it possible to approach a corpus from an "objective" angle. Thus, according to Bénédicte Garnier and France Guérin-Pace, "textual statistics allows us to objectify and synthesize this qualitative information to bring out a common and diverse representation atthe same time".

Objectivity comes from the calculations produced by the software. The latter rigorously executes, always in the same way, the corpus processing through the different procedures. However, the results produced are not sufficient on their own and require interpretation by the analyst. Thus, we are talking about an objective processing through the software algorithms. The final analysis integrates this processing but is intended to be as close as possible to the context.

Case study, analysis by Iramuteq 

The debate on national identity initiated by the French government during the 2007-2012 term of office was statistically processed by the researchers who developed the IRaMuTeQ software. The objective was to understand and report on the depth of the debate, in contrast to the various media reports. For Pascal Marchand and Pierre Ratinaud, "the analysis by IRaMuTeQ allows us to account for the content of all contributions, without drawing randomly from the mass, nor involving our own prejudices. It's just a matter of automatically recognizing and sorting the entire vocabulary used by Internet users to obtain speech classes".

They analysed the 18,240 contributions published on the website of the Ministry of Immigration, Integration, National Identity and Solidarity Development.

Their processing included several procedures that helped to put the proposals into perspective and achieve significant results. Five themes were thus isolated and linked to the individual contributions. Here is an example of a possible interpretation based on the calculations made by the IRaMuTeQ software:

This first step of analysis as close as possible to the verbatim constitutes a basic level of analysis of the corpus structure. It allows the analyst to make a first assessment of what he has understood thanks to the software and should allow him to refine his exploitation of the contributions. As you will see from the extract we have reproduced below, the final synthesis will not be based on this first analysis but will show a high degree of abstraction compared to the original corpus and the first analyses.

We therefore follow here the main interest of the TAL tool in the context of writing a synthesis: to provide analysis tools, frameworks from which the analyst will then be able to deploy his interpretations of all the contributions, while being absolutely certain to take into account the entire corpus. The tool is a necessary but not sufficient part of the reasoning leading to the construction of a synthesis.

Starting from the groups of opinions that the software has made it possible to formalize, the researchers were thus able to express polarities that they would not have noticed when browsing the website manually. It should also be noted that the Ministry's site was not subject to anopen data policy. As soon as it was closed, all the data was therefore lost, which is an excellent illustration of the need for open data access.

Thus, through the use ofIRaMuTeQ, the researchers have not only extracted the themes addressed but have also explained the sometimes contradictory emotions that are present throughout the corpus.

The interest of the approach.

We had the opportunity to develop this new methodology during the mission with the National Assembly in October 2017. At that time, the institution had launched a consultation aimed at opening up a space for citizen expression on the theme of rebuilding the Assembly as well as the potential openings for citizen participation during parliamentary work. We therefore had to produce, in a relatively short period of time, a summary that best reflected the content submitted by citizens on the DemocracyOS platform deployed by Open Source Politics for the occasion.

We chose to base the synthesis on a hybridization of two methods, thus isolating the verbatim that seemed most relevant in each category. This selection process was made possible by the daily activity of the Open Source Politics team on the platform throughout the consultation.

This work has given us an important intrinsic knowledge of the contributions. We added automatic language processing to this first process. We were therefore able to provide the National Assembly with visualization graphs of the contributions - cela has enabled us to obtain a distance from the consultation process that seems imperative to us if we wish to obtain a representative synthesis of the exchanges, objective, unbiased by our daily commitment to consultation. This first experience thus marked the first use by Open Source Politics of the TAL in the drafting of a synthesis which was thus all the more fleshed out and nourished.

More generally, following an online consultation, we are building a representative synthesis of the exchanges that took place during the debates. In this synthesis, we will specify the most discriminating examples of verbatim, which are also the most explicit in terms of meaning with regard to the issues and themes initiated during the debate. The use of a logometry procedure is not indispensable but it gives more possibilities of reading the data set. The automatic processing of the language thus makes it possible to enrich the synthesis thanks to procedures that cannot be reproduced by humans and increases the processing capacity of a large volume of data.

In short, here is a non-exhaustive list of elements that enhance the synthesis:

  • A unique perspective on the dataset resulting from the consultation,
  • A representation of the most revealing words,
  • Graphical visualization of the data presented in an intelligible way.
  • An objective but humble approach: the results are proposed paths, they remain interpretable by the citizen and reusable by whoever wants to.

Conclusion

For the most ambitious missions, Open Source Politics follows the consultation process of its clients from the definition of the organization's expectations to the writing of the summary and the announcement of the results.

We are therefore involved daily in the follow-up of the contributions, which leaves us little distance from them. For the purpose of writing a synthesis, automatic language processing (ALT) allows us to disregard our prejudices while taking into account the totality of the contributions, which would be impossible without the intervention of ALT.

At the end of this process, we have thus acquired a double competence with regard to the corpus, namely direct involvement and the necessary distancing from the corpus in order to elaborate a balanced synthesis. This synthesis will then be able to best serve its primary objective, namely to enable the clarity of the citizens' contribution to facilitate the co-construction of public policies.

decidim budget participatif

Discover our

newsletter!

 

Your new monthly appointment on all the Decidim news in French and much more...

 

Congratulations! You are now subscribed to our newsletter!