Build a sentiment analysis of a Web3 community in 3 steps, a Dogamí case study

Ava Kouhana
-
September 5, 2022

The aim of this article is to explain step by step how to analyze the quantity as well as the quality of tweets related to a keyword on a given sample.

Below is an example of what kind of analyses you could do with the code.

For who

This article was thought out for anyone with Python knowledge who is a beginner in AI or NLP. Indeed, if you are willing to become autonomous on the analysis of Twitter communities using NLP while understanding the scientific approach, this article is for you.

Do not worry, the core parts of the code are at the end and every single line of code can be handily reused !

The approach

we will proceed as we would have in a research paper: challenging the intuitions and knowledge of the web3, trying hypotheses, nullifying or validating some of them and switching the approach whenever it appeared unsuitable.

This is how we will proceed
  • Scrap datas from Twitter of Dogamí, Genopets, Petaversenetwork, AaveGotchi
  • Choose an efficient sentiment analysis and improve it to better understand the web3 ecosystem and wording
  • Clean the datasets
  • Analyze and produce graphs over the figures found

Prerequisites:
  • Python knowledge of the librairies: Pandas, Matplotlib and Numpy…
  • NLP fundamentals (NLTK and basic knowledge over vector-based models and Stopwords)
  • Have a general idea of what scrapping is

Why Dogamí ?

2022 marks undeniably the rise of play-to-earn games. Dogamí is based on the Tezos Blockchain and is one of the only P2E which earned the trust on a total consensus of actors such as Sandbox, Tezos or Ubisoft and many others. Moreover, Dogamí has more than 100k followers on twitter and 100k active users on the discord which makes it a case in point of a community-driven web3 company.

And in a web3 ecosystem where the community is key, is Dogamí’s community as involved and dedicated as it seems ?

I • Scrap the data

First, let us scrap the tweets which mention the words « Dogamí », « Genopets », « PetaverseNetwork » and « Aavegotchi » between April 1st and March 31st.

To do so, you don’t need the API keys but you will need to either master Python and Snscrape or re-use my code in Annex A.

For this analysis, more than 50,000 tweets in total were scrapped with Snscrape coded in Python.

Once they were scrapped, we got the raw data in a CSV file which looked like this.

*caption of the CSV file scrapped about Dogamí

To read the CSV file and transform it into a dataframe you can re-use my code in Annex B.

II • Choose an efficient sentiment analysis

For this specific analysis, it is better to improve on an existing library (as NLTK which is a NLP python library) rather than code from scratch. Because the use of stopwords and other features are enough to get reliable results.

A Vader Sentiment Analysis imported from NLTK can be enough for most cases of sentiment analysis (Annex C). However for the specific web3 jargon, it is not accurate enough.

So in this case, it is more appropriate to switch to a sentimentIntensityAnalyzer which was improved upon and which gave 4 indicators per tweets going from 0 to 1 of:

●the positivity

●the negativity

●the neutrality

●the compound ( computed by summing the valence scores of each word and then weighted by the positive, negative & neutral scores and normalized to be between -1 (most extreme negative) and +1 (most extreme positive)).

The next step is to store everything in a dataframe (Annex D).

Now, testing the sentiment analysis is at the heart of our work.

The aim is threefold:

A- to have an idea of the quality of the sentiment analysis

B- to identify the criteria of positive and negative tweets

C- to identify the giveaways, tweets from BOTS and neutral tweets to remove them from the datasets

Let us try

●1st test:

*a compound which is equal to 0 seems to be the sign of a neutral comment

●2nd test:

*a high positivity: more than 50% seems to be the sign of a very positive tweets

●3rd test:

*the sentiment analysis recognizes the smiley: the ‘pos’ indicator increased and the neutrality decreased

From then on, we can have a hunch of :

  • the quality of the sentiment analysis
  • how useless a tweet with a compound equals to 0 is
  • how positive a tweet with a ‘pos’ score higher than 50% is

Now, we have to test the capacity of the sentiment analysis to recognize a giveaway or a BOT-generated tweet.

Why ?

Because the aim of this article is to analyse how involved a Web3 community is. So taking into account neutral tweets, BOT-generated tweets or giveaways would be both misleading and useless. Indeed, in no instance do those kind of tweets express any kind of user involvement.

4th test:

*a high ‘neu’ indicator or a compound equal to 0 seem to be a good indicator of a tweet which doesn’t display great community involvement

5th test:

* to the naked eye, this tweet is at least neutral or BOT-generated and the ‘neu’ score is very high

6th test:

*same result as the previous test

After these tests, we can assume that a great sign of a giveaway, a neutral or a BOT-generated tweet is a compound equal to 0 or a ‘neu’ score above 60%.

III • Clean the datasets

From the tests above, you can identify a tweet with either:

● a compound equal to 0

● or a high neutrality score (more than 60%)

this can be reckoned as either a giveaway, a neutral or a BOT generated tweet.

To clean the dataset, keep only :

● tweets with a compound different to 0.

● tweets with a neutrality score strictly lower than 60%.

To remove non reliable tweets you can use the code on Annex E.

The results

Once the dataset is cleaned, the data can be used to have a better idea of how involved a community is. When applied to Dogamí, these were the results.

Graph 1- the popularity of Dogamí

Graph 2 : The community satisfaction

Graph 3 : the evolution of Dogamí’s community

Graph 4- The involvement of Dogamí’s community

PS: The code of each graph is in the Annex (F-I)

Now, we have the graphs. But what can we conclude about the case study ? It is common knowledge that a great result is worthless if poorly interpreted. So a great sentiment analysis goes by the capacity of drawing conclusions out of it.

The case study conclusion

The analysis of Dogamí’s community from April 1st to March 31st displays pretty clear results.

First, the quantity of tweets about Dogamí is more important than the ones of the other competitors on the market.

Secondly, when cleaning the datasets by removing the tweets which had nothing to do with community involvement, most of the relevant tweets were about Dogamí.

Thirdly,Dogamí’s community is growing at a fast and sustained rate.

Fourthly, tweets about Dogamí are predominantly positive.

Therefore, Dogamí seems to tick the boxes for both a numerous and a qualitative community.

Conclusion

When we hear about AI, it may seem as rather very complicated and out of reach. But, the biggest impediment when it comes to learning is apprehensiveness. We saw throughout this article that there was nothing too complicated about sentiment analysis if you are well guided in the scientific approach. Now it is your turn !

The code:

A-Scrapping data from Twitter related to Dogamí (same process for any scrapping of a key word in Twitter)

B-Opening and reading the CSV file

C-NLTK analysis

D-storing the Sentiment Analysis in a dataframe

E-cleaning the datasets

F-Graph 1

G-Graph 2

H-Graph 3

I-graph 4

If you have any questions on the code or else : contact me.