The fresh tweet-ids support the latest distinctive line of tweets regarding the Twitter API which can be more than nine weeks (i

The website Footnote 2 was utilized as an easy way to gather tweet-ids Footnote step 3 , this amazing site provides experts that have metadata out-of a (third-party-collected) corpus of Dutch tweets (Tjong Kim Carried out and you may Van den Bosch, 2013). age., the historical limitation whenever asking for tweets considering a pursuit ask). The new Roentgen-package ‘rtweet’ and you can complementary ‘lookup_status’ form were utilized to get tweets inside JSON format. The latest JSON file constitutes a desk to the tweets’ information, like the design date, the latest tweet text message, and also the source (i.age., form of Twitter client).

Study clean up and you will preprocessing

The JSON Footnote 4 files were converted into an R data frame object. Non-Dutch tweets, retweets, and automated tweets (e.g., forecast-, advertisement-relatea, and traffic-related tweets) were removed. In addition, we excluded tweets based on three user-related criteria: (1) we removed tweets that belonged to the top 0.5 percentile of user activity because we considered them non-representative of the normal user population, such as pages who created more than 2000 tweets within four weeks. (2) Tweets from users with early access to the 280 limit were removed. (3) Tweets from users who were not represented in both pre and post-CLC datasets were removed, this procedure ensured a consistent user sample over time (within-group design, Nusers = 109,661). All cleaning procedures and corresponding exclusion numbers are presented in Table 2.

The latest tweet texts was indeed transformed into ASCII encryption. URLs, line vacation trips, tweet headers, display labels, and you can references to screen brands was eliminated. URLs increase the profile number whenever discover into the tweet. However, URLs do not enhance the profile number while they are found at the end of a great tweet. To prevent an effective misrepresentation of your own genuine profile restrict you to definitely users suffered with, tweets which have URLs (yet not mass media URLs including extra images otherwise movies) was indeed excluded.

Token and bigram analysis

This sugar daddies Pittsburg KS new Roentgen bundle Footnote 5 ‘quanteda’ was applied in order to tokenize the new tweet messages for the tokens (we.e., remote conditions, punctuation s. Likewise, token-frequency-matrices had been determined which have: the regularity pre-CLC [f(token pre)], new relative frequency pre-CLC[P (token pre)], the newest frequency blog post-CLC [f(token blog post)], the fresh new relative volume blog post-CLC and you can T-results. The latest T-attempt is similar to a simple T-fact and you can exercises the fresh mathematical difference in setting (i.elizabeth., new cousin phrase wavelengths). Bad T-ratings suggest a somewhat large density away from an excellent token pre-CLC, while positive T-results mean a relatively highest density off a good token post-CLC. The newest T-score formula found in the study was showed while the Eq. (1) and you may (2). Letter is the final number from tokens per dataset (i.age., both before and after-CLC). It formula is founded on the method for linguistic data by the Church mais aussi al. (1991; Tjong Kim Performed, 2011).

Part-of-speech (POS) study

The fresh new R plan Footnote 6 ‘openNLP’ was utilized so you can classify and you will number POS groups regarding tweets (we.elizabeth., adjectives, adverbs, content, conjunctives, interjections, nouns, numeral, prepositions, pronouns, punctuation, verbs, and you will various). Brand new POS tagger works playing with a maximum entropy (maxent) chances model in order to assume this new POS category based on contextual keeps (Ratnaparkhi, 1996). The fresh new Dutch maxent model utilized for new POS group are taught with the CoNLL-X Alpino Dutch Treebank data (Buchholz and you will ). The latest openNLP POS design has been stated which have an accuracy score out-of 87.3% whenever utilized for English social network analysis (Horsmann et al., 2015). A keen ostensible limit of your most recent investigation is the reliability off the fresh POS tagger. But not, similar analyses was indeed did for pre-CLC and blog post-CLC datasets, definition the accuracy of your own POS tagger are uniform more one another datasets. Therefore, we imagine there aren’t any health-related confounds.