Putunga Reo Māori Tīhau: The Reo Māori Twitter Corpus
Te Reo Māori
Ko te Putunga Reo Māori Tīhau (RMT V2) nei he putunga o ngā tīhau 94,163 kua tuhia ki te reo Māori. He mea hanga hei tautoko i ngā mahi reo Māori -ā-waha o te rorohiko, hei rauemi hoki mō ngā iwi Māori puta noa i te motu. E 2,302 ngā kaituhi pīhau reo Māori nei, ko ētehi he whaiaro nō te tangata ake, ko ētehi he mea tuku nō te kamupene. Kua tautohua ngā kaituhi pīhau nei e te pae tukutuku a Indigenous Tweets nā Prof. Kevin Scannell.
Kia tikiake te putunga RMT
Nā tēneki waehere e taea ai te tikiake i ngā pīhau me ngā raraungameta. Kua hangaia te waehere nei mai i ngā waehere o Twitter arā te waehere tauira o API v2 endpoints.
Kia mōhio mai: kō ētehi o ngā tīhau kāore i reira ināianei, nō reira ka kore e taea te tikiake. Tukua tētehi īmēra ki a David Trye inā ka hiahiatia te katoa o te putunga RMT me ētehi raraungameta i tua atu, he raraungameta kua kōrerotia i tā mātou pepa.
Ka hāngai te tere o tāu tikiake ki tāu ake rate limit i tāu pūkete kaiwhanake o Twitter (arā, 300, 900 rānei ngā tono ki ia 15 meneti). Ki te whakapaua tāu tatau, ka puta mai te kōrero hē: 429 'Too many requests'.
- Tonoa tētehi Pūkete Kaiwhanake o Twitter inā kaore i a koe tētehi.
- Me matua noho a Python 3 ki tāu rorohiko. Ka whakamahia e te waehere tikiake i te
requests==2.24.0
, ā, ka whakamahia hoki terequests-oauthlib==1.3.0
. Ka whēnei ngā tono hei utaina ēnei:
pip install requests
pip install requests-oauthlib
- Tikiake me te unu i ngā kōnae katoa i te kōpake rmt. Kei tēneki kōpaki tētehi kōnae ko
rmt-corpus-v2.csv
te ingoa, kei reira ngā tīhau IDs me ngā raraungameta. Kei reira hoki ngā waehere Python e rua mō te tikiake me te whakahōputu o ngā raraunga (arā,get_tweets_with_bearer_token.py
me tejson_to_tsv.py
). - Kia whakaritea ai tāu API bearer token me tuku atu tēnei tono:
export 'BEARER_TOKEN'='<your_bearer_token>'
- Whārere
get_tweets_with_bearer_token.py
.
python get_tweets_with_bearer_token.py > output.json
Nā reira, ka tikiake te 100 tīhau i ia tono, ā, nā ēnei whakaritenga ka āhua 45 meneti pea te roa. Ka whēnei te roa i te mea, ka 30,000 tīhau (300 tono x 100 tīhau) i ia 15 meneti. I te otinga ake ka rauikatia ki te kōnae output.json heoi anō he pseudo-JSON kē; ka motumotu ngā taupū ki tētehi rārangi e whēnei ana; "Batch X, Code Y", he nama te X me te Y.
- Whārere
json_to_tsv.py
kia whakawhiti ai te kōnae putunga ki te hōputu TSV.
python json_to_tsv.py
Ko te otinga o tēnei waehere ko te kōnae rmt-corpus-final.csv
, he kōnae ka tūwheratia ki te tautono ripanga. Kua rite tahi te hōputu o ngā kōrero tīhau, arā ka tangohia ngā pūāhua motuhake, ka weteoro te HTML, ā, ka rite tahi te whakatakoto o ngā kōrero kaitīhau me ngā tūhononga. Ka tāpiritia hoki ngā raraungameta mai i te kōnae motuhake o rmt-corpus-v2.csv
. Ki raro iho nei ka whakamārama atu ngā taurangi kei ngā kōnae CSV.
He Whakamārama Raraunga: rmt-corpus-v2.csv
Koia nei ngā māramatanga (inā e mōhio ana) kei te kōnae rmt-corpus-v2.csv
, ahakoa kei a Twitter tonu te tīhau ake, kāore rānei.
Taurangi | Whakamārama |
---|---|
id | Te tautohu motuhake o Twitter mō tēnei tīhau. Ka taea te rapu tēnei tautohu ki twitter.com/user/status/XXX, inā ko XXX te tīhau tautohu. |
num_maori_words | E hia ngā kupu Māori ki te tīhau. |
total_words | E hia te katoa o ngā kupu ki te tīhau. |
percent_maori | Ko te paiheneti o ngā kupu Māori ki te tīhau (=num_maori_words / total_words *100). |
favourites | E hia ngā pānga (likes, retweets & quotes) i whakapā atu ki tēnei tīhau. |
reply_count | E hia ngā whakahoki kōrero ki tēnei tīhau. |
user.alias | He ingoakē mō kaitīhau, ka hangaia i te TX , inā ko te X tōna tūranga i te tatau whānui o āna tīhau i te putunga katoa (user.num_tweets ). |
user.status | Te tūnga pūkete (i te Tīhema 2020) o te tangata nāna te tīhau i tuku: ka 'active', 'protected', 'suspended', 'not found' rānei. |
user.followers | Tokohia āna apataki (i te Tīhema 2020). |
user.friends | E hia ngā hoa ka whaia e ia (i te Tīhema 2020). |
user.num_tweets | E hia ngā tīhau kua tukuna e ia. |
user.region | Tōna wākāinga, e ai ki a ia. Inā i taea, i whakaritea ngā rohe nei ki New Zealand regions me ngā ingoa o ngā whenua ki tāwāhi. |
year | Te tau kua tukuna te tīhau (2007-2022). |
He Whakamārama Raraunga: rmt-corpus-final.csv
Ki te ora tonu te tīhau ki a Twitter, ka tāpiri atu hoki ngā taurangi ki raro iho nei. Ki te kore te taurangi e kitea ka tuhia ‘None’.
Taurangi | Whakamārama |
---|---|
content | Ko ngā kōrero o te tīhau; kua rite tahi te hōputu o nga kōrero tīhau, arā ka tangohia ngā pūāhua motuhake, ka weteoro te HTML, ā, ka rite tahi te whakatakoto o ngā kōrero kaitīhau me ngā tūhononga. |
conversation_id | Ko te tautohu motuhake mō te kōrerorero e noho atu ai tēnei tīhau. |
in_reply_to_user_id | Mehemea he whakahoki kōrero tēnei ki te tīhau o tāngata kē, koia nei te tautohu motuhake o te tangata nāna te tīhau tuatahi i tuku. |
author_id | Ko te tautohu motuhake o Twitter mō te kaituhi o te tīhau. |
created_at | Te wā me te rā i tuku ai te tīhau, ka whēnei te hōputu: YYYY-MM-DDTHH:mm:ss.000Z . |
lang | Ko te reo o te tīhau e ai ki a Twitter. Ka kore te reo Māori i kitea i kōnei i te mea he kore mōhio o te API ki te reo Māori. |
source | Ko te pūrere, te tautono o wāho rānei i tuku i te tīhau (arā 'Twitter Web Client', 'Twitter for iPhone'). |
error | Ko te take i kore tikiake ai i te tīhau. Ki te hē mai, kā whēnei: 'Authorization Error', 'Not Found Error', 'None'. |
He rauemi i tua atu
- Ko te waehere i whakarite ai i tātari ai te putunga RMT kei kōnei project GitHub repository.
- E taea ana te tikiake kupu rārangi me te tatau o ngā kupu me ngā tohuhaki i te putunga nei (V1).
Tohutoro te putunga RMT
Ki te whakamahi koe i te putunga RMT nei, tēnā koe me tohutoro whēnei mai:
- Trye, D., Keegan, T. T., Mato, P., & Apperley, M. (2022). Harnessing Indigenous Tweets: The Reo Māori Twitter corpus. Lang Resources & Evaluation, 56, 1229-1268. doi:10.1007/s10579-022-09580-w
Kaimahi
Kaitautoko i tua:
- Te Hiku Media, NZ
- Kevin Scannell, Saint Louis University, US
Tautoko ā-pūtea
E mihi atu ana mō te tautoko ā-pūtea:
- Ngā Pae o te Māramatanga
- The University of Waikato
I whakatikahia te kōrero kei tēnei whārangi i Noema 2022. Whakamōhio mai koa mehemea ka kite koe i ētehi hē i ngā waehere i tēneki whārangi rānei.
- RMT V2: I te Noema 2022, 80,935 tīhau (86% o te putunga RMT V2) i taea ai te tikiake i a Twitter.
- RMT V1: I te Noema 2021, 72,575 tīhau (92% o te putunga RMT V1) i taea ai te tikiake i a Twitter.
English
The Reo Māori Twitter (RMT) Corpus V2 is a collection of 94,163 te reo Māori tweets, designed for linguistic analysis and to help in the development of new Natural Language Processing (NLP) resources for the Māori community. The corpus captures output from 2,302 users, including a mixture of personal and institutional accounts. These users were identified via Prof. Kevin Scannell's Indigenous Tweets website.
Download the RMT Corpus
The tweets and user metadata in the RMT Corpus can be hydrated (downloaded from Twitter) using the code provided. The source code is adapted from Twitter's sample code for API v2 endpoints.
Note: Some tweets in the corpus are no longer publicly available and, as such, cannot be downloaded. Please email David Trye if you would like access to the complete dataset, including additional metadata mentioned in our paper.
The speed at which you can download the corpus depends on the rate limit for your Twitter developer account (e.g. 300 or 900 requests per 15-minute window). If you exceed the allocated limit, a 429 'Too many requests' error will be returned.
- Apply for a Twitter developer account if you do not have one already.
- Ensure that Python 3 is installed on your machine. The code for hydrating the corpus uses
requests==2.24.0
, which in turn usesrequests-oauthlib==1.3.0
. You can install these packages as follows:
pip install requests
pip install requests-oauthlib
-
Download and extract all files in the rmt folder. This folder contains a file called
rmt-corpus-v2.csv
, which has the tweet IDs and selected metadata, as well as two Python scripts for downloading and formatting the data (namely,get_tweets_with_bearer_token.py
andjson_to_tsv.py
). -
Configure your API bearer token by running the following command in the terminal:
export 'BEARER_TOKEN'='<your_bearer_token>'
- Run
get_tweets_with_bearer_token.py
from the terminal.
python get_tweets_with_bearer_token.py > output.json
This will download the corpus in batches of 100 tweets. If you use the default settings, the script will take roughly 1 hour to run, as it will attempt to download 30,000 tweets (300 requests x 100 tweets) every 15 minutes. The resulting file, output.json
, is only pseudo-JSON (each batch is separated by a line in the form "Batch X
, Code Y
", where X
and Y
are numbers).
- Run
json_to_tsv.py
to convert the output file to TSV format.
python json_to_tsv.py
This script will produce a file called rmt-corpus-final.csv
, which you can then open in a spreadsheet application. Tweet text is formatted consistently (special characters are removed, any HTML is decoded, and user mentions and links are standardised). The tweets are also supplemented with metadata from the original rmt-corpus-v2.csv
file. A description of the variables in each of the CSV files is given below.
Data Description: rmt-corpus-v2.csv
The following information (where known) is supplied in rmt-corpus-v2.csv
, even if the tweet is no longer available on Twitter.
Data Column | Description |
---|---|
id | Twitter's unique identifier for the tweet. You can search for a tweet online by visiting twitter.com/user/status/XXX, where XXX is the tweet's ID. |
num_maori_words | The number of Māori words in the tweet. |
total_words | The total number of words in the tweet. |
percent_maori | The percentage of Māori text detected in the tweet (=num_maori_words / total_words *100). |
favourites | The number of favourites (likes, retweets & quotes) that the given tweet received. |
reply_count | The number of replies that the given tweet received. |
user.alias | An alias for the author of the tweet in the form TX , where X represents the user’s ranking based on their total number of tweets in the corpus (user.num_tweets ). |
user.status | The account status (as of Decemeber 2020) of the user who wrote the tweet: 'active', 'protected', 'suspended' or 'not found'. |
user.followers | The user's number of followers (as of December 2020). |
user.friends | The number of accounts that the user follows (as of December 2020). |
user.num_tweets | The total number of tweets in the corpus that were written by this user. |
user.region | The user's location, based on self-reported information. Where possible, the data has been aggregated into New Zealand regions and names of foreign countries. |
year | The year the tweet was written (between 2007 and 2022). |
Data Description: rmt-corpus-final.csv
If the tweet is still publicly available on Twitter, the following variables will appear alongside those mentioned above. Missing values are indicated with 'None':
Data Column | Description |
---|---|
content | The tweet content, with consistent formatting applied (special characters stripped, user mentions and links standardised). |
content_with_emojis | The tweet content with emojis included. |
conversation_id | The ID for the conversation that the tweet is part of. |
in_reply_to_user_id | If the tweet is written in reply to another, this is the ID of the user who who wrote the original tweet. |
author_id | Twitter's unique identifier for the user who wrote the tweet. |
created_at | The timestamp when the tweet was posted, in the format YYYY-MM-DDTHH:mm:ss.000Z . |
lang | The two-letter code representing the language that the tweet was (erroneously) classified as (NOT Māori, as the API does not support te reo). |
source | The device or third-party application from which the tweet was posted (e.g. 'Twitter Web Client', 'Twitter for iPhone'). |
error | The reason why the tweet could not be downloaded, if there was an error ('Authorization Error', 'Not Found Error', 'None'). |
Other Resources
- Code for cleaning and analysing the RMT Corpus is available on the project GitHub repository.
- You can download a wordlist with frequencies for all words and hashtags in Version 1 of the corpus.
Citing the RMT Corpus
If you use the RMT corpus, please cite the following paper:
- Trye, D., Keegan, T. T., Mato, P., & Apperley, M. (2022). Harnessing Indigenous Tweets: The Reo Māori Twitter corpus. Lang Resources & Evaluation, 56, 1229-1268. doi:10.1007/s10579-022-09580-w
Team
External Collaborators:
- Te Hiku Media, NZ
- Kevin Scannell, Saint Louis University, US
Funding
We graciously acknowledge the generous support of:
- Ngā Pae o te Māramatanga
- The University of Waikato
The information on this page was last checked in November 2022. Please let us know if you notice any errors in the code and/or instructions.
- RMT V2: As of 25 November 2022, 80,935 tweets (86% of the RMT Corpus V2) could be successfully downloaded from Twitter.
- RMT V1: As of 12 April 2021, 72,575 tweets (92% of the RMT Corpus V1) could be successfully downloaded from Twitter.