An open-source toolkit for analyzing line-oriented JSON data from the Twitter v1.1 API or flattened line-oriented JSON data from the Twitter v2 API using Apache Spark.
- Java 8 or 11
- Python 3
- Apache Spark
To get started with twut
, you can either use it directly from Maven or download the JAR and ZIP files for Spark or PySpark.
To use twut
with Apache Spark, you can use the following command to include the package:
$ spark-shell --packages "io.archivesunleashed:twut:1.1.0"
Alternatively, you can download the JAR file from the latest release and include it manually:
$ spark-shell --jars /path/to/twut-1.1.0-fatjar.jar
For Python users, download the ZIP file from the latest release and include it in your PySpark environment:
$ pyspark --py-files /path/to/twut-1.1.0.zip
You will also need to set the PYSPARK_PYTHON
and PYSPARK_DRIVER_PYTHON
environment variables.
After you have twut
built or downloaded, you can follow the basic set of recipes and tutorials here.
Licensed under the Apache License, Version 2.0.
This work is primarily supported by the Andrew W. Mellon Foundation. Other financial and in-kind support comes from the Social Sciences and Humanities Research Council, Compute Canada, the Ontario Ministry of Research, Innovation, and Science, York University Libraries, Start Smart Labs, and the Faculty of Arts and David R. Cheriton School of Computer Science at the University of Waterloo.
Any opinions, findings, and conclusions or recommendations expressed are those of the researchers and do not necessarily reflect the views of the sponsors.