forked from utcompling/textgrounder
-
Notifications
You must be signed in to change notification settings - Fork 1
A system for exploring political speech in social networks such as Twitter.
License
benwing/poligrounder
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Intermediate-file format: We have an intermediate-file format as follows. Corpora in an arbitrary source format are converted directly into this format. The intermediate format contains two types of files. One contains all the data for a document, including both the text and the metadata, specified as a number of fields, separated by a tab character. There may potentially be a number of these files, if there is a large amount of data. The other type of file is a very small "schema" file, describing the fields in the data files. The data files contain one document per line, for ease in processing with Hadoop -- even though that may potentially produce very large lines. The text in these files should be pre-split into words, since the algorithm for word-splitting will in general be specific to the corpus. The text should be in raw UTF-8 format, listed as a series of words separated by spaces. Because of the pre-splitting, there should be no words containing spaces, tabs or newlines, and hence no need to escape such characters to prevent them from being interpreted as separators. The data file contains a number of "fixed" fields that are common across all corpora and have well-defined semantics, as well as some "variable" fields that are specific to each corpus. The order of the fields is not too important, since the schema file indicates which field occurs where; however, it is convenient for viewing purposes to place the field containing text last, and the other fixed fields first. The defined fixed fields are: 1. "corpus": Corpus. This could be e.g. "wikipedia-english-20111007" or "twitter-infochimps" or something, indicating the source of the documents. 2. "title": Document title. This should be unique within a given corpus, but need not be unique across all corpora. The combination of corpus and title uniquely identifies a document. If there is only a unique ID, and no title, the ID can be copied into the title field. If the title is not necessarily unique, the ID can be prepended. Examples of cases where both titles and ID's exist are Wikipedia articles and Twitter users. 3. "id": Document ID. Generally, this should also be unique within a given corpus. This field is only used for identification currently. If there isn't an ID separate from the title, either the title can be copied or an arbitrary number generated. 4. "group": Document group. This is used for grouping documents into larger documents (e.g. grouping tweets by user). This can be left blank in cases where no such concept exists. 5. "coord": Document coordinates. This is a combination of two floating-point values, specifying latitude and longitude respectively (where positive indicates north and east, negative south and west), and separated by a comma. 6. "split": Document split. This should be one of "training", "dev" or "test". 7. "text": Text, pre-split, separated by spaces, in UTF-8 format. 8. Additional fields, relevant to specific types of corpora. The schema file, which describes the fields in the data file, is currently simply a one-line file with each field name separated by a tab. Fixed fields need to be listed as well. Possibly we may add a second line specifying the field type. Possible field types might be string, integer, floating-point or boolean. (Booleans would be serialized as "yes"/"no", "true"/"false" or "on"/"off". On input we accept any of them. If we deal with output, we either need to pick one or have different boolean types, e.g "yes-boolean", "on-boolean" or "true-boolean".) Either the types should be automatically nullable (i.e. they have an additional "unknown" value corresponding to a blank field), or there should be nullable versions of each type. Counts-file format: The data file in the above format can be processed to produce a file with word counts instead of actual text. The text field is replaced by a "counts" field. The counts file contains a series of entries separated by spaces. Each entry contains one or more fields, separated by colons. Because the colon serves as a separator, it needs to be escaped when it occurs as a character in a word (e.g. in a URL or emoticon occurring as a word). To make it easy to split accurately, the colon needs to be escaped into a format that does not have a colon character in it (i.e. simply preceding by a backslash won't work). Instead, we URL-encode colons and percent signs, i.e. colon becomes %3A and percent sign becomes %25. Unigram files have entries of the form WORD:COUNT. Bigram files will have entries either of the form WORD1:WORD2:COUNT or WORD:COUNT (the latter containing unigram counts, for backoff purposes). If additional parameters need to be specified, they should be given in a format such as :PARAM:VALUE, with an initial colon.
About
A system for exploring political speech in social networks such as Twitter.
Resources
License
Stars
Watchers
Forks
Packages 0
No packages published
Languages
- Scala 39.3%
- Java 33.1%
- Python 24.8%
- Shell 2.4%
- Perl 0.4%