twitter_loctagger_it

This package allows you to map Italian Twitter users to a specific city or region in Italy.

About

This package is specifically intended to detect, extract, and geocode locations from the "location" field, which is part of a user's profile. For example, if a user wrote "vivo a Sesto Calende", the algorithm would map this user to "Sesto Calende, Varese, VA, Lombardia, Italia". Note that the "location" field can be obtained by using the Twitter API as part of the user information.

How to use

Import the package as follows:

from twitter_loctagger_it.twitter_loctagger_it import geocoder

Run the following command:

df_output = geocoder.tag_location(df_input)()

Note: the package requires as input a Pandas Series (with the Twitter ID as the index) and returns a Pandas DataFrame with the geocoded results.

Example

As I mentioned above, this package requires as input a Series object (df_input) which looks like this

id	location_original
244950970	Recanati,italy
17774175	ancona,Italia
91842195	Nel Paese dei Balocchi
244754469	monsampolo del tronto,marche

where the column location_original contains the self-reported location for each users, while the index is the list of the corresponding user ids.

Then, by running the command above, the result (df_output) would be

id	location_original	city	province	province_code	region	geographic_ripartition	state
244950970	Recanati,italy	Recanati	Macerata	MC	Marche	Centro	Italia
17774175	fidenza	Fidenza	Parma	PR	Emilia-Romagna	Nord-Est	Italia
91842195	Nel Paese dei Balocchi
244754469	monsampolo del tronto	Monsampolo del Tronto	Ascoli Piceno	AP	Marche	Centro	Italia

Caveats

The main issue is with cities which generate false positives. For example, if a user reported a nonsensical location such as "seconda stella a destra", this could possibly be matched to "Stella, Savona, SV, Liguria, Italia". To avoid this issue, I looked for all the cities that could match nonsensical patterns and I decided to simply remove them from the list of cities to be matched. Note that these tend to be very small cities that in my case only generated false positives. In general, I tried to keep a balance between avoiding false positives as much as possible while not losing too much level of detail.

The second issue instead regards duplicate names, i.e. cities present in more than one region. In Italy, the list is the following: "Paterno, San Teodoro, Castro, Peglio, Corvara, Livo, Samone". Again, I decided to remove these from the possible matches.

Finally, there are cases in which users write multiple cities in the "location" field, where one is possibly their home-town and the others cities they subsequently moved to. In case in which a user moved abroad, i.e. he/she writes something like "Roma, London", then it would be matched to the Italian city only and the other one would simply be ignored. Instead, if all cities are in Italy, then the user would be matched to one of the cities and, since there are no strict patters in which users would write down the sequence of cities, the choice would be as good as random.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
twitter_loctagger_it		twitter_loctagger_it
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

twitter_loctagger_it

About

How to use

Example

Caveats

About

Releases

Packages

Languages

marvin-01/twitter_loctagger_it

Folders and files

Latest commit

History

Repository files navigation

twitter_loctagger_it

About

How to use

Example

Caveats

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages