Event Recommendation data processing

Events recommendation

This article is to describe the project of Design Thinking course in ZJU.

This article is under updating…

Project Objective

Provide personalized events information around campus for students according to events contents, user’s information, etc.

Dataset

  • Event Info: include event’s time, location and event’s content, etc.
  • User Info: include user’s age, gender, event preferences, etc.

But it’s hard to get real life dataset for time or some other reasons.
So…

Use a similar existing dataset from internet.

Resource:

From dataset of one of Kaggle Competition.
https://www.kaggle.com/c/event-recommendation-engine-challenge/data

Description:

This is the original description of all dataset from Kaggle:

There are six files in all: train.csv, test.csv, users.csv, user_friends.csv, events.csv, and event_attendees.csv.

train.csv has six columns: user, event, invited, timestamp, interested, and not_interested. Test.csv contains the same columns as train.csv, except for interested and not_interested. Each row corresponds to an event that was shown to a user in our application. event is an id identifying an event in a our system. user is an id representing a user in our system. invited is a binary variable indicated whether the user has been invited to the event. timestamp is a ISO-8601 UTC time string representing the approximate time (+/- 2 hours) when the user saw the event in our application. interested is a binary variable indicating whether a user clicked on the “Interested” button for this event; it is 1 if the user clicked Interested and 0 if the user did not click the button. Similarly, not_interested is a binary variable indicating whether a user clicked on the “Not Interested” button for this event; it is 1 if the user clicked the button and 0 if not. It is possible that the user saw an event and clicked neither Interested nor Not Interested, and hence there are rows that contain 0,0 as values for interested,not_interested.

users.csv contains demographic data about our some of our users (including all of the users appearing in the train and test files), and it has the following columns: user_id, locale, birthyear, gender, joinedAt, location, and timezone. user_id is the id of the user in our system. locale is a string representing the user’s locale, which should be of the form language_territory. birthyear is a 4-digit integer representing the year when the user was born. gender is either male or female, depending on the user’s gender. joinedAt is an ISO-8601 UTC time string representing when the user first used our application. location is a string representing the user’s location (if known). timezone is a signed integer representing the user’s UTC offset (in minutes).

user_friends.csv contains social data about this user, and contains two columns: user and friends. user is the user’s id in our system, and friends is a space-delimited list of the user’s friends’ ids.

events.csv contains data about events in our system, and has 110 columns. The first nine columns are event_id, user_id, start_time, city, state, zip, country, lat, and lng. event_id is the id of the event, and user_id is the id of the user who created the event. city, state, zip, and country represent more details about the location of the venue (if known). lat and lng are floats representing the latitude and longitude coordinates of the venue, rounded to three decimal places. start_time is the ISO-8601 UTC time string representing when the event is scheduled to begin. The last 101 columns require a bit more explanation; first, we determined the 100 most common word stems (obtained via Porter Stemming) occuring in the name or description of a large random subset of our events. The last 101 columns are count_1, count_2, …, count_100, count_other, where count_N is an integer representing the number of times the Nth most common word stem appears in the name or description of this event. count_other is a count of the rest of the words whose stem wasn’t one of the 100 most common stems.

event_attendees.csv contains information about which users attended various events, and has the following columns: event_id, yes, maybe, invited, and no. event_id identifies the event. yes, maybe, invited, and no are space-delimited lists of user id’s representing users who indicated that they were going, maybe going, invited to, or not going to the event.

But we’ll only use part of them:

  • events
  • users

CODE

The code is show as jupyter notebook. And it will update consecutively(Below is updating info).

data preprocessing code

updating info:

5.23

  • Read events.csv and take a general look of it.
  • reduce the high dimension dataset to 2-D format and use matplotlib to visualize it.