Alvin's Big Data Notebook : Identify Web Session for Clustering

A session represents an episode of interaction between a user and Web server. It consists of the pages accessed and the times spent on these pages by a user in the episode.

Identify sessions as follows.

1. Read record in the Web server log file.
2. Parse the record to get IP address, URL of the page, and time and date of the request.
3. If the page is a background image file, it is discarded. The image files can be detected by looking at their file name extensions, e.g. gif, jpeg, etc.
4. Decide the session according to the IP address.
5. If the elapsed time from last request is within max_idle_time, the page is appended into the session.
6. Otherwise, the session is closed and a new session is created for the IP address. The closed session is filtered using min_time and min_page, and output.
7. The time spent on a page is estimated to be the difference between the page's request time and the time of the next request from the same client.
8. The time spent on the last page of each session is approximated by the average time of all other pages in the session.

Generalization of Sessions.

A straightforward solution is to represent the sessions on all pages, padding zeros for pages for pages not in a session. But the number of pages is large.

In attribute-oriented induction, the simple page in each session is replaced by its corresponding general page in the page hierarchy. Duplicate pages are then removed with their times added together.
As a result, a generalized session can then be represented by a vector, (session id, t1, t2, ..., tn), where ti is the total time the user spent on the i-th general page and its descendants.

The page hierarchy should be constructed according to the semantics of the pages. If the level is set too high, over-generalization may occur in which too many details are lost. On the other hand, if the level is set too low, the dimensionality may be too large.

Cited from: http://academic.csuohio.edu/fuy/Pub/lnai00.pdf

Alvin's Big Data Notebook

Saturday, 20 September 2014

Identify Web Session for Clustering

No comments:

Post a Comment