Web Usage Mining applies data mining techniques to the usage of Web resources, as recorded in Web server logs or other logs of requested URLs(plus, possibly, further parameters).
Data mining was initially raised from mining unknown regular information from customers' goods baskets in market. Many traditional and classical mining methods were raised for these goal, such as association rules, sequential patterns, classifier and clustering methods.
Data mining can help people to know what is hidden behind the huge data resource. On this meaning, it helps to reduce the information capacity and the complexity. Data mining is an unsupervised machine learning method, which has no and a little help from background knowledge.
On the other hand, with the variety and plenty of information, data mining technology opens is application, but at the same time, many new difficulties come out. WWW is a new data resource for data mining, which is a totally different data format embodied in the data meaning, data organization, data structure and data storage. So new mining methods are needed.
Now we are doing some works on Usage mining from WWW from three points:
- To define new usage patterns and mine these patterns from web server logs. Users, Time and Pages are the three key parameters to define the usage patterns. It is already known that traditional usage patterns in data mining are not proper for WWW environment. Those patterns, such as association rules, sequential patterns and etc, can not fully embody the user behaviors through WWW, especially revisiting and backward actions, which happens because the links between pages. So combined with traditional mining methods and those methods in other application field, such as tree structure, directed graph and undirected graph, we are trying to mine the patterns about Maximum Forwards Path, Circle Path, Parallel Path and Directed Tree. We also try to further define and mine more complex usage patterns from web server logs.
- To define new usage patterns and mine these patterns from Real Stream server logs. Real Stream server logs are different from common server logs in data format. Stream logs give much more plentiful information about users behaviors records. Our mining works on the stream logs from tele-Lectures in tele-TASK. So our goal is how to define the students' behaviors on the stream lectures and how to mine these patterns.
- To give an unified definition and expression format for different usage patterns. Among the mined different patterns, there exist information redundancy and obscurity. From two aspects, these problem could be solved: on the one hand, we can make use Fuzzy and Rough Set theory to build a value judgment standard for the patterns, and then to further integrate or divide the mined pattern sets; on the other hand, we can depict the meta-data by a "meaningful" method, which can unify these meta-data with mined patterns on format and semantics, then semantic information can be extracted during mining process or after patterns mining.