Web Mining

By Juan C. Dürsteler

Web mining aims to discover interesting patterns in the structure, the contents and the usage of web sites. An indispensable tool for the webmaster, it has, nevertheless, a long road ahead in which visualisation plays an important role.

In the issue numbers 164, 165 and 166 we spoke about Customer Relationship Management (CRM) and we also saw the importance of detecting user behaviour patterns. We spoke as well about the relevance of information visualisation in presenting the results. There we treated this from the perspective of the customers themselves (how do I find and select what I want?) and from the perspective of the business manager (what do my customers prefer, and how do they behave?).

Nevertheless, if we put ourselves in the shoes of the webmaster, understood as the person responsible for the web and its architecture, we’ll see that it’s crucial to know the real structure of the web, its contents and the usage the customers make of it. It can seem a nonsense thinking that a webmaster doesn’t know the structure of his own web, specially if he/she contributed to its creation. I can certify from my own experience that the website one has in his/her mind or even in the documentation, is not usually exactly the same as the real thing, mainly due to errors and misinterpretations, specially in large websites.

Web mining can be defined as the integration of the information gathered by traditional data mining methods and techniques with information related to the web. In a simplified way we could say that it’s data mining adapted to the particularities of the web.

WebMiningEng.gif (35040 bytes)
Conceptual Map of Web Mining.
Source: map created by the author with IHMC CmapTools v3.10
Click on the image to interact with it.

As Patricio Galeas explains in his web page about web mining, its scope covers mainly three areas within the field of knowledge discovery:

  • Web Structure Mining (WSM).
    This speciality intends to reveal the real structure of web sites through the gathering of structure related data, and mainly about its connectivity. Typically it takes into account two types of links: static and dynamic.
  • Web Content Mining (WCM)
    Its goal is gathering data and identifying patterns related to the contents of the web and the searches performed on them. There are two main strategies:

    • Web page mining, extracting patterns directly from the contents existing in web pages. In this case the data in use can be
      • Free text
      • HTML pages
      • XML pages
      • Multimedia elements
      • Any other type of contents existing in the web site.
    • Search results mining, intending to identify patterns in the results od the search engines.
  • Web Usage Mining (WUM)
    Here the goal is to dive into the records of the servers (logfiles) that store the information transactions that are performed in the web in order to find patterns revealing the usage the customers make of it. For example the most visited pages, usual visiting paths, etc. We can also distinguish here:

    • General access pattern tracking. Here the interest doesn’t rely on the access patterns of a particular visitor but on the integration of them into trends allowing us to re-structure the web in order to facilitate our customer’s access and utilisation of our web site.
    • Customised access pattern tracking. Here what we look for is gathering data about the individual visitor’s behaviour and their interaction with the website. This way we can establish access/purchase profiles so that we can offer a customised experience to every customer. An archetypical case of this is amazon.com and its purchase advice and suggestions.

Web mining is a discipline with an important potential. Despite the increasing and huge volume of existing web sites the proportion of them using web mining tools to analyse their structure, contents and usage in order to improve the service to the user an the profitability of the business is still low.

On the other hand, web mining suffers from the same problems of the general excess of information: we need visualisation tools to enable us to digest and interpret the many results it provides.

In forthcoming issues we’ll see the role that information visualisation is playing in this field.

Etzioni, O. (1996). “The World Wide Web: Quagmire or Gold Mine” Communications of the ACM, 39(11), 65-68.

Links of this issue:

http://www.infovis.net/printMag.php?num=164〈=2 Num 164 about Customer Relationship Management (CRM)
http://www.infovis.net/printMag.php?num=165〈=2 Num 165 about Customer Relationship Management (CRM) II
http://www.infovis.net/printMag.php?num=166〈=2 Num 166 about Customer Relationship Management (CRM) III
http://www.galeas.de/webmining.html Patricio Galeas web page about web mining
http://www.infovis.net/printMag.php?num=105〈=2 Num 105 Discovering the Knowledge

Source: Inf@Vis!