语义网络，信息提取，web scraping, web data extraction

Background

On the Web, nearly all pages are authored with some kind of markup languages. The traditional markup tags are for presenting the contents on Web browsers. In order for software to automatically process the contents, additional metadata are neccessary to tell what about the contents.

A few approaches emerged to convey semantics of the contents. Microformat, one major approach, seeks to re-use existing XHTML and HTML tags to convey metadata and other attributes.

Although Microformat is an open community, it is more like a standard body. A micoformat can be published by any community member, while it makes sense only when it is well accepted by enough members. It must take a considerable time to turn to be a facto standard. Unfortunately, the Web is booming up and cannot bear to wait for a new standard to be voted. Further more, after a microformat has been accepted as a facto standard, it costs too much to re-author the existing Web pages.

In fact, each of the Web content authors are annotating the published contents with markup tags and attributes all the way. Most of the time, the annotations are for special presenting effects. For example, an author may annotate a content snippet with a special value of class, one of HTML attributes, to display it in a special background colour with the help of CSS. In fact, the special display effects always imply some semantics and try to make them more understandable. In summary, the existing annotations are valuable to be mined for semantics.

Every author annotates the published contents freely in some degree. There should be an approach to recognize and register the free semantic annotations. FreeFormat is just the one. FreeFormat mines the semantics from the free annotations and format them into semantic structures which are managed and hosted on the Internet. As a result, a metadata repository or a knowledge base is built up and acts as a semantics register center. All can published their own semantic structures for their contents, which can instruct 3rd party software to automatically process the contents. As the repository becomes larger and larger, it can be viewed as a semantic layer covering the underlying native Web contents.

FreeFormat, invented by GooSeeker, must be an effective approach to reformat the contents on the Web. MetaCamp server, implemented by GooSeeker, is a Web-based application managing the metadata repository and provide many collaborative tools to community members to share the semantic structures.

Uses of FreeFormat

By content publishers

Most of time, content publishers want to contribute to information and knowledge sharing on the Web. They want the published content to be widely manipulated and consolidated. FreeFormat is more convenient and cost-effective than other annotation approaches because the annotations are not required to be embedded into the contents. Content publishers are freed to choose publishing tools or media at their wills. For example, they can choose any off-the-shelf popular content management systems to publish their contents without having to re-code the logics of the systems to embed semantic annotations. Furthermore, they will not be forced to reauthor the legcy contents as required by the embedded annotating approaches.

By publishing the data schemas of their contents onto a central repository, called as MetaCamp base(still in closed alpha test), the publishers can promote their contents to more information consumers. The data schemas can be tagged, classified and published to all content consumers. Some recommandation tools are also provided to make the data schemas more viewable and reachable.

FreeFormat has been implemented by MetaSeeker Toolkit V3.1.0. Content publishers can use MetaStudio, a client-side tool from the toolkit, to define and publish their data schemas.

By content consolidators

With the help of FreeFormat and the central repository, i.e. MetaCamp base, content consolidators must save much without surfing the Web to find necessary domain content. Facilitated by the online search and retrieval tools provided by the MetaCamp base, content consolidation software gets necessary metadata for the contents from the central MetaCamp base. From the central base, consolidators can optionally retrieve data extraction instruction files which direct a data extractor, e.g. DataScraper, a data extractor released by GooSeeker for free, to extract and reformat the Web contents.

Technical overview

In the community GooSeeker, the word, bucket, is chosen by the community GooSeeker to express a structured container which specifies the data schema of a group of Web pages and stores extracted data from the pages. A bucket is very similar to multiple shelves each of which represents a piece of data in specific semantic on a Web page. The word, property, is used to name these shelves. All properties in a bucket are organized into a tree-like structure.

Some metadata in HTML documents, e.g. tags and attributes, can be viewed as marks of semantic annotations. The Freeformat approach can recognize the marks and link to the annotations from the external, the semantic layer. For example, class and id, two of the HTML attributes, are often recognized as the marks. In this paper, the word freeformat is often used to represent the marks recognized by FreeFormat.

Unfortunately, not every piece of content on an ordinary Web page are marked with meaningful metadata. For example, some textual content snippets are just expressed with a TEXT node embraced by some kind of HTML element, e.g. a DIV. That is, in a bucket some properties may be raw. A proprietary algorithm has been implemented to locate a semantic block, held by a bucket, as a whole. MetaStudio, one of the collaborate tools provided by GooSeeker, provides an GUI to facilitate the users to define data schemas and to locate semantic blocks on Web pages. After having defined a data schema, MetaStudio generates a series of instruction files, XML programs, and uploads them onto the MetaCamp base. The auto-processing software can make use of the files to extract data from the Web and manipulate them further.

On how to define a data schema, please refer to http://www.gooseeker.com/en/node/document/metastudio/operationv3/steps.

Products

MetaStudio: is a collaborative tool to describe data schemas. It is released as a Firefox extension. It is free.
DataScraper: is a data scraper continuously extracting data from the Web. It is instructed by the data extraction instruction files generated by MetaStudio. It is released as a FireFox extension. It is free.
MetaCamp base: is a Web application storing and managing data schemas.
MetaCamp service console: is a operation support system for MetaCamp services.

语义网络，信息提取，web scraping, web data extraction

2009年5月14日星期四

Background

Uses of FreeFormat

By content publishers

By content consolidators

Technical overview

Products

Further reading

没有评论:

发表评论

关注者

博客归档

我的简介