语义网络，信息提取，web scraping, web data extraction

2009年5月14日星期四

Background

On the Web, nearly all pages are authored with some kind of markup languages. The traditional markup tags are for presenting the contents on Web browsers. In order for software to automatically process the contents, additional metadata are neccessary to tell what about the contents.

A few approaches emerged to convey semantics of the contents. Microformat, one major approach, seeks to re-use existing XHTML and HTML tags to convey metadata and other attributes.

Although Microformat is an open community, it is more like a standard body. A micoformat can be published by any community member, while it makes sense only when it is well accepted by enough members. It must take a considerable time to turn to be a facto standard. Unfortunately, the Web is booming up and cannot bear to wait for a new standard to be voted. Further more, after a microformat has been accepted as a facto standard, it costs too much to re-author the existing Web pages.

In fact, each of the Web content authors are annotating the published contents with markup tags and attributes all the way. Most of the time, the annotations are for special presenting effects. For example, an author may annotate a content snippet with a special value of class, one of HTML attributes, to display it in a special background colour with the help of CSS. In fact, the special display effects always imply some semantics and try to make them more understandable. In summary, the existing annotations are valuable to be mined for semantics.

Every author annotates the published contents freely in some degree. There should be an approach to recognize and register the free semantic annotations. FreeFormat is just the one. FreeFormat mines the semantics from the free annotations and format them into semantic structures which are managed and hosted on the Internet. As a result, a metadata repository or a knowledge base is built up and acts as a semantics register center. All can published their own semantic structures for their contents, which can instruct 3rd party software to automatically process the contents. As the repository becomes larger and larger, it can be viewed as a semantic layer covering the underlying native Web contents.

FreeFormat, invented by GooSeeker, must be an effective approach to reformat the contents on the Web. MetaCamp server, implemented by GooSeeker, is a Web-based application managing the metadata repository and provide many collaborative tools to community members to share the semantic structures.

Uses of FreeFormat

By content publishers

Most of time, content publishers want to contribute to information and knowledge sharing on the Web. They want the published content to be widely manipulated and consolidated. FreeFormat is more convenient and cost-effective than other annotation approaches because the annotations are not required to be embedded into the contents. Content publishers are freed to choose publishing tools or media at their wills. For example, they can choose any off-the-shelf popular content management systems to publish their contents without having to re-code the logics of the systems to embed semantic annotations. Furthermore, they will not be forced to reauthor the legcy contents as required by the embedded annotating approaches.

By publishing the data schemas of their contents onto a central repository, called as MetaCamp base(still in closed alpha test), the publishers can promote their contents to more information consumers. The data schemas can be tagged, classified and published to all content consumers. Some recommandation tools are also provided to make the data schemas more viewable and reachable.

FreeFormat has been implemented by MetaSeeker Toolkit V3.1.0. Content publishers can use MetaStudio, a client-side tool from the toolkit, to define and publish their data schemas.

By content consolidators

With the help of FreeFormat and the central repository, i.e. MetaCamp base, content consolidators must save much without surfing the Web to find necessary domain content. Facilitated by the online search and retrieval tools provided by the MetaCamp base, content consolidation software gets necessary metadata for the contents from the central MetaCamp base. From the central base, consolidators can optionally retrieve data extraction instruction files which direct a data extractor, e.g. DataScraper, a data extractor released by GooSeeker for free, to extract and reformat the Web contents.

Technical overview

In the community GooSeeker, the word, bucket, is chosen by the community GooSeeker to express a structured container which specifies the data schema of a group of Web pages and stores extracted data from the pages. A bucket is very similar to multiple shelves each of which represents a piece of data in specific semantic on a Web page. The word, property, is used to name these shelves. All properties in a bucket are organized into a tree-like structure.

Some metadata in HTML documents, e.g. tags and attributes, can be viewed as marks of semantic annotations. The Freeformat approach can recognize the marks and link to the annotations from the external, the semantic layer. For example, class and id, two of the HTML attributes, are often recognized as the marks. In this paper, the word freeformat is often used to represent the marks recognized by FreeFormat.

Unfortunately, not every piece of content on an ordinary Web page are marked with meaningful metadata. For example, some textual content snippets are just expressed with a TEXT node embraced by some kind of HTML element, e.g. a DIV. That is, in a bucket some properties may be raw. A proprietary algorithm has been implemented to locate a semantic block, held by a bucket, as a whole. MetaStudio, one of the collaborate tools provided by GooSeeker, provides an GUI to facilitate the users to define data schemas and to locate semantic blocks on Web pages. After having defined a data schema, MetaStudio generates a series of instruction files, XML programs, and uploads them onto the MetaCamp base. The auto-processing software can make use of the files to extract data from the Web and manipulate them further.

On how to define a data schema, please refer to http://www.gooseeker.com/en/node/document/metastudio/operationv3/steps.

Products

MetaStudio: is a collaborative tool to describe data schemas. It is released as a Firefox extension. It is free.
DataScraper: is a data scraper continuously extracting data from the Web. It is instructed by the data extraction instruction files generated by MetaStudio. It is released as a FireFox extension. It is free.
MetaCamp base: is a Web application storing and managing data schemas.
MetaCamp service console: is a operation support system for MetaCamp services.

刚看到一则新浪新闻，百度阿拉丁平台上线了。从这个名提出来就开始关注，根据原来的宣传，该平台主要是用来索引Web上的暗信息，“阿拉丁”真是吊足了人的胃口，但是上了它的网站以后，左看右看，也没有什么特别的挖掘暗内容的手段。其功能好像就是sitemap再加上分类标签，也没有看到提供什么工具将网站地图输出成它需要的XML文档，要是给google提交sitemap，在网上有大把的sitemap生成工具。

从其网站上的介绍信息看，其能力和先进性远远落后于以基于FreeFormat技术的MetaSeeker为代表的social semantic web。这是语义网络的新方向，也是实用化方向，由网络用户描述网络内容的信息结构，将Web格式化成一个数据库，促进信息共享和集成，昨天看到有人称其为crowd sourcing。要论挖掘暗信息，更远远不如MetaSeeker这种以信息提取工具为出身的工具和服务了。

但是，凭借百度的势力，可以下个结论，这可能只是投石问路，杀手锏在后面。继续期待。

很想上去试试，但是看其流程太麻烦了，不如给google发sitemap来的容易，如果仅仅是sitemap，麻烦的不值。

FreeFormat技术的社区性/社会性

什么是FreeFormat一文简单地说明了该技术方法的技术路线，本文进一步阐述FreeFormat技术方法的价值，也就是所谓的社区性或社会性，而且可以改变Web信息提取的用途，使其能够回馈互联网大社区。

万维网（Web）给人们的信息和知识管理带来了巨大改变，现在，人们逐渐习惯了到万维网上寻找知识和问题的答案，以前需要花费几个小时甚至更长时间翻阅图书资料，现在可能只需要几分钟。如果，万维网中的内容能够被计算机进一步处理加工，可以肯定，能够创造更大的价值。然而，计算机凭借现有的算法和能力无法像人一样阅读和理解万维网中的内容。人工智能是一种解决方案，然而，根据现有的研究进度，人工智能的目标还是比较遥远的。其实，有另外一种更实用的解决方案：将现有互联网中的内容进行结构化改造，首先汇聚网络内容语义结构元数据，然后利用该元数据将互联网上的内容提取出来并转换成结构化数据。这样，现有的非结构话信息转变成了像关系数据库一样的结构化信息，使计算机的进一步处理加工成为可能。

事实上，从万维网（Web）上提取信息早在上个世纪就出现了，随着万维网上内容的增加，人们自然想到要使用计算机进一步加工这些信息，以创造更多价值。各种信息提取算法层出不穷，而且随着计算机技术的发展，用新计算机语言和技术不断武装以前算法，其能力大大增强，然而，本质上所有这些算法和技术的基础没有改变，我们知道，万维网上的内容是用HTML文档呈现给人们的，即使服务器使用了各种先进的动态页面管理技术，客户端浏览器面对的仍然是HTML文档，几乎所有的信息提取算法和技术都是利用HTML文档中的各种标签，使用字符串正则表达式或者DOM结构遍历技术，从文档中指定位置提取信息。随着一些新技术的出现，例如，XPath, XSLT, XQuery等，信息提取的效率和能力大大提高，然而，没有质的改变，其局限性主要提现在下面三点：

信息提取技术的应用是分散的

信息提取的应用十分普遍，几乎每个互联网领域的公司和个人都或多或少地需要从现有的万维网内容中提取需要的信息。当前，计算机编程语言发展的十分强大，编写一段信息提取的代码也许只需要几个小时，因此，这些公司和个人几乎都有编写定制的信息提取代码的经历。虽然单个个体的开发投入很小，而整个产业领域的开发投入总量是巨大的，所以，当前这种分散状态耗费了可观的资源。

信息提取开发成果是无法继承和积累的

由于信息提取系统的开发是分散的，每个个体的开发成果无法继承和积累，个体开发的信息提取代码都是为特定目的的，随着应用场景的改变，从技术上这些代码无法重用，例如，目标页面改变了就需要新代码应对新的文档结构；开发人员到一个新项目中，因为项目环境的变化，他很可能选择开发一个全新的信息提取代码；开发人员替换后，继任者很有可能抛弃前任的成果，重新开发。成果不能继承是一项损失，再加上重复开发的支出，都对企业盈利造成损害，如果，再将这些损失扩大到产业领域甚至整个人类社会领域，总量是巨大的，这跟万维网的理念是相悖的。

信息提行为往往是不受欢迎的

信息提取通常被看作是一种信息攫取行为，实际上，其底层技术跟当前的搜索服务的技术有很多相似之处，例如，其前端都使用相同的网络爬虫技术，然而很多信息提取行为与当前的搜索服务的进一步加工和使用信息的目的和方法不同，相比于搜索服务，很多信息提取活动不能回馈互联网产业和社会，不能产生增值，很多信息提取活动是单方向的，甚至非法拷贝别人的内容。因此，信息提取行为经常受到抵制。

以上这些弱点严重损害了信息提取的应用和发展，问题关键在于现有的技术与万维网的理念相悖。万维网是全球共享的知识库，人们自发创造内容，补充到万维网知识库中，同时享受别人的创造成果，这是一个价值增值的回馈循环。显而易见，当前的信息提取技术没有参与到这个回馈循环中。

FreeFormat方法和工具成功地解决了这个问题，使信息提取行为不再是备受争议的攫取行为，而是积极参与万维网价值增值的回馈循环。使用FreeFormat方法和工具，网络用户能够参与到万维网内容语义结构的定义和共享活动中，随着参与度的增加，定义出来的语义结构将联系成一张语义网络，使用这些语义结构，能够有效地对网络内容进行结构化改造，而且，由于这些语义结构是共享发布的，能够有效地减少重复劳动造成的消耗。网络用户不再需要重复定制大量的信息提取程序，使用本发明的方法和装置，先搜索是否已经存在满足需求的别人共享出来的语义结构，如果存在，只需生成自己的信息提取指令即可。

为什么实现基于FreeFormat技术的Web信息提取

在Web信息提取领域工作多年，经历了垂直搜索、社交网络、mashup、MEME、推荐引擎等多个浪潮，每个浪潮都需要大量的信息提取/页面抓取工具，经过多年的定制开发工作，发现这个领域简直是长青藤，要创办上述网站，需要消耗很大费用用于提取数据。所以从2007年开始开发通用的Web信息提取工具，希望能够帮助互联网领域创业者们将精力集中在核心业务上。

在这个领域工作多年，对此领域的网站经营活动有了一些认识，实际上，上述这些网站服务想靠技术打天下胜算的机会很渺茫，不能因为觉得掌握了信息提取技术、MEME tracking技术或者其它信息处理技术就能成功运营一个服务，这么多年来看到太多公司起来又趴下，不乏红极一时的新秀，例如，当前垂于挣扎的垂直搜索领域。实际上，随着互联网的成长，这个领域越来越具有媒体特性了，作为一个程序员，出于对传媒行业的好奇，先后研读了一些媒体经济学和传媒业发展历史的书籍，受益匪浅。

早在垂直搜索刚刚兴起之时，凭一时热情，也尝试做了一点，发现自己的综合素质还欠缺很多，在企业经营诸方面，例如，资金使用和管理、业务拓展、加盟合作、营销传播等等，不是一个技术人员能够应付得来的，使我想起了在一个公司任高级职位时总经理所告诫的：你们要学会十个手指头弹钢琴。同时让我联想到一本书《十年》，感觉到如果能够经营好一个电视板块，那么就有可能经营好一个网站。

步入中年后，感觉到什么才是“不惑”，就是不再怀疑自己的才能了，不再去补短木板了，而是要发挥自己的长木板的作用，短板这块一定要“贵人”相助。而自己的乐趣就是设计和开发新软件，大可发挥一把，所以将多年互联网工作经验凝聚到当前这个产品中，持续投入几百万资金和2年多时间后，将其免费提供给有志在互联网领域经营产业的勇士。

产品已经发展到V3版本，除了要解决高效低成本地提取Web信息的问题外，逐渐向语义网络领域发展，希望能够回馈互联网大社区。因此提出了FreeFormat这个概念，期望能够避免分散在互联网行业中的对信息提取的重复投资，以一种社区回馈的方式整理和共享互联网内容。

什么是语义搜索引擎——读书笔记

最近读了Leigh Dodds的一篇文章Streams, Pools and Reservoirs，可谓长见识，Leigh Dodds认为语义搜索引擎（semantic search engine）和具有语义分析能力的搜索引擎(semantically enabled search engine)是两码事，得出这个结论的根据是对Web内容组织和检索的历史的回顾，类比曾经发生的Web的几个历史阶段，Leigh Dodds展望了基于linked data cloud的语义搜索引擎的特征，下面整理一下该文的要点及其思考

Web内容组织和检索历史回顾

Web的演变过程可以归纳成以下阶段：

内容在线发布，当数据量增大后，用一种分类索引的方式组织Web上的内容，我估计原文作者可能指类似于Yahoo早期的分类索引
搜索引擎自动地为内容建立索引，用了一个词“create a link-base”
搜索引擎的特色化和增值业务

结构化内容（data sets）的组织和检索展望

Leigh Dodds认为当前已经处于类似上述第一阶段的后期了，即，有大量的结构化数据用RDF描述，然后还有LOD项目（Linking Open Data），即将出现语义搜索引擎将data sets联系起来。

当前阶段的描述是：data sets之间的关系和联系的维护在很大程度上还是手工的，引自原文如下：

Not in the sense that members of the LOD community are manually entering data to link datasets together, but rather at the level of looking for opportunities to link together datasets, encouraging data publishers to co-ordinate and inter-relate their data, and by attempting to organically grow the link data web by targeting datasets that would usefully annotate or extend the current Linked Data Cloud.

因此，Leigh Dodds预测：语义搜索就是自动地将data sets联系和组织起来。区别了语义搜索引擎和具有语义分析能力的搜索引擎。

他认为，具有语义分析能力的搜索引擎(semantically enabled search engine)是

use techniques like natural language parsing and improved understanding of document semantics in order to provide an improved search experience for humans

而语义搜索引擎（semantic search engine）是：

A Semantic Web search engine should offer infrastructure for machines. Simple semantic web search engines like Swoogle and Sindice provide a way to for machines to construct '''link bases''', based on some simple expressions of '''what data is of relevance''', in order to find data that is of interest to a particular user, community, or within the context of a particular application. And crucially this can be done without having to always crawl or navigate over the entire linked data web. This process can be commoditised just as it has with the web of documents

思考

在两年前着手开发MetaSeeker工具包的时候，这种声音并不是主流，当时更多的人将重点放在语义识别上，我选择不同的方向不是因为更有眼光，而是凭着一个老程序员的这点技能，搞人工智能或者本体论相关方面的探索想都别想，我更愿意开发一个实用的工具，让建设垂直搜索和社交网站的人能够低成本甚至零成本的提取Web数据。因此，选择了Web内容结构化的路，实际上这条路也不简单，例如原文说的data relevance的组织和建设，至今还没有找到一种很有效的方法。

普通搜索和语义搜索的对比

作者用一个表格进行了对比，抄录如下：

Document Web	Semantic Web Infrastructure	Description
Google Image Search	Type Searching	Ability to discover resources of a particular type: e.g. Person, Review, Book
Google Translate	Vocabulary Normalisation	Application of simple inferencing to expose data in more vocabularies that made available by the publisher
Google Custom Search	Community Constructed Data Sets and Indexes	Ability to create and manipulate custom subsets of the linked data cloud
Google Trends	Linked Data Analysis & Publishing Trends	Identifying new data sources; new vocabularies; clusters of data; data analysis

我从开发实现的角度并没有理解这个表格，开发出来的目标系统的外在的特征应该是什么样子的，Leigh Dodds认为后两项是：

to be able to easily aggregate, combine and analyse aspects of the linked data cloud

没有理解该怎样实现。

参考实现

Leigh Dodds使用水流比喻在理想的Web环境中，数据应该也是类似流动的，从stream，到pool，再到reservoir，作者认为必须有一个基础设施保证数据流的顺畅。作者举了一个例子Talis Platform，认为其建立了一个建设数据水库的生态环境，并在文章Enabling the Linked Data Ecosystem中进行了论述。该系统的功能特性：

RSS，比作data stream，Leigh Dodds称其为core search service，没有理解，应该是内容推送，怎么会是搜索？
SPARQL，对数据进行查询
Augmentation service，在RSS推送的内容中增加metadata
store groups，将多个data set组织成更大的实体

思考

看来，我应该参考一下Talis Platform的store groups特性，改进我刚刚发布的共享软件MetaSeeker，也许就是前面我提到的data relevance的建设。

能为语义网络技术做点什么

对人生的期许过于理想化不是一件好事，几年前为语义网络技术所吸引，深奥的理论研究已经做不了了，只想搞点实用的小东西。

此前感觉到Microformat挺好，同时也感觉到应该有个更强大的东西，经过近2年的思考和实验，提出FreeFormat这个概念。半年的编程，历尽艰辛，期间被严重的巩膜炎所困扰，终于可以免费下载使用了。实际上才完成了信息提取特性，语义数据库的建设只是一个雏形。

理想不能当饭吃，下一步怎样走还得好好想想，最近有个国际公司要找一个搜索技术专家，至少受到了家庭成员的压力，但是，我想也许撑过了金融海啸会别有洞天。

夜深人静读一读圣经，期望获得上主的指引。

订阅：博文 (Atom)