语义网络，信息提取，web scraping, web data extraction: 什么是语义搜索引擎—

最近读了Leigh Dodds的一篇文章Streams, Pools and Reservoirs，可谓长见识，Leigh Dodds认为语义搜索引擎（semantic search engine）和具有语义分析能力的搜索引擎(semantically enabled search engine)是两码事，得出这个结论的根据是对Web内容组织和检索的历史的回顾，类比曾经发生的Web的几个历史阶段，Leigh Dodds展望了基于linked data cloud的语义搜索引擎的特征，下面整理一下该文的要点及其思考

Web内容组织和检索历史回顾

Web的演变过程可以归纳成以下阶段：

内容在线发布，当数据量增大后，用一种分类索引的方式组织Web上的内容，我估计原文作者可能指类似于Yahoo早期的分类索引
搜索引擎自动地为内容建立索引，用了一个词“create a link-base”
搜索引擎的特色化和增值业务

结构化内容（data sets）的组织和检索展望

Leigh Dodds认为当前已经处于类似上述第一阶段的后期了，即，有大量的结构化数据用RDF描述，然后还有LOD项目（Linking Open Data），即将出现语义搜索引擎将data sets联系起来。

当前阶段的描述是：data sets之间的关系和联系的维护在很大程度上还是手工的，引自原文如下：

Not in the sense that members of the LOD community are manually entering data to link datasets together, but rather at the level of looking for opportunities to link together datasets, encouraging data publishers to co-ordinate and inter-relate their data, and by attempting to organically grow the link data web by targeting datasets that would usefully annotate or extend the current Linked Data Cloud.

因此，Leigh Dodds预测：语义搜索就是自动地将data sets联系和组织起来。区别了语义搜索引擎和具有语义分析能力的搜索引擎。

他认为，具有语义分析能力的搜索引擎(semantically enabled search engine)是

use techniques like natural language parsing and improved understanding of document semantics in order to provide an improved search experience for humans

而语义搜索引擎（semantic search engine）是：

A Semantic Web search engine should offer infrastructure for machines. Simple semantic web search engines like Swoogle and Sindice provide a way to for machines to construct '''link bases''', based on some simple expressions of '''what data is of relevance''', in order to find data that is of interest to a particular user, community, or within the context of a particular application. And crucially this can be done without having to always crawl or navigate over the entire linked data web. This process can be commoditised just as it has with the web of documents

思考

在两年前着手开发MetaSeeker工具包的时候，这种声音并不是主流，当时更多的人将重点放在语义识别上，我选择不同的方向不是因为更有眼光，而是凭着一个老程序员的这点技能，搞人工智能或者本体论相关方面的探索想都别想，我更愿意开发一个实用的工具，让建设垂直搜索和社交网站的人能够低成本甚至零成本的提取Web数据。因此，选择了Web内容结构化的路，实际上这条路也不简单，例如原文说的data relevance的组织和建设，至今还没有找到一种很有效的方法。

普通搜索和语义搜索的对比

作者用一个表格进行了对比，抄录如下：

Document Web	Semantic Web Infrastructure	Description
Google Image Search	Type Searching	Ability to discover resources of a particular type: e.g. Person, Review, Book
Google Translate	Vocabulary Normalisation	Application of simple inferencing to expose data in more vocabularies that made available by the publisher
Google Custom Search	Community Constructed Data Sets and Indexes	Ability to create and manipulate custom subsets of the linked data cloud
Google Trends	Linked Data Analysis & Publishing Trends	Identifying new data sources; new vocabularies; clusters of data; data analysis

我从开发实现的角度并没有理解这个表格，开发出来的目标系统的外在的特征应该是什么样子的，Leigh Dodds认为后两项是：

to be able to easily aggregate, combine and analyse aspects of the linked data cloud

没有理解该怎样实现。

参考实现

Leigh Dodds使用水流比喻在理想的Web环境中，数据应该也是类似流动的，从stream，到pool，再到reservoir，作者认为必须有一个基础设施保证数据流的顺畅。作者举了一个例子Talis Platform，认为其建立了一个建设数据水库的生态环境，并在文章Enabling the Linked Data Ecosystem中进行了论述。该系统的功能特性：

RSS，比作data stream，Leigh Dodds称其为core search service，没有理解，应该是内容推送，怎么会是搜索？
SPARQL，对数据进行查询
Augmentation service，在RSS推送的内容中增加metadata
store groups，将多个data set组织成更大的实体

思考

看来，我应该参考一下Talis Platform的store groups特性，改进我刚刚发布的共享软件MetaSeeker，也许就是前面我提到的data relevance的建设。

语义网络，信息提取，web scraping, web data extraction

2009年5月14日星期四

什么是语义搜索引擎——读书笔记

Web内容组织和检索历史回顾

结构化内容（data sets）的组织和检索展望

思考

普通搜索和语义搜索的对比

参考实现

思考

没有评论:

发表评论

关注者

博客归档

我的简介