2008年7月29日星期二

TF/IDF算法


(转http://www.shenzhenseo.org.cn/seo/tf-idf-suanfa.html)
在今日我们可以从网络上吸收大量资讯,有时候一堆文章看不完。如果我们想要吸收资讯,时间却又不够的时候,使用电脑帮我们过滤资讯,或是用电脑帮我们做个总整理,是个方法。如果今天手中有一篇文章,我们想要用电脑帮我们找出这篇文章最重要的关键字,要怎麽做呢?在资讯检索 (IR: Information Retrieval)领域里面,有个基础的方法,入门必学的方法,就是使用 TF 和 IDF (TF: Term Frequency, IDF: Inverse Document Frequency)。使用这两个估计值,可以让电脑具有计算重要关键字的能力,进而节省我们的时间。
  接下来让我们看看,TF 和 IDF 个是甚麽东西呢?TF 全名是Term Frequency,也就是某个关键字出现的次数,譬如说某篇文章里面,「电脑」这个词出现很多次,或是「使用者需求」这个词出现很多次,那麽这些词句的出现频率,就会很高。一篇文章中出现很多次的词句,必定有其重要性。譬如说一篇论述「人工智慧」的文章,「人工智慧」这个词句再文章中出现的频率也一定很高。然而为甚麽除了 TF  (Term Frequency) 以外,还要有 IDF (Inverse Document Frequency) 呢?
  让我们先想想,如果单使用某个字词出现的频率,来判断一篇文章最重要的关键字,会有甚麽困难。首先,我们会遇到一些常用字词,出现的频率也很高,会和重要字词出现的频率一样高,让电脑因此无法分辨出,哪些是常用字词,那些是重要字词。如果就英文来说,有个规则是语言学家 (linguist) 归纳出来的规则,叫做 Zipf’s Law

  引述中文维基百科的一段介绍如下:
  从根本上讲, 齐夫定律 可以表述为, 在自然语言的 语素库 里, 一个单词出现的频率与它在频率表里的排名成 反比. 所以, 频率最高的单词出现的频率大约是出现频率第二位的单词的 2 倍,而出现频率第二位的单词则是出现频率第四位的单词的2倍。这个定律被作为任何与 power law probability distributions 有关的事物的参考。 这个 “定律” 是 Harvard linguist George Kingsley Zipf (IPA [z?f])发表的。
        比如, 在 Brown 语库, “the” 是最常见的单词,它在这个语库中出现了大约 7 %(10 万单词中出现 69971 次)。正如齐夫定律中所描述的一样,出现次数为第二位的单词 “of” 占了整个语库中的 3.5% (36411次), 之後的是”and” (28852次). 仅仅 135 但此项就占了 Brown 语库的一半。
  所以我们现在知道问题在哪边了。如果只用词句出现的频率来判断某一篇文章里面最重要的关键字,我们可能会找到常用字,而不是最重要的字,像是英文里面的 “the”、”a”、”it”,都是常常出现的字,但是通常一篇文章里面最重要的字不是这些字,即使那些重要的字出现的频率也很高。
  这个时候我们要怎麽办呢?IDF 在这个时候就帮上忙了。在了解 IDF 之前,我们先了解 DF 是甚麽。DF 就是Document Frequency,也就是说,如果今天我们手中有固定 N 篇文章,某个关键字的 Document Frquency (DF),就是说这个关键字在 N 篇文章里面出现了几次。Inverse Document Frequency (IDF) 则是把 DF 取倒数,如此一来,一个数字乘以 IDF,就等於是除以 DF 的意思。
  有了 TF 和 IDF 以後,我们就可以计算 TF 乘上 IDF,对每一个关键字都算出一个分数。这个分数的高低,就代表了这个关键字在某篇文章中的重要程度。为甚麽我们说这样子可以找出重要的字,而不是常出现的字呢?因为 TF 会把某篇文章中,出现最多次的排在第一位,其次的排在第二位,以此类推。然而乘上 IDF 以後,也就是除以 DF,那些常常出现的字,像是英文中的 “the”、”a”、”it”,因为每一篇文章都会出现,所以 DF 就大。DF 大,取倒数之後的 IDF 就小,IDF 小,乘上 TF 以後,虽然”the”、”a”、”it”在某篇文章中出现的频率很高,但是因为 IDF 小,TF * IDF 一相乘,重要性就变低了,我们 (电脑程式) 就不会把这些常出现的字,误认为是重要的字了!
  真正重要的字会得到甚麽样子的分数呢?如果这篇文章刚好在讲 AI,”AI” 出现很多次,因此 “AI” 在这篇文章里面的 TF 很高。然而我们电脑资料库里面的 N 篇文章,并不是每一篇都在讲  AI,也因此”AI”可能只有在 N 篇文章里面的某 3 篇文章出现,因此 DF 只有 3,IDF 变成 0.33,假设我们 N = 100 有 100 篇文章在资料库里面,其他常出现字像是 “the” 每一篇都出现,DF 就是 100,IDF 就是 0.01。所以 “AI” 的 IDF 会比 “the” 的 IDF 高,假设这篇文章中 “AI” 和 “the” 两个字出现的次数刚好一样,乘上 IDF 以後,”AI” 这个字的分数就比 “the” 这个字的分数来的高,电脑也就会判断 “AI” 是这篇文章重要的关键字,而 “the” 这个字并不是这篇文章的重要关键字。
  所以经由 TF * IDF,我们可以计算某个关键字,在某篇文章里面的重要性。从这一个方向,我们可以计算一篇文章中重点的字有哪些,帮我们做一篇文章的总整理。从相反的方向,我们可以给定关键字,然後再每一篇文章里面为这个关键字计算一次 TF * IDF,然後比较哪一篇文章,这个关键字是最具重要性的,用这个方法找出和一个关键字最相关的文章。不管是从文章找出重点字词,或是由关键字找相关文章,TF * IDF 都是个基本且不错的方法。会写程式又还没嚐试过这个方法的读者,或许可以亲自试试看,不过可能要先自己准备文章资料库 (corpus),或是从网际网路上面用网页撷取器 (crawler) 存几篇有兴趣的网页,然後把 HTML 标签清理乾净,剩下纯文字,就可以用这个方法来小试身手罗!
  我们也可以比较一下人类和电脑的不同。电脑做数学数字的计算,或是执行固定的步骤 ,非常擅长,速度也很快。人类可以了解一个字的意思,读完一篇文章以後,了解了意思,之後要找这篇文章最重要的关键字,是从「意义」开始,回忆出或做出结论,这篇文章重要的关键字是甚麽。
  然而如果要电脑也遵照这个方向,先了解字的意义,再了解文章的意义,然後在做出结论,这篇文章的重要关键字,反而困难,因为要了解字的意义,电脑需要先有一个语意网路 (Semantic Network),或是知识的分类关系树 (Ontology),把字句依照语意分门别类,有如生物里面的「界门纲目科属种」一般的关系分类,才有办法了解一个字和其他字的关系。之後要了解一篇文章,又必须要了解一个句子,牵涉到自然语言处理 (NLP: Natural language Processing) 的问题,像是从句子里面找出主词、动词、和受词,以及补语,分辨出子句和主句,代名词的指称,以及前後文判断产生不同的剖析 (parsing)。了解完一句,才能了解整篇文章。
  因此,TF * IDF 对於电脑来说,计算速度快,工程也不浩大,不用大型计算机就可以计算。这边也可以顺便提到 strong AI 和 weak AI 的关系。如果就工程的角度,TF * IDF 是个好方法,it works! 节省我们的时间,或是解决大问题中的一个小环节。然而 strong AI 在这边会提出「中文房间」(Chinese Room) 的论证,也就是说,电脑能够找出重要关键字,是否就代表电脑真的「知道」(understand) 关键字的意义呢?
  中文房间 (Chinese Room) 简单地说,就是一个人关在房间里面,只留两个窗口,一个地方会送纸条出来,另一个地方会送纸条出去。房间里面有一本手册,里面写满对照表,记载者看到甚麽英文字,就应该输出甚麽中文字,以及一些指令的对照,譬如说窗口送一个指令说 COMBINE,就把两个中文字写在一起才送出去。接着我们在外面就开始送英文句子进去这个房间,另一个窗口就会有这句话的中文翻译跑出来。然而这个论证想要坦讨的就是,虽然这个房间看起来像是会把英文翻译成中文,但是在房间里面的那个操作人员并不懂中文,他指是按照指令,还有手册里面的对照表,机械式地动作,可是外面看起来像是这个房间会英翻中,因此这个房间应该懂得中文才对。
  在这边我的看法是,也许就近程来看,我们只要有可以解决问题的解答就可以,不管电脑是否真的懂 (understand) 字的意义。然而长期来说,如果我们真的需要具有人类的智力的电脑出现,能够真的懂而不是行为上看起来懂,那麽就要仔细探讨中文房间这种论证。也许生物的方法,像是计算神经科学的方法,是一个方向。
  我们可能又会问,神经元只有动作电位和静止两个状态,怎麽能了解意义?但是只有一个神经元,或许没办法了解意义,全部大脑的神经元交互作用,意义可能就因此被了解了!其中的奥妙,就是计算神经科学嚐试要解答的问题之一。有兴趣的读者也可以一起从人脑开始,解决 strong AI 的问题。或是有数学的高手,也许某一个数学理论,可以很漂亮地解决意义了解的问题也说不定,像是 manifolds,具有一个集合使用不同面向来观看的特性,同时具有 Global 和 Local 的性质,是个不错的候选选项。从这个方向去解决 strong AI 也是另一个可能性。总之,继续努力研究就是了!

2008年7月27日星期日

多文档摘要MMR模型java实现


多文档摘要MMR模型java实现

2008年7月25日星期五

The Connection Has Been Reset-连接被重置


转载说明,to LF,你虽为英语专业,可是基本上也不怎么看英语的。这篇转载,推荐一下。分析了很多,而且技术上讲的比较简单。
China’s Great Firewall is crude, slapdash, and surprisingly easy to breach. Here’s why it’s so effective anyway.

“The Connection Has Been Reset”

Illustration by John Ritter
Many foreigners who come to China for the Olympics will use the Internet to tell people back home what they have seen and to check what else has happened in the world.
The first thing they’ll probably notice is that China’s Internet seems slow. Partly this is because of congestion in China’s internal networks, which affects domestic and international transmissions alike. Partly it is because even electrons take a detectable period of time to travel beneath the Pacific Ocean to servers in America and back again; the trip to and from Europe is even longer, because that goes through America, too. And partly it is because of the delaying cycles imposed by China’s system that monitors what people are looking for on the Internet, especially when they’re looking overseas. That’s what foreigners have heard about.
They’ll likely be surprised, then, to notice that China’s Internet seems surprisingly free and uncontrolled. Can they search for information about “Tibet independence” or “Tiananmen shooting” or other terms they have heard are taboo? Probably—and they’ll be able to click right through to the controversial sites. Even if they enter the Chinese-language term for “democracy in China,” they’ll probably get results. What about Wikipedia, famously off-limits to users in China? They will probably be able to reach it. Naturally the visitors will wonder: What’s all this I’ve heard about the “Great Firewall” and China’s tight limits on the Internet?
In reality, what the Olympic-era visitors will be discovering is not the absence of China’s electronic control but its new refinement—and a special Potemkin-style unfettered access that will be set up just for them, and just for the length of their stay. According to engineers I have spoken with at two tech organizations in China, the government bodies in charge of censoring the Internet have told them to get ready to unblock access from a list of specific Internet Protocol (IP) addresses—certain Internet cafés, access jacks in hotel rooms and conference centers where foreigners are expected to work or stay during the Olympic Games. (I am not giving names or identifying details of any Chinese citizens with whom I have discussed this topic, because they risk financial or criminal punishment for criticizing the system or even disclosing how it works. Also, I have not gone to Chinese government agencies for their side of the story, because the very existence of Internet controls is almost never discussed in public here, apart from vague statements about the importance of keeping online information “wholesome.”)
Depending on how you look at it, the Chinese government’s attempt to rein in the Internet is crude and slapdash or ingenious and well crafted. When American technologists write about the control system, they tend to emphasize its limits. When Chinese citizens discuss it—at least with me—they tend to emphasize its strength. All of them are right, which makes the government’s approach to the Internet a nice proxy for its larger attempt to control people’s daily lives.
Disappointingly, “Great Firewall” is not really the right term for the Chinese government’s overall control strategy. China has indeed erected a firewall—a barrier to keep its Internet users from dealing easily with the outside world—but that is only one part of a larger, complex structure of monitoring and censorship. The official name for the entire approach, which is ostensibly a way to keep hackers and other rogue elements from harming Chinese Internet users, is the “Golden Shield Project.” Since that term is too creepy to bear repeating, I’ll use “the control system” for the overall strategy, which includes the “Great Firewall of China,” or GFW, as the means of screening contact with other countries.
In America, the Internet was originally designed to be free of choke points, so that each packet of information could be routed quickly around any temporary obstruction. In China, the Internet came with choke points built in. Even now, virtually all Internet contact between China and the rest of the world is routed through a very small number of fiber-optic cables that enter the country at one of three points: the Beijing-Qingdao-Tianjin area in the north, where cables come in from Japan; Shanghai on the central coast, where they also come from Japan; and Guangzhou in the south, where they come from Hong Kong. (A few places in China have Internet service via satellite, but that is both expensive and slow. Other lines run across Central Asia to Russia but carry little traffic.) In late 2006, Internet users in China were reminded just how important these choke points are when a seabed earthquake near Taiwan cut some major cables serving the country. It took months before international transmissions to and from most of China regained even their pre-quake speed, such as it was.
Thus Chinese authorities can easily do something that would be harder in most developed countries: physically monitor all traffic into or out of the country. They do so by installing at each of these few “international gateways” a device called a “tapper” or “network sniffer,” which can mirror every packet of data going in or out. This involves mirroring in both a figurative and a literal sense. “Mirroring” is the term for normal copying or backup operations, and in this case real though extremely small mirrors are employed. Information travels along fiber-optic cables as little pulses of light, and as these travel through the Chinese gateway routers, numerous tiny mirrors bounce reflections of them to a separate set of “Golden Shield” computers.Here the term’s creepiness is appropriate. As the other routers and servers (short for file servers, which are essentially very large-capacity computers) that make up the Internet do their best to get the packet where it’s supposed to go, China’s own surveillance computers are looking over the same information to see whether it should be stopped.
The mirroring routers were first designed and supplied to the Chinese authorities by the U.S. tech firm Cisco, which is why Cisco took such heat from human-rights organizations. Cisco has always denied that it tailored its equipment to the authorities’ surveillance needs, and said it merely sold them what it would sell anyone else. The issue is now moot, since similar routers are made by companies around the world, notably including China’s own electronics giant, Huawei. The ongoing refinements are mainly in surveillance software, which the Chinese are developing themselves. Many of the surveillance engineers are thought to come from the military’s own technology institutions. Their work is good and getting better, I was told by Chinese and foreign engineers who do “oppo research” on the evolving GFW so as to design better ways to get around it.
Andrew Lih, a former journalism professor and software engineer now based in Beijing (and author of the forthcoming book The Wikipedia Story), laid out for me the ways in which the GFW can keep a Chinese Internet user from finding desired material on a foreign site. In the few seconds after a user enters a request at the browser, and before something new shows up on the screen, at least four things can go wrong—or be made to go wrong.
The first and bluntest is the “DNS block.” The DNS, or Domain Name System, is in effect the telephone directory of Internet sites. Each time you enter a Web address, or URL—www.yahoo.com, let’s say—the DNS looks up the IP address where the site can be found. IP addresses are numbers separated by dots—for example, TheAtlantic.com’s is 38.118.42.200. If the DNS is instructed to give back no address, or a bad address, the user can’t reach the site in question—as a phone user could not make a call if given a bad number. Typing in the URL for the BBC’s main news site often gets the no-address treatment: if you try news.bbc.co.uk, you may get a “Site not found” message on the screen. For two months in 2002, Google’s Chinese site, Google.cn, got a different kind of bad-address treatment, which shunted users to its main competitor, the dominant Chinese search engine, Baidu. Chinese academics complained that this was hampering their work. The government, which does not have to stand for reelection but still tries not to antagonize important groups needlessly, let Google.cn back online. During politically sensitive times, like last fall’s 17th Communist Party Congress, many foreign sites have been temporarily shut down this way.
Next is the perilous “connect” phase. If the DNS has looked up and provided the right IP address, your computer sends a signal requesting a connection with that remote site. While your signal is going out, and as the other system is sending a reply, the surveillance computers within China are looking over your request, which has been mirrored to them. They quickly check a list of forbidden IP sites. If you’re trying to reach one on that blacklist, the Chinese international-gateway servers will interrupt the transmission by sending an Internet “Reset” command both to your computer and to the one you’re trying to reach. Reset is a perfectly routine Internet function, which is used to repair connections that have become unsynchronized. But in this case it’s equivalent to forcing the phones on each end of a conversation to hang up. Instead of the site you want, you usually see an onscreen message beginning “The connection has been reset”; sometimes instead you get “Site not found.” Annoyingly, blogs hosted by the popular system Blogspot are on this IP blacklist. For a typical Google-type search, many of the links shown on the results page are from Wikipedia or one of these main blog sites. You will see these links when you search from inside China, but if you click on them, you won’t get what you want.
The third barrier comes with what Lih calls “URL keyword block.” The numerical Internet address you are trying to reach might not be on the blacklist. But if the words in its URL include forbidden terms, the connection will also be reset. (The Uniform Resource Locator is a site’s address in plain English—say, www.microsoft.com—rather than its all-numeric IP address.) The site FalunGong .com appears to have no active content, but even if it did, Internet users in China would not be able to see it. The forbidden list contains words in English, Chinese, and other languages, and is frequently revised—“like, with the name of the latest town with a coal mine disaster,” as Lih put it. Here the GFW’s programming technique is not a reset command but a “black-hole loop,” in which a request for a page is trapped in a sequence of delaying commands. These are the programming equivalent of the old saw about how to keep an idiot busy: you take a piece of paper and write “Please turn over” on each side. When the Firefox browser detects that it is in this kind of loop, it gives an error message saying: “The server is redirecting the request for this address in a way that will never complete.”
The final step involves the newest and most sophisticated part of the GFW: scanning the actual contents of each page—which stories  is featuring, what a China-related blog carries in its latest update—to judge its page-by-page acceptability. This again is done with mirrors. When you reach a favorite blog or news site and ask to see particular items, the requested pages come to you—and to the surveillance system at the same time. The GFW scanner checks the content of each item against its list of forbidden terms. If it finds something it doesn’t like, it breaks the connection to the offending site and won’t let you download anything further from it. The GFW then imposes a temporary blackout on further “IP1 to IP2” attempts—that is, efforts to establish communications between the user and the offending site. Usually the first time-out is for two minutes. If the user tries to reach the site during that time, a five-minute time-out might begin. On a third try, the time-out might be 30 minutes or an hour—and so on through an escalating sequence of punishments.
Users who try hard enough or often enough to reach the wrong sites might attract the attention of the authorities. At least in principle, Chinese Internet users must sign in with their real names whenever they go online, even in Internet cafés. When the surveillance system flags an IP address from which a lot of “bad” searches originate, the authorities have a good chance of knowing who is sitting at that machine.
All of this adds a note of unpredictability to each attempt to get news from outside China. One day you go to the NPR site and cruise around with no problem. The next time, NPR happens to have done a feature on Tibet. The GFW immobilizes the site. If you try to refresh the page or click through to a new story, you’ll get nothing—and the time-out clock will start.
This approach is considered a subtler and more refined form of censorship, since big foreign sites no longer need be blocked wholesale. In principle they’re in trouble only when they cover the wrong things. Xiao Qiang, an expert on Chinese media at the University of California at Berkeley journalism school, told me that the authorities have recently begun applying this kind of filtering in reverse. As Chinese-speaking people outside the country, perhaps academics or exiled dissidents, look for data on Chinese sites—say, public-health figures or news about a local protest—the GFW computers can monitor what they’re asking for and censor what they find.
Taken together, the components of the control system share several traits. They’re constantly evolving and changing in their emphasis, as new surveillance techniques become practical and as words go on and off the sensitive list. They leave the Chinese Internet public unsure about where the off-limits line will be drawn on any given day. Andrew Lih points out that other countries that also censor Internet content—Singapore, for instance, or the United Arab Emirates—provide explanations whenever they do so. Someone who clicks on a pornographic or “anti-Islamic” site in the U.A.E. gets the following message, in Arabic and English: “We apologize the site you are attempting to visit has been blocked due to its content being inconsistent with the religious, cultural, political, and moral values of the United Arab Emirates.” In China, the connection just times out. Is it your computer’s problem? The firewall? Or maybe your local Internet provider, which has decided to do some filtering on its own? You don’t know. “The unpredictability of the firewall actually makes it more effective,” another Chinese software engineer told me. “It becomes much harder to know what the system is looking for, and you always have to be on guard.”
There is one more similarity among the components of the firewall: they are all easy to thwart.
As a practical matter, anyone in China who wants to get around the firewall can choose between two well-known and dependable alternatives: the proxy server and the VPN. A proxy server is a way of connecting your computer inside China with another one somewhere else—or usually to a series of foreign computers, automatically passing signals along to conceal where they really came from. You initiate a Web request, and the proxy system takes over, sending it to a computer in America or Finland or Brazil. Eventually the system finds what you want and sends it back. The main drawback is that it makes Internet operations very, very slow. But because most proxies cost nothing to install and operate, this is the favorite of students and hackers in China.
A VPN, or virtual private network, is a faster, fancier, and more elegant way to achieve the same result. Essentially a VPN creates your own private, encrypted channel that runs alongside the normal Internet. From within China, a VPN connects you with an Internet server somewhere else. You pass your browsing and downloading requests to that American or Finnish or Japanese server, and it finds and sends back what you’re looking for. The GFW doesn’t stop you, because it can’t read the encrypted messages you’re sending. Every foreign business operating in China uses such a network. VPNs are freely advertised in China, so individuals can sign up, too. I use one that costs $40 per year. (An expat in China thinks: . A Chinese factory worker thinks: . Even for a young academic, it’s a couple days’ work.)
As a technical matter, China could crack down on the proxies and VPNs whenever it pleased. Today the policy is: if a message comes through that the surveillance system cannot read because it’s encrypted, let’s wave it on through! Obviously the system’s behavior could be reversed. But everyone I spoke with said that China could simply not afford to crack down that way. “Every bank, every foreign manufacturing company, every retailer, every software vendor needs VPNs to exist,” a Chinese professor told me. “They would have to shut down the next day if asked to send their commercial information through the regular Chinese Internet and the Great Firewall.” Closing down the free, easy-to-use proxy servers would create a milder version of the same problem. Encrypted e-mail, too, passes through the GFW without scrutiny, and users of many Web-based mail systems can establish a secure session simply by typing “http:” rather than the usual “http:” in a site’s address—for instance, https://mail.yahoo.com. To keep China in business, then, the government has to allow some exceptions to its control efforts—even knowing that many Chinese citizens will exploit the resulting loopholes.
Because the Chinese government can’t plug every gap in the Great Firewall, many American observers have concluded that its larger efforts to control electronic discussion, and the democratization and grass-roots organizing it might nurture, are ultimately doomed. A recent item on an influential American tech Web site had the headline “Chinese National Firewall Isn’t All That Effective.” In October,  ran a story under the headline “The Great Firewall: China’s Misguided—and Futile—Attempt to Control What Happens Online.”
Let’s not stop to discuss why the vision of democracy-through-communications-technology is so convincing to so many Americans. (Samizdat, fax machines, and the Voice of America eventually helped bring down the Soviet system. Therefore proxy servers and online chat rooms must erode the power of the Chinese state. Right?) Instead, let me emphasize how unconvincing this vision is to most people who deal with China’s system of extensive, if imperfect, Internet controls.
Think again of the real importance of the Great Firewall. Does the Chinese government really care if a citizen can look up the Tiananmen Square entry on Wikipedia? Of course not. Anyone who wants that information will get it—by using a proxy server or VPN, by e-mailing to a friend overseas, even by looking at the surprisingly broad array of foreign magazines that arrive, uncensored, in Chinese public libraries.
What the government cares about is making the quest for information just enough of a nuisance that people generally won’t bother. Most Chinese people, like most Americans, are interested mainly in their own country. All around them is more information about China and things Chinese than they could possibly take in. The newsstands are bulging with papers and countless glossy magazines. The bookstores are big, well stocked, and full of patrons, and so are the public libraries. Video stores, with pirated versions of anything. Lots of TV channels. And of course the Internet, where sites in Chinese and about China constantly proliferate. When this much is available inside the Great Firewall, why go to the expense and bother, or incur the possible risk, of trying to look outside?
All the technology employed by the Golden Shield, all the marvelous mirrors that help build the Great Firewall—these and other modern achievements matter mainly for an old-fashioned and pre-technological reason. By making the search for external information a nuisance, they drive Chinese people back to an environment in which familiar tools of social control come into play.
Chinese bloggers have learned that if they want to be read in China, they must operate within China, on the same side of the firewall as their potential audience. Sure, they could put up exactly the same information outside the Chinese mainland. But according to Rebecca Mac­Kinnon, a former Beijing correspondent for CNN now at the Journalism and Media Studies Center of the University of Hong Kong, their readers won’t make the effort to cross the GFW and find them. “If you want to have traction in China, you have to  in China,” she told me. And being inside China means operating under the sweeping rules that govern all forms of media here: guidance from the authorities; the threat of financial ruin or time in jail; the unavoidable self-censorship as the cost of defiance sinks in.
Most blogs in China are hosted by big Internet companies. Those companies know that the government will hold them responsible if a blogger says something bad. Thus the companies, for their own survival, are dragooned into service as auxiliary censors.
Large teams of paid government censors delete offensive comments and warn errant bloggers. (No official figures are available, but the censor workforce is widely assumed to number in the tens of thousands.) Members of the public at large are encouraged to speak up when they see subversive material. The propaganda ministries send out frequent instructions about what can and cannot be discussed. In October, the group Reporters Without Borders, based in Paris, released an astonishing report by a Chinese Internet technician writing under the pseudonym “Mr. Tao.” He collected dozens of the messages he and other Internet operators had received from the central government. Here is just one, from the summer of 2006:
17 June 2006, 18:35 

From: Chen Hua, deputy director of the Beijing Internet Information Administrative Bureau 

Dear colleagues, the Internet has of late been full of articles and messages about the death of a Shenzhen engineer, Hu Xinyu, as a result of overwork. All sites must stop posting articles on this subject, those that have already been posted about it must be removed from the site and, finally, forums and blogs must withdraw all articles and messages about this case.
“Domestic censorship is the real issue, and it is about social control, human surveillance, peer pressure, and self-censorship,” Xiao Qiang of Berkeley says. Last fall, a team of computer scientists from the University of California at Davis and the University of New Mexico published an exhaustive technical analysis of the GFW’s operation and of the ways it could be foiled. But they stressed a nontechnical factor: “The presence of censorship, even if easy to evade, promotes self-censorship.”
It would be wrong to portray China as a tightly buttoned mind-control state. It is too wide-open in too many ways for that. “Most people in China feel freer than any Chinese people have been in the country’s history, ever,” a Chinese software engineer who earned a doctorate in the United States told me. “There has never been a space for any kind of discussion before, and the government is clever about continuing to expand space for anything that doesn’t threaten its survival.” But it would also be wrong to ignore the cumulative effect of topics people are not allowed to discuss. “Whether or not Americans supported George W. Bush, they could not  learning about Abu Ghraib,” Rebecca Mac­Kinnon says. In China, “the controls mean that whole topics inconvenient for the regime simply don’t exist in public discussion.” Most Chinese people remain wholly unaware of internationally noticed issues like, for instance, the controversy over the Three Gorges Dam.
Countless questions about today’s China boil down to: How long can this go on? How long can the industrial growth continue before the natural environment is destroyed? How long can the super-rich get richer, without the poor getting mad? And so on through a familiar list. The Great Firewall poses the question in another form: How long can the regime control what people are allowed to know, without the people caring enough to object? On current evidence, for quite a while.

2008年7月22日星期二

基于知网的中文词语相似度 java 开源

把前两天写的东西开源了吧。不知道有没有人会搜索这些,能够看到这些模块。
希望能够被别人使用。
ok,代码共享在good code上,使用SVN或者,直接下载都可以。
项目是关于: 基于知网(Hownet : http://www.keenage.com/html/c_index.html)的中文词语相似度据算。
具体算法是根据知网共享版的C++版本改变的,主要是参照论文《基于<知网>的词汇语义相似度计算》。
算是一个java版本吧。没有任何界面,我觉得真正需要用的人,应该不会去使用界面的。
项目地址是: http://code.google.com/p/chinesewordsimilarity/如果下载使用的话,建议看一下其中的论文:基于<知网>的词汇语义相似度计算, 算法都是论文中描述的。

2008年7月18日星期五

转: 印度阿三的起源


“红头阿三”对如今年轻人来说可能是个陌生的名词,但对现今六、七十岁的老上海来说却是非常熟悉的。之所以称为“红头阿三”,原因大致有这几种说法:凡印籍巡捕皆头缠红巾(实际上,印捕充任交通警、巡逻警的用红巾缠头,任看守警为黄巾缠头)所以叫“红头”。关于“阿三”则有两种说法:一说印度人是亡国奴,在上海人眼中的地位低于西捕和华捕列第三位;还有一种说法是印度人说话有口头禅“I say”、“I say”,其谐音为“阿三”,红头加上阿三故称“红头阿三”。 
印捕是上海处于殖民地时的产物,在我国是独一无二。1843年(清道光二十三年)8月上海被西方殖民者强迫开辟为通商口岸,外国冒险家纷至沓来,要求在上海购地建房。当时清政府上海道台宫慕久在英国领事巴富尔的威胁和欺骗下,于1845年11月29日以道台名义,用告示形式公布了洋人的《上海租地章程》。从此,英租界成了英国侵略者在上海的“国中之国”。 
1849年(清道光二十八年),法国领事敏体尼援引英国殖民者的先例,上海道台麟桂屈服于殖民主义者的压力,于1849年4月6日划定法租界界址。 
上海租界既然是“国中之国”,自然要有武装力量及其他镇压工具,这便是万国商团、舰队水兵与巡捕。万国商团是租界武装力量的主体,其主要职责是保卫租界,一般不负社会治安之责。舰队水兵是万国商团的后盾。巡捕,即警察。起先,巡捕一律由西方人担任,故称为西捕。当时法租界公董局就有明文规定:“巡捕房人员应全部由法国人或宣布服从法国领事馆,并从此归法国裁判权管辖的外国人组成。”英租界最多时有西捕160名。西捕薪水高,否则就找不到西捕,这样开支就大了。另外,西捕有种种局限,比如微服侦察,其相貌特征根本无法掩饰;又比如去公共场所打听,其语言障碍造成的困难也难以克服;再加上租界里有帮会组织,西捕由于种种原因很难深入进去,也物色不到合适的人选做耳目,破案效率不高,租界治安堪忧。总之,从1854年开设巡捕房起,起先西捕还能应付。后来刑案随人口激增而水涨船高,由清一色的西捕办案,弊端更为突出,于是从1870年后改为允许华人充任巡捕,称为华捕。由于这一改变收效明显,以致渐而主次倒置,即西捕大减、华捕大增。例如1883年英美公共租界有巡捕200名,华捕竟占170名之多。以后,华捕越雇越多,殖民者怕不易控制,自1884年开始从英殖民地“进口”印度籍巡捕,即“红头阿三”。殖民者从印度“出口”巡捕时,对人员精心挑选,必须是印度的锡克族人,个个身高马大,满脸虬须,令人望而生畏。但他们也有与西捕相似的短处,所以大多充当巡警、狱警与交通警。“红头阿三”来自英国殖民地,严格说来其身份比处在租界里的上海人还低,但他们是英国人的忠实“看家狗”。狗仗人势,整天警棍乱舞,让上海人吃足苦头,特别是那些摊贩与车夫,挨“红头阿三”的警棍与皮靴更是家常便饭。殖民者为了利用 “红头阿三”为其忠实卖命,发他们的薪金比华捕高一倍,还配给住房等,并在当时的戈登路巡捕房内(解放后为江宁路公安分局)建造了一座三层楼印度教堂。印捕的存在随着租界结束而取消。 
法租界在这方面做法与英租界大致相同,他们“进口”的是安南巡捕,即越南人,在上海人看来,其体态、皮肤与广东人相近,所以没送他们什么绰号或代称。
===================
和ZKM闲聊,说道,是个印度人,都叫阿三,还真这样。

2008年7月17日星期四

qq签名档


2005年03月02日 22时42分
我永远是那条鱼,游在你的海里。怀念那个早晨,那次早餐,那部电影~~~~~~~~~~

有些故事还没讲完那就算了吧
那些心情在岁月中已经难辩真假

2005年03月24日 21时23分
ILoveThreeThingsTheSunTheMoonAndYouTheSunForTheDayTheMoonForTheNightAndYouForever.

总是莫名其妙的事情,已经记不起当时的意思了。

2005年04月20日 19时42分
2005年的春天,我在没有形容词的文档中… 郁闷从这里开始…

应该是要写一堆的各种各样的软件文档吧,记得自己要模拟一个小项目,要写需求说明书,以及软件开发各个时期的各种文档,一直到最后的详细设计说明书等等,反正当时是很郁闷的事情。

2005年08月31日 14时22分
You may be disappointed if you fail, but you are doomed if you don't try.

自我鼓励。

2006年07月03日 17时59分
you can't turn back, you're in chains

什么事情?不记得了。

2006年08月30日 19时56分
我狠狠地狠狠地挥霍了一把。。。

又是什么?还是不记得了。

2006年09月01日 20时25分
男、未婚
2006年09月24日 14时36分
joy is like pain  it feels like a miracle

歌词?好像是歌词。

2006年10月05日 21时21分
举头望明月,天涯共此时

估计那个时候挺寂寞的。

2006年10月07日 01时41分
show me the meaning of being lonely

可能非常寂寞了。

2006年10月30日 23时32分
随便

不寂寞了?有转机,还是其他什么?

2006年12月09日 21时40分
我其实也会蛙泳

当时好像改了个网名:自由泳的青蛙。

2007年01月20日 23时54分
今天忽然想起了你

07年了,估计是寒假前。

2007年01月20日 23时54分
选择控制面板,然后编辑个人资料即可

不知所谓。

2007年01月22日 03时35分
复习多媒体,老子这次闭卷考试!

果然是期末考试。
2007年03月05日 16时06分
自由泳的青蛙

网名吧。源自海贼王。

2007年03月05日 16时54分
Keep It Simple, Sweet

来自网络书本。原话是Keep it simple, stupid.

2007年03月14日 19时35分
我希望我的2007年像猪一样无忧无虑。这个基本上很难。

寒假过完了,正式开始新的一年了吧。

2007年03月14日 19时41分
j=j++;

和上一条签名,才差几分钟,可见qq签名是多么难得一件事情。
2007年09月01日 16时58分
SMN: wuyq101在gmail.com

使用了新的邮箱,和msn。
2008年01月02日 21时37分
替人找男朋友,北航女硕士,温柔娴熟,有意者联系zhaolp0419在gmail.com 非诚勿扰

实验室的同学,老看玩笑,如今已有结果,只是出入人的意料。人是可以吃下任何东西的动物。
爱也一样。

2008年05月09日 20时26分
.o0┈給妳の懓o.┈"壹直﹎ヤ╭很侒静ヤ

模仿90后的,其实这几个字是使用火星文生成器生成的。后来在一小挫别有用心的人的劝说下,换了。
他们说太恶心了。代沟真是个无比牛逼的名词。

2008年05月14日 18时38分
我们和你们在一起

地震。
2008年05月22日 16时25分
/祈福/点亮希望

地震。
2008年06月11日 16时52分
頑張れ

看lost,看东京爱情故事,看了另一部忘了名字的日剧,关于棒球运动员的。頑張れ

2008年7月15日星期二

操你爷爷


大家好:
 
首先对实验室网通线路断线4天表示抱歉!
 
实验室网通线路的提供商是北京电信通,线路故障原因是世宁大厦附近的光缆不符合奥运的一些规定,因此被人为剪断,而且剪得非常多非常乱,电信通的工程师无法承诺修复的时间。在这四天里,实验室的邮箱、域名解析和网站都受到影响,干扰了大家的正常学习和工作,请见谅!
 
现在实验室网通线路仍然没有恢复,因此大家不能访问国外普通网站。但是现在实验室网络出口已经切换到教育网,尽量使大家访问文献等资源不受影响。实验室的邮箱、域名解析和网站都已经得到恢复,请大家检查,如果还不能使用请和我联系。
 
接下来我会继续跟踪电信通的维修进展,争取督促他们早日修复光缆。
 
请大家互相通知,谢谢!
===================================================
过了四天才收到信,早干嘛去了。作为用户,为什么说剪就剪,事先没有半点通知。没有闹运会之前,为什么没有规定出来说这个不符合规范。口口声声说不要政治化,现在哪一点不政治化。现在在北京的人,几乎没有自己的个人权利。还整天宣传北京欢迎你,真的这么欢迎大家吗,好像不是。实名制,清外地人,进地铁强制喝矿泉水。老百姓,从来没有被信任过,仅是利用而已。可悲的是,大家都这么愿意被利用,一直以来的洗脑教育,可悲到没有人说话,没有人反对。
北航唯一我能吃下的食堂也献给奥运,其他的食堂,由于物价上涨,在“老百姓纷纷表示对生活影响不大的情况下”,菜的质量,只有一个词可以形容:难以下咽,不是咸的要命,就是难看的要命。
新主楼也一直被奥运强奸着,几乎和奥运没有关系,仅是旁边有个举重比赛场管,就要这样子被铁栏杆层层圈住。进出还要出示证件。我自己的学校,你开你的奥运会,干嘛查我的证件啊,你这不是闲的蛋疼,有病吗?
还有那些傻的要命的奥运志愿者(指其中一部分),何必浪费一个暑假站在校门口,和那些保安聊天,打牌,大好时光用来看书,泡妞,上网玩游戏,抢银行不是更好吗。

2008年7月14日星期一

oooo


终于搞完了。 轻松一下。这几天连续看笑傲江湖,抽空才写点代码(这样不对,LF在的话,也许会反讽我一下),刚才在截止时间过后三分钟的时候,我终于还是commit了。365$, 从晚上8点才开始写,之前还想睡觉,叫上KLW去D座2楼的超市也买点吃的,比如咖啡啊,绿茶啊,月光宝盒啊什么的。可惜超市关门,估计是被奥运强奸了,谁他妈让北航是场馆之一呢。说这话,没有一点自豪感,不像报纸上电视说的那样,什么自傲骄傲,还民族##$$#$.. 狗屁,我只想把那些铁栏杆卸掉。北京估计什么都要实名制了,现在9开头的公交车都要实名。到天安门广场上放个屁,也要实名的吧,大抵是这样的。
随着“避运套”的流行,我不幸属于受孕人群。灾难深重。如果有机会,还是逃走比较好。
中期检查快到了,这个事情,实在是心头大患,无聊又无奈,想做没心思。

2008年7月9日星期三

大便,周國平和眼罩


剛才去上廁所了。大號。一共拉拉了三條(我不知道該用哪個量詞,暫且用條)。第一條很粗,很干,深色,泛青色。第二條,正常水平,開始變黃色,很臭。第三條,已經是收尾工作了,變得水分含量比較大。作為一個資深有痔青年,最后還滴了一滴血,作為點綴。
-----------------------------廁所回來的分割線-------------------------------------
你惡心到了沒?關我什么事。米蘭昆德拉說了,拒絕大便是一種媚俗。刻意大便也是。我是故意的。
來自周國平的:
为自己写的日记
看托尔斯泰曰记。他在新婚九个月时写道:“我自己喜欢并且了解的我,那个有时整个地暴露、叫我高兴也叫我害怕的我,如今在哪里?我成了一个渺小的微不足道的人。自从娶了我所爱的女人以来,我就是这样一个人。这个本子里写的东西几乎全是谎言、虚伪。一想到她此刻就在我身后看我写东西,就减少了破坏了我的真实性……为了她〔她会看我的日记〕我倒不是不写真话,而是在许多可写的东西中进行挑选,选出那些单为我自己不会去写的东西。”
       非常准确,这也正是我的感觉。一本单为自己写的日记几乎是保持自我真实性的必要条件。尤其是一个过家庭生活的人,为了维护家庭的安宁,为了照顾亲人的情绪,言行中的某种克制是不可避免的。然而在曰记中,在心灵独白中也如此克制,那就是虚伪了,最后一个真实的机会也丧失了。这当然与爱不爱无关,在最爱的人面前也不可能时时处处坦露一切真实心绪和想法,往往会选出那些终究还能接受的东西来写。。。。。。
comment:所以我昨晚說的當然是認真的。
---------------------------------睡午覺-------------------------------------------
睡覺,戴女性化的眼罩。

2008年7月6日星期日

转 http://btr.blogbus.com/logs/24080982.html


你知道最痛苦的是什么吗?他问。
我大概是在几秒钟之后才意识到他说了“痛苦”这个词。就像闪电和雷声之间的时差,声音领先了意识好几秒。
此前他一直沉默着。茶餐厅厚厚的背景声盖着他的沉默。不突兀。没有丝毫不合理。让人几乎忘了,看似合理的解释淹没了沉默真正的理由。他只是用刀叉切着一只只瑞士鸡翼,以几乎不可思议地手法把鸡翅膀繁复的骨剔得干干净净。他用吸管把冻奶茶里的冰块按下去,等它浮上来的时候再按下去,偶尔吸一口,一口。他拿起猪仔包的时候才有一丝浅笑,但在这笑容里掩饰多于显露。他并不自觉。
我不理解茶餐厅里说出的痛苦。我几乎想开口问,既然你要谈论痛苦,为什么要选茶餐厅。我想大概每个地方都有一张那个地方的词汇表,词汇表里没有的词都会与那个地方格格不入,出现在那里的时候就会显得显得荒诞不经。
天空变得愈来愈暗。将至的暴雨像越吹越大的气球。
是雨过天晴之后,发现先前发生的一切都不过如此,他在我的沉默中接住了自己的设问,就是说……
暴雨扑向它早已警告过的街道。有人入戏地奔跑起来,有人依旧在雨中慢慢走。
就是说,当你发现当初那些惊天动地的东西现在看来不值一提的时候,当你发现当时那些让你饭也吃不下、觉也睡不着的痛苦其实根本算不上痛苦这个词的十分之一的时候,你其实并没有那种预料中的释然,你反而会更加痛苦,因为那种把一切恢复到平静的减法,简直就要把正在发生的一切取消似的。你明白吗?
嗯。我点点头。
你们女人也是这样吗?他的问句像一只求助的手。
嗯。我用不置可否的音量发出了一声“嗯”。暴雨不知何时已经停了,地面还是湿的。飞驰的车来来往往,有海浪冲袭沙滩的声音。但是假如你现在依旧痛苦,那就说明你并不是真的释然啊,过去对于你也并非真的不值一提吧,我说。
他又回到了沉默里。他一个又一个地吃着鲜虾云吞,就好像一旦停顿,语言就会趁虚而入似的。
后来,我们在茶餐厅门口相互告别时,天空已经再度晴朗起来。那天夜里他发了条短信给我。你大概是对的,他写道,就像我们都爱彩虹。