主题:急寻JAVA枪手!!!QQ:418711109
程序说明
一, 目的:提取文档中的有用信息。其中文档的格式为:XML
二, 要求:1, 提取title-----文章的题目 即提取<title>和</title> 之间的内容
例如: <title> List of companies of Canada </title>
需要把 list of companies of Canada 提取出来
2, 提取这个文章的id---- 即首次出现的第一个ID 即第一次出现<id>和</id>之间的内容
例如:−<header>
<title>Default</title>
<id>8000</id>
−<revision>
<id>242931647</id>
需要把<header>里面的ID即8000 提取出来,这个ID是第一次出现的 。而不是第二个ID 242931647
3. 提取anchor---锚文本(链接关键字)
4. 提取link---链接
例如:只要出现下面的格式
<link xlink:type="simple" xlink:href="../485/7485.xml">
companies</link>
我们就提取绿色背景的内容作为anchor
提取黄色是link
三, 注意: 提取的内容不区分大小写字母。所以文档存放在pages这个文件夹里。
读取文档的路径不能改变
四, 输出形式:得出两个表格
第一个表格
title id
Default 8000
##### ####
##### #####
(第一栏存放所有文档的题目,第二栏统计所有题目对应的id)
第二个表格
anchor link
companies ../485/7485.xml
###### #######
####### ########
(第一栏存放所以的anchor。第二栏存放这个anchor所对应的link)
下面是一个完整的文体举例,我用不同颜色表示了索要提取出来的内容
灰色背景为: title
粉红背景为: id
绿色背景为: anchor
黄色背景为: link
<!--
generated by CLiX/Wiki2XML [MPI-Inf, MMCI@UdS] $LastChangedRevision: 92 $ on 16.04.2009 15:24:05[mciao0825]
-->
−<article>
−<header>
<title>Default</title>
<id>8000</id>
−<revision>
<id>242931647</id>
<timestamp>2008-10-04T09:48:59Z</timestamp>
−<contributor>
<username>Cyfal</username>
<id>4637213</id>
</contributor>
</revision>
−<categories>
<category>All disambiguation pages</category>
<category>Disambiguation pages</category>
</categories>
</header>
−<bdy>
−<p>
<b>default</b>
, as in failing to meet an obligation, may refer to:
−<list>
−<entry level="1" type="bullet">
<link xlink:type="simple" xlink:href="../gan/Byron_C$enter=2C_M$ichigan.xml">
Default (law)</link>
</entry>
−<entry level="1" type="bullet">
<link xlink:type="simple" xlink:href="../838/58838.xml">
Default (finance)</link>
</entry>
</list>
</p>
−<p>
<b>default</b>
, as a result when no action is taken, may refer to:
−<list>
−<entry level="1" type="bullet">
−<information wordnetid="105816287" confidence="0.8">
−<datum wordnetid="105816622" confidence="0.8">
<link xlink:type="simple" xlink:href="../316/957316.xml">
Default (computer science)</link>
</datum>
</information>
— also contains consumer electronics usage
</entry>
−<entry level="1" type="bullet">
<link xlink:type="simple" xlink:href="../639/889639.xml">
Default logic</link>
</entry>
</list>
</p>
−<p>
It may also refer to:
−<list>
−<entry level="1" type="bullet">
−<musical_organization wordnetid="108246613" confidence="0.8">
−<group wordnetid="100031264" confidence="0.8">
<link xlink:type="simple" xlink:href="../344/9159344.xml">
Default (band)</link>
</group>
</musical_organization>
, a Canadian post-grunge and alternative rock band
</entry>
−<entry level="1" type="bullet">
<link xlink:type="simple" xlink:href="../734/3841734.xml">
defaults (software)</link>
, a command line utility for plist (preference) files
</entry>
</list>
</p>
−<p>
−<table style="background:none">
−<row>
−<col style="vertical-align:middle;">
<image width="30px" src="Disambig_gray.svg">
</image>
</col>
−<col style="vertical-align:middle;">
−<it>
This page lists articles associated with the same title. If an
<weblink xlink:type="simple" xlink:href="http://localhost:18088/wiki/index.php?title=Special:Whatlinkshere/Default&namespace=0">
internal link</weblink>
led you here, you may wish to change the link to point directly to the intended article.''
</it>
</col>
</row>
</table>
</p>
</bdy>
</article>
一, 目的:提取文档中的有用信息。其中文档的格式为:XML
二, 要求:1, 提取title-----文章的题目 即提取<title>和</title> 之间的内容
例如: <title> List of companies of Canada </title>
需要把 list of companies of Canada 提取出来
2, 提取这个文章的id---- 即首次出现的第一个ID 即第一次出现<id>和</id>之间的内容
例如:−<header>
<title>Default</title>
<id>8000</id>
−<revision>
<id>242931647</id>
需要把<header>里面的ID即8000 提取出来,这个ID是第一次出现的 。而不是第二个ID 242931647
3. 提取anchor---锚文本(链接关键字)
4. 提取link---链接
例如:只要出现下面的格式
<link xlink:type="simple" xlink:href="../485/7485.xml">
companies</link>
我们就提取绿色背景的内容作为anchor
提取黄色是link
三, 注意: 提取的内容不区分大小写字母。所以文档存放在pages这个文件夹里。
读取文档的路径不能改变
四, 输出形式:得出两个表格
第一个表格
title id
Default 8000
##### ####
##### #####
(第一栏存放所有文档的题目,第二栏统计所有题目对应的id)
第二个表格
anchor link
companies ../485/7485.xml
###### #######
####### ########
(第一栏存放所以的anchor。第二栏存放这个anchor所对应的link)
下面是一个完整的文体举例,我用不同颜色表示了索要提取出来的内容
灰色背景为: title
粉红背景为: id
绿色背景为: anchor
黄色背景为: link
<!--
generated by CLiX/Wiki2XML [MPI-Inf, MMCI@UdS] $LastChangedRevision: 92 $ on 16.04.2009 15:24:05[mciao0825]
-->
−<article>
−<header>
<title>Default</title>
<id>8000</id>
−<revision>
<id>242931647</id>
<timestamp>2008-10-04T09:48:59Z</timestamp>
−<contributor>
<username>Cyfal</username>
<id>4637213</id>
</contributor>
</revision>
−<categories>
<category>All disambiguation pages</category>
<category>Disambiguation pages</category>
</categories>
</header>
−<bdy>
−<p>
<b>default</b>
, as in failing to meet an obligation, may refer to:
−<list>
−<entry level="1" type="bullet">
<link xlink:type="simple" xlink:href="../gan/Byron_C$enter=2C_M$ichigan.xml">
Default (law)</link>
</entry>
−<entry level="1" type="bullet">
<link xlink:type="simple" xlink:href="../838/58838.xml">
Default (finance)</link>
</entry>
</list>
</p>
−<p>
<b>default</b>
, as a result when no action is taken, may refer to:
−<list>
−<entry level="1" type="bullet">
−<information wordnetid="105816287" confidence="0.8">
−<datum wordnetid="105816622" confidence="0.8">
<link xlink:type="simple" xlink:href="../316/957316.xml">
Default (computer science)</link>
</datum>
</information>
— also contains consumer electronics usage
</entry>
−<entry level="1" type="bullet">
<link xlink:type="simple" xlink:href="../639/889639.xml">
Default logic</link>
</entry>
</list>
</p>
−<p>
It may also refer to:
−<list>
−<entry level="1" type="bullet">
−<musical_organization wordnetid="108246613" confidence="0.8">
−<group wordnetid="100031264" confidence="0.8">
<link xlink:type="simple" xlink:href="../344/9159344.xml">
Default (band)</link>
</group>
</musical_organization>
, a Canadian post-grunge and alternative rock band
</entry>
−<entry level="1" type="bullet">
<link xlink:type="simple" xlink:href="../734/3841734.xml">
defaults (software)</link>
, a command line utility for plist (preference) files
</entry>
</list>
</p>
−<p>
−<table style="background:none">
−<row>
−<col style="vertical-align:middle;">
<image width="30px" src="Disambig_gray.svg">
</image>
</col>
−<col style="vertical-align:middle;">
−<it>
This page lists articles associated with the same title. If an
<weblink xlink:type="simple" xlink:href="http://localhost:18088/wiki/index.php?title=Special:Whatlinkshere/Default&namespace=0">
internal link</weblink>
led you here, you may wish to change the link to point directly to the intended article.''
</it>
</col>
</row>
</table>
</p>
</bdy>
</article>