代码如下

/**

* 根据 HTML 代码获取 word 文档内容

* 创建一个本质为 mht 的文档，该函数会分析文件内容并从远程下载页面中的图片资源

* 该函数依赖于类 MhtFileMaker

* 该函数会分析 img 标签，提取 src 的属性值。但是，src 的属性值必须被引号包围，否则不

能提取

* @param string $content HTML 内容

* @param string $absolutePath 网页的绝对路径。如果 HTML 内容里的图片路径为相对路径，

那么就需要填写这个参数，来让该函数自动填补成绝对路径。这个参数最后需要以

/结束

* @param bool $isEraseLink 是否去掉 HTML 内容中的链接

* by www.jb51.net

function

getWordDocument(

$content

$absolutePath

= "" ,

$isEraseLink

= true )

{

$mht

new

MhtFileMaker();

(

$isEraseLink

)

$content

= preg_replace('/<a\s*.*?\s*>(\s*.*?\s*)<\/a>/i' , '$1' ,

$content

);

//去掉链接

$images

array

();

$files

array

();

$matches

array

();

//这个算法要求 src 后的属性值必须使用引号括起来

( preg_match_all('/<img[.\n]*?src\s*?=\s*?[\"\'](.*?)[\"\'](.*?)\/>/i',

$content

$matches

) )

{

$arrPath

$matches

[1];

for

(

=0;

count

(

$arrPath

);

++)

{

$path

$arrPath

[

];

$imgPath

= trim(

$path

);

(

$imgPath

!= "" )

{

$files

[] =

$imgPath

;

(

substr

(

$imgPath

,0,7) == '

http://§

{

//绝对链接，不加前缀

}

else

{

$imgPath

$absolutePath

$imgPath

;

}

$images

[] =

$imgPath

;

}