[转载]操作PDF文档功能的相关开源项目探索iTextSharp 和PDFBox

mikel
C#
2010-04-17
254热度
0评论

很久没自己写写心得日志与大家分享了，一方面是自己有点忙，一方面是自己有点懒，没有及时总结。因为实践是经验的来源，总结是提升的基础，所以无论怎样，自己都该反省一下。今天我主要是研究学习了两个PDF文档的相关类，iTextSharp 和PDFBox。我研究出发点是实现PDF文档的检索，需要提取PDF文档中的文字内容，然后通过正则匹配实现搜索。

《类似 Windows Search的文件搜索系统》中介绍的文件检索方法是很不错的，但它里面对PDF中的中文检索不支持，因为里面调用的iTextSharp不能很好地支持英文，PdfReader类的GetPageContent()方法无法正常返回中文字符，经我测试，并非简单的编码问题。所以，急需能够从PDF中提取text功能。

我首先学习iTextSharp.dll 下载：http://sourceforge.net/projects/itextsharp/ 这里面有很多输出PDF文档的简单例子（下载iTextSharp例子），在学习中发现，不支持中文内容输出。在网上搜索相关内容发现，原来是缺少字体库。有两种方法解决：

1.自己指定系统的字体库，创建PDF中使用的字体。参见：http://unruledboy.cnblogs.com/Skins/ChinaHeart/Controls/archive/2005/08/30/225984.html

Document document = new Document(PageSize.A4,50, 50, 50, 50);
try
{
PdfWriter writer = PdfWriter.GetInstance(document, new FileStream("Chap11.pdf", FileMode.Create));

//下面是创建PDF文档加密的
//writer.SetEncryption(PdfWriter.STRENGTH40BITS,"654321", "654321", PdfWriter.AllowCopy);
document.Open();

//指定字体库，并创建字体
BaseFont baseFont = BaseFont.CreateFont(
"C:\\WINDOWS\\FONTS\\SIMHEI.TTF",
BaseFont.IDENTITY_H,
BaseFont.NOT_EMBEDDED);
iTextSharp.text.Font font = new iTextSharp.text.Font(baseFont, 9);

//指定输出内容的字体

document.Add(new Paragraph(" This document is Top Secret! ", font));
document.Close();
}
catch (Exception de)
{
Console.WriteLine(de.StackTrace);
}

2.从http://sourceforge.net/projects/itextsharp/ 下载扩展字体库 iTextAsianCmaps.dll 和iTextAsian.dll，支持亚洲字体。

下载界面如下：

/// <summary>
/// 创建中文字体(实现中文)
/// </summary>
/// <returns></returns>
public static iTextSharp.text.Font CreateChineseFont()
{
BaseFont.AddToResourceSearch("iTextAsian.dll");
BaseFont.AddToResourceSearch("iTextAsianCmaps.dll"); //"STSong-Light", "UniGB-UCS2-H",
BaseFont baseFT=BaseFont.CreateFont("STSong-Light", "UniGB-UCS2-H", BaseFont.EMBEDDED);

iTextSharp.text.Font font = new iTextSharp.text.Font(baseFT);
return font;
}

"UniGB-UCS2-H" "UniGB-UCS2-V"是简体中文。 "STSong-Light"是字体名称。BaseFont.EMBEDDED是将字体嵌入文档内。

其次，我接下来尝试在使用iTextSharp读对象类时，指定字体库，可是很遗憾没有相应方法。请参照：http://www.cnblogs.com/diction/articles/1120984.html （提取文本不支持中文）而且，即使有也很不灵活，因为你不可能预知PDF文档中使用的字体，PDF文档中可能有多种字体。后来，搜索网页相关信息发现：原来iTextSharp的操作PDF文档优势是PDF文档的创建。

需求是学习和工作的动力

我的原始目标是找到PDF文档内容提取为文本的方法，我转向《How to parse PDF files》该文章完整讲述了PDF文档提取文本的方法和整个解决过程思路，我会单独转载该文章，希望不能访问国外网的网友也能看到。PDFBox的下载http://sourceforge.net/projects/pdfbox/files/ 下载解压后里面内容很丰富，

所有需要的dll都包含在Bin文件夹里面

"PDFBox is a Java PDF Library. This project will allow access to all of the components in a PDF document. More PDF manipulation features will be added as the project matures. This ships with a utility to take a PDF document and output a text file. "

PDFBox是个JAVA开源项目，里面使用IKVM.NET开源项目http://www.ikvm.net/ 支持JAVA类库在.NET中调用。

IKVM.NET is an implementation of Java for Mono and the Microsoft .NET Framework. It includes the following components:

A Java Virtual Machine implemented in .NET
A .NET implementation of the Java class libraries
Tools that enable Java and .NET interoperability

对IKVM.NET的学习，对以后在.NET下使用JAVA类库很有帮助，其实IKVM.Runtime.dll 就是封装了JAVA类库的运行环境。

需要添加的DLL有：FontBox-0.1.0-dev.dll、IKVM.GNU.Classpath.dll、 IKVM.Runtime.dll、PDFBox-0.7.3.dll

PDFBox使用实例代码如下：请参照：http://www.cnblogs.com/wuhenke/archive/2010/04/16/1713949.html

private static string parseUsingPDFBox(string filename)
{
PDDocument doc = PDDocument.load(filename);

PDFTextStripper stripper = new PDFTextStripper();

return stripper.getText(doc);
}

PDFBox功能很强大，有时间值得好好学习一下。

参考:

http://www.codeproject.com/kb/cpp/ExtractPDFText.aspx?df=100&forumid=47947

http://www.codeproject.com/KB/string/pdf2text.aspx

http://www.cnblogs.com/hardrock/

http://www.ikvm.net/