Extract metadata from pdf file using Python script

Metadata is stored in any document by authoring application which can be user-name ; comment ;creation date;modification date. Today we will learn python script to extract metadata from pdf file.
But before that you have to install pypdf python module . For that open terminal & type
pip install pypdf

Pypdf is offers ability to extract document information, split ,merge,crop,encrypt and decrypt documents.

import pyPdf
import optparse
from pyPdf import PdfFileReader

def printMeta(fileName):
pdfFile = PdfFileReader(file(fileName, ‘rb’))
docInfo = pdfFile.getDocumentInfo()
print ‘[*] PDF MetaData For: ‘ + str(fileName)
for metaItem in docInfo:
print ‘[+] ‘ + metaItem + ‘:’ + docInfo[metaItem]

def main():
parser = optparse.OptionParser(‘usage %prog “+\
“-F ‘)
parser.add_option(‘-F’, dest=’fileName’, type=’string’,\
help=’specify PDF file name’)

(options, args) = parser.parse_args()
fileName = options.fileName
if fileName == None:
print parser.usage
exit(0)
else:
printMeta(fileName)

if __name__ == ‘__main__’:
main()

first we import pypdf module ;then optprase module. There is two function available.
(1)main
(2)printMeta

(1)main :-
First some lines are indicated usage message for user & specify argument to supply filename. Whatever file name is supplied by user is saved to fileName variable ;if the file does not exist then it print usage message & stop execution of script.

If file is exist & we supply correct argument then it call second function printMeta.

(2)printMeta(fileName):

pdfFile = PdfFileReader(file(fileName, ‘rb’)) :- read pdf file & saved it to pdfFile .

docInfo = pdfFile.getDocumentInfo() :- Get document info from pdf file & saved it to docinfo.

print ‘[*] PDF MetaData For: ‘ + str(fileName) :- It print [*] PDF MetaData For: filename.

for metaItem in docInfo:
print ‘[+] ‘ + metaItem + ‘:’ + docInfo[metaItem]

above part print every metadata one by one which is extracted from document & saved to docinfo.

python-pdf

python-pdf

Usage of script:-

chmod +x script_name

./script_name -F filename.pdf

One thought on “Extract metadata from pdf file using Python script

  1. pdfFile = PdfFileReader(file(fileName, ‘rb’))

    gives an error :

    File “”, line 6
    pdfFile = PdfFileReader(file(fileName, ‘r’))
    ^
    SyntaxError: invalid syntax

Leave a comment