One of the projects I worked on was to import a large number of XML files in a PostgreSQL database (as XML files). I chose python to do it. Here are the steps:
Database
The data source is the SCOPUS database which has a silly number of entries (approx 20,000,000), most containing 2 XML files (article and citations). The publications with no citations don't have the citations.xml file.
Therefore, initial database has a single table named scpous_data with the following fields:
- id - string of 20 characters. This is the primary key too
- article - the place where we store the article XML
- citations - the place where we store the citations xml
The exercise was to fill up the table with data.
Connect
I'm connecting to the DB using psycopg2. Since this was a test to see the speed of importing, I chose to hard-code the connection parameters:
import psycopg2
def connect():
conn = psycopg2.connect(
host='127.0.0.1', port=5555, database='scopus',
user='scopus', password='scopus')
return conn
Insertion statements
Once the tests of connecting to the DB were done, I've built the insertion commands:
INSERT_STRING = """
insert into "scpous_data" (article, citedby, id)
values (%s, %s, %s)
"""
UPDATE_STRING = """
update "scpous_data" set article=%s, citedby=%s where id=%s
"""
UPSERT_STRING = """
WITH up_insert AS
(UPDATE "scpous_data" set article=%s, citedby=%s where id=%s RETURNING *)
INSERT INTO "scpous_data" (article, citedby, id)
SELECT %s, %s, %s
WHERE NOT EXISTS (SELECT * FROM up_insert)
"""
Inserting a record
Once the upsert string is defined (and correct), we can design the insert function:
def insert_xml(cur, identifier, article, citedby):
# Insert XML into record
cur.execute(UPSERT_STRING,
(article, citedby, identifier, article, citedby, identifier)
)
cur.connection.commit()
This method will be quite slow because we perform the commit. A better approach is to batch several inserts together and then commit.
Load records
To load a record from the physical medium into the DB, I defined a simple loader function:
def load_one_record(root):
identifier = os.path.split(root)[-1]
article = open(os.path.join(root, "%s.xml" % identifier)).read()
# This is somewhat optional
#
article_dom = xml.dom.minidom.parseString(article)
article = article_dom.toprettyxml()
# load or create a default XML for citations
#
try:
citations = open(os.path.join(root, "citedby.xml")).read()
except IOError:
citations = '<?xml version="1.0" encoding="UTF-8"?><cited-by><count>0</count></cited-by>'
# debug
print "%s - (art: %d, ref: %d)" % (identifier, len(article), len(citations))
# Add the data to the DB
insert_xml(cursor, identifier, article, citations)
This takes advantage the last component of the path is the article ID. Otherwise, I'd have to look for the ID in the XML file itself (or extract it from the XML file's name).
To load the sample onto the disk, I just did an os.walk():
def load_scopus_sample(connection):
"""
Load the scopus sample from the disk
"""
parser = argparse.ArgumentParser(
description='Compute sizes for pattern-based items.')
parser.add_argument(
'-d', '--dir', action='store', type=str, default='.',
help='the directory (default .)'
)
args = parser.parse_args()
cursor = connection.cursor()
for root, dirs, files in os.walk(args.dir):
if len(dirs) == 0:
load_one_record(root)
cursor.close()
Then, I wrapped everything in a main() function:
def main():
conn = connect()
load_scopus_sample(conn)
cur = conn.cursor()
print cur.execute('select count(*) from "scopus_data";')
# print the number of records
print cur.fetchall()
#insert_xml(conn, cur)
# Perform an XPATH query on the citedby column
#
cur = conn.cursor()
cur.execute(SELECT_XPATH_1)
print cur.fetchall()
cur.close()
conn.close()
Bonus: XML query through python
PostgreSQL is almost as nice as Oracle when it comes to XML, but not nearly as quick (or so they say). However, you can do nice queries in XML files and get the results in python as from any query. Below are a couple of examples.
First one gets the eid of the citing document from the citedby column and presents the result as a list:
SELECT_XPATH_1 = """
unnest(select xpath('//cited-by/citing-doc/eid/text()', citedby) from "scpous_data"
"""
Second example gets all the references from an article XML. Since the article XML has namespaces, they must be specified.
NAMESPACES_STRING = """{
{xocs,http://www.elsevier.com/xml/xocs/dtd},
{ce,http://www.elsevier.com/xml/ani/common},
{ait,http://www.elsevier.com/xml/ani/ait},
{cto,http://www.elsevier.com/xml/cto/dtd},
{xsi,http://www.w3.org/2001/XMLSchema-instance}
}
"""
SELECT_XPATH_2 = """
select id, unnest(xpath('//xocs:doc/xocs:item/item/bibrecord/tail/bibliography/reference/@id',
article, '%s')) from "scpous_data"
""" % NAMESPACES_STRING
Conclusion
This proved to work but it was quite slow.
PostgreSQL is not as powerful as Oracle with XML (Oracle offers e.g. indexing), so queries would be quite slow.
To speed up queries, the data would need to be transformed in a relational model, so the effort for loading XML (and practically duplicating the data) is not really justified.
I attempted to tweak/fix the loading process via:
- cursor.setinputsizes(cx_Oracle.CLOB, cx_Oracle.CLOB, 20)
- cursor.prepare(UPDATE_STATEMENT)
References
- XML Type
- PostgreSQL XML functions
- Unnest example
- Insert XML into PostgreSQL example
- Using XML and XPath in PostgreSQL Database
- Shredding XML in PostgreSQL
- Using XPath to extract data from an XML column in Postgres
- Python psycopg - Basic usage
- Psycopg - complex python types
- Python psycopg connection/cursor
- Python psycopg - The cursor class
Member discussion: