A while ago I had to process XML loaded in a PostgreSQL database. The task was to extract references from a publication structure. Fortunately, PostgreSQL allows you to use XPath
in plpgsql, but there are some things you need to be aware.
First, you need to note all namespaces:
nsarray := ARRAY [
ARRAY['xocs','http://www.elsevier.com/xml/xocs/dtd'],
ARRAY['ce', 'http://www.elsevier.com/xml/ani/common'],
ARRAY['ait', 'http://www.elsevier.com/xml/ani/ait'],
ARRAY['cto', 'http://www.elsevier.com/xml/cto/dtd'],
ARRAY['xsi', 'http://www.w3.org/2001/XMLSchema-instance']
];
Then, you can use the xpath()
function:
xpath(
'//xocs:doc/xocs:item/item/bibrecord/tail/bibliography/reference/@id',
entry,
nsarray
)
...but that's no fun, particularly when you are almost guaranteed to get multiple hits. It's better to use:
FOR xp in SELECT * from unnest(xpath(
'//xocs:doc/xocs:item/item/bibrecord/tail/bibliography/reference/@id',
entry.article, nsarray)) LOOP
-- Do something with the record
--
refids := refids + 1;
END LOOP;
A full implementation looks like this:
CREATE OR REPLACE FUNCTION cs_refresh_mviews() RETURNS integer AS $$
DECLARE
entry RECORD;
xp RECORD;
count INTEGER;
refids INTEGER;
nsarray varchar array;
begin
nsarray := ARRAY [
ARRAY['xocs','http://www.elsevier.com/xml/xocs/dtd'],
ARRAY['ce', 'http://www.elsevier.com/xml/ani/common'],
ARRAY['ait', 'http://www.elsevier.com/xml/ani/ait'],
ARRAY['cto', 'http://www.elsevier.com/xml/cto/dtd'],
ARRAY['xsi', 'http://www.w3.org/2001/XMLSchema-instance']
];
count := 0;
refids := 0;
FOR entry IN SELECT * FROM "ERC_ERIS_XML_DATA" LOOP
FOR xp in SELECT * from unnest(xpath(
'//xocs:doc/xocs:item/item/bibrecord/tail/bibliography/reference/@id',
entry.article, nsarray)) LOOP
refids := refids + 1;
END LOOP;
count := count + 1;
IF (count % 100) = 0 THEN
raise notice 'ID: (%)', entry.id;
END IF;
END LOOP;
raise notice 'Rows processed: (%)', count;
raise notice 'REFS processed: (%)', refids;
RETURN 1;
end;
$$ LANGUAGE plpgsql;
select cs_refresh_mviews() as output;
It gets all XML documents stored in a table with the column name article. For each document, get the references and increment a value. At the end we'll get messages on how many items we had in the table and how many references.
For more information, go to the documentation.
Member discussion: