Here are some basic examples on how to use Golem to extract data from CML files. In all of the following, we are using a dictionary for the CASTEP code:
>>> d = golem.Dictionary("castepDict.xml")
>>> namespace = "http://www.castep.org/cml/dictionary/"
In general, to get the data associated with some term within a document:
def getTerm(doc, term):
entry = d["{%s}%s" % (namespace, term)]
data = term.findin(doc)
assert len(data) == 1
return term.findin(doc)[0].getvalue()
findin always returns a list, so we need to explicitly extract the first element of the result. One caveat to remember is that there may be no data in the list, or more than one return value; in those cases, this function will raise an AssertionError.
Sometimes, there are many instances of a given concept within a CML file. A typical example is unit cell dimensions: if you are examining the output of a geometry optimization, the unit cell will typically be output once for every step of the calculation, plus once at the start and once at the end.
What if you only want the final unit cell - the converged structure?
def lastcell(doc):
ucell_d = d["{%s}ucell" % namespace]
return ucell_d.findin(doc)[-1].getvalue()
findin always returns a list in document order - so the last element in the list will be the one you want.
Sometimes, however, you can’t rely on knowledge of document order to find the particular instance of a concept you’re looking for. Usually, this is because the location (or presence) of one concept in a document depends on the position or value of another. For example, trying to find the initial volume of the system being simulated:
def initial_volume(doc):
is_d = d["{%s}initialsystem" % namespace]
vol_d = d["{%s}volume" % namespace]
i = is_d.findin(doc)[0]
v = vol_d.findin(i).getvalue()
First, we find the section of the document corresponding to initialsystem; then we find volume within this, and finally get its value.
Sometimes, getvalue will not work, such as when the term one is looking up has no defined template (XSLT transform) from its CML serialization to an object the Golem library understands. In those circumstances, you will see an error message along the lines of:
>>> kps = d["{http://www.castep.org/cml/dictionary/}kpoint"]
>>> kps = kpd.findin("/MgRh_18-08.32.31/MgRh-geomopt1.cml")
>>> kps
[<golem.EntryInstance object at 0x202ce90>]
>>> kps[0].getvalue()
Traceback (most recent call last):
File "<stdin>", line 1, in ?
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.4/lib/python2.4/site-packages/golem-1.0-py2.4.egg/golem/__init__.py", line 1388, in __getattr__
return etree._ElementTree.__getattribute__(self.tree, name)
AttributeError: 'lxml.etree._ElementTree' object has no attribute 'getvalue'
In these circumstances, you can locate the fragment of CML containing the relevant data using Golem, but you’ll have to extract the data from it yourself - as here, with <kpoint> . Golem makes use of the lxml <http://codespeak.net/lxml>``_ library; the objects retrieved by ``findin store the XML they find as obj.tree (as lxml ElementTrees). The following has been reformatted for legibility’s sake:
>>> from lxml import etree
>>> etree.tostring(kps[0].tree)
'<kpointList xmlns="http://www.xml-cml.org/schema"
xmlns:castep="http://www.castep.org/cml/dictionary/"
xmlns:castepunits="http://www.castep.org/cml/units/"
dictRef="castep:kpoint" title="k-Point List">
<kpoint id="kpt1"
coords="4.583333333333e-1 4.583333333333e-1 4.583333333333e-1"
weight="4.629629629630e-3"/>\n [...]
</kpointList>'
So you can interrogate the data using the methods the lxml API gives you, such as XPath:
def kpoints(doc):
kpoint_d = d["{%s}kpoint" % namespace]
kpxml = kpoint_d.findin(doc)[0]
kps = kpxml.tree.xpath("cml:kpoint",
namespaces={"cml": "http://www.xml-cml.org/schema"})
kpoint_grid = [[[ float(y) for y in x.xpath("@coords")[0].split()],
float(x.xpath("@weight")[0]) ] for x in kps ]
return kpoint_grid
We got one piece of data above which happened to be a child of initialsystem. What if we want all of initialsystem‘s children? We can get that by iterating over every concept that is a <golem:childOf> initialsystem as follows:
def getInit(doc):
v = {}
i = d["{%s}initialsystem" % namespace]
ikids = i.getChildren()
for ent in ikids:
nodes = ent.findin(doc)
if len(nodes) == 1: # check that there's only one bit of data
try:
val = ent.getvalue(nodes[0])
v[ent.id] = val
except AttributeError:
# we couldn't convert this
pass
return v
which returns the results as a dict, v, skipping any data we were unable to convert.