* Purpose
Extract specific keywords from body of an article and output existed keywords, PubMed ID and PubMed Central ID.全文から特定のキーワードを検索してPubMed ID、PubMedCentral IDとセットでアウトプットする。
* How
Download full-text XMLs data of articles from PubMedCentral and use Element.tree library of python for text-mining.PubMedCentralから論文全文のXMLファイルを取得し、PythonのElement.treeライブラリーでテキストマイニングを行う。
* Tool
Python 2.7.3(NOTE: Python 2.4 didn't have Element.tree library)
* Files
- all_pmcid.txt: get- words.list : list of keywords
1. Get a full-text article in XML text file from PubMed Central.
See the previous post.- http://bioinfomemo.blogspot.jp/2013/10/get-full-text-article-from-pubmed.html
2. Exclude all character decoration
The xml.etree will stop reading when there is a text decoration tag inside paragraph.--------------------Shell script---------------------NOTE: Validate XML format
#!/bin/bash
ls -1 xml/* | while read line
do
cat ${line} |
sed -e 's/<ext-link [^>]*>//g' | sed -e 's/<ext-link>//g' | sed -e 's/<\/ext-link>//g' |
sed -e 's/<xref [^>]*>//g' | sed -e 's/<xref>//g' | sed -e 's/<\/xref>//g' |
sed -e 's/<bold>//g' | sed -e 's/<\/bold>//g' |
sed -e 's/<italic>//g' | sed -e 's/<\/italic>//g' |
sed -e 's/<sup>//g' | sed -e 's/<\/sup>//g' |
sed -e 's/<p [^>]*>//g' | sed -e 's/<p>//g'| sed -e 's/<\/p>//g' |
sed -e 's/<supplementary-material [^>]*>//g' | sed -e 's/<supplementary-material>//g' | sed -e 's/<\/supplementary-material>//g' |
sed -e 's/<title>//g' | sed -e 's/<\/title>//g' |
sed -e 's/<caption>//g' | sed -e 's/<\/caption>//g' |
sed -e 's/<media [^>]*>//g' | sed -e 's/<media>//g' | sed -e 's/<\/media>//g' |
sed -e 's/<sec [^>]*>//g' | sed -e 's/<sec>//g' | sed -e 's/<\/sec>//g' |
sed -e 's/<table-wrap [^>]*>//g' | sed -e 's/<table-wrap>//g' | sed -e 's/<\/table-wrap>//g' |
sed -e 's/<table-wrap-foot [^>]*>//g' | sed -e 's/<table-wrap-foot>//g' | sed -e 's/<\/table-wrap-foot>//g' |
sed -e 's/<table [^>]*>//g' | sed -e 's/<table>//g' | sed -e 's/<\/table>//g' |
sed -e 's/<label [^>]*>//g' | sed -e 's/<label>//g' | sed -e 's/<\/label>//g' |
sed -e 's/<thead [^>]*>//g' | sed -e 's/<thead>//g' | sed -e 's/<\/thead>//g' |
sed -e 's/<tbody [^>]*>//g' | sed -e 's/<tbody>//g' | sed -e 's/<\/tbody>//g' |
sed -e 's/<tr [^>]*>//g' | sed -e 's/<tr>//g' | sed -e 's/<\/tr>//g' |
sed -e 's/<th [^>]*>//g' | sed -e 's/<th>//g' | sed -e 's/<\/th>//g' |
sed -e 's/<td [^>]*>//g' | sed -e 's/<td>//g' | sed -e 's/<\/td>//g' |
sed -e 's/<fn [^>]*>//g' | sed -e 's/<fn>//g' | sed -e 's/<\/fn>//g' |
sed -e 's/<fig [^>]*>//g' | sed -e 's/<fig>//g' | sed -e 's/<\/fig>//g' > del_tag/$line
done
---------------------------------------------------
- http://memopad.bitter.jp/w3c/xml/xml_validator.html
- http://openlab.ring.gr.jp/k16/htmllint/htmllint.html
3. Assemble xml files to one file and reformat for ElementTree library.
(1) Delete <!DOCTYPE> and add new line at end of text of all XML files. Merge all XML files to the one file and rap with <articleset></articleset> and add <!DOCTYPE> at first line.--------------------Shell script---------------------
#!/bin/bash
cd del_tag/xml/;
cat PMC*******.xml | head -1 > header.txt;
ls -1 del_tag/xml/* | while read line
do
cat ${line} | sed -e '$s/$/\n/' | sed -e '1d' > ins_end/${line}
done
cat ins_end/* > all.xml
sed -i -e '1i\<articleset>' all.xml
sed -i '$s/$/<\/articleset>/' all.xml
cat header.xml all.xml > all2.xml
---------------------------------------------------
4. Text-mining by python Element.tree.
----------------------Python-----------------------
### SPLIT KEYWORDS
f=open('words.list','r')
words_list=f.read().splitlines()
### EXTRACT PMCID, PMID AND BODY TEXT FROM XML
from xml.etree import ElementTree
XMLFILE = "full.xml"
tree = ElementTree.parse(XMLFILE)
root = tree.getroot()
art=[]
for e in root.getiterator("article"):
p = e.find('.//body').findall('.//p')
p_str=""
for i in p:
if isinstance(i.text, unicode):
p_str += i.text.encode('utf-8')
elif isinstance(i.text, str):
p_str += i.text
art.append({
"pmcid": e.findtext('.//article-id[@pub-id-type="pmcid"]'),
"pmid": e.findtext('.//article-id[@pub-id-type="pmid"]'),
"text": p_str
})
### EXTRACT KEYWORDS FROM BODY TEXT
result = []
n = ""
e = ""
g = ""
for i in range(len(art)):
for word in art[i].get('text').split():
if word in words_list:
n = word
result.append({
"pmid" : art[i].get('pmid'),
"pmcid" : art[i].get('pmcid'),
"word" : w
})
# output
# ['PMID', 'PMCID', 'Keyword']
### UNIQUE RESULT
seen = set()
uniq_result = []
for d in result:
t = tuple(d.items())
if t not in seen:
seen.add(t)
uniq_result.append(d)
---------------------------------------------------
0 コメント:
コメントを投稿