Top_Banner

May 23, 2018

Xpath query to fetch data from a Webpage

Don't need to be an coding expert to use Xpath query . . .!!
                to fetch data from any web page.

Internet as a source of information has penetrated too deep in our lives, so much that we always look at google for solutions to any of the issues whether related to personal or professional life. Web is flooded with plenty of resources on any topic related to human being. However, main content on any web page is either Static or Dynamic. Structured data on Static pages as are rarely modified after it published on web. Therefore, no need to get any dynamic update for such pages, use simple tools to grab the data and work forever.

However, Dynamic pages are actively maintained and update too frequently with real time data that is published in a pre-defined defined and standard page format.  Such pages and content includes :

          ¤  Temperature or Weather forecasts
          ¤  Gold or Currency rates
          ¤  Real time Stocks and Commodity Price quotes
          ¤  Daily Net Asset Values for Mutual funds.
          ¤  Social Media Profile performance such as Likes, page views, Followers, Retweets  etc .
          
I started it all to get 'Daily Net Asset Values'  for Schemes in my Mutual funds Portfolio.

Although it is possible to get Live quotes (Delayed) for NSE/BSE stock through a simple and built-in function in google sheets, but it doesn't work for mutual funds as of now. 

See it here : [ Auto-updated Equity Stocks Portfolio in Google Sheets ]

Equity stock Portfolio works well for any stock supported by Google Finance. But it is not possible to get any data other than a list of pre-defined attributes.  Another issue as state earlier, It does not support mutual funds. Therefore, after studying a lot of resources, I figured out a pre-defined function in google sheet that is an excellent way to get daily NAVs directly from AMFI official website. 

See it here :  [Auto-update ] Mutual Funds NAVs Daily from AMFI website . . . 

AMFI provides NAVs in text format. So, above method did not Fully meet the desired intent to fetch additional fund performance data for better comparison of schemes. It would be great to have past returns, risk measures, expense ratios, assets and exit loads if any. I had to go into a little deeper concepts and not so common functions in google sheets or Libre office calc. In the meantime, I need to mention it in very strong words that I never took Google sheets so seriously despite its huge capabilities and such a strong built-in functions, nothing less than any of popular spreadsheet applications in market today.

Whole stuff is based on just one pre-defined function i.e importxml
Syntax  :  [ =importxml ( " URL ", " Xpath_query ") ]

However, things are not so simple as it seems. Xpath_query  is the critical part in the above syntax. Let see it details.

What is Xpath ?
Xpath stands for XML Path Language.
Most of us have a very good understanding of a Folder Tree or  Root Directory Structure in any of conventional file system such as windows explorer. Xpath uses same analogy of Parent-Child relationships and folder path structure to navigate to a specific file or sub-folders. 

Have a look at a basic XML document structure,


XML documents too has a "path like" syntax to identify and navigate nodes.  
[elements] are the building blocks of any xml document. These so arranged in an hierarchy that includes multiple nodes as well as sub-elements, to create a structured data-set on a landing page. It is very easy to notice common elements such as <div><span> . As you explore that complete code, some other common elements are <table>, <tbody>, <tr> , <td> , <a>, <li>, <ul> etc. 

[attributes] are like <tag>  or custom names assigned to independent nodes, It helps to directly jump over a specific node or element by-passing all elements from the beginning to the target element in a xml document. It is same like calling someone by name.  These two attributes are  too common <class>, <id> than others.

Syntax :         "// <#element> [ @ <#attribute> = '<#name>' ]" 

Highlighted part is termed as Predicate. It is used to find a specific node or node that contains a specific value. for example, Xpath_query to fetch  "42" in above image :

                       "// div [ @class = 'today_nowcard-temp']/span"

Here, <span> is the target element that contains the desired value. Predicate part would direct look for the class value equal to defined <#name>
if you closely look at above syntax,  "//. . . ./ . . " doesn't it look like a folder path ??


Simple Method to get a ready made Xpath query?
It requires some practice, little patience and a number of trials to master the coding  for xpath query using above Syntax method. But once you do it,  It feels like a pro in Google sheets than ever before. 

Cheers!! We have another simple way to get Xpath query
if you are not comfortable enough to generate a manual syntax at you own. 



Have you even noticed,  an option inspect ] given at extreme  bottom, if you right click at "selected data" on a page in google chrome. It will split the screen into two parts.  Bottom part would have the selected data highlighted in xml document. Here, It is required to inspect the highlighted element structure to generate a Xpath query.

        Or

Proceed to next image for the simple version,



Right click on the Highlighted target element in the bottom window, Select 'Copy XPath' as shown in image above.



Lets move on how to use this Xpath query with two example videos.

● Auto-Update Portfolios with Stock Quotes and NAVs ● 



● Fetch Latest Temperature using Xpath query in Google Sheets ● 





        Resources :              

1 comment:

  1. Hi,
    very interesting guides, but I still can't load data from this site :
    www.tennis24.com
    https://www.sofascore.com/it/calcio/livescore


    I want to extract match list of a day
    Can you help me with an example?

    ReplyDelete