LISTSERV 16.5 - CODE4LIB Archives

Thank you all.
I could use Pandas and Plotly's Dash to develop the ngram viewer.
Eric's suggestion was apt and helped in transposed arrangement.

Owen's experience matches with my approach too. Thanks for sharing your
experiences with me.

Thank you all

On Fri, Nov 8, 2019, 3:27 PM Owen Stephens <[log in to unmask]> wrote:

> I was involved in some work on calculating and visualising this kind of
> word/phrase frequency on a project a few years ago - this was based on
> queries to a corpus indexed in ElasticSearch, and charting words/phrase
> frequency against each other using javascript - see
> https://ukmhl.historicaltexts.jisc.ac.uk/ngram for an example - but
> although it looks superficially similar it isn't anywhere close to as
> sophisticated as the Google n-gram viewer.
>
> You probably already know, but I think it's worth stating, that the
> Google n-gram viewer (https://books.google.com/ngrams) is not simply
> visualising the frequency of word/phrase occurrences, but the frequency
> as a percentage of the frequency of all n-grams of the same size in the
> corpus https://books.google.com/ngrams/info. The Google n-gram viewer
> goes well beyond this as well, supporting ways of being more specific
> (e.g. you can limit by part of speech, and by language of the texts
> analysed). This suggests a sophisticated linguistic parsing of a large
> corpus with the ability to answer complex questions quickly at a scale -
> something we weren't able to do in our project.
>
> In the project I was involved in, we are simply showing the percentage
> of texts in the corpus in which a word appears, not the frequency as a
> percentage of all same sized n-grams - which means our viewer is more
> about reflecting general book topics than it is about linguistic
> analysis. You can also see issues with the measurement at either end of
> the graph where there seem to be spikes in usage - but this actually
> reflects that the collection simply lacks large numbers of texts from
> those years, which means a term only has to appear in a few books to get
> a high percentage. Despite this, the tool is still useful (IMO) within
> those constraints - in the example search it is possible to see that the
> term "tubercolosis" rises in frequency, while the term "phthisis" (for
> the same condition) drops off - a trend also shown by Google n-gram
> viewer
>
> https://books.google.com/ngrams/graph?content=tuberculosis%2Cphthisis&year_start=1800&year_end=1930&corpus=15&smoothing=3&share=&direct_url=t1%3B%2Ctuberculosis%3B%2Cc0%3B.t1%3B%2Cphthisis%3B%2Cc0
>
> Finally if you want to do more sophisticated analysis it may be worth
> looking at specialist tools - e.g. AntConc
> http://www.laurenceanthony.net/software/antconc/
>
> Hope some of that is helpful
>
> Owen
>
>
> Fitchett, Deborah wrote on 08/11/2019 01:10:
> > You might be interested in Chart.js (https://www.chartjs.org/) - it
> does the visualisation part, if you could do the search part.
> >
> > Deborah
> >
> > -----Original Message-----
> > From: Code for Libraries <[log in to unmask]> On Behalf Of Vinit
> Kumar
> > Sent: Thursday, 7 November 2019 6:17 PM
> > To: [log in to unmask]
> > Subject: [CODE4LIB] n-gram visualisation
> >
> > Dear Code4Libers,
> >
> > I have data with the following structure:
> > ngrams   2009    2010  2011 2012   2013 2014 2015
> > library        22        3          32     32       35      21       21
> > technology  3         4          43     32       30     43      32
> > and so on
> >
> > Is it possible to visualise this data in a similar manner as Google
> N-gram viewer displays? Wherein one can put the keyword in a search bar and
> the visualisation displays the year wise trend of that keyword in the
> corpus based on the above structured data.
> > Any pointers or tools would be of help.
> > Thanking you in anticipation.
> >
> >
> > --
> > Regards
> > Vinit Kumar, Ph.D.
> > Assistant Professor,
> > Department of Library and Information Science Babasaheb Bhimrao Ambedkar
> University, Rae Bareilly Road, Lucknow, India 226025
> > +919454120174
> >
> >
> > ________________________________
> >
> > "The contents of this e-mail (including any attachments) may be
> confidential and/or subject to copyright. Any unauthorised use,
> distribution, or copying of the contents is expressly prohibited. If you
> have received this e-mail in error, please advise the sender by return
> e-mail or telephone and then delete this e-mail together with all
> attachments from your system."
>
> --
> Sent from Postbox <https://www.postbox-inc.com>
>