articles

Old-fashioned Way About Data Science

The usual Pythonic way to get a data series is sometimes not the fastest. In some cases, it’s enough to have only 3-6 Linux CLI tools to get a necessary data from Web and parse it in a proper way. Let’s look at such an approach on how to count the number of released RFC’s per year and draw the graph in a terminal.

Publish Date

Update History

Stat over RFCs

The important question at the start is always “Where can I get the data source?”. Fortunately, in 2024th you can find the organized list for all of RFCs 1.

Outdated implementation, data source is not available anymore At the first glance it seems that we need to parse HTML or XML data, but it's not. Actually, we need to parse a simple text/plain data which should be in the same format for all entries. Here comes our first helper - [w3m](http://w3m.sourceforge.net/) [2](#f2) CLI text-based browser. ```bash ~> w3m -cols 1024 -dump https://www.rfc-editor.org/rfc-index.html | head -50 RFC Index ... avoided for brevity ... See the RFC Editor Web page for more information. RFC Index Num Information 0001 Host Software S. Crocker [ April 1969 ] (TXT, HTML) (Status: UNKNOWN) (Stream: Legacy) (DOI: 10.17487/RFC0001) 0002 Host software B. Duvall [ April 1969 ] (TXT, PDF, HTML) (Status: UNKNOWN) (Stream: Legacy) (DOI: 10.17487/RFC0002) 0003 Documentation conventions S.D. Crocker [ April 1969 ] (TXT, HTML) (Obsoleted-By RFC0010) (Status: UNKNOWN) (Stream: Legacy) (DOI: 10.17487/RFC0003) ``` That's how we find strictly-formed entries as follows: > `NUMBER DESCRIPTION [ DATE ] OTHER_TECHNICAL_INFO` and we can extract the date information using [awk](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/awk.html) [3](#f3) - a DSL designed for text processing. I prepared a test file from a small fragment of the collected data: ```bash ~> cat test-entries 0506 FTP command naming problem M.A. Padlipsky [ June 1973 ] (TXT, HTML) (Status: UNKNOWN) (Stream: Legacy) (DOI: 10.17487/RFC0506) 0507 Not Issued 0508 Real-time data transmission on the ARPANET L. Pfeifer, J. McAfee [ May 1973 ] (TXT, HTML) (Status: UNKNOWN) (Stream: Legacy) (DOI: 10.17487/RFC0508) 0509 Traffic statistics (April 1973) A.M. McKenzie [ April 1973 ] (TXT, HTML) (Status: UNKNOWN) (Stream: Legacy) (DOI: 10.17487/RFC0509) ~> awk -F'[\\[|\\]]' '/^[0-9][0-9][0-9][0-9]/ && !/Not Issued/{print $2}' test-entries June 1973 May 1973 April 1973 ``` Please note that `^[0-9][0-9][0-9][0-9]` is used instead of `^[0-9]{4}` because [mawk](https://invisible-island.net/mawk/) [4](#f4) [doesn't support](https://github.com/ThomasDickey/original-mawk/issues/25) [5](#f5) repetition. Since now we able to extract human-readable dates, let's convert them to numeric format with [date](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/date.html) [6](#f6) utility during implicit loop given from [xargs](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/xargs.html) [7](#f7): ```bash ~> # Years extraction ~> awk -F'[\\[|\\]]' '/^[0-9][0-9][0-9][0-9]/ && !/Not Issued/{print $2}' test-entries | xargs -I {} env TZ=Europe/London date -d'01 {}' +"%Y" 1973 1973 1973 ~> # Months extraction ~> awk -F'[\\[|\\]]' '/^[0-9][0-9][0-9][0-9]/ && !/Not Issued/{print $2}' test-entries | xargs -I {} env TZ=Europe/London date -d'01 {}' +"%m" 06 05 04 ``` The next simple step is sorting ([sort](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html) [8](#f8) tool) and counting unique ([uniq](https://pubs.opengroup.org/onlinepubs/9699919799/utilities/uniq.html) [9](#f9) tool) values: ```bash ~> awk -F'[\\[|\\]]' '/^[0-9][0-9][0-9][0-9]/ && !/Not Issued/{print $2}' test-entries | xargs -I {} env TZ=Europe/London date -d'01 {}' +"%Y" | sort -n | uniq -c 3 1973 ~> awk -F'[\\[|\\]]' '/^[0-9][0-9][0-9][0-9]/ && !/Not Issued/{print $2}' test-entries | xargs -I {} env TZ=Europe/London date -d'01 {}' +"%m" | sort -n | uniq -c 1 04 1 05 1 06 ``` We're ready to put it all together over pipes: ```bash ~> w3m -cols 1024 -dump https://www.rfc-editor.org/rfc-index.html | awk -F'[\\[|\\]]' '/^[0-9][0-9][0-9][0-9]/ && !/Not Issued/{print $2}' | xargs -I {} env TZ=Europe/London date -d'01{}' +"%Y" | sort -n | uniq -c 1 1968 25 1969 58 1970 182 1971 134 1972 162 1973 60 1974 24 1975 11 1976 20 1977 8 1978 7 1979 17 1980 29 1981 37 1982 49 1983 39 1984 41 1985 ... ```


Over the years, the old HTML-to-TXT-based easily parsable RFC index format has gone, and the only applicable way to gather all existing issued RFCs has become an XML-based index 14. In some ways this is even a bit easier, since XML is known as a structured and stable M2M format (and old enough).

Classic helper to parse XML in CLI is xmllint 15. Trying to collect the XML-based index using curl 16 and showing first 5 lines of the document:

~> curl -s https://www.rfc-editor.org/rfc-index.xml | head -5
<?xml version="1.0" encoding="UTF-8"?>
<rfc-index xmlns="https://www.rfc-editor.org/rfc-index"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
           xsi:schemaLocation="https://www.rfc-editor.org/rfc-index
                               https://www.rfc-editor.org/rfc-index.xsd">

First of all, we need to analyze the XML format and where the required data is located. After the short investigation it’s clear, the path is the following:

rfc-index
└── rfc-entry
    └── date
        ├── day
        ├── month
        └── year

Where the day is an optional field.

It’s also important to note the custom XML namespace defined in the document’s 2nd line, it will prevent xmllint from XPath-based 17 parsing. To avoid overengineering by fighting the XML namespace, it is easier to remove it altogether:

~> curl -s https://www.ietf.org/rfc/rfc-index.xml | xmllint --format - | sed '2 s/xmlns=".*"//g' | head -5
<?xml version="1.0" encoding="UTF-8"?>
<rfc-index >
  <bcp-entry>
    <doc-id>BCP0001</doc-id>
  </bcp-entry>

Looks much better, and now is the time to try to extract all the years::

~> curl -s https://www.ietf.org/rfc/rfc-index.xml | xmllint --format - | sed '2 s/xmlns=".*"//g' | xmllint --xpath "//rfc-entry/date/year/text()" - | sort -n | uniq -c
      1 1968
     25 1969
     58 1970
    182 1971
    134 1972
    162 1973
     60 1974
     24 1975
     11 1976
     20 1977
      8 1978
      7 1979
     17 1980
     29 1981
     37 1982
     49 1983
     39 1984
     41 1985
...

Drawing in terminal

Bare numbers aren’t very comfortable for analysis and thus we’ll use gnuplot 10 utility to draw graphs in the following configuration 11:

gnuplot -e "set term dumb size 170, 35; set xtics 3; plot '-' with lines notitle"

It’ll read the STDIN stream and draw a 170x35 graph right in the terminal. Putting it all together one more time:

~> curl -s https://www.ietf.org/rfc/rfc-index.xml | xmllint --format - | sed '2 s/xmlns=".*"//g' | xmllint --xpath "//rfc-entry/date/year/text()" - | sort -n | uniq -c | gnuplot -e "set term dumb size 150, 35; set xtics 3; plot '-' with lines notitle"
line 57: warning: Too many axis ticks requested (>2e+02)


  2030 +------------------------------------------------------------------------------------------------------------------------------------------+
       |                                                                                                                                          |
       |                                                                                                                                          |
       |                            ***************************                                                                                   |
  2020 |-+                                                     ******************                                                               +-|
       |                                                      *****************                                                                   |
       |                                                                       ***********************                                            |
       |                                                                                          *********                                       |
       |                                                                                  ************************************                    |
  2010 |-+                                                                                    ***************************                       +-|
       |                                                                                       *******************************                    |
       |                                                                                           ***********************************************|
       |                                                             ******************************                                               |
  2000 |-+                                                       ***************************                                                    +-|
       |                                                                ****************                                                          |
       |                                       *************************                                                                          |
       |                                        ****************                                                                                  |
       |                      ******************                                                                                                  |
  1990 |-+            ********                                                                                                                  +-|
       |      ********                                                                                                                            |
       |         ****                                                                                                                             |
       |         ******                                                                                                                           |
  1980 |-+ ******                                                                                                                               +-|
       | *****                                                                                                                                    |
       |  **********                                                                                                                              |
       |            *************************************                                                                                         |
       |                                    *******************                                                                                   |
  1970 |************************************                                                                                                    +-|
       |                                                                                                                                          |
       |                                                                                                                                          |
       |                                                                                                                                          |
  1960 +------------------------------------------------------------------------------------------------------------------------------------------+

As you can see, something is going wrong. This is because gnuplot expects the first column as the X-axis and the second as the Y-axis. We need to swap our columns with each other:

~> curl -s https://www.ietf.org/rfc/rfc-index.xml | xmllint --format - | sed '2 s/xmlns=".*"//g' | xmllint --xpath "//rfc-entry/date/year/text()" - | sort -n | uniq -c | awk '{print $2" "$1}'
1968 1
1969 25
1970 58
1971 182
1972 134
1973 162
1974 60
1975 24
1976 11
1977 20
1978 8
1979 7
1980 17
1981 29
1982 37
1983 49
1984 39
1985 41
...

And rerun our graph plotting:

~> curl -s https://www.ietf.org/rfc/rfc-index.xml | xmllint --format - | sed '2 s/xmlns=".*"//g' | xmllint --xpath "//rfc-entry/date/year/text()" - | sort -n | uniq -c | awk '{print $2" "$1}' | gnuplot -e "set term dumb size 150, 35; set xtics 3; plot '-' with lines notitle"


  500 +-------------------------------------------------------------------------------------------------------------------------------------------+
      |      +       +      +      +       +      +       +      +      +       +      +      +       +      +       +      +      +       +      |
      |                                                                                                                                           |
  450 |-+                                                                                          *                                            +-|
      |                                                                                            *                                              |
      |                                                                                            **                                             |
  400 |-+                                                                                         * *                                           +-|
      |                                                                                           *  *         **                                 |
      |                                                                                           *  *       **  *                                |
  350 |-+                                                                                         *  *       *   *                              +-|
      |                                                                                          *    *     *     *                               |
      |                                                                                          *    *     *      *   *                          |
  300 |-+                                                                                       *      *   *        * * *****                   +-|
      |                                                                                        *        ****        * *      *                    |
      |                                                                             **        *                      *       *                    |
  250 |-+                                                                        *** *       *                                *                 +-|
      |                                                                         *     *    **                                  *        *         |
      |                                                                        *      *   *                                     *      * *        |
      |                                                                       *        ***                                       *    *   *       |
  200 |-+                                                                   **         *                                          * **     *    +-|
      |      *                                                     ****    *                                                       *        **    |
      |      **   *                                                *   *  *                                                                   *   |
  150 |-+   *  * * *                                              *    * *                                                                    * +-|
      |     *   *  *                                              *     *                                                                      *  |
      |     *       *                                            *                                                                             *  |
  100 |-+   *       *                                         ****                                                                              *-|
      |    *         *                                       *                                                                                    |
      |    *         *                                     **                                                                                     |
   50 |-+**           *                   **   **    ******                                                                                     +-|
      | *             *              *****  ***  * **                                                                                             |
      |*     +       + *******    ***      +      *       +      +      +       +      +      +       +      +       +      +      +       +      |
    0 +-------------------------------------------------------------------------------------------------------------------------------------------+
     1968   1971    1974   1977   1980    1983   1986    1989   1992   1995    1998   2001   2004    2007   2010    2013   2016   2019    2022   2025

Now it looks correct.

To draw the statistics over months, some additional data augmentation is required. To convert month names:

~> curl -s https://www.ietf.org/rfc/rfc-index.xml | xmllint --format - | sed '2 s/xmlns=".*"//g' | xmllint --xpath "//rfc-entry/date/month/text()" - | sort -n | uniq -c | awk '{print $2" "$1}'
April 848
August 860
December 630
February 802
January 828
July 696
June 847
March 878
May 796
November 659
October 820
September 751

to numeric format with date 6 utility during implicit loop given from xargs 7:

~> curl -s https://www.ietf.org/rfc/rfc-index.xml | xmllint --format - | sed '2 s/xmlns=".*"//g' | xmllint --xpath "//rfc-entry/date/month/text()" - | xargs -I {} env TZ=Europe/London date -d'01 {}' +"%m" | sort
 -n | uniq -c | awk '{print $2" "$1}'
01 828
02 802
03 878
04 848
05 796
06 847
07 696
08 860
09 751
10 820
11 659
12 630

For months statistics we see the expected deviations for July and November/December, the most productivity release dates are in March/April:

~> curl -s https://www.ietf.org/rfc/rfc-index.xml | xmllint --format - | sed '2 s/xmlns=".*"//g' | xmllint --xpath "//rfc-entry/date/month/text()" - | xargs -I {} env TZ=Europe/London date -d'01 {}' +"%m" | sort -n | uniq -c | awk '{print $2" "$1}' | gnuplot -e "set term dumb size 150, 35; set xtics 1; plot '-' with lines notitle"


  900 +-------------------------------------------------------------------------------------------------------------------------------------------+
      |            +           +            +            +            +           +            +            +            +           +            |
      |                        ***                                                                                                                |
      |                      **   ****                                                                                                            |
      |                     *         ****                                                     *                                                  |
  850 |-+                 **              ****                       **                       * *                                               +-|
      |                  *                    **                   **  *                     *   *                                                |
      |**              **                       **               **     *                    *    **                                              |
      |  ****         *                           **           **       *                   *       *                    *                        |
      |      ****   **                              **       **          *                 *         *                 ** *                       |
  800 |-+        ***                                  **   **             *               *           *              **   *                     +-|
      |                                                 ***                *              *            *           **      *                      |
      |                                                                    *             *              **       **         *                     |
      |                                                                     *           *                 *    **            *                    |
      |                                                                      *         *                   * **              *                    |
  750 |-+                                                                     *        *                    *                 *                 +-|
      |                                                                       *       *                                        *                  |
      |                                                                        *     *                                         *                  |
      |                                                                         *   *                                           *                 |
      |                                                                          *  *                                            *                |
      |                                                                          * *                                             *                |
  700 |-+                                                                         *                                               *             +-|
      |                                                                                                                            *              |
      |                                                                                                                             *             |
      |                                                                                                                             *             |
      |                                                                                                                              ***          |
  650 |-+                                                                                                                               ****    +-|
      |                                                                                                                                     ****  |
      |                                                                                                                                         **|
      |                                                                                                                                           |
      |            +           +            +            +            +           +            +            +            +           +            |
  600 +-------------------------------------------------------------------------------------------------------------------------------------------+
      1            2           3            4            5            6           7            8            9            10          11           12

Moreover, we can plot the same graph for IETF 12 Internet-Drafts 13 to discover how rapidly their numbers are growing:

~> curl -s https://mirror.funkfreundelandshut.de/ietf/internet-drafts/all_id.txt | awk '/^draft/{print $2}' | awk -F- '!/RFC/{print $1}' | sort -n | uniq -c | awk '{print $2" "$1}' | gnuplot -e "set term dumb size 150, 35; set xtics 2; set ytics 20; plot '-' notitle smooth csplines"


  2400 +------------------------------------------------------------------------------------------------------------------------------------------+
  2360 |-+     +      +       +       +       +      +       +       +       +      +       +       +      +       +       +       +      +     +-|
  2280 |-+                                                                                                                                      +-|
  2200 |-+                                                                                                                                      +-|
  2120 |-+                                                                                                                                      +*|
  2040 |-+                                                                                                                                      +*|
  1960 |-+                                                                                                                                      +*|
  1880 |-+                                                                                                                                      +*|
  1800 |-+                                                                                                ***                                   *-|
  1740 |-+                                                                                             ***   ***                                *-|
  1660 |-+                                                                                    **     **         **                             *+-|
  1580 |-+                                               *******                           ***  *****             **                           *+-|
  1500 |-+                                               *      *                         *                         *                          *+-|
  1420 |-+                                              *        ****   *******         **                           *                         *+-|
  1340 |-+                                              *            ***       **     **                              **                      * +-|
  1260 |-+                                            **                         *****                                  *******               * +-|
  1180 |-+                                           *                                                                         *              * +-|
  1120 |-+                                         **                                                                           *          ***  +-|
  1040 |-+                                       **                                                                              ***     **     +-|
   960 |-+                                     **                                                                                   *****       +-|
   880 |-+                                    *                                                                                                 +-|
   800 |-+                                   *                                                                                                  +-|
   720 |-+                                ***                                                                                                   +-|
   640 |-+                               *                                                                                                      +-|
   580 |-+                             **                                                                                                       +-|
   500 |-+                          ***                                                                                                         +-|
   420 |-+                         *                                                                                                            +-|
   340 |-+                    *****                                                                                                             +-|
   260 |-+                ****                                                                                                                  +-|
   180 |-+           *****                                                                                                                      +-|
   100 |-+     ****** +       +       +       +      +       +       +       +      +       +       +      +       +       +       +      +     +-|
    20 +------------------------------------------------------------------------------------------------------------------------------------------+
      1988    1990   1992    1994    1996    1998   2000    2002    2004    2006   2008    2010    2012   2014    2016    2018    2020   2022    2024

References

1. RFC Editor
2. Text-based web browser w3m
3. “Aho, Weinberger and Kernighan” domain-specific language
4. awk originally written by Mike Brennan
5. Built-in regex’s do not support brace-expressions
6. date - write the date and time
7. xargs - construct argument lists and invoke utility
8. sort - sort, merge, or sequence check text files
9. uniq - report or filter out repeated lines
10. gnuplot - portable command-line driven graphing utility
11. gnuplot documentation
12. Internet Engineering Task Force
13. Internet-Drafts
14. XML-based RFCs index
15. xmllint - command line XML tool
16. curl - command line tool and library for transferring data with URL syntax
17. Evaluate XPath in the Linux Command Line