The usual Pythonic way to get a data series is sometimes not the fastest. In some cases, it’s enough to have only 3-6 Linux CLI tools to get a necessary data from Web and parse it in a proper way. Let’s look at such an approach on how to count the number of released RFC’s per year and draw the graph in a terminal.
The important question at the start is always “Where can I get the data source?”. Fortunately, in 2024th you can find the organized list for all of RFCs 1.
Over the years, the old HTML-to-TXT-based easily parsable RFC index format has gone, and the only applicable way to gather all existing issued RFCs has become an XML-based index 14. In some ways this is even a bit easier, since XML is known as a structured and stable M2M format (and old enough).
Classic helper to parse XML in CLI is xmllint 15. Trying to collect the XML-based index using curl 16 and showing first 5 lines of the document:
~> curl -s https://www.rfc-editor.org/rfc-index.xml | head -5
<?xml version="1.0" encoding="UTF-8"?>
<rfc-index xmlns="https://www.rfc-editor.org/rfc-index"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="https://www.rfc-editor.org/rfc-index
https://www.rfc-editor.org/rfc-index.xsd">
First of all, we need to analyze the XML format and where the required data is located. After the short investigation it’s clear, the path is the following:
rfc-index
└── rfc-entry
└── date
├── day
├── month
└── year
Where the day is an optional field.
It’s also important to note the custom XML namespace defined in the document’s 2nd line, it will prevent xmllint from XPath-based 17 parsing. To avoid overengineering by fighting the XML namespace, it is easier to remove it altogether:
~> curl -s https://www.ietf.org/rfc/rfc-index.xml | xmllint --format - | sed '2 s/xmlns=".*"//g' | head -5
<?xml version="1.0" encoding="UTF-8"?>
<rfc-index >
<bcp-entry>
<doc-id>BCP0001</doc-id>
</bcp-entry>
Looks much better, and now is the time to try to extract all the years::
~> curl -s https://www.ietf.org/rfc/rfc-index.xml | xmllint --format - | sed '2 s/xmlns=".*"//g' | xmllint --xpath "//rfc-entry/date/year/text()" - | sort -n | uniq -c
1 1968
25 1969
58 1970
182 1971
134 1972
162 1973
60 1974
24 1975
11 1976
20 1977
8 1978
7 1979
17 1980
29 1981
37 1982
49 1983
39 1984
41 1985
...
Bare numbers aren’t very comfortable for analysis and thus we’ll use gnuplot 10 utility to draw graphs in the following configuration 11:
gnuplot -e "set term dumb size 170, 35; set xtics 3; plot '-' with lines notitle"
It’ll read the STDIN stream and draw a 170x35 graph right in the terminal. Putting it all together one more time:
~> curl -s https://www.ietf.org/rfc/rfc-index.xml | xmllint --format - | sed '2 s/xmlns=".*"//g' | xmllint --xpath "//rfc-entry/date/year/text()" - | sort -n | uniq -c | gnuplot -e "set term dumb size 150, 35; set xtics 3; plot '-' with lines notitle"
line 57: warning: Too many axis ticks requested (>2e+02)
2030 +------------------------------------------------------------------------------------------------------------------------------------------+
| |
| |
| *************************** |
2020 |-+ ****************** +-|
| ***************** |
| *********************** |
| ********* |
| ************************************ |
2010 |-+ *************************** +-|
| ******************************* |
| ***********************************************|
| ****************************** |
2000 |-+ *************************** +-|
| **************** |
| ************************* |
| **************** |
| ****************** |
1990 |-+ ******** +-|
| ******** |
| **** |
| ****** |
1980 |-+ ****** +-|
| ***** |
| ********** |
| ************************************* |
| ******************* |
1970 |************************************ +-|
| |
| |
| |
1960 +------------------------------------------------------------------------------------------------------------------------------------------+
As you can see, something is going wrong. This is because gnuplot expects the first column as the X-axis and the second as the Y-axis. We need to swap our columns with each other:
~> curl -s https://www.ietf.org/rfc/rfc-index.xml | xmllint --format - | sed '2 s/xmlns=".*"//g' | xmllint --xpath "//rfc-entry/date/year/text()" - | sort -n | uniq -c | awk '{print $2" "$1}'
1968 1
1969 25
1970 58
1971 182
1972 134
1973 162
1974 60
1975 24
1976 11
1977 20
1978 8
1979 7
1980 17
1981 29
1982 37
1983 49
1984 39
1985 41
...
And rerun our graph plotting:
~> curl -s https://www.ietf.org/rfc/rfc-index.xml | xmllint --format - | sed '2 s/xmlns=".*"//g' | xmllint --xpath "//rfc-entry/date/year/text()" - | sort -n | uniq -c | awk '{print $2" "$1}' | gnuplot -e "set term dumb size 150, 35; set xtics 3; plot '-' with lines notitle"
500 +-------------------------------------------------------------------------------------------------------------------------------------------+
| + + + + + + + + + + + + + + + + + + |
| |
450 |-+ * +-|
| * |
| ** |
400 |-+ * * +-|
| * * ** |
| * * ** * |
350 |-+ * * * * +-|
| * * * * |
| * * * * * |
300 |-+ * * * * * ***** +-|
| * **** * * * |
| ** * * * |
250 |-+ *** * * * +-|
| * * ** * * |
| * * * * * * |
| * *** * * * |
200 |-+ ** * * ** * +-|
| * **** * * ** |
| ** * * * * * |
150 |-+ * * * * * * * * +-|
| * * * * * * |
| * * * * |
100 |-+ * * **** *-|
| * * * |
| * * ** |
50 |-+** * ** ** ****** +-|
| * * ***** *** * ** |
|* + + ******* *** + * + + + + + + + + + + + + |
0 +-------------------------------------------------------------------------------------------------------------------------------------------+
1968 1971 1974 1977 1980 1983 1986 1989 1992 1995 1998 2001 2004 2007 2010 2013 2016 2019 2022 2025
Now it looks correct.
To draw the statistics over months, some additional data augmentation is required. To convert month names:
~> curl -s https://www.ietf.org/rfc/rfc-index.xml | xmllint --format - | sed '2 s/xmlns=".*"//g' | xmllint --xpath "//rfc-entry/date/month/text()" - | sort -n | uniq -c | awk '{print $2" "$1}'
April 848
August 860
December 630
February 802
January 828
July 696
June 847
March 878
May 796
November 659
October 820
September 751
to numeric format with date 6 utility during implicit loop given from xargs 7:
~> curl -s https://www.ietf.org/rfc/rfc-index.xml | xmllint --format - | sed '2 s/xmlns=".*"//g' | xmllint --xpath "//rfc-entry/date/month/text()" - | xargs -I {} env TZ=Europe/London date -d'01 {}' +"%m" | sort
-n | uniq -c | awk '{print $2" "$1}'
01 828
02 802
03 878
04 848
05 796
06 847
07 696
08 860
09 751
10 820
11 659
12 630
For months statistics we see the expected deviations for July and November/December, the most productivity release dates are in March/April:
~> curl -s https://www.ietf.org/rfc/rfc-index.xml | xmllint --format - | sed '2 s/xmlns=".*"//g' | xmllint --xpath "//rfc-entry/date/month/text()" - | xargs -I {} env TZ=Europe/London date -d'01 {}' +"%m" | sort -n | uniq -c | awk '{print $2" "$1}' | gnuplot -e "set term dumb size 150, 35; set xtics 1; plot '-' with lines notitle"
900 +-------------------------------------------------------------------------------------------------------------------------------------------+
| + + + + + + + + + + |
| *** |
| ** **** |
| * **** * |
850 |-+ ** **** ** * * +-|
| * ** ** * * * |
|** ** ** ** * * ** |
| **** * ** ** * * * * |
| **** ** ** ** * * * ** * |
800 |-+ *** ** ** * * * ** * +-|
| *** * * * ** * |
| * * ** ** * |
| * * * ** * |
| * * * ** * |
750 |-+ * * * * +-|
| * * * |
| * * * |
| * * * |
| * * * |
| * * * |
700 |-+ * * +-|
| * |
| * |
| * |
| *** |
650 |-+ **** +-|
| **** |
| **|
| |
| + + + + + + + + + + |
600 +-------------------------------------------------------------------------------------------------------------------------------------------+
1 2 3 4 5 6 7 8 9 10 11 12
Moreover, we can plot the same graph for IETF 12 Internet-Drafts 13 to discover how rapidly their numbers are growing:
~> curl -s https://mirror.funkfreundelandshut.de/ietf/internet-drafts/all_id.txt | awk '/^draft/{print $2}' | awk -F- '!/RFC/{print $1}' | sort -n | uniq -c | awk '{print $2" "$1}' | gnuplot -e "set term dumb size 150, 35; set xtics 2; set ytics 20; plot '-' notitle smooth csplines"
2400 +------------------------------------------------------------------------------------------------------------------------------------------+
2360 |-+ + + + + + + + + + + + + + + + + + +-|
2280 |-+ +-|
2200 |-+ +-|
2120 |-+ +*|
2040 |-+ +*|
1960 |-+ +*|
1880 |-+ +*|
1800 |-+ *** *-|
1740 |-+ *** *** *-|
1660 |-+ ** ** ** *+-|
1580 |-+ ******* *** ***** ** *+-|
1500 |-+ * * * * *+-|
1420 |-+ * **** ******* ** * *+-|
1340 |-+ * *** ** ** ** * +-|
1260 |-+ ** ***** ******* * +-|
1180 |-+ * * * +-|
1120 |-+ ** * *** +-|
1040 |-+ ** *** ** +-|
960 |-+ ** ***** +-|
880 |-+ * +-|
800 |-+ * +-|
720 |-+ *** +-|
640 |-+ * +-|
580 |-+ ** +-|
500 |-+ *** +-|
420 |-+ * +-|
340 |-+ ***** +-|
260 |-+ **** +-|
180 |-+ ***** +-|
100 |-+ ****** + + + + + + + + + + + + + + + + +-|
20 +------------------------------------------------------------------------------------------------------------------------------------------+
1988 1990 1992 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012 2014 2016 2018 2020 2022 2024
1. RFC Editor ↩
2. Text-based web browser w3m ↩
3. “Aho, Weinberger and Kernighan” domain-specific language ↩
4. awk originally written by Mike Brennan ↩
5. Built-in regex’s do not support brace-expressions ↩
6. date - write the date and time ↩
7. xargs - construct argument lists and invoke utility ↩
8. sort - sort, merge, or sequence check text files ↩
9. uniq - report or filter out repeated lines ↩
10. gnuplot - portable command-line driven graphing utility ↩
11. gnuplot documentation ↩
12. Internet Engineering Task Force ↩
13. Internet-Drafts ↩
14. XML-based RFCs index ↩
15. xmllint - command line XML tool ↩
16. curl - command line tool and library for transferring data with URL syntax ↩
17. Evaluate XPath in the Linux Command Line ↩