Towards an LMAP Specification of ProbeAPI.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmail

In an effort to bring ProbeAPI nearer to the internet measurement community, we’ve been paying close attention to the new LMAP specification for internet measurement platforms. LMAP is being defined with the goal of standardizing large scale measurement systems, in order to be able to perform consequent measurements among diverse entities. They may even differ in implementation details, but complying to this standard opens the possibility of making the components, results and instructions comparable.

“Amongst other things, standardisation enables meaningful comparisons of measurements made of the same Metric at different times and places, and provides the operator of a Measurement System with criteria for evaluation of the different solutions that can be used for various purposes including buying decisions (such as       buying the various components from different vendors). Today’s systems are proprietary in some or all of these aspects. “ – RFC 7594, July 2015

In order to find out how compliant or non-compliant ProbeAPI might be toward this standard, we started a design and implementation comparison in terms of an LMAP system. In this post we will focus on the general outline of the system, oriented to its main components, their roles and data flow. A detailed comparison for a data model and measurement methods will have to remain pendant for a dedicated post, since they are very extended topics.

The general working scheme of ProbeAPI includes most components from the LMAP specification in very similar roles:

The user with the API makes a measurement request. The API, hosted in the cloud, then communicates the testing instructions to the Controller Interface, which will forward the testing instructions to the Bootstrapper and Controller outside the cloud. The Bootstrapper part is in charge of integrating the probes to the whole system and updates the database to keep track of the disconnecting probes. It is implemented using an XMPP server, which uses a sleek protocol and allows for all the probes relevant to a particular measurement to receive the message simultaneously.

The probes themselves report their online status directly to the API, while the Bootstrapper keeps track of the ones that disconnect. The probes receive the measurement instructions from the Controller. After carrying them out, they will send the results directly to the API to be delivered to the user.

LMAP Scheme for ProbeAPIThe Controller and Bootstrapper component mixes the Controller part, which is an element inside the scope of LMAP while the Bootstrapper lies outside the LMAP scope.

 

When a new probe becomes online, it generates its own unique ID which will be sent together with the results, where they can be separated in terms not only of ProbeID, but also ASN or Country. Then it calls the login method from the cloud interface so it will be accounted as online. When a Probe logs off, it is the Bootstrapper service which accounts their disconnection to the Database.

Interaction Diagram for MA-Login, Measurement Instruction and MA-Logout.
Interaction Diagram for MA-Login, Measurement Instruction and MA-Logout.

When a measurement instruction is sent, the Control Protocol is an XMPP instruction which can contain, for example, the following information:

  • <Task-ID>Task-ID
  • <MA-ID>Probe-ID
  • <suppression>TimeOut
  • <instruction>Command
  • <parameter> host_address
  • <parameter> ttl
  • <parameter>count
  • <parameter>timeout
  • <parameter>sleep
  • <parameter>BufferSize
  • <parameter>fragment
  • <parameter>resolve
  • <parameter>ipv6only

There is a Task-ID generated from the API, which is passed over to the probe with each measurement. When the results are collected, they are easily recognized.  Failure information from the Measurement Agents will be included in the results.

Here is an example of the results header obtained for httpget measurements:

  • HTTPGet_Status
  • HTTPGet_Destination
  • HTTPGet_TimeToFirstByte
  • HTTPGet_TotalTime
  • HTTPGet_ContentLength
  • HTTPGet_DownloadedBytes
  • Network_NetworkName
  • Network_LogoURL
  • Network_CountryCode
  • Network_NetworkID
  • DateTimeStamp
  • Country_Flag<url>
  • Country_Name
  • Country_State
  • Country_StateCode
  • Country_CountryCode
  • Probe-ID
  • ASN_Name
  • ASN_ID
  • Location_Latitude
  • Location_Longitude

The possible measurements at the time are:

ICMP (ms) , HTTP-GET (ms), Page-Loading time (ms), DNS Query Time.

The API itself doesn’t offer scheduling functions yet, but they are being implemented. Since ProbeAPI’s measurements are active. Each MA measures normally one flow per instruction. The report Data can be presented Raw or formatted in Json. There are also plans to implement scheduling also for reports. Right now reports are immediate.

There is also no Subscriber Parameter DB, since this information is delivered directly with the results from the probes. AS-Number, Country, AS-Name and Geographic Location are provided directly with the results.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmail

A study on the coverage of ProbeAPI and RIPE Atlas

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmail

Ripe Atlas has been successful in establishing a fairly well extended network of measurement Probes. They are placed in different environments, which can be server rooms, volunteers ‘offices, universities and households. Since the placement of a probe requires a physical device to be installed, the deployment and growth rate of the network is limited to the available physical distribution capacities and the cost of producing enough physical devices. On the flip side, this quality of being a hardware based measuring platform, not only guarantees a stable availability of the probes, but also there is a genuine piece of hardware that allows any customizations the measurements may require.

Top 20 Atlas by Users

Although Atlas has already achieved an impressive number of deployed probes, there are still large networks in need of coverage.

ASN Country(ISO 2 letter code) Users(APNIC Labs estimate) RIPE Atlas probes(online)
AS4134 CN 336 million 2
AS4837 CN 204 million 0
AS9829 IN 66 million 0
AS7922 US 55 million 336
AS17974 ID 47 million 1
AS8151 MX 39 million 4
AS24560 IN 33 million 5
AS8452 EG 33 million 0
AS4713 JP 30 million 8
AS7018 US 29 million 40
AS9121 TR 27 million 8
AS3320 DE 26 million 206
AS28573 BR 24 million 20
AS45595 PK 23 million 1
AS9299 PH 22 million 5
AS9808 CN 21 million 0
AS701 US 20 million 80
AS45899 VN 19 million 1
AS18881 BR 19 million 8
AS4766 KR 18 million 8

In this respect ProbeAPI can provide much relief. Because of its software-based nature, it has many complementary features that provide very interesting strategic flexibilities. For example, its deployment has a very low cost: it only requires the installation of a piece of software on a Windows computer. Being able to measure real user’s connectivity is a big advantage, but at the same time the normal usage of computers make ProbeAPI instances very volatile: personal computers go on and offline for different reasons during their normal usage.

We can note by observing both graphs, that there are still large networks with little coverage from ProbeAPI. ASNs 4134 (China Telecom), 4837 (China-168) and 9829 (India Telecom) are good examples of large networks with a comparatively small number of probes.

Nevertheless, ProbeAPI’s easy deployment gives us the possibility to be present in networks where too little or no physical probes have been installed.In our measurements, the number of available probes in ProbeAPI at a given moment is around 84000. During a normal usage day, more than 290000 became online. Although not all probes are online all the time, the number of available probes at a given moment is almost 8 times RIPE Atlas’ active probe count. This counterweighs the volatility problem of ProbeAPI’s instances, but for longer measurements from a static set of probes, the stability of Atlas Probes is an important fact to take into account.

It is important to remark that this comparison does not intend to establish technical superiority of one system or the other. Quite at the contrary, during this analysis we realized that in this respect, Atlas and ProbeAPI may contribute complementary features for measuring networks. For low-coverage, physically or politically hard to reach networks, a software solution like ProbeAPI may be a viable alternative in order to be able to first expand our general measurement coverage. Once a region starts installing more Atlas probes, longer measurements with fixed sets of probes become available thanks to Atlas’ more stable probes.

In this stage ProbeAPI’s end-user perspective can provide the convenient point of view of the last mile’s conditions. Combining the stability and precision of Atlas’ probes, with the massive amounts of possible measurements from end-user perspective, we can get a very well detailed portrait of the network’s condition.

Currently there are around 74000 active probes from ProbeAPI and Atlas monitoring the same ASs. ProbeAPI has around 14000 probes measuring networks where Atlas isn’t present. On the other side, Atlas has around 1700 probes where ProbeAPI is absent. Combined they give a grand total of around 95000 active probes able to measure networks serving almost 2.9 Billion users.

 Conclusion

The software-based design of ProbeAPI helps us achieve a vast coverage, even achieving the impressive number of 4000+ available probes for a single AS. Of course, the natural instability of the probes is an inherent constraint of ProbeAPI’s architecture, but that is the trade-off in exchange for a very extended and fast growing measurement network.

On the other hand, RIPE Atlas is designed around physical devices installed in diverse locations by hosts. This physical design brings the inherent stability a physically independent device can provide. Probes can be placed strategically in different points of the net other than only end-users, where measurements can reveal valuable information about the net’s conditions. All This requires some host recruiting, so this distribution process is naturally slower than a software one.
There are essential architectural differences between ProbeAPI and RIPE Atlas. Both systems were designed with a similar set of measurement features in mind, but their differences in design end up opening different doors, which in return give us the possibility of observing the net from a large number of diverse vantage points.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmail

Testing Google Cloud Platform CDN Interconnect with CloudFlare on ProbeAPI

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmail

As many of you might have already heard, Google has introduced a new cooperation program with four CDN providers: CloudFlare, Fastly, Highwinds and Level3. The Google Cloud Platform CDN Interconnect Program consists on giving CDN providers access to route their traffic through Google’s private high-speed links, so they can serve their customers through reliable, low-latency routes thanks to Google’s infrastructure.

After reading the news, the ProbeAPI team got curious to find out how much of a performance gain is there to expect using this service, if any at all. Taking advantage of the large number of probes available at ProbeAPI, we set up an experiment to put this interesting new infrastructure to the test.

We used Amazon’s S3 and Google Storage as cloud storage providers and CloudFlare as CDN. To test similar routes, we chose the server in Singapore for our Amazon S3 bucket and the Asia server for the Google Storage bucket. We chose a maximum of 100 Probes in the USA as destination for our transfer tests.

After connecting and configuring the buckets to make them accessible through the CDN, we put the files to be tested in the buckets: several randomly generated 1.1MB PDF files, since PDF is one of the file extensions cached by CDNs.

Our objective is to measure the transfer times of those files and find out how long does it take to the CDN to cache those files for each storage provider. That means: we want to compare the transfer times of a cached file vs an uncached file from each bucket. We take the difference of those transfer times and we get the time it takes the CDN to cache a file, based on the delay caused by the caching and assuming that a previously cached file will transfer faster.

We test two files per bucket, let’s call them A and B. We make a pre-test running a Http Get test from ProbeAPI calling only the A file from both buckets. Running this pre-test made file A get cached in the US Servers of CloudFlare. Now we are in condition to run the real test. So we ran an Http Get test using ProbeAPI with files A and B on both buckets. So this is what happens: We cached file A with the pre-test, because of this, file A is expected to transfer faster than file B, which will have to be transferrred from Asia to the US CDN servers like file A did on the first test. Because this takes a bit extra time to do, we can calculate the overhead caused by the caching.

Results

After just a few tests with ProbeAPI, the first thing that strikes you, is the amazing speedup in transferring files preloaded in the CDN cache, especially for Amazon servers. There was also a noticeable improvement in the uncached Amazon’s performance after the 5th Test. Either because of sudden changes in network conditions or some load balancing mechanism reduced the caching overhead enormously because of that route being used repeatedly by hundreds of probes across the US.

Now getting to the point. We can observe Amazon’s amazing speedup when the file is already cached, even surpassing uncached Google Storage performance after some tests, which is already fast by itself.

Google’s buckets performed well altogether and that’s where we can clearly see the power of this infrastructure. The overhead introduced by the caching process when using Google + CloudFlare is minimal compared to the one introduced from Amazon + CloudFlare. This is due to the evident performance upgrade brought by this new partnership with Google, with CloudFlare now being able to use Google’s infrastructure to transfer data from the Datacenter to the CDN in the blink of an eye.

Caching Overhead Asia to US

We decided to run the tests once more, using the same methodology but transferring files from US servers to probes located in the US as well. This is a very likely scenario, thus making this set of measurements very interesting.

CDN Comparison US to USHere we can observe the expected scenario again: Uncached files take longer to deliver than files already cached in the CDN. In this case the difference is (also expectedly) less dramatic, due to the US-US traffic routing. The uncached files take similar amounts of time to load, although there’s still a noticeable overhead improvement when measuring transfers from Google’s buckets.

Even with your content being available locally in the US itself, the benefits of this CDN-Google partnership are still evident and relevant.

Analysis and Conclusion

We are living exciting times, with the Internet becoming ever faster and adopting more sophisticated connectivity year after year. This is one example of how the Net is adopting optimized structures. Some years ago we wouldn’t have dreamed of having our content available practically locally everywhere in the world, that’s one for CDNs.

Now with this cooperation, not only that is possible, but also your newest and updated content gets much faster to its destination and this is where the major beneficiaries of Google’s Interconnect platform lie, whose content is constantly changing, updating, adding new files and want them to be rapidly distributed for a virtually seamless availability all over the world.

Even with your content travelling shorter distances, like our US-US test showed, the benefits of serving customers faster and more reliably are still very noticeable and could be critical in certain scenarios: e.g. during a flash crowd, when your content (or part of it) becomes highly popular overnight. This would be a critical situation where you want to serve everybody without decreasing the quality of your service. The best part is that it works automatically and such a scenario that haunted administrators in the past, is becoming less and less fearsome thanks to CDNs and now even updated content is able to reach its destination with a short overhead.

 

 

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmail

API Changes> State targeting for USA, new API limits and web reputation check

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmail

Added support for targeting probes with state level accuracy

We have added optional input StateCode parameter to those methods:

StartDIGTestByCountry
StartHttpGetTestByCountry
StartPageLoadTestByCountry
StartTracertTestByCountry
StartPingTestByCountry

Currently only probes located in the USA can be targeted on state level. The API response will also contain StateCode (e.g. VA) and also State (e.g. Virginia) if you request CountryCode=US.

Added new limits on the number of probes you can request test from To avoid abuse we have added limits to how many probes you can request results at the same time in 1 API call. This is controlled by the probeLimit parameter which is required.

Here are the maximum values allowed for different tests.

PING – 100
DIG – 100
TRACEROUTE – 100
PAGELOAD – 20
HTTPGET – 20

If you require those limits to be lifted, please contact us.

Added web reputation services for HTTP tests

To avoid abuse and protect the probes from accessing risky content, we have added white and black lists that will be checked for all HTTP tests (e.g. HTTP GET or Page load). Following errors can happen>

StatusCode: 221
StatusDescription: Destination blocked Message: “This URL destination has been marked as risky by our web classification engine. We are unable to test this URL. ”

StatusCode: 222
StatusDescription: Destination not classified Message: “This URL destination has not been classified by our web classification engine. We are unable to test this URL at this time. Please contact support to whitelist this URL”

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmail

API improvements

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmail

We have deployed few new improvements to the API.

Changes to default timeout and commandTimeout

We have changed default timeout and commandTimeout parameters to maximize the number of returned results and minimize the timeouts on probes (of course you can set any time out as you want by specifying those parameters)

Ping
timeout 8000
commandtimeout = 1000

DIG
timeout = 4000
commandtimeout = 1000

Traceroute
timeout 52000
commandtimout 1000

HTTP GET
timeout 8000
commandtimeout = timeout – 3000

Pageload
timeout 45000
commandtimeout = timeout – 3000

Global error handler. Supported errors in the HTTP Header
To better understand returned results from the API we are now returning more information about the possible API problesms right in the HTTP Header.

Some of the new Status Codes:
200 – OK
500 – Unexpected Error from the API side
520 – Errors related with the empty results

If HTTP Status Code is <> from 200, then we also respond with [Message] element in the response body that will provide more user friendly message what the error is about.

Changes in the data output model

We are returning only objects which are currently in use. Objects as PING / HTTPGET etc are hidden for DIG. And the same rules exists for the others type of tests

e.g based on results for DIG OLD RESPONSE:

"StartDIGTestByCountryResult": [ { "ASN": { "AsnID": "AS12741", "AsnName": "Netia SA" }, "CDNEdgeNode": null, "Country": { "CountryCode": "PL", "CountryFlag": "http:\/\/speedcheckerapi.blob.core.windows.net\/bsc-img-country-logos\/pl.png", "CountryName": "Poland" }, "DIGDns": [ { "AdditionalInformation": null, "Destination": "www.broadbandspeedchecker.co.uk", "QueryTime": "13", "Status": "NoError" } ], "DateTimeStamp": "\/Date(1435158434263+0000)\/", "HTTPGet": null, "ID": 31658310, "Location": { "Latitude": 50.320525, "Longitude": 19.132328 }, "LoginDate": null, "Network": { "LogoURL": "https:\/\/speedcheckerapi.blob.core.windows.net\/bsc-img-providers\/logo_0039_ef7487217cc7_60.jpg", "NetworkID": "226", "NetworkName": "Netia SA" }, "PAGELoad": null, "Ping": null, "PingTime": null, "TRACERoute": null } ]

NEW RESPONSE:

"StartDIGTestByCountryResult": [ { "ASN": { "AsnID": "AS197480", "AsnName": "SerczerNET Malgorzata Nienaltowska" }, "Country": { "CountryCode": "PL", "CountryFlag": "http:\/\/speedcheckerapi.blob.core.windows.net\/bsc-img-country-logos\/pl.png", "CountryName": "Poland" }, "DIGDns": [ { "Status": "OK", "Destination": "www.broadbandspeedchecker.co.uk", "QueryTime": "1" } ], "DateTimeStamp": "\/Date(1435150848253+0000)\/", "ID": 17854019, "Location": { "Latitude": 53.29718, "Longitude": 23.28092 }, "Network": { "LogoURL": null, "NetworkID": "3362", "NetworkName": "SerczerNET Malgorzata Nienaltowska" } } ]

New properties available for probe results

For all probe results we added new property: Status

Status: OK, Timeout , TtlExpired (+ potentially other statuses added in the future)

For Ping and Traceroute we added also PingTime array, which can be useful to calculate standard deviations and other statistical analysis of the individual pings that were performed.

Ping Response "Ping": [ { "Status": "OK", "Destination": "www.interia.pl", "Hostname": "www.interia.pl", "IP": "217.74.65.27", "PingTime": 12, "PingTimeArray": [ "13", "11", "12" ] } ]

Traceroute response

"TRACERoute": [ { "Destination": "www.interia.pl", "HostName": "www.interia.pl", "IP": "217.74.65.27", "Tracert": [ { "HostName": "", "IP": "10.27.1.1", "PingTimeArray": [ "1", "0", "0" ], "Ping1": "1", "Ping2": "0", "Ping3": "0", "Status": "OK" }, { "HostName": "", "IP": "178.217.192.254", "PingTimeArray": [ "42", "31", "33" ], "Ping1": "42", "Ping2": "31", "Ping3": "33", "Status": "OK" }, { "HostName": "interia.tpix.pl", "IP": "195.149.232.12", "PingTimeArray": [ "12", "14", "11" ], "Ping1": "12", "Ping2": "14", "Ping3": "11", "Status": "OK" }, { "HostName": "", "IP": "217.74.64.187", "PingTimeArray": [ "11", "11", "11" ], "Ping1": "11", "Ping2": "11", "Ping3": "11", "Status": "OK" }, { "HostName": "www.interia.pl", "IP": "217.74.65.27", "PingTimeArray": [ "12", "11", "13" ]", "Ping1":"12", "Ping2":"11", "Ping3":"13", "Status":"OK"}]}]}

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmail

New parameters available for test methods

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmail

We have deployed few new improvements to the API.

Support for new parameters

PING Test methods

ttl – Max allowed hops for packet (default=128, max=255)
bufferSize – Size of the Packet data (default=32, max=65500)
fragment – Fragmentation of sending packets. (default=1)
resolve – IP addresses will be resolved to domain names for each hop (default=0)
ipv4only – Force using IPv4. If no IPv4 IP address is returned will return error (default=0)
ipv6only – Force using IPv6. If no IPv6 IP address is returned will return error (default=0)

DIGDNS Test methods

commandTimeout – DNS query timeout in milliseconds (default=1000ms)
retries – Total number of retries (default=0)

TRACEROUTE Test methods

maxFailedHops – Stop the command execution after maximum errors in a row (e.g. stop after 5 ping timeouts, default=0)
ttlStart – First Hop from which the trace route should start (default=1) bufferSize – Size of the Packet data (default=32, max=65500)
fragment – Fragmentation of sending packets (default=1)
resolve – IP addresses will be resolved to domain names for each hop (default=0)
ipv4only – Force using IPv4. If no IPv4 IP address is returned will return error (default=0)
ipv6only – Force using IPv6. If no IPv6 IP address is returned will return error (default=0)

PAGELOAD Test methods

commandTimeout – Maximum allowed time for pageload in milliseconds

HTTPGET Test methods

commandTimeout – Maximum allowed time to send HTTP GET request and receive the response in milliseconds
maxBytes – Max bytes to download from response stream

2. Changes in how timeouts are handled by the API

Each method has 2 main time out parameters:

timeout – Amount of time in milliseconds in which API responds with result
commandTimeout – timeout for the actual test command. For most of the tests its obvious that commandTimeout correlates with total timeout of the API method. However, in case of Ping and Traceroute its not the case. Ping methods by default execute 3 ping commands. Traceroute executes even more and it depends on many parameters such as TTL, count etc.

Facebooktwittergoogle_plusredditpinterestlinkedinmailFacebooktwittergoogle_plusredditpinterestlinkedinmail